Are Integers really better for joins than Strings? — Data Engineering by a Pragmatist | by Boris Serebrinskiy

5 min readFeb 6, 2022

Are Integers really better for joins than Strings? — Data Engineering by a Pragmatist

Firstly, I’d like to give thanks to everyone who chose to follow me on Medium. I’m humbled by the thought my blog is read and followed by hundreds of people — all that in just over a month! Incredible! I hope to maintain the same level of engagement, and while I missed last week’s entry — I’d love to commit to one story a week — I think I got an interesting topic today.

Back in my very first story Data Engineering Story by a Pragmatist | by Boris Serebrinskiy | Dec, 2021 | Medium I’ve dropped this phrase — “use “strings” instead of “integers” for star schema joins”. I meant to start a bit of a debate — does one really need to have number-based keys for joins, or can one stay with “natural keys” — or “Strings”. Over my career I’ve heard many experts and novices alike to go with the obvious — numbers are easier to match, join, order, etc. But what about the reality of modern data engines?

In the follow up stories collectively called “Thou Shalt Not Skew” Data Engineering by a Pragmatist — Thou Shalt Not Skew! (Part 1) | by Boris Serebrinskiy | Jan, 2022 | Medium and Data Engineering By a Pragmatist — Thou Shalt Not Skew (Part 2) | by Boris Serebrinskiy | Jan, 2022 | Medium and How Joins affect Data Skew — Data Engineering by a Pragmatist | by Boris Serebrinskiy | Jan, 2022 | Medium I’ve gone off the deep end talking about hashing, skewing, etc. But it is the hashing that matters to the story here. In the big data space, hashing and the hash joins are one of the primary ways database match data, i.e. execute such joins.

Hash joins Hash join — Wikipedia attempt to hash both sides of the condition and then compare hashes and actual values. What follows below is not a very scientific way of proving my hypothesis — but a pragmatist one :)

Past few weeks I’ve been playing with a handy piece of cloud hardware — a pretty beefy Ubunty virtual machine with lots and lots of fast cores. While the size of the VM is not critical to this story, having a “shock and awe” kind of a machine always helps to prove a point :)

My plan was to test out its performance on some Spark benchmarks. Within 20 minutes I had the Spark engine up and running. I am a big fan of this technology and consider it indispensable for any “Big” data engineer.

So, let’s see how fast this beauty is. Kicking the tires — let’s sum up a trillion numbers in Scala (go big or go home, right?)

spark.conf.set(“spark.sql.codegen.wholeStage”, true)

spark.range(1000L * 1000 * 1000 * 1000).selectExpr(“sum(id)”).show()

+ — — — — — — — — — -+

| sum(id)|

+ — — — — — — — — — -+

|1001881602603448320|

+ — — — — — — — — — -+

Boom! Time taken to sum a trillion numbers: 6.963524675 seconds (this is truly ridiculously fast — doing the same thing in single threaded Python took 16 seconds to do just 1 billion — I decided not to wait 16* 1000 seconds = 4.5 hours. If you use Spark — you are awesome! If you don’t — you’re still awesome but please install it today and enjoy!

By the way, how did the CPU cores look during that time? As expected, no skew. And while this story is not about skews, I want to demonstrate how it looks on a real machine — picture from htop shows evenly busy cores.

Picture 1 — non skewed Sum operation.

I was doing some other experiments and I screen grabbed a pretty bad skew on one of the join operations. I can’t share my code but it was based on Spark benchmark from GitHub — databricks/spark-sql-perf

Picture 2. Serious skew on join operations.

Moving on. Let’s add some hashing operations.

With the same trillion numbers, hash each and find the sum of the hashes.

spark.range(1000L*1000*1000*1000).createOrReplaceTempView(“test”)

spark.sql(“SELECT *, hash(id) as p1 from test”).selectExpr(“sum(p1)”).show()

+ — — — — — — — +

| sum(p1)|

+ — — — — — — — +

|26267346032547|

+ — — — — — — — +

Time taken to hash a trillion numbers and take their sum: 48.670489114 seconds. Which means roughly 41 seconds to do the hashing.

So, I’m thinking — let’s try to hash a trillion strings.

This did not work:

spark.sql(“SELECT *, hash(cast id as string) as p1 from test”).selectExpr(“sum(p1)”).show()

The cast was crawling, I waited 5 minutes with Spark UI showing it would take over an hour to finish the job. But the goal is to hash strings and see how long it takes, not to see how long the cast takes.

Trying something different:

spark.sql(“SELECT *, hash( ‘’ || id) as p1 from test”).selectExpr(“sum(p1)”).show()

This is going slow too. 10 billion rows per minute. 1.5 hours. So I take out the hash function — and it is still the same timing, almost. So, it seems making a string out of the “id” column is the problem. I still can’t get the hash of a string timing measured.

So, I am going to scale down the test to 10 billion numbers and run 2 functions:

spark.sql(“SELECT *, hash( ‘’ || id) as p1 from test”).selectExpr(“sum(p1)”).show()

That’s about 40 seconds. (Compare this to doing 1000 times more hashes of a trillion in 48 seconds

Now, run just the string conversion:

spark.sql(“SELECT *, ( ‘’ || id) as p1 from test”).selectExpr(“max(p1)”).show()

This will actually take longer since MAXing strings is slower than integers (but this is not relevant to the join performance of the hashes — Aggregation of strings is definitely slower than integers, but we are discussing joins — different matter)

I will cast back the strings to help the max function:

spark.sql(“SELECT *, cast ( ( ‘’ || id) as long) as p1 from test”).selectExpr(“max(p1)”).show()

About 40 seconds again!

Which to my less than scientific mind feels that hashing 10 billion of strings vs 10 billion of integers on a modern machine was pretty much about the same (I ran that separately — 0.6 seconds to do 10 billion hash of integers). I know Scala uses Murmur3Hash MurmurHash — Wikipedia and I know strings are slower to hash than numbers. But speaking pragmatically, does it matter that much anymore? Imagine that you no longer need to design your data model and ETL processes to support conversion of natural string keys into surrogate numeric keys? That your star schema joins can be run simply on the natural keys…. Whole dissertations been written on the subject… So much of the coding overhead can be removed. So many trees could have been saved from those dissertations :)

Given that we don’t really need to worry about storage costs (at least on the cloud), and the computing power is so abundant, I conclude that hash join performance of strings is pretty much about the same as integers.

With that, have fun with Spark and I’ll see you at my next story.

Written by Boris Serebrinskiy