Data Engineering By a Pragmatist — Thou Shalt Not Skew (Part 2)

10 min readJan 17, 2022

The first part of “Thou Shalt Not Skew” Data Engineering by a Pragmatist — Thou Shalt Not Skew! (Part 1) | by Boris Serebrinskiy | Jan, 2022 | Medium contained questionable attempts to demonstrate examples of skewed data distribution in real life situations — getting into a hockey game or finding a parking spot in Brooklyn, or voting for the new mayor of New York City. I want to declare it has been the most resounding… What’s the opposite of “success” ?… “Meh”? This conclusion is based on hard data — 1. The viewership and readership is WAY, WAY down. 2. Linkedin comments are inconsequential (complaints about Brooklyn parking… like that was the point of my blog — call 311, I say, or move to the West Coast! Or sell your car and laugh all the way to the bank given where used car prices are nowadays). 3. I’ve received fewer “claps” on Medium than the guy who wrote a 2 paragraph blog about his last night dreams. Incredible, thumbs up, people! I don’t want to sound like Jeb Bush, but he and I both mean it when we say “Please clap!” Jeb Bush to audience: ‘Please clap’ — YouTube

Clearly, based on these 3 points, I must press on and continue with converting some of the resounding “success” of Part 1 into Part 2 — “how do you fight the data skew?”. I may even bring some technical conversation into this lifeless blog as I feel my fellow data engineers are falling asleep (I’m afraid the guy who blogged his dreams was one of my frustrated readers…Damn him for getting more claps!) But there is an interesting thing happening — it seems some of my non-technical friends have become interested in learning what I do for living — one even got to googling “data quality’, “referential integrity”, “multidimensional modeling”. This is what I wanted to achieve — wake up the “pragmatist” in people, and maybe get them interested in “Data”! Amazing!

With that, let’s ratchet it up. Using the examples from Part 1 Data Engineering by a Pragmatist — Thou Shalt Not Skew! (Part 1) | by Boris Serebrinskiy | Jan, 2022 | Medium (which I reluctantly must recommend reading prior to this opus), we’ll jump into pragmatic solutions, and potentially demonstrate the differences between a human mind and a computer. I’ll start with the most egregious examples — the Election voting place set up. Splitting voters by the first initial of their last name, or by first initial of the street they live on, or any other “humans can easily understand this” model leads to a heavy data skew. For simplicity we assume all desks are equal in size and each staffer processes an equal number of voters per hour. There are far more people in the United States and in my voting district with last names starting with an “S” than with a “Y”. OK, that’s fairly obvious. How do we fix it? As a data architect, I can tell the election office chose to use the “Shared Nothing” model Shared-nothing architecture — Wikipedia. While they may not even realize that, the model is potentially prone to two characteristics — high processing rate (good!) and a high data skew (bad!), and what’s most interesting — these two can often compete against each other! Imagine that there was one only table where all voters would go — the processing would be quite slow, but there would be no skew at all as everyone would line one into a single queue. But, with 26 different desks, one for each letter of the English alphabet, the rate would be much better but the “Y” desk would always be empty, while “S” desk is as busy as hell. But how much better? Let’s say 5 out every 26 people have their last name starting with “S”. The best processing rate with 26 desks as compared to 1 desk would not improve 26 times — there will be a bottleneck at the “S” table. And, sure, you can add more people to that desk, but that might be difficult — what if each desk can only seat one person? In a “Shared Nothing” system the key to best processing rate is to keep all “desks” equally busy (that property is not unique to SN systems, but a critical one).

Let’s replace desks and election officials analogy with the real computers, we’d be keeping the same “Shared Nothing” model. Now, the computers won’t be using the 1st initial of the last names, they are going to “scramble” each last name using logic called “hash”, i.e. they will be “hashing” the names. Hash function — Wikipedia Each computer also knows its position, from 1 to 26, or as programmers “think” — from 0 to 25. By the way, we won’t even need all 26 computers, but that will become apparent later. There are many hashing functions, but they all result in one thing — some arbitrary data gets converted a shorter, fixed size, “hash” representing that data. Different data become different hashes, same data results in the same hashes. Can different data end up as the same hash? Sure, and that’s called hash collision, but for the sake of data skew it is not critical and is OK to have. Hash collision is critical to cryptography but not so much to data engineering. A good topic in its own right, maybe for a later blog.

Let’s go through the examples. Using my last name, Serebrinskiy, and hashing it using SHA256 tool https://www.tools4noobs.com/online_tools/hash/ hash of it is afc841366f5c61d6ee55c9a01a00d75f91ef423a643c27a5e36a1f0747028a94 Note, that hash of upper cased SEREBRINSKIY is very different — 1cffdb3de8cc0061ca5eb8ea41385ff6c84f7883175f7ca54946a859e141177b. Even changing 1 letter is going to produce massive moves in the hash — let me drop “y” at the end of my name — Serebrinski — hashes to aadb79fd319d83ecf2053245af3592c41cbb569014db75bdece2fd3d9c10027b.

So, what good do these numbers do for us? One more step, please be patient. Remember, we have 26 computers and each computer is numbered from 0 to 25 (or 1 through 26). Where should the voter with the Bucket of “45000” go? For that we need to divide 45000 by the number of the computers (26) and find the remainder. This is called a “modulo function”. Using https://www.calculators.org/math/modulo.php, we get remainder of the modulo of 26 for 45000 is 20! This means I have to go to the 21st computer (computers are numbered 0 to 25). Sandman is going to go to “29717 mod 26 = 25” — 26th computer. Sanders is going to “40907 mod 26 = 9” — 10th computer.

Amazing! 3 people with the same initial now go to three different desks. Again, this is not random, this is a well defined math. Doing this randomly can work but will break data locality, a topic of one of my next blogs. Data locality is critical to processing data joins and random tasks assignment would break it. Hashing, on the other hand, is not random, and maintains data locality.

If I were to keep adding more voters (a.k.a. Data tasks), their hashes would continue to approximately equally distribute them across all 26 computers. But why 26 you say? Great question — and the answer to it — we actually can use any number of computers, 1 or a 1000! All that would change is what we put into our modulo function.

Let’s test this theory using just 3 computers. 45000 mod 3 = 0” — I go to the 1st computer. Sandman’s 29717 mod 3 = 2–3rd computer. Sanders’ 40907 mod 3 = 2” — also 3rd computer.

How about 1000 computers — 45000 mod 1000 = 0–1st computer, 29717 mod 1000 = 718th computer, 40907 mod 1000 = 908th computer.

And so on. Increase the number of the hex characters and you can scale the number of computers to as high as one can imagine, while maintaining a near equal number of tasks each computer has to handle.

The same algorithm can be applied to any data, e.g. car license plates — how do we distribute processing of tolls or new car registrations. Quickly — a car with a license plate of “NYC123” hashes to aabd70d946fd8c81e74b41bfe98b9208f63b09e2feaa1a41085f9ddaa417372e, with “aabd” turning into 43709 decimal. Assuming the Department of Motor Vehicles uses 6 computers to process the plates — 43709 mod 6 = 5, meaning it is the 6th computer responsibility. If, for example, two of the 6 computers break down, the function will become 43709 mod 4 = 1 i.e. the 2nd computer has to handle it. Magic!

There is some danger in too many people having the same last name and ending up going to the same computer. An easy problem to solve — just mix something else into their hash bases — date of birth, address, etc. The beauty of the “hash and mod” approach is that it always results in a number pointing to the computer doing the work. That number is called a “data partition”, and the process “data partitioning”.

By the way, in my last blog I mentioned that a supermarket next to my house uses a single lane for all its customers and someone is directing them to the next available cashier. This is, by far, a much simpler model — no hashing, no modulo, you can add or remove computers (“cashiers”) as you wish. So, why not use it? Well, for starters, this guy directing everyone is a “bottleneck”, and if he were, say getting distracted by looking at his phone and texting his girlfriend, the queue would stall. In computer terms, it is no longer a “Shared Nothing” architecture and it is prone to such bottlenecks. It especially starts to break down when the number of computers is getting really high, in hundreds or thousands. Imagine the poor guy having to direct customers to a thousand cashiers. He would be spinning like a hamster on a wheel. And if it would only take him a second to identify the next available cashier — that’s just 60 customers per minute. With a thousand cashiers spending an average of 5 minutes per customer, the whole system would stall on the bottleneck. The system theoretically would be able to move at 1000/5=200 customers per minute. But the bottleneck is only doing 60 assignments per minute. The system will never be able to get to its maximum rate of 200. What’s worse — it is also very difficult to scale up — if you wanted the bottleneck to work even faster — you could put 2 guys to do the job, but then they would have to coordinate which cashier they send people to. That’s a tough problem.

This is why most of high end data engineering is using hashing.

Hashing is math driven and requires very little computer-to-computer coordination. Absence of such coordination allows for the very high processing speed. You know, like in real life — the less you need to talk to anyone, the faster you can get your job done.

Hash based partitioning is not the only way to combat the infamous data skew but it is one of the most widely used. It has drawbacks, and those worth their own blog story, which I will embark on in my “processing vs on-disk data skew” blog. One of the obvious challenges is that whenever the number of computers changes, all assignments must shift around to balance out the work. To alleviate this issue there has been research done to implement what’s called “sticky” or “stable” hashing algorithms that attempt to preserve assignments during rebalancing.

OK, enough of the dry talk. Let’s look at some real data. I am bringing down a list of all soccer (sorry, football :) FIFA roster, 18728 players ( FIFA 20 complete player dataset | Kaggle courtesy of Stefano Leone | Expert | Kaggle )

I am using the most practical tool of any data engineer for Big-ish Data — Excel :) And, yes, I know and use Python too.

In 5-ish minutes I got this:

Graph demonstrating 1st initial of player’s name, count of each initial.

We see huge spikes in As, Js and Ms.

WHAT HAPPENED TO ALL THE Q GUYS!?!

Distribution By 1st initial — one bucket per initial

The skew is unbelievably bad. If you compare the highest to the lowest bucket, it varies thousands of times. If the players had to go vote for NYC mayor, some of them would still be at it. While the “Q” guys would be long done! Poor Js and Ms :(

2. Now with the hashing added — I’m adding a few steps. Feature engineering is fun!

The 2nd graph shows what happens if we hash and modulo the names. I’ve played a bit with the code from Is there an Excel function to create a hash value? — Super User and Password hash function for Excel VBA — Stack Overflow

Step 1. Add SHA1 hash

Step 2. Take 5 left most hex digits

Step 3. Convert to decimal

Step 4. Find modulo of a sample set assuming 30 computers running some job (just to be close to the 1st graph above).

And here is the result — near even distribution from 0 to 29th data partition. The skew is barely noticeable, less than 10% between min and max partitions. 18000 players across 30 computers should average about 600 per machine. We got exactly that.

hashed and modulo over 30 data partitions

Let’s increase the number of computers — 100 now! And they are all equally busy, with 150–200 players per machine.

Everything continues to work nicely! But the skew has increased to about 25%. Still OK though.

On this note, I think I’ve made all the points I wanted. In the next blog entry I am going to talk about how database joins are affected by skew and what things to look for. I will explain what database joins are as well, don’t worry. This is going to be fun!

Data Engineering By a Pragmatist — Thou Shalt Not Skew (Part 2)

Written by Boris Serebrinskiy