Data Engineering by a Pragmatist — Thou Shalt Not Skew! (Part 1) | by Boris Serebrinskiy

6 min readJan 9, 2022

Data Engineering by a Pragmatist — Thou Shalt Not Skew! (Part 1)

I’m going to go on a limb and claim that Data Engineering is an art. Ability to find brilliant solutions and discover them in unconventional ways is nothing short of the creativity of an artist. But I digress — so, if art imitates life, and data engineering is … an art — thus data engineering imitates life (of course, I am joking, all you math gurus, please sheath your pitchforks :)

On that transitive and somewhat ridiculous derivation, let’s keep going. As I start this, 3rd, blog (which will come in 2 parts) I wanted to get into one of the most obscure and confusing subjects of Big Data processing — the “data skew”. If you followed my last blog (bless the souls of 267 people who spent an average 5 minutes 12 seconds reading https://silverbob.medium.com/data-engineering-by-a-pragmatist-tales-of-choosing-a-database-5f9f2f3cc1ab , and my profound respect to the few who wisely chose to close their browser and delete all its history instead of reading that nonsense) — you recall I went to the Rangers game with my two sons. That event will become pertinent to the subject of the skew.

You ever wonder why there is no parking on your street, and you have to drive in ever expanding circles to find a spot? It isn’t even a circle, it is more like a drunken monkey Google Maps torture by the time you’ve succeeded. You don’t have this problem? But for us Brooklynites, that’s daily life — pack 2.7 million people into 70 square miles, with a predictable result — the 2nd most densely populated “parking lot” in the entire United States (the top prize goes to Manhattan, of course, but Brooklynites own SO MANY MORE cars). In one sentence — distribution gone wrong! Weirder stuff still — my street is always fully packed with cars, though all the houses have driveways, and the owners keep their cars off the street. Walk a block or two and you see almost no cars in the next streets over — what gives?! Walk a dozen more blocks and the streets are full again. Don’t people know how to distribute their cars evenly? The mystery is a pretty easy one to solve — my street is bordering a very large residential, business, and arguably famous area known as “Little Odessa” or officially, Brighton Beach, while the other side of the neighborhood pushes against the fence of Kingsborough Community college, where many students are thankfully taking public transportation, but not all.

Why am I telling you all this (especially if you don’t live in a big city or are lucky enough not to own a car)? Two things I pick up from all the “data” above — firstly, in life, distribution is often uneven (that one is painfully obvious as I look at my bank account and then read news about Jeff or Elon :), and secondly, locality matters — people (or cars) like to park at the nearest spot. Notice, I did not say “distribution is unfair” — this is not a social equality blog, but a data engineering one. So, how do these two very profound observations matter for the pragmatists that we are? And what is “data skew”? Data skew is, simply put, uneven distribution of data.

Life example #1 — We are back to the Rangers game. There are three or four types of lines (or as they say on the island(s) of Albion — “queues”) one has to pass before you make it to the seats.

Queue #1 — they check your vaccination status. There are several staffers, and you can approach any. Once you enter a particular worker queue, you cannot get out and must wait your turn. “Worker queue” — we are finally getting some real convergence of life and data engineering terms here :) OK, identifiable bottleneck — some people don’t have their vaccine card ready and stall the line. Sucks, but OK. The three of us each took a separate lane, and we were through.

Queue #2 — they put us through metal detectors. There are dozens of those, and you can jump from one lane to another if you see one stalled by some fan of Metallica or Black Sabbath. This is a really efficient system, we fly through.

Queue #3 — a.k.a disaster! A.k.a. ticket scanning. a.k.a. people think Madison Square Garden technology equals Apple Pay, equals “I’ll just wave my phone next to the scanner and magic…”. Identifiable bottleneck — though there are quite a few lanes, many people travel in groups (like the 3 of us) and have to scan the tickets together if one person bought them all, like I did. This is called “data locality” (keep this term handy). Not only are the scanners faulty (read — “poor processing rate”), once you get stuck behind a large group all you can do is curse loudly (socially accepted behavior at MSG). So, very inefficient. I should have split my electronic tickets into 3. Too late.

Queue #4 — escalators to the upper seats. There is only one at our gate, so everyone has to pile into that — there is no distribution here, we are in a single threaded mode, just sit and wait your turn. Sucks.

Sidebar: Single Threaded vs SMP vs MPP processing. Briefly — we can’t talk seriously about modern data engineering without at least some basic acronyms. Single threaded means you have one queue, and every piece of work is lined up on it — that’s our escalator. SMP — symmetric multiprocessor system — many queues and many workers checking these queues, but all staying together within the confines of some area (e.g., your personal computer or one gate at MSG). That’s technically queues #1,2,3 above. MPP — massive parallel processing — imagine Madison Square Garden with 6 separate entrances or gates — I only described one, but together they may look like 6 separate computers working together — that’s MPP-ish. In a real MPP system people or tasks from one gate can instantly jump to another gate. In modern data processing we may use hundreds of servers to work on a single problem.

So, the whole MSG building is operating using queues, distribution of workers, locality of ticket holders and their friends and families. G-d knows what other things go on behind the ticket scanners (I really need to talk to their people, those things are awful), food concessions, etc. — they all use distribution. The more uneven this distribution is, the less efficient is the system, resulting in performance bottlenecks. In a pragmatist’s view — I can’t get my beer and I miss the singing of the national anthem and the start of the game.

Life example #2 — I’ll be short this time. Presidential elections, November 2020. New York City passes a law allowing up to 10 days of early in person voting, so I go a week before the final Election Day. I see many distribution and locality issues — firstly, I have to go to a school a mile away instead of walking to the nearest one. Secondly, the staffers split all voter rolls by the first initial of the last name. NOOOOO! This is wrong — there are a lot more people with the initial “S” (me), than the letter “X”. And while the staffers bunch a few of the seldom used “X”, “Y”, “Z” to one lane, it is still not efficient! But they can’t do anything — the rolls are printed and have been “co-located” (remember this word!) on the staffer’s desks. What they should have done is to have computers on each desk and allow people to join a random lane!!! This would then become a no-localized SMP model, which I will explain in Part 2 of the blog. Instead, I am standing in a line of 5–6 people to the poor “S” lady while half the desks are empty.

And finally, Life example #3 — my local supermarket, called “Tashkent” (the capital of Uzbekistan). The model of efficiency! 15 cashiers. One queue. There is a guy directing each shopper to the next available cashier. The store moves massive volumes every hour. The only problem — they have another store, and when this one is busy, the other store can’t help it. Because this is an SMP model, and in SMP two computers can’t help each other. You need MPP for that.

What have we learned — distribution or lack of it, the skew, is critical to efficient data processing. “Thou Shalt Not Skew” then if Thou Wantest Efficient Data Processing!

On how to do that in real life data space we shall talk in Part 2. Best wishes, see you next time!

P.S. Part 2 published here Data Engineering By a Pragmatist — Thou Shalt Not Skew (Part 2) | by Boris Serebrinskiy | Jan, 2022 | Medium

Written by Boris Serebrinskiy