How Joins affect Data Skew — Data Engineering by a Pragmatist

Boris Serebrinskiy
8 min readJan 23, 2022

In my 2-part mini-series “Thou Shalt Not Skew”
Stats for Data Engineering by a Pragmatist — Thou Shalt Not Skew! (Part 1) (medium.com) and

Stats for Data Engineering By a Pragmatist — Thou Shalt Not Skew (Part 2) (medium.com) I’ve tried to explain how to best distribute data to keep a parallel processing system running at its peak efficiency. The key was to ensure information equally aligns to each processing unit (computer or human). Specifically, by applying a hashing algorithm. After I wrote the blog, I found a great visual for this at Hashing in Distributed Systems — GeeksforGeeks

What I showed though was a very simple system with just one data set and no joins.

Joins, of course, are a bread and butter of relational theory Relational algebra — Wikipedia and any RDBMS worth its salt supports them. There is a fascinating side to how relational databases employ these algorithms when they process join operators. But I wanted to focus this conversation on parallel processing of joins by MPP systems and the concept of a join skew.

Sidebar #1: In the so called “Shared Nothing” (“Independent Compute over Independent Storage”) Shared-nothing architecture — Wikipedia systems eliminating data skew is absolutely critical to keeping each node local storage equally allocated and each machine equally busy. This is because there is a finite amount of storage that’s attached to each processing unit and if there is a data skew, that storage may become full on some of the nodes, while others will be underutilized. This is not the case in the systems known as “Independent Compute over Shared Storage” (or Shared Disk), where all storage is accessible to all compute nodes. Separation of Compute and Storage has been heralded as one of the major features of modern databases. Such systems are often found in cloud computing and use high scale blog storage. It has its drawbacks though, specifically around high speed OLTP workloads — this is because multiple readers are OK reading the same storage, but multiple competing writers have to cross-coordinate and run slower when compared to Shared Nothing engines where coordination is not required. Blob storage is also quite slow compared to the locally attached disks. Logically, putting a network between Storage and Compute layers also leads to latency and requires investments into caching layers, with more coordination required. See Difference between Shared Nothing Architecture and Shared Disk Architecture — GeeksforGeeks for other interesting details.

I will now distinguish between “storage” vs “join” skews. Storage skew is a static type of skew, it can be seen by looking how much disk has been utilized by each node in an MPP system. Join skew occurs only within queries with joins and leads to one or more servers running “hot” and queries run longer than expected as some machines have to process more data. Many data engineers only focus on the data skew when they lay out their tables — they look at which columns to hash their tables by, to maintain that near-even data distribution across all the machines. Well, let me give you the good and the bad news. First, the good news — modern cloud databases based on Shared Storage don’t require this technique anymore — see the Sidebar above. Now, the bad news — all parallel processing relational engines suffer from join skew, every one of them. I’ve had to explain to very seasoned data engineers and DBAs how and why despite their valiant efforts to maintain zero skew data in their tables, the queries still exhibited signs of inefficient skewing and run slower than expected.

Let me explain how a join is executed by a parallel query engine. We follow 3 simple rules:

Rule #1 — a join is a type of a compute operator and is executed within CPU/memory, not storage.

Rule #2 — join matches two or more pieces of data by computing a “join condition” between tables (typically).

Rule #3 — very important!join cannot be computed unless data is co-located on the same computer.

It is the rule #3 that brings us to the topic of the join skew. Database engines cannot figure out whether a join condition is true or false until it brings data from both tables into the same computer. Sounds simple, right? Take this query for example — find SUM of all orders grouped by CustomerID:

SELECT Customers.CustomerID, SUM(Orders.OrderQuantity) FROM Customers INNER JOIN Orders on Customer.CustomerID = Orders.CustomerID GROUP BY Customers.CustomerID

In this query, the database engine will attempt to satisfy Rule #3 by “moving” all customers and orders with the same CustomerID into the same CPU and then running an INNER JOIN equality check. That is easy if there is one computer, but what happens if you have 10 or 100? We want to execute a query in parallel on multiple computers, with data residing on different pieces of storage. Clearly, we can’t have all the data to be moved into memory of a single machine,

Sidebar #2: All relational query systems employ a complex piece of software called “query optimizer” or “query compiler”. These are tasked with creating a plan on how to execute queries, specifically, which operators to run first, then second, etc. In the query sample above, the optimizer would have to choose which table to move to which computer and whether to do a JOIN before the SUM, and when to do the GROUP BY. Parallel query engines are VERY complex and employ many statistical techniques to attempt to hit the best possible execution path. In my blog I will be referencing modern advanced optimizers built for analytical workloads. Such workloads are known for routinely involving multiple joins per query.

Let’s put some numbers on the table. Assume we have 100 million orders for 1 million customers.

Problem #1 for the optimizer — how to distribute work across many machines equally? One of the things the optimizer will attempt is to execute what’s called a “hash” join. It will hash each side of the INNER join (CustomerID fields) and send data to the corresponding machines with the same modulo of the hash (I’ve described this process in my previous blog).

Here is what will happen if the optimizer goes for a hash join:

  • Hashes of the identical values hash identically and result in equal modulos, thus:
  • Hash all values of Customers.CustomerID column- send data to individual processing nodes according to the hash of CustomerID
  • Hash all values of Orders.CustomerID column — send data to individual processing nodes according to the hash of CustomerID
  • We have achieved colocation — the engine will compare actual values and identify INNER JOIN matches.

Note: all this distribution only happens as a transient copy of data, nothing is written to disk on any of the computers, all data is just read into memory and sent to another machine over the network.

OK, so far so good. But what if one (or a minority of your customers) have been doing a lot of orders? Imagine the following distribution of orders by customer (counts of orders by customer).

Each customer, except for Boris, has just a few orders, but Boris has 50 million orders out of 100 million. When the optimizer chooses to hash by CustomerID, it will have no choice but to send all data for Boris’ orders to a single computer, and that node will run for much, much longer than other nodes, potentially never completing the query. In a system with say 40 nodes, only 1 will have all Boris’ orders. Optimizer has made an awful decision and created a join skew!

I want to emphasize — this has nothing to do with the data distribution on disk. It does not matter how you laid out your tables, which distribution keys you used to store the data. Only the join keys matter.

The operation of moving data between processing nodes based on hashes of the join key is called “distribution” (or “redistribution”) as the engine is temporarily redistributing data using its interconnect network.

The situation gets even more complex when the query has 2, 3, 4, 5, 6,7,8 and more joins. This blog is not aiming to explain how the optimizer operates in such cases, but suffice it to say, it needs to decide which joins to execute first. Here is a basic example — find SUM or Customers orders and group by CustomerID and CategoryName of the Product:

SELECT Customers.CustomerID, Categories.CategoryName, SUM(Orders.OrderQuantity) FROM Customers INNER JOIN Orders on Customer.CustomerID = Orders.CustomerID

INNER JOIN Products on Orders.ProductID = Products.ProductID

INNER JOIN Categories on Categories.CategoryID = Products.CategoryID

GROUP BY Customers.CustomerID, Categories.CategoryName

Does the optimizer run join between Products and Categories first or does it run Orders to Products? Every one of these joins can lead to a join skew depending on the data. And, again, that skew is independent from hashing on disk. Queries with 10–20 joins may take several minutes (!!!) to compile (not execute, just compile!) as the optimizer is searching for the best solution.

In order to combat this join skew the optimizers will employ a technique called “broadcasting”. Using special heuristics (table statistics, data distribution analysis, etc), an optimizer may choose to broadcast smaller tables to every node in the system. Not redistribute, but broadcast — Customers table is likely to be replicated fully, all one million records, to every computer in the system. Here is what will happen then:

  • Engine will not be using hashing algorithm for the join
  • Engine will randomly and evenly distribute Orders table across all computers
  • Engine will broadcast Customers table to every computer, fully
  • Engine will run INNER JOIN equality check and discard all mismatches

Broadcasting is not without its flaws and cannot be used in all the cases. In general, it can only be done for small tables as it consumes network bandwidth and scales poorly on larger systems. Also, an optimizer has to make a decision — do I broadcast Table A “toward” Table B or do I do it in reverse? It gets really complicated with 10–20 join queries. If it makes a mistake and broadcasts a very large table, it will saturate all network bandwidth between the nodes.

Sidebar #3: Does join skew exist in SMP machines (single computer data systems) ? Yes, it does, but it is expressed as different threads/CPU cores having to do different amounts of work. Such skew is difficult to identify and is typically ignored.

Lastly, I’d like to make a note that joins are not the only operators causing skew. GROUP BYs and other group base aggregators will do so as well.

In this example:

SELECT Customers.CustomerID, SUM(Orders.OrderQuantity) FROM Customers INNER JOIN Orders on Customer.CustomerID = Orders.CustomerID GROUP BY Customers.CustomerID

Even if the Customers table was broadcasted, the GROUP BY operator still requires all data for a given customer to be co-located. Not for the purposes of the join, but for the SUM operator. And though one might say — “but I can run SUM on many computers and then roll it up on one” (because SUM of SUMS is fully additive), this may not work for all operators, for example COUNT DISTINCT is not a fully additive operation. This is a more advanced topic, perhaps for another blog.

In conclusion, I personally call “join skew” as “processing skew” (“runtime” skew, if you will), to highlight a more general nature of this phenomenon, rather than limiting it to just joins. I’ve seen it occur in window functions, in rollups, cubes, etc. The fundamental idea I wanted to convey in this blog — the processing skew is independent from the storage skew and can be a confusing subject. Separation of compute and storage does not solve it. Not every query can be completely skew free. The optimizers make mistakes and there is no substitute for advanced query plan analysis by data engineers.

With that, thank you and till the next time!

--

--

Boris Serebrinskiy

I love sharing my skills and experiences. Happy to get on a good hike with my 2 sons, too bad they hate hiking. Wife loves dancing, and I'm not. Happy Family!