Easy distributed joins with Pachyderm
Distributing your dataset/database joins can be a daunting task, to say the least. Not only do you need to think about how your data is sharded, indexed, etc., you need to think about what types and sizes of resources you need to allocate based on the scale of your data. For many, these considerations cause them to retreat to brute forcing their data transformations on increasingly beefy boxes.
However, data engineers and data scientists should be able to use the tooling they are familiar with, such as Python pandas or simple SQL, to perform data transformations, and still be able to seamlessly distribute that processing on a cluster. Take, for example, the following join:
An example join
Let’s imagine that we have data in two data sets called
events. As the name implies, the
users data set includes some data about our users:
username, occupation, birthday
bob1234, dentist, 07/01/1972
sallymcgee, software engineer, 03/28/88
events data set includes some data about things our users have done in our fictitious system:
timestamp, username, event_type
2017–01–15 03:02:11, sallymcgee, refund
2017–01–30 14:33:52, bob1234, purchase
If we wanted to fill in the rest of the user information for each of the events in
events, we could join
username. For our purposes, let’s imagine that we have a simple python script
join.py (see here for a concrete example script) that performs this join and outputs:
timestamp, username, event_type, occupation, birthday
2017–01–15 03:02:11, sallymcgee, refund, software engineer, 03/28/88
2017–01–30 14:33:52, bob1234, purchase, dentist, 07/01/1972
Distributing the join
Naively applying this simple python script to large data sets at production scale will result in problems. We are potentially pulling both data sets fully into memory and creating a third data set that is the combination. That being said, maintaining the simplicity of the above approach would be wonderful if it could be distributed across our data.
With Pachyderm, we can easily create a data pipeline with a single worker that executes our python script, performing the join, and outputs the joined data:
However, we can also instruct Pachyderm to increase the parallelism of our data pipeline (which is controlled in a “pipeline specification”). When we increase the parallelism, Pachyderm will begin to distribute the join over multiple workers, each of the workers still running our same simple python script:
Pachyderm is smart enough to know how to distribute the data in
events so that you get the same joined results, without having to modify your naive join logic! Under the hood, Pachyderm will partition the data sets and expose portions of the data to each worker:
In all of these scenarios (1 worker, 2 workers, etc.), every combination of the
users data and
events data is considered by at least one worker, so we don’t miss any joins. This logic of extends seamlessly to multi-table joins as well. In which case, the diagrams above would be 3-dimensional (for three data sets), 4-dimensional (for four data sets), etc.
Also note, here we have considered only one type of join, but there are many different types (inner, left, right, etc.). For many of these joins, Pachyderm offers additional optimizations that are discussed further in our docs.
It’s as easy as that! Write the code you want to write, and let Pachyderm think about how to split up the data and distribute the transformation.
Be sure to: