How to build robust data pipelines in the Big Data ecosystem
Building a data pipeline with a traditional ETL tool is completely different from developing a data pipeline in the Big Data ecosystem. One of the reason for this is due to the vast number of tools, technologies and components that are there at our disposal. Go ahead, try to name and define each and every component in the Hadoop landscape (Hive, Sqoop, Spark…….) and you quickly realize what I mean.
Qualities of a robust pipeline
Just to set the context for this topic, we will define some of the important properties of a robust pipeline.
- Should process only required data
- Should have checkpoints
- Should be very very (not a typo) well documented
Overlooking the above points makes your work more of loading data rather than building a data pipeline.
Lets go over the points in details:
Processing only required data: With parallel and distributed architectures everywhere, be it Apache Spark or cloud services like Google BigQuery and Amazon Redshift, there is a lot of processing power at our disposal. These massive systems can easily handle hundreds of GB’s of data within minutes. But one should not build sloppy pipelines by over utilizing this power.
For ex: Say you have to build a pipeline where you get monthly data from a source and that has to be compared with a historical table to check for new records and then append those new rows to the historical table. So this means your historical table is only going to get bigger and bigger throughout its lifespan.
With power at your disposal , you could simply compare the whole bulk of the historical table every time and the pipeline is ready.
Lets see the drawbacks of this approach:
- At the start the historical table is considerably small, so the job completes say within 7–8 mins. The job runs fine and everyone is happy. Fast forward this to 6 months. Would the job still take 7–8 mins ? Absolutely not.
- Time taken by a pipeline should not grow like an upward slope . To do that we should only process required data. In this example, why compare the entire history when your source has only current months data ?
- What if this pipeline runs in a Hadoop cluster ? In the next 6 months this job will start failing with out-of-memory exceptions and you are only left with the option of increasing the memory for this job which is not a good choice.
- What if this pipeline runs in Google Bigquery or Amazon RedShift ? You may escape with the out-of-memory exceptions due to the serverless capacities the cloud provides. But all these services are paid and they charge you for every byte you process. So although you want to handle only the current months data, with this design you end up reading the entire history table making it a very costly pipeline to maintain.
So in this example, the historical table should be broken down into smaller chunks . Basically partition the data accordingly so that only required data is processed making it more a reliable and cost effective design.
A data pipeline can usually consists of several steps and can be a mix of Hive commands, Python scripts, Spark program etc. So the basic idea of having checkpoints is to not repeat the entire flow in case of a failure in a particular step.
For ex: You have 10 steps of a pipeline. A bad design is to simply put all steps sequentially as executable in a single shell script. What if the job fails at the 10th step ? Since everything is packed in one shell script, rerunning will trigger all the previous steps which is not a good thing.
A couple of points to consider can be:
- Individual steps of a pipeline must be separated out so that they can be triggered individually in case of a failure.
- Interdependent jobs must be clubbed together. Ex: If 2 steps of a pipeline are (a) Compare records and (b) Append new records. In this case , step (b) should not be allowed to run separately. So (a) and (b) should always run together to avoid data issues.
This is always least prioritized but is as important as the construction of a pipeline. Some points here are
- Start with the documentation of your data model. This gives a big picture.
- Explain each step in your pipeline and write down what actions need to be taken for each step in case of a failure.
- Always ask a colleague who was not part of the design to read your docs and check whether they understood.
This information not only helps support the pipeline if you have left that organization but also helps new members to redesign things if they need to.
Always take considerable amount of time to design your pipeline before rushing in to load the data. Its always easy to develop something that runs successfully for a couple of months or even a year. But the goal of designing these things should be to sustain it for several years without much redesigning.