How to build robust data pipelines in the Big Data ecosystem

Qualities of a robust pipeline

  1. Should process only required data
  2. Should have checkpoints
  3. Should be very very (not a typo) well documented
  1. At the start the historical table is considerably small, so the job completes say within 7–8 mins. The job runs fine and everyone is happy. Fast forward this to 6 months. Would the job still take 7–8 mins ? Absolutely not.
  2. Time taken by a pipeline should not grow like an upward slope . To do that we should only process required data. In this example, why compare the entire history when your source has only current months data ?
  3. What if this pipeline runs in a Hadoop cluster ? In the next 6 months this job will start failing with out-of-memory exceptions and you are only left with the option of increasing the memory for this job which is not a good choice.
  4. What if this pipeline runs in Google Bigquery or Amazon RedShift ? You may escape with the out-of-memory exceptions due to the serverless capacities the cloud provides. But all these services are paid and they charge you for every byte you process. So although you want to handle only the current months data, with this design you end up reading the entire history table making it a very costly pipeline to maintain.
  1. Individual steps of a pipeline must be separated out so that they can be triggered individually in case of a failure.
  2. Interdependent jobs must be clubbed together. Ex: If 2 steps of a pipeline are (a) Compare records and (b) Append new records. In this case , step (b) should not be allowed to run separately. So (a) and (b) should always run together to avoid data issues.
  1. Start with the documentation of your data model. This gives a big picture.
  2. Explain each step in your pipeline and write down what actions need to be taken for each step in case of a failure.
  3. Always ask a colleague who was not part of the design to read your docs and check whether they understood.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gohit Varanasi

Gohit Varanasi

Code works like magic and always inspires.