The Data Pipeline

Amit Choudhary
Mentorskool
Published in
3 min readAug 28, 2019

Data is the new oil. This statement is of small size but huge ramifications! With so many startups sprouting up in the space of Food, Education, Travel, E-Commerce, House odd jobs, Payments and Social Media, the ability to capture our behavior via apps has resulted in a gold mine being created out there on those server farms. This data if carefully managed and nurtured from its genesis to consumption would result in gathering invaluable insights about our behavioral aspects of eating, shopping, travelling, spending etc.

If we can visualize the movement of data from the point it is created to the point it is consumed as a water pipeline and water as data, it would look something like the above image. Data collected in itself does not add much value to the business. On the contrary, it is difficult to justify the idea of storing a lot of data early in the journey of a typical startup as it takes space and hence adds to cost. On top of it, as modern organisations go AI first, they tend to collect as much data as possible from the onset. This further adds up to the infrastructure cost. However, with careful cleaning, storage, processing, visualizing and extracting knowledge from the data, it’s value skyrockets and becomes the single biggest money spinner for business

With a goal of squeezing as much value from data as possible, data driven organizations carefully design and build scalable data pipelines. Some of the top questions asked by matured data teams are :

  1. What are the different sources from which we are collecting our data?
  2. What are the types of data we are storing?
  3. What are the metrics relevant to our business? How do we process data to report these metrics?
  4. What’s the volume of the ingested data?
  5. What’s the speed of data ingestion?
  6. What proportion of our data is ‘dark data’? (Ex. images, free-text, videos, audio etc.)
  7. What proportion of our data pipeline programs are compute hungry? (Ex. Higher the custom-built AI engines, higher the compute demand)
  8. To what extent have our data pipelines been automated?
  9. How quickly are the generated insights implemented back in the product or service?
  10. What’s the trade-off of building smarter applications to spending a bomb on compute and storage?

Seeking answers to each of the above questions lead to a plethora of technologies being built and adopted by modern data pipelines. 80% of data storage systems are still relational making SQL still highly relevant. However, with increasingly newer data needs, adoption of other storage technologies like NoSQL and Graph is gradually picking up.

Storage has moved from vertically scalable systems to horizontally scalable systems making way to distributed computing technologies like Hadoop and Spark take center stage in processing data at scale.

Parallel processing programming paradigm has forced programming languages to start consuming every RAM of the machine on clusters.

Cloud has eased up the infrastructure headache. No need to own anything. Hire what you need and pay per use. Multi-node clusters can be spawned at the click of a button.

All the above make way to the rockstar of data pipeline, the AI algorithms which are both data as well as compute hungry. AI holds promise of delivering wonders. However, without the rest of the members of the pipeline working in sync and duly performing their role, AI is just expensive math! Not mentioning the staggering electricity bills (which unfortunately no one talks about)

I will cover each of the above questions in my upcoming articles. I hope this leaves you with a platter of thoughts!

Till then, Happy Learning!

--

--

Amit Choudhary
Mentorskool

Lifelong Learner | Data Driven | Teacher | Data Story Teller