Data and Data Pipelines

(Introduction & Importance of Managing Data Flows)

Published in

Accredian

7 min readAug 2, 2022

Preface

In the last decade, we saw a boom in the innovations of big data technology due to the emergence of a copious amount of data. The data acquired data from several sources cannot be wasted and requires extensive processing before we can utilize it further.

Scraping data from different sources, storing them in a database, applying countless processes on diverse types, and using compatible tools to gain insights from it becomes a tedious task on a day-to-day basis.

Moreover, a single human error can waste time, energy, and money. It is the main reason why data analysts, scientists, and engineers use data pipelines in their workflows.

Data Pipeline

A data pipeline involves propagating data from a source to a destination. The data can be processed, transformed, and analyzed using various processes before being stored in a system such as a data warehouse.

The data storing requires a concrete architecture and services, which data-based solution providers can apply for several applications such as real-time reporting, metric updates, business insight delivery, and predictive analysis.

Primarily, we inject data into a pipeline through batch processing or stream processing, but there can also exist other types of processing.

Batch Processing

In batch processing, a series of tasks are pre-planned and defined to execute at definite triggers or intervals. It aims to minimize human interaction and is preferred by many in IT, data science, and analytics. These pipelines typically come under ETL processes.

We implement batch processing when we have to deliver either of the following:

End-of-quarter reports
Invoice summary of a monthly period
Payroll processes

Stream Processing

Stream processing enables the real-time movement of data. It continuously collects data from various sources and analyzes it. Real-time business intelligence, predictive analysis, and decision-making become possible using stream processing tools like Apache Spark.

For example, Robinhood uses real-time data sourced by NASDAQ to estimate closing stock prices with high precision.

So how is a Data Pipeline better?

Data flow can be complex and delicate. Data that migrate between systems are prone to a plethora of issues such as data corruption, latency (or bottlenecks from either system), conflicting data types, and even data redundancy (or duplicates). These complexities augment further and take effect when you scale the requirements set by an organization.

Some advantages a well-made data pipeline offers over the traditional data management approaches such as:

Agility to meet the required demands
Easier access to information and insights
Faster decision-making

Components of a Data Pipeline

Source

Origin or source is the point of data entry in a data pipeline. Data sources may include transaction processing applications, IoT devices, social media captions, tweets, user data, Web APIs, any public datasets from Kaggle or GitHub, or storage systems (data warehouse or data lake) of an organization.

Destination

A final point where data is loaded and stored in a pipeline is called a destination. This data can be stored in services like data lake or data warehouse. The stored data can then be sourced to power data dashboards, analytical tools, and business insight developments, or can be accessed by various outlines of an organization.

Dataflow

It indicates the flow of steps involved in transitioning data from source to destination systems. It also indicates the processing which is applied to the data. One such example is an ETL (Extract, Transform, and Load) process.

Workflow

The workflow defines a sequence of execution of jobs and the dependency between them in a data pipeline. Dependencies and their sequencing determine how certain jobs execute in a data pipeline. Some keywords and their description related to the workflow are as follows:

Job: A job is a series of processes (or executables) performed on the data. For example, retrieving raw data, cleaning raw data, and then developing a machine learning model.
Upstream/Downstream: When tasks are streamed down the line, the execution flow is known as downstream aligned. On the other hand, when tasks are streamed up the line or in an upward direction, the execution flow is known as upstream aligned.

Monitoring

It determines the efficiency of a data pipeline by checking if components are working as intended. When the requirements of an organization scale, monitoring tools help to verify the integrity of the execution workflow with the help of a debugger. It also helps to keep track of the load on the system and the consistency of data.

Tools to create a Data Pipeline

ETL tools, including data preparation and data integration tools such as Google Cloud Fusion, Azure Data Factory, Hevo, or AWS Glue.
Data warehouses — for storing structured data such as Amazon Redshift or Google BigQuery. Locally set-up databases can be used for storage purposes as well.
Data lakes — storages for raw, both structured and unstructured data provided by GCP, AWS, Microsoft Azure, and IBM.
Batch workflow schedulers and orchestration tools such as Airflow, Jenkins, and Luigi allow users to programmatically specify workflows with the help of tasks and dependencies between them.

Use-Cases

Many organizations are data-centric and collect lots of data that can be useful for strategic decisions, predictions, and marketing. The following are some of the popular companies that make use of data pipelines in their businesses.

Netflix

Netflix has one of the most complicated data pipeline infrastructures.
The video streaming company’s pipeline deals with petabytes of data and is broken into smaller subsystems focused on data ingestion, predictive modeling, and analytics.
Netflix Technology blog has explained in detail the Evolution of the Netflix Data Pipeline
Also, check out this video to know about the various technologies used by Netflix.

Dollar Shave Club

A well-established eCommerce company for personal grooming products.
Dollar Shave Club has a subscription-based business model with millions of customers.
They have a pretty good data architecture hosted on AWS and are primarily used for their product recommendations to boost future sales.
You can read about their use case in this blog.

Robinhood

Robinhood Logo

Robinhood is a financial services company known for allowing customers to trade stocks, options, exchange-traded funds (ETFs), and cryptocurrencies with zero commission fees.
The California-based company has 31 million users and has a data pipeline architecture hosted on AWS, operated on ELK (Elasticsearch, Logstash, and Kibana), and managed on Airflow.
You can read about Robinhood’s data management architecture on their blogs for ELK and Airflow.

Coursera

Coursera is a popular online education platform that offers massive open online courses (MOOCs), specializations, and even degrees from a variety of universities.
The Stanford-based company has a simple data pipeline that is used by the company’s employees to create dashboards, make informed decisions on launching features, and monitor the health of backend services.
They have provided more information on their blogs regarding their data usage and data infrastructure.

& That’s it. I hope you liked the explanation of data and data pipelines and learned something valuable. If you have anything to share with me, let me know in comment section. I would love to know your thoughts.

Final Thoughts and Closing Comments

There are some vital points many people fail to understand while they pursue their journey in Computer Science, Data Science and AI. If you are one of them and looking for a way to counterbalance these cons, then Follow me and Subscribe for more forthcoming articles related to Python, Computer Science, Data Science, Machine Learning, and Artificial Intelligence.
If you find this read helpful, then hit the Clap👏. Your encouragement will catalyze inspiration to keep me going and develop more cool stuff like this.

What’s Next?

Introduction to Apache Airflow