Data Engineering 101

Data Pipeline Architecture: a variety of ways you can build your Data Pipeline

Data Engineering: Beyond Hype #6

Saikat Dutta
CodeX

--

Data Pipelines are the main building block of Data Lifecycle Management.

Data Engineers spend 80% of their time working on Data Pipeline, design development and resolving issues.

Since this is so important for any Data Engineering workload, let’s try to dive deeper in al the different architectures to build a Data Pipeline.

Recap

In the last episode, we learned what Data Engineering is, and what are the best practices of Data Ingestion.

We discussed batch, streaming, and lambda architecture for data processing.

We discussed 7 best practices of Data Ingestion.

In this episode, we will talk about Data Pipelines as a whole.

Why Data Pipelines:

Pipelines are used for transferring something from one point to another.

However, a Data Pipeline does much more.

A Data pipeline is a sequence of steps that collects raw data and converts it into actionable insights or information about the business.

But why do we need Data Pipelines?

Source systems in a large enterprise are extremely varied. For example, a large enterprise might collect data from 1. Different Transactional Systems, 2. Social Media, 3. Documents 4. Videos 5. ERP systems 6. Marketing Systems, ex. Hubspot 7. 3rd party analytics platforms, like Google Analytics.

Hence, RAW data from these systems might have different protocols and different structures. It is extremely complicated to try to standardize them for analysis.

Data Pipelines are used to collect all the data from disparate sources into a unified system.

Once the data is unified, Data Pipelines also help standardize the data, remove invalid data, cleanse it and perform aggregation and transformations based on subsequent downstream requirements.

Types of Data Pipeline

Data Pipeline categorization can be done based on different parameters. However, let us consider different types based on schedule, the freshness of data, and architecture of data processing.

  • Batch Processing: Data Pipelines are scheduled at specific intervals. Useful for daily dashboards or periodic reporting. Data freshness is low.
  • Stream Processing: Data Pipelines execute continuously as data is streamed. Used when the difference between event time and processing time needs to be minimized e.g., tracking anomaly, fraud, dynamic pricing, etc. Data is fresh but often not accurate.
  • Batch + Stream Processing: Combines the power of long-running batch models with streaming analytics. Data is still fresh and inconsistencies are removed by using output from Batch processing.

What are the main components of a Data Pipeline?

  1. Sources
  2. Destination / Sinks
  3. Transformations of Data and Business Logic
  4. Data Quality Validation

There can be more components, e.g: Data Modelling, Data visualization, however, these 4 are the basic building blocks.

So how do you use these building blocks to solve different use cases in ETL?

RAW DATA LOAD

As the name suggests these pipelines are used to move data in raw format. The data is unchanged. Involves only Extract and Load, no transformations are applied. These are typically run only once or in specific schedules per day.

Raw Data Load in ETL / ELT
Image source: Author

Pros:

  1. Data Movement from one DB to Data Lake, From on Premise Data Store to object storage on the cloud.
  2. Data is typically loaded in compressed open file formats like Parque/ORC/Deltalke/Iceberg etc. Hence less storage is occupied in Datalake.
  3. Helps to make the data transformation pipelines idempotent. Ex. Even if the subsequent data transformation pipelines are run multiple times in the day or rerun, they will result in the same data.
  4. Helps to debug source data issues, without referring to source data multiple times.
  5. Suited for Data Lake workloads and mostly on cloud only.

Cons:

  1. Duplication of data in another layer.
  2. Not suited for on-premise Data Warehouse workloads.

Extract Transform Load ( ETL ) :

Extract-Transform-Load (ETL) is the most widely used data pipeline pattern. From the early 1990’s it was the standard to integrate data into a data warehouse.

Data extracted from operational data stores and then unified, cleaned, and transformed as per business needs. Data is also modeled to replicate real-world entities.

Extract Transform Load (ETL) Pipeline
Image source: Author

Ex. Loan application data is taken from loan origination platforms, loan payments, transactions are taken from loan receivable platforms, and credit and other 3rd party data is taken from flat files, and data feeds.

All these data are then segregated to form Loan Fact, Loan Transaction Fact, Loan payment fact, different dimension tables related to loan, etc.

Pros:

  1. Suited for Data Warehouse workloads.
  2. Works well when complex data transformations are required.
  3. Results in only One Copy, as no redundant raw data layer is needed.
  4. If all sources are not ready at the same time, ETL can be useful. The transformation will happen only when all data is available.

Cons:

  1. Highly Time consuming and has High Latency. Micro batch processing can be done, but still latency is inevitable.

2. High compute power needed for complex transformations.

3. Due to 1 and 2 it's not practical with Big Data.

4. You often have vendor locking with ETL tools and Data Warehouse providers.

Extract Load Transform ( ELT ):

Extract-Load-Transform became popular as a replacement for ETL with an increase in data volume. Waiting for all transformation work to complete delays the availability of data for business use. Hence data is loaded immediately after extraction. AI/ML processing can happen without data modeling.

Extract Load Transform (ELT) pipeline
Image source: Author

Pros:

  1. Suitable Data Lake implementations
  2. Decreased latency as data is available much sooner for processing.
  3. All kinds of data structured, instructed, etc can be loaded.

Cons:

  1. Increased duplication of data in different layers. Results in two+ copy(Raw and Transformed layer / optional serving layer)

2. Data Quality Suffers, data is exposed without deduplication and cleansing.

3. Often ends up creating a Data Swamp.

4. People often forget or ignore the T — Transformation.

Extract Transform Load Transform ( ETLT ):

The Extract-Transform-Load-Transform (ETLT) pattern is a hybrid of ETL and ELT. The first stage of transformations are light transformations, independent of all other data sources. Data cleansing, format standardization, and masking of sensitive data are typical kinds of first-stage light transformations.

Pros:

  1. Smaller transformation, data cleaning, and filtration done while extracting data.
  2. very practical pipeline design solves data privacy through masking.
  3. As some filtration or transformation is done initially chances of data becoming a swamp is less.
  4. Enables Lakehouse architecture.

Cons:

  1. Still limited by processing power and latency.
  2. Still generates two copy.

All the variations of ELT/ETL and Data Lake or Warehouse are limited by

  1. Centralized Data team
  2. Increased time to provision new data sources
  3. Lesser flexibility is serving different departments and use cases ( ML/AI/Marketing/Sales/HR etc.)

Data Mesh and Data Fabric architecture are modern alternatives to Data Lake / Data Warehouse / Lakehouse based centralized processing. The key enablers for Data Mesh/Fabric are Data Virtualization and Dedicated Links.

DATA VIRTUALIZATION:

Data virtualization, illustrated in figure 6, serves data pipelines differently than the other patterns. Most pipelines create physical copies of data.

Virtualization delivers data as views without physically storing a separate version.

Virtualization works with layers of data abstraction. The source layer is the least abstract. The integration layer combines and connects data from disparate sources providing views similar to the transformation results of ETL processing.

Data Virtualization

The business layer presents data with semantic context, and the application layer structures the semantic view for specific use cases.

  1. Unlike ETL processing virtualization processes are initiated by a query.
  2. The query is issued in the application and semantic context and is translated through integration layers and source layers to connect with the right data sources.
  3. The response traverses the path in the opposite direction to acquire source data, transform and integrate, present a semantic view, and deliver an application view of the data.

Pros:

  1. Virtualization can work well when people want the freshest data possible.
  2. Virtualization integrates and transforms only data that is requested and used
  3. Zero Copy architecture, No data replication.

Cons:

  1. Only good for modest data volumes
  2. May struggle to perform with complex transformations or huge data.

DATA STREAM PROCESSING

The best alternative for low latency ETL is Data Stream Processing.

Data Strem processing involves a streaming data source, ex. logs of sensors and IOT devices, location, and trip data from moving vehicles in logistics, etc.

Stream processing involves filtering, and transforming a continuous stream of data.

Data Stream Processing (source)

Processing begins by parsing the events to isolate each unique event as a distinct record and only events relevant to the use case are processed further.

At the destination end of the data flow, there are two different patterns that emerge.

For some use cases, the event is posted to a message queue or an event log, or an events archive. Once posted or stored, the downstream data pipeline does data consumption later.

For other use cases, the events are pushed to monitoring or alerting applications. Ex. real-time anomaly in freight carriers, dynamic price based on demand, etc.

Pros:

  1. Stream processing pipelines work with sensors and Internet of Things (IoT) devices.
  2. This pattern is particularly useful for machine learning.
  3. Useful for log analytics
  4. Data Freshness is high

Cons:

  1. In absence of Lambda architecture, data accuracy is not high.

CHANGE DATA CAPTURE

CDC along with an OLAP use case can solve data latency issues. CDC allows fresh data to be available for processing in ETL pipelines.

There are mainly two types of CDC applications:

  1. Transaction date based
  2. Database log based
  3. Trigger-based
  4. Differential Method Based ( Diff )
Change Data Capture (source)

Pros:

  1. Reduced latency for Datawarehouse workloads.
  2. Increased data freshness, near real-time or micro-batch ETL.

Cons:

  1. Increased complexity.
  2. CDC applications can add extra cost.

Now that we have covered the different types and architectures of pipelines, there are a few common design patterns that apply to almost all the processing techniques.

Repeatable and Idemotent:

Its important that ETL pipelines are able to re-run from a fixed timeline and produce the same results.

  1. Store data in an immutable way, as a list of transactions with date time.
  2. This will alow the computation to be idempotent and produce same results no matter if run retrospectively.

Automated Framework — Templatize

Its extremely important to automate data pipelines using an orchestration tool / Dags / corn jobs. The pipelines should resolve all dependencies and run on its own.

Without automation, manual error and delay will burden the workforce and kill data quality.

Templates are a must have for different data pipelines. This allows faster provisioning of new sources, reduced development and test cycles for data pipelines.

Auditable — (Monitoring and logging)

Its important to learn at any point of time what pipelines are running, what completed successfully and what failed. In case of any failures clear logging should be in place.

This will drastically reduce debugging and error handling times and effort.

This should also allow for Data Obesrvability and Data Lineage, to allow everyone to see what transformations have been applied through which pipelines.

Reliable and Secure ( Data Validation and QA)

Data Quality is a key parameter to look for to validate results of a Data Pipeline. The pipelines should be designed in a way that the output is reliable and that maximum data quality is achieved.

DQ check must be done in both source and destination, and alerting must be configured.

This will build confidence among the stakeholders to use the data for downstream reporting.

Conclusion:

This is probably extremely exhaustive list of architectures for Data Pipeline covered in a single blog. I will cover practical implementation of some of these in future in a separate article.

For beginners this might seem like too much. Hence do take it slow, and cover one at a time and come back to the blog whenever you need any reference.

However, these are fundamental building structure for Data Pipelines and are tool agnostic. If we build strong foundational knowledge of each and are able to differentiate pros and cons, that will help us to build the pipelines in any ETL/ELT/DI/Orchestration tools.

References and Further Study:

  1. Data Virtualization — How it works.
  2. Data Pipeline design patterns.

3. “Data Pipeline Design Principles” by Kovid Rathee

4. 3 Design Principles for Engineering Databy Jiaqi Liu

5. Data-Mesh-and-Data-Virtualization-are-not-the-Same-Thing by Terada.

See you again next week.

Whenever you’re ready, there are 3 ways I can help you in your Career Growth:

  1. Let me help you mentor with your career journey here.
  2. Download an exhaustive Data Engineering Roadmap here.
  3. Grow Your LinkedIn brand and get diverse here.

Follow the weekly newsletter here.

--

--

Saikat Dutta
CodeX
Writer for

Azure Data Engineer| Multi Cloud Data Professional| Data Architect | Career Mentor | Writer(Tech) | https://withsaikatdt.gumroad.com/l/DE2022