Data Engineering 101

7 Best Practices for Data Ingestion

Data Engineering: Beyond Hype #5

Saikat Dutta

Published in

CodeX

6 min readAug 18, 2022

“Data Engineering is the new Sexiest job in 2022” It has surpassed Data Science in demand and career opportunities.

If you have not already seen the astronomical growth in demand for Data Engineering, chances are you were living in a cave for the last 2 years.

What exactly is the hype all about?

In order to try to answer this, we must first explore,

What is Data Engineering?

Coursera defines,

Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale.

Hmm, interesting, but what it actually is? let’s try to dig into each of the points highlighted above.

Collecting Data from different Sources:

Data is now available in a variety of structured and unstructured datasets.

It resides in text, images, videos, social media platforms, the Internet of Things(or IoT) devices, real-time events that stream data, legacy databases, and data sourced from data providers and agencies.

The sources have never before been so diverse and dynamic.

It is not easy to connect to different types of sources, read different formats of data and collect it all.

Storing Data in a Standard repository:

Once you connect, it's important to store raw data in a common place.

The data needs to be cleansed before analyzing it and making sense of it.

The data will also need to conform to compliances and standards.

Final Stakeholders / Data Consumers:

We have our business stakeholders, applications, programmers, analysts, and data science use cases all pulling this data from the enterprise data repository.

All these consumers will try to analyze and make sense of the data in order to understand the business and make some key decisions.

Scaling the Infrustructre:

Finally scaling is one of the most important tasks for a Data Engineer. The Data is going to grow exponentially. The infrastructure to load, store and analyze the data needs to be able to grow along with the data.

So, we have understood the basic works and challenges that are involved in Data Engineering. Let’s try to dig a little deeper into the first two tasks, collecting and storing Data.

Data Ingestion:

Data Ingestion is defined as the process of absorbing data from a vast multitude of sources, and then transferring it to a target site where it can be analyzed and deposited.

A Data Engineer spends more than 50% of his time writing different pipelines that move data from one place to another. There are two basic frameworks to achieve the same:

ETL: Extract — Transform — Load
ELT: Extract — Load — Transform

As is intuitively understood the basic difference is when you apply the transformation, before or after loading the data.

However, in both the frameworks the common element is to be able to extract the data and load it into another destination. This is Data Ingestion.

Now, on a broad categorization, there are mainly 3 types of Data Ingestion:

Batch-based Data Ingestion :

Batch-based ingestion happens at a regularly scheduled time. The data is ingested in batches. This is important when a business needs to monitor daily reports, ex: sales reports for different stores. This is the most commonly used data ingestion use case.

2. Real-time/Streaming Data Ingestion:

The process of gathering and transmitting data from source systems in real-time solutions such as Change Data Capture (CDC) is known as Real-Time Data Ingestion.

CDC or Streaming Data captures any changes, new transactions, or rollback in real time and moves changed data to the destination, without impacting the database workload.

Real-Time Ingestion is critical in areas like power grid monitoring, operational analytics, stock market analytics, dynamic pricing in airlines, and recommendation engines.

3. Lambda-based Data Ingestion Architecture:

Lambda architecture in Data ingestion tries to use the best practices of both batch and real-time ingestion.

Batch Layer: Computes the data based on the whole picture. This is more accurate however is slower to compute.
Speed Layer: Is used for real-time ingestion, the computed data might not be completely accurate, however, gives a real-time picture of the data.
Serving layer: The outputs from the batch layer in the form of batch views and those coming from the speed layer in the form of near real-time views get forwarded to the serving. This layer indexes the batch views so that they can be queried in low latency on an ad-hoc basis.

Now that we understood what Data Ingestion is, it seems to be fairly straightforward, just copy the data from the source and paste it into the destination, right?

NO.

What why?

What are the challenges in Data Ingestion?

Varied data sources need custom protocols to connect
Different formats and standards of Data in sources
The integrity of data while reading and storing
Data Quality and Duplication

And these are just a few to name.

So, How can we ensure we ingest the correct data?

We can by following some simple best practices followed for years.

Data Ingestion Best Practices:

Add Alerts at source for data issues

Adding alerts in the source data will save a lot of time trying to debug issues downstream.

Basic data quality checks like null column, duplicate records and invalid data can be checked before losing the data into the repository.

If the checks fail, alerts must be triggered for the source team to fix. The faulty records can be discarded and logged.

2. Keep a copy of all your raw data before applying the transformation

The raw data layer must be read-only and no one should have update access.

This will serve as a backup in case of a failure in subsequent layers while trying to cleanse or add transformation.

3. Set expectations and timelines early, Data Ingestion isn’t easy

Business leaders and project managers often either over or underestimate the time needed for data ingestion.

Data Ingestion can often be very complex, and ingestion pipelines need to have proper tests in place.

hence it’s always good to set expectations of stakeholders on the timelines involved to build the pipeline and the time taken to load the data.

4. Automate pipelines, use orchestration, set SLAs

Data Ingestion pipelines should be automated, along with all the needed dependency.

An orchestration tool can be used to synchronize different pipelines.

SLAs must be set for each pipeline, which will allow monitoring teams to flag in case any pipelines run longer.

5. Data Ingestion Pipelines must be idempotent

Idempotency is a critical characteristic for Data Ingestion Pipelines.

Idempotence means that if you execute an operation multiple times, the result will not change after the initial execution.

Mathematically f(f(x)) = f(x) is an idempotent function.

In the context of data integration, idempotence makes the data ingestion pipeline self-correcting.

Most importantly, it prevents duplicate records from being loaded.

A few of the strategies to achieve the same can be: Delete Insert, Upsert, Merge operation, and look up tasks.

6. Templatize, Reuse frameworks for development

A lot of the data ingestion pipelines are repetitive, hence it’s important to create templates for pipeline development.

If you create a reusable framework in your pipeline, the delivery effort can be reduced massively.

Increased velocity in ingesting new data will always be appreciated by the business.

7. Document your pipelines

This is the last but one of the most important habits to inculcate.

Its extremely important that you document the input, output, and logic inside your pipeline.

This documentation can help save time to debug, explain business logic, or create a source-to-destination mapping for business.

See you again next week.

Whenever you’re ready, there are 3 ways I can help you in your Career Growth:

Let me help you mentor with your career journey here.
Grow Your LinkedIn brand and get diverse here.
Take charge of your career growth here.

Book some time with me: https://topmate.io/saikatdutta

Sources: