Data Engineering 101

How to Create First Data Engineering Project? An Incremental Project Roadmap

Build Data Engineering projects in this incremental approach for guaranteed success

Saikat Dutta

Published in

Data Engineer Things

15 min readAug 20, 2023

A detailed breakdown of a Data Engineering Project. (Image by the author)

On Apr 16, 2022, I wrote a detailed Roadmap of how to learn Data Engineering as a beginner. The roadmap was a great hit. However, it had a significant gap.

People often fail to remain consistent while learning so many different technologies. They fail to piece it together. The roadmap should have addressed the challenge of this dwindling attention span. We will try to explore how an Incremental Project Roadmap can address this gap.

The Data Engineering Roadmap listed plenty of topics to cover, which are still relevant. It also had a great tracker to plan the topics and track progress. But it lacked some elements of structure around what to learn first and what to implement as a real-world project.

Also, the plan was great for beginners, but what about people who have some experience in the field and are willing to implement hands-on?

While making the move from an ETL Developer to Data Engineer I faced a similar challenge. Now in 2023, when I look back, I feel I would have done much better with a different approach.

Long post alert: This is going to be a pretty long read but be assured it will save a lot of time while learning. So sit back, grab a cup of coffee, and read on….

Start with the project
Incremental Project Method
— Define the Business Problem
— Define Data Requirements
— Incremental Project Design
— — Sprint1: Building a Simple Data Ingestion Pipeline
— — Sprint 2: Adding Idempotency to the Pipeline
— — Sprint 3: Adding Unit Testing
— — Sprint 4: Creating multiple pipelines on the same principles
— — Sprint 5: Orchestration and Workflow Management
— — Sprint 6: Automation of Data Pipeline
— — Sprint 7: Data Quality and Validation
— — Sprint 8: Continuous Integration and Continuous Deployment
— — Sprint 9: Infrastructure as Code (IaC)
— — Sprint 10: Scalability and Optimization
Conclusion

Start with the project

So, in the real world, will you be given months to learn everything you need to solve a problem? You will almost always be given a set of tasks or problem statements. Then you will analyze what is needed for the same. Right?

Go Figure, will possibly be the real mantra in an actual project. But we all have a learning problem.

We are trapped in a tutorial hell loop.

How often have you felt underconfident, even after completing a course? You feel great when you follow a tutorial, but the moment you are given a set of tasks, you procrastinate, why? Because you hit the Valley of Despair and are lost in tutorial hell.

Let me introduce Incremental Project Method, and how it can solve the dwindling motivation problem.

Incremental Project Method

The incremental Project Method reverses the idea of doing a project once you have learned everything. Rather we start with a project, break the project down into smaller manageable tasks, and learn the concepts and technologies needed to solve that specific task and achieve the goals.

So, why is it called incremental then?

Well once, the basic goals are met you introduce new complexities and functionalities. You can relate this to actually building an MVP first to validate the idea. Once the idea is validated you add functionalities and scale.

So, what are the benefits of this model?

You replicate an actual work environment
You learn problem-solving right from the start
The instant gratification of completing a project helps.
Incremental complexity ensures you have a new challenge.
Helps you beat the plateau of learning, and move towards mastery.

Ok, That’s great but can you explain with a real example?

Sure, let’s try to create a very basic Data Engineering project, and then let’s review how new complexities can be added. But first, let me break a myth once and for all:

Twitter or Uber data Analysis is not a real data Engineering project.

Don’t get me wrong here, these are great topics to start learning the end-to-end pipeline for a Data Engineering project. Darshil Parmar has some great beginners Data Engineering projects on the same topics. But that’s almost it.

That can be a great first step in an Incremental Project Plan, but can’t be advertised in a resume as an actual production-grade personal project.

If you are someone with a couple of years of experience, this becomes even more evident. Any interviewer can tell you don’t have actual project experience. You can still crack a lot of interviews but if you add a little bit more complexity, you can take your personal projects to the next level.

Here are some actual real-world Data Projects to add to your resume:

1. Migrating on-prem data to the cloud
2. Building a Sentiment Analytics solution
3. Implementing Master Data Management
4. Designing an automated Data Ingestion Framework
5. Designing a new Data Warehouse system for Analytics
6. Implementing real-time data analytics with streaming data
7. Creating datasets for an ML/AI use case, and feature engineering

Do they sound scary?

Let’s break the process down to ease into it. You can find more such use cases. The important thing is to start with one use case and then align it with your interests.

So, let us start.

Define the Business Problem

For a beginner, knowing the technical details is enough, but for seniors, domain knowledge is almost as important. Try to define a project in the business area you are familiar with. Create a sample business scenario to define your project.

Let’s try to define one with the Banking/lending business (since my experience lies in the lending domain).

“Great Lending Company is a lending company trying to serve the unbanked. Hence they provide credit to people who do not have access to normal banks.

These people might not have great credit scores, but they also need capital access. Since they can’t get loans anywhere else they can be charged a higher premium.

But, the loans will have a higher risk, and since their credit scores are not great you need to rely on more data to minimize risk. However, you can not have a brick-and-mortar business with a huge number of employees, as the risk is so high.

So Great Lending Company wants to implement the below business processes:

Risk modelling of customers.
Automated loan origination process
Reducing dependency on manual effort
Minimize credit loss by proper servicing of the loans.
Predict any defaulters and track them for payment of EMI.

Define Data Requirements

Now let us understand the data requirements to support the business problem. To enable automated processing they need to collect a lot of data.

Customer personal information (Structured)
Customers' credit details (Semi-structured)
Customer’s employment info (Structured)
Location via GIS data (Semi-structured)
Document data (Unstructured)

As a Data Engineer for Great Lending Company, We should be able to achieve the below goals:

Ingest all the above types of data
The ingestion process should be automated
It should be scalable and robust to minimize data loss
We need to cleanse and transform the data for business needs.
The data should be available for reporting within the agreed SLA.

Incremental Project Design:

Now that we have broken the problem statement into data requirements, let’s break them down further and create an incremental project plan. We will use Agile principles to ensure we deliver value with incremental iterations.

Sprint1: Building a Simple Data Ingestion Pipeline

Let us start by building a very basic data pipeline for our business. This is kind of in a POC state, and we can quickly create an MVP ( minimum viable product ) of the data pipeline for Great Lending Company.

Data Engineering Pipeline. Image by the author.

Let’s break this down into smaller tasks:

Identify Data Sources: In this case, we have 3 structured, 2 semistructured and 1 Unstructured data sources. Let’s assume the sources to be 3 SQL databases, 1 NoSQL database, a Shared Network path for documents and an API to get credit data. The variety of sources will determine our data ingestion tool/technology.
Choose a Data Ingestion Tool: This is a very critical component and needs to be analyzed based on existing skillsets and requirements of the data sources etc. However, for the sake of simplicity of learning, let’s have basic Python/Scala for ingestion. We can experiment with different tools in an actual project environment.
Setup the infrastructure manually: Installation of Python/Scala, any dependencies, and any infrastructure needed(shared folder/cloud storage/HFDS storage)
Extract Data: Write Python/Scala code to connect to different sources and extract data from the same.
Load to Data Lake: Create a folder as RAW/Bronze in either your local file system, a cloud data lake, or HDFS storage. This will be treated as the Datalake. Connect to the folder and copy data from the source to the data lake.
Data Transformation: Create a silver/refined folder in your datalake. This is the second layer in your data transformation. Implement basic transformations like deduplication, data cleansing, merging into a single source of truth, etc.
Load to destination(Warehouse/Gold Layer): This is the presentation-ready data layer. Ideally, data is loaded into a data warehouse with a Star schema/data mart/data vault schema. Data Modelling is implemented in this layer. You can also use the relatively new delta lake/iceberg tables here.

Congratulations!! you have built your very first Data Pipeline entirely from scratch. Time for your favourite Ice Cream.

You can learn about different ELT/ETL tools while building this, ex. Fivetran, Airbyte, Azure Data Factory. You will learn about transformation engines like Spark, and MapReduce.

This whole process should not take you more than a week. On the second week let’s add some more complexities to the pipeline.

Sprint 2: Adding Idempotency to the Pipeline

Ok, you just said adding some complexity, but then you are introducing a completely new term. What does Idempotency even mean?

Let me quote Start Data Engineering:

“Idempotence is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application” — wikipedia
Defined as
f(f(x)) = f(x)
In the data engineering context, this can come to mean that: running a data pipeline multiple times with the same input will always produce the same output.”

Implement Tracking: A unique identifier or tracking mechanism can be used to recognize and ignore repeated operations. This ensures that even if an operation is retried, it either will not run again or produce the same results. Delete — Write, Write-Audit-Publish or compare and merge can be used. More on this later.
Deduplication: When a data pipeline is accidentally rerun or rerun on failure data might get duplicated. Ensure that no data duplication happens even in a rerun. The patterns in point 1 help here too.
Catchup Missed data: The pipeline should be able to catch up on data for missed days. One way to implement this is by logging what has already been loaded and only loading new data that has been created since.
Backfill Data: Data pipelines should be parameterized, so that in case specific days data need to be reloaded, the pipelines can do the same without any issues or without duplicating data. The patterns in point 1 along with logging in point 3 can help achieve this.

Check this article from Daniel Beach on idempotency in the data pipeline.

Now you have not only created a Data Pipeline, you have even implemented one of the key design principle’s for Data pipeline. Time to relax, take a break and move on to the next challenge fresh.

Sprint 3: Adding Unit Testing

Adding unit tests to your pipeline adds immense value. This ensures you don’t break things in production, and that you don’t cause downtimes in business.

Write Testable code: Use programming best practices along with functional programming to create small chunks of code as functions. Smaller modules of code, ensured the functions are all testable.
Write Test Cases: Define different test cases with high coverage. Define the raw input data (sample) and the expected data too. Write the test cases to match the same.
Automate Testing: Use CI/CD process to automatically execute the unit test codes along with the notebooks.

You can learn about different unit testing frameworks here, ex:

Sprint 4: Creating multiple pipelines on the same principles

Often creating a complete data analytics system, requires multiple tables to be ingested. One way to do this is to create separate pipelines for each table or to have a few dynamic pipelines that are able to read multiple tables. In either case, you would often have to use one pipeline as a template to create multiple pipelines.

Templatize pipelines: Ensure all the components of a pipeline are written in a modular fashion, in order to enable create templates, that can be reused for multiple pipelines.
Dependency Management: Any dependency should be created as independent configurations or classes or methods so that they can be used across different pipelines.

Sprint 5: Orchestration and Workflow Management

All the pipelines need to be created into an orchestrated workflow. Often a master pipeline is created that triggers different child pipelines based on success/failure/completion criteria.

The orchestration of the pipelines is represented in a Directed Acyclic Graph (DAG). You can create a DAG using low-code/no-code tools or tools like Airflow or Mage.

Master data workflow
Dependency between child pipelines

Sprint 6: Automation of Data Pipeline

Once your master pipelines are ready and tested, it’s important to automate the pipelines.

Scheduling: Schedule the pipeline at a specific time, to ensure the pipeline is executed automatically as needed.
Monitoring: Once the schedules are kicked off, there will be failures. It’s important to add a robust alert mechanism in case of a failure. Logging can be used to store error messages. It’s important to consistently monitor the pipeline runs to ensure they don’t fail or are rerun within time.
Automated Rerun and Retry: The pipeline rerun can happen in a couple of ways. One is when a support/on-call engineer sees the error, analyses the root cause, and then decides to rerun the pipelines. The other way is to automate the complete rerun process. Based on the error message the pipeline can be rerun. Also, a specific set of rules can be created to rerun a pipeline upon failure automatically. A number of retry attempts can be defined.

After sprint 6 you already have a quality deliverable, that implements best practices and is also tested to ensure it delivers results correctly. You have achieved to create all pipelines needed for a complete analytics solution, and have also automated the data loading process. You can now run the pipelines multiple times, observe any recurring failures or find patterns in failures.

In the next few sprints let’s try to add more functionality to ensure no recurring failures are impacting our pipelines.

Azure Data Engineering Pipeline. Image by the author.

Sprint 7: Data Quality and Validation

Data Quality is an important aspect of building the reliability of the data. Businesses will only use the data products we create if they can trust the data and insights. Hence it's important to implement a Data Quality Framework within the data pipeline.

Here are some very basic tasks that you can perform to achieve a basic Data Quality Framework in an incremental fashion. Please note there are plenty of nuances in an advanced project, however, these tasks introduce the basic concepts.

Define Data quality metrics and rules: Define all the data quality checks and rules to be followed. Some data validation rules can be, phone numbers should be 10 digits only, email ids should have the xxxx@yyy.com format, the customer’s name can not be null etc.
Along with the data quality rules we need to define Data Quality metrics too, for example, the accuracy of customer address needs to be 90%, more than 5% of data columns can not have all nulls etc.
Automated Data Validation: Based on the data quality rules, automated validations need to be written in code. These validations will execute once the data is loaded or data transformation is completed etc. These automated data validation ensures data integrity, consistency, and accuracy. Data validation needs to be added while extracting data from the source too.
Data Quality Monitoring and Alerts: The results of the automated data quality validations need to be monitored for data remediation. Alerts need to be in place to flag any failed DQ validation. If there are data issues in the source itself, the upstream teams need to be alerted. The data that failed validation can be written to error logs and removed from the data load process.

Sprint 8: Continuous Integration Continuous Deployment (CI/CD)

Version Control for collaboration and CICD for Devops is an extremely important aspect of Data Engineering. CICD in a Data Pipeline is considered part of Dataops and you need to make yourself familiar with the concepts and at least one tool.

Version Control and Branching: It's almost a necessity to version your code using any code versioning tool(GIT). Also, a proper branching strategy ensures smooth collaboration between developers in the team.
So, add all your codes into GIT and use the proper branching strategy to check out, perform changes and merge your codes.
Build Automation: Once you are done with your changes, you should implement build automation, so that whenever you make any changes, your project gets built automatically, and all unit tests are executed automatically. Ensure all the DAGs are also built in this phase.
Use CICD tools like Azure DevOps / GitHub Actions etc to create automatic builds for your data pipeline.
Automated Deployment: Once proper branching and automatic build are in place, you need to consider automatic deployment to higher environments. In your personal project, you might not bother about different environments, but in an actual project, you will develop and unit test codes in a dev environment. But codes for all other higher environments (test/stage/prod) need to be auto-deployed using CICD pipelines. The pipeline should be made intelligent enough to handle any environment-specific configurations using variables.

Sprint 9: Infrastructure as Code (IaC)

An important part of the CICD pipeline is also to deploy all infrastructure automatically. Infrastructure as Code (IaC) is a relatively new but must-have skill for Data Engineers.

Define Infrastructure as Code: All your infrastructure requirements need to be defined as code/configuration using YML files. IaC tools can be used to auto-deploy the infrastructure from the YML files.
Version Control for IaC: Just like normal pipeline code and files, the IaC codes or configuration files too need to be tracked using version control systems like GIT.
Automate Provisioning: Use tools like Terraform, Ansible, Chef etc to automate the provisioning of different infrastructures, like server/VM/software/cloud components etc. These tools will allow reading the YML files and provisioning the required infra.

By this stage, you have a complete production-grade end-to-end pipeline. Building this whole project incrementally will keep you engaged throughout your learning journey.

But what’s the way forward?

Sprint 10: Scalability and Optimization

Any software spends most of its time in operation rather than development. Similar to all software, any data pipeline takes less than months to build but runs for years in an actual production environment.

Once the pipelines are moved to production new data issues crop up, some might need hotfixes, some might need redevelopment too.

However, one issue that’s almost certainly going to come up is performance bottlenecks. Our data in dev is too clean, or often not even a fraction of the size of the production data. No matter how much validation and optimization is done during development and testing, production data and infrastructure is bound to create performance issues.

This is where your skills in improving performance, optimizing load times, and scaling for data sizes will come in handy. Hence as a last step start adding optimization and scalability to your pipeline:

Scaling Strategies: Plan for handling increasing data volumes by optimizing your pipeline’s scalability. The project structure should be modular to enable it to be reused and scaled to any data size.
Performance Tuning and Optimizations: Identify bottlenecks and optimize your pipeline for better performance.
Cluster Management: If you are running your pipelines in a cluster, ideally the cluster should spin up and run only during the data load and transform process. When there is no job running, you should suspend the cluster to save billing. Ideally, you should create an automated process to suspend clusters as soon as your data load process has been completed. There are other cluster management activities too that you can attempt.

In a nutshell

In summary, build your data engineering projects incrementally,

Start with an MVP pipeline
Add idempotency, unit testing, orchestration
Build code and infra automatically
Use CICD for auto deployment
Monitor, test, and optimize

This practical yet incremental approach will help you learn different concepts as you go along. The small incremental steps will keep giving you some adrenaline boost whenever you strike any of the tasks off the list.

Reference: