Roadmap for Data Engineering 2024

Darshil Parmar
DataVidhya
Published in
9 min readJan 15, 2024

Become a Modern Data Engineer by following this guide in 2024

Data Engineering is the fastest-growing field, and as the data is growing daily companies need data engineers who can manage and process these data on the scale.

AI Tools needs Data Engineers for better output and training accuracy!

The problem is there are so many tools available in the market, just look at this big data landscape.

Big Data Landscape — https://mattturck.com/MAD2023/

It’s always confusing where to get started, even if you get started, you’ll probably get lost in the learning process.

This blog is your ultimate guide to becoming a data engineer, I will give you a fully focused roadmap for data engineering so that you don’t get lost trying to learn multiple things with 8 different projects.

It’s been five years that I worked as a data engineer, I worked at a start-up company, did freelance and even worked remotely for big companies like Wayfair, I’ve seen how data engineering is executed in different companies so I am going to combine all of my experience and give you a concise roadmap.

Let’s get started!

Section 1: Call To Adventure

So you might be in different stages of your journey, you are completely new or you know few tools and want to know where to go next.

The first thing I always suggest people start their journey is by clearing computer science fundamentals, You don’t need to get a computer science degree or spend 3–6 months learning about all of these things, you just need to get an overview of these things

CS fundaments don’t change, you see every new technology into the market, and they are built on core CS fundamentals so if you don’t have a tech background they start from here

CS fundamentals include understanding how the computer understands code, how code complies, basic data structures and algorithms, building blocks of programming language, etc…

Here’s the best FREE resource to learn the basics of Computer Science: CS50 By Harvard

Section 2: Building Foundation

After this you take one step forward to learn about data engineering, this is where you’ll prepare for the foundation skills

Two important skills that are very important in learning:

  1. Python: As a data engineer, you’ll be writing a lot of transformation jobs, deploying scripts, validating and testing them, and for that, you need to master one programming language. The three popular choices are Java, Scala, and Python. If you’re a beginner, Python is a great option as it’s easy to learn and understand.
  2. SQL: Structured Query Language is the king of the data industry. Whether you’re working as a data analyst, data engineer, or data scientist, you’ll find yourself using SQL frequently. SQL is the way you communicate with a relational database, and it’s important to understand how to use it to select, insert, update, and delete data.

You can learn these from anywhere (YouTube, Blogs, Courses), if you are interested in learning Python & SQL for Data Engineering then I have published an in-depth course on these topics with End-To-End Project

You will build a Spotify Data Pipeline using AWS

You can enroll in the course here (recorded videos with lifetime access)

  1. Python for Data Engineering: https://datavidhya.com/courses/python
  2. SQL for Data Engineering: https://datavidhya.com/courses/sqldata

Currently, I am running 50% off on these courses, do check them out, took me 4–5 months of effort to build them :)

Doing this much part will give you a strong foundation to start your journey as a data engineer, you need to focus on high-demand skills, again there are 100s of tools and technology available but you will be focusing on what most companies use so that it’s easier for you to apply for jobs

Section 3: Core Data Engineering Foundation

Now I am going to suggest something different, as you are taking steps to enter in core data engineering world, you will be learning from both books and course videos

You can’t watch videos all day or read books all day, if you decide to spend let’s say daily 2 hours learning, you can watch videos for 1 hour and read a book for 1 hour, this way you will not get bored just by watching videos or reading book, you’ll always have something new to learn.

Get a copy of The Fundamentals of Data Engineering Book: https://amzn.to/3wdBuAU

This is what you are going to do

Get the book Fundamentals of Data Engineering by Joe Reis (you can get a hard copy or just get an ebook online)

This is one of the best books for data engineers, so you need to start reading this book while you learn other topics of data engineering from courses.

This book you will read in the background whenever you get bored from watching videos, will keep you motivated and give you a strong foundation understanding of DE.

Get this book and just start reading it, you don’t have to finish it, just read for 30 minutes daily and you’ll get ahead of most of the people in the market.

This part is boring so very few people will do it, so competition is much less :)

So while you read this book, you need to continue with your learning journey of important tools

Learn Data Warehouse

Everything you do as a data engineer is likely to get stored in a data warehouse. In the end, businesses want to run analytics queries to find insights, and data warehouses are designed for this type of workload

If you want to find answers such as

  • What was the last 5 years of revenue?
  • Which category of product we sell most this year compared to last year?

All of these questions can be answered easily if you have data stored in a data warehouse

Learning Data Warehouse also has 2 parts,

  1. Learning about core Data Warehouse fundamentals (that don’t change): It includes learning about OLAP vs OLTP, Dimension Table, Extract Transform Load, ER Modeling, or Dimension Modelling such as understanding fact and dimension tables
  2. Learning about important tools that are available in the market: Snowflake, BigQuery, Redshift, Synapse Analytics

I suggest you start with Snowflake, it’s a modern data warehouse tool, and lots of companies are migrating to Snowflake

Where to learn all of this? I have created one of the most in-depth courses Data Warehouse with Snowflake for Data Engineers- https://datavidhya.com/courses/datawarehouse

This course alone took me 2 months to prepare and no course in the market can come near this.

(At the end I will also suggest free resources so don’t worry about it if you don’t want to purchase my courses)

So at this point, you are learning data warehouse from the course and also reading fundamentals of data engineering in the background so keep reading the book!

Once you finish learning about Data Warehouses then it is time to learn about data processing.

This is the central part of data engineering, you get data from multiple places like applications, web analytics, and sensors, all of these data are in different formats, coming at a different frequency so you need proper tools to process them

There are two main frameworks:
Batch processing: Processing data in batches, such as processing last month’s data once or twice a day.
— Real-time processing: Processing data as it comes in, in real-time.

For batch processing, most companies use Apache Spark. It’s an open-source framework for data processing.

You can start by learning Apache Spark fundamentals, and then learn a tool that powers the Apache Spark environment, such as Databricks, AWS EMR, GCP Data Proc, or any other tools you find in the market.

My suggestion is to practice with Spark on Databrick and use PySpark (Python) as the language.

For real-time processing, we have frameworks and tools such as Apache Kafka, Apache Flink, and Apache Storm. You can pick one and learn about it.

Apache Kafka

All of this data processing needs to happen in a specific order, in a sequence

  1. Reading data from 3 sources first
  2. Then ggregating them
  3. Then do some logic operations and store them

This needs to happen in the proper sequence and for that, we also have many data pipeline or workflow orchestration tools available

One of them is Apache Airflow, which was developed by Airbnb then they open-sourced it, and now every company that performs data engineering uses Apache Airflow

Now we are processing big data, huge volumes of data, and you can’t process or even store them in your local PC, you need cloud computing for that these are 3 top cloud providers AWS, Azure, and GCP

If you already know one platform then forget about other clouds, it’s all the same just a few things are different here and there but all of them are the same on a fundamental level.

If you are new then start with AWS simple! I don’t want to confuse you with multiple options so I am going to give you one clear answer AWS, if you think of another way like you want to learn Azure then go for it as long as you learn at least one cloud platform you are safe!

Section 4: Advanced Data Engineering

Here are some other topics you should keep your eyes on

Open Table Formats

One of the topics you will learn in the book is DataLake, it is like a centralized repository where you can store all of your data as it is and access it as per the requirements, the problem with this was they don’t support features like databases such as ACID transactions, to solve this problem the concept of Open Table formats came that comes with a lot of features on top of the data lake

There are many tools available for this such as Iceberg & Hudi, it is something new that came around last year so being up-to-date with the industry is important.

Data Observability

You might have 100s of pipelines running on your cloud machine, if something breaks how do you track it and debug it? Data Observability tools like Datadog help in that, they give you a complete picture of what’s happening in your pipeline

Modern Data Stack

You can explore them, but don’t get attached to these tools as new tools always come into the market and go so just understand what the core problem they are trying to solve.

You can also learn about Docker or Kubernetes, which are useful when deploying data pipelines in production.

Additionally, reading customer case studies on platforms such as AWS and GCP can give you a better understanding of how to use these tools in real-world scenarios.

All of these should take around 6–8 months if you are consistent in your learning journey, I have explained this complete roadmap here on my YouTube channel

Here’s is complete roadmap link with 8 different projects: https://datawithdarshil.notion.site/Data-Engineering-Roadmap-2024-4ce9c7c864ad4bd4a95276cf285a9344?pvs=4

Hope you found this helpful and don’t forget to applaud :)

--

--

Darshil Parmar
DataVidhya

Data Engineering | Building @DataVidhya | YouTube (120k+)