Roadmap for Data Engineering 2023

Darshil Parmar
7 min readJan 20, 2023

--

How I Would Learn Data Engineering 2023 (If I could start over)

Data Engineering in 2023

Starting a career in data engineering can be overwhelming with so many different tools and technologies available in the market.

Big Data Landscape — https://mattturck.com/data2021/

It’s common to have questions like, “Should I first learn Databricks or Snowflake? Should I focus on Airflow or Hadoop?”

In this blog, I will take you from the basic level to the advanced level of all the resources and skills you need to become a data engineer.

I have divided skills into 3 different:

  1. For people who are completely new to this field and want to switch their career into data engineering from other fields.
  2. For people who know some basics and want to know how to move forward.
  3. For people who have some experience and want to grow in their careers.

Section 1- Exploring Unknown

Are you looking to switch your career to data engineering, but feeling overwhelmed by the number of tools and technologies available? You’re not alone. Many people find themselves in the same position, whether they’re working in a non-tech job, are a student or fresher, or are working in a different tech job and looking to switch.

If you fall into any of these categories, the first thing you need to do is master your computer science fundamentals.

If you’re completely new to this field, you need to understand the basic concepts and terminology used in computer science before diving into data engineering.

A great resource for this is the series available on YouTube provided by Harvard’s CS50.

https://www.youtube.com/watch?v=IDDmrzzB14M&list=PLhQjrBD2T380F_inVRXMIHCqLaNUd7bN4

By watching this video series, you’ll gain a basic understanding of computer science without needing a degree or spending months learning the fundamentals.

Once you’ve cleared your computer science fundamentals, you can move on to the next step: learning the skills required for data engineering.

There are three fundamental skills required of data engineers:

  1. Programming languages: As a data engineer, you’ll be writing a lot of transformation jobs, deploying scripts, validating and testing them, and for that, you need to master one programming language. The three popular choices are Java, Scala, and Python. If you’re a beginner, Python is a great option as it’s easy to learn and understand.
  2. SQL: Structured Query Language is the king of the data industry. Whether you’re working as a data analyst, data engineer, or data scientist, you’ll find yourself using SQL frequently. SQL is the way you communicate with a relational database, and it’s important to understand how to use it to select, insert, update, and delete data.
  3. Linux Commands: Most data engineering tasks are done on remote machines or cloud platforms, and these machines generally run on Linux operating systems. It’s important to understand how to work with these machines and understand basic Linux commands.

Section 2: Building a strong foundation

At this stage, your goal should be to learn the minimum level skill set required for data engineering and how to kickstart your career as a data engineer.

You don’t have to spend time learning about every skill or tool available in the market; you just have to focus on the highly demanded and important level skill set required for data engineering at this stage.

In this stage, we will focus on building a strong foundation for data engineering.

The first fundamental skill you need to focus on is understanding data warehouses.

There are two parts to this:
— Learning about data warehouse fundamentals
— Learning about tools such as Snowflake or BigQuery.

Data warehouse fundamentals generally include understanding OLTP, dimension tables, extract, transform, load, and data modelings such as understanding fact and dimension tables.

If you prefer learning from a book, you can read “The Data Warehouse Toolkit” by Ralph Kimball.

https://amzn.to/3kqXaHh

This is one of the best books written on data warehouses.

Once you’ve learned data warehouse fundamentals, you can apply what you’ve learned to a specific tool.

There are many different data warehouses available in the market, such as Snowflake, BigQuery, and Redshift.

I recommend learning Snowflake, as its demand is increasing day by day.

In addition to understanding data storage, you also need to understand data processing frameworks.

There are two main frameworks:
Batch processing: Processing data in batches, such as processing last month’s data once or twice a day.
— Real-time processing: Processing data as it comes in, in real-time.

For batch processing, most companies use Apache Spark. It’s an open-source framework for data processing.

You can start by learning Apache Spark fundamentals, and then learn a tool that powers the Apache Spark environment, such as Databricks, AWS EMR, GCP Data Proc, or any other tools you find in the market.

My suggestion is to practice with Spark on Databrick and use PySpark (Python) as the language.

For real-time processing, we have frameworks and tools such as Apache Kafka, Apache Flink, and Apache Storm. You can pick one and learn about it.

The way we’re learning is by breaking down different problems into smaller chunks.

First, we focus on learning fundamentals and then we learn one highly demanded tool in the market to that you can apply your fundamental knowledge.

The third skill you need to master as a data engineer is learning about cloud platforms.
There are three main choices available:
— Microsoft Azure
— Amazon Web Services (AWS)
— Google Cloud Platform (GCP)

Top Cloud Platforms

I started my career with AWS, but you can pick any cloud platform because once you learn one, it will be easier to master the others. The fundamental concepts of cloud platforms are similar, with just slight differences in the user interface, cost, and other factors.

In data engineering, you’ll need to create data pipelines to process your data. Data pipelines, also known as ETL pipelines, are used to extract data from a relational database, apply transformations and business logic, and then load the data into a target location. To manage these operations, you’ll need to learn about workflow management tools.

One popular choice is Apache Airflow.

Airflow is an open-source workflow management tool that allows you to create, schedule, and monitor data pipelines. It’s widely used in the industry and has a large user community. By learning Airflow, you’ll be able to create data pipelines and automate the ETL process, making your job as a data engineer much easier.

Section 3: Modern Data Stack and Advanced Level Skills

As a data engineer, there are so many different tools and approaches available in the market.

It’s important to stay updated and learn about all of them. On top of that, you also need to learn how to design the entire data infrastructure, how to manage and scale the system, and master advanced skills.

In this section, we will focus on learning advanced-level skills required for data engineering.

The first thing I recommend is exploring the Modern Data Stack (MDS).

There is a list of tools that you can learn more about and understand their core use cases.

One tool that I highly suggest exploring is DBT (Data Build Tool) as many companies are using it and it’s gaining popularity in the market.

However, it’s important not to get attached to too many tools, just understand the core use case of each one.

Another important aspect is understanding security, networking, deployment, and other related topics.

You can also learn about Docker or Kubernetes, which are useful when deploying data pipelines in production.

I recommend reading the books:
— Designing Data-Intensive Applications

https://amzn.to/3XijdOJ

— Fundamentals of Data Engineering

https://amzn.to/3wdBuAU

Additionally, reading customer case studies on platforms such as AWS and GCP can give you a better understanding of how to use these tools in real-world scenarios.

I have created a detailed video on this topic with a complete document about courses and projects

Hope you found this helpful and don’t forget to applaud :)

--

--

Darshil Parmar

Data Engineering | Building @DataVidhya | YouTube (120k+)