Data Engineering 101

How to Become a Data Engineer: Complete Roadmap

A complete roadmap on how you can learn Data Engineering in 2022

Saikat Dutta
CodeX

--

In October 2012, HBR predicted that Data Science would be the sexiest job in the 21st century. For the first 10 years of the century, it did seem to be exactly matching.

However, companies soon realized that without proper data infrastructure in place and without quality data, data science projects were bound to fail. This resulted in astronomical demand for people who could fix these issues, AKA Data Engineers (DE in short).

A lot of people have asked me on LinkedIn to guide them about How to Become a Data Engineer.

So, you are here because you want to become a Data Engineer, but why? Let me answer that first.

Why Become a “Data Engineer” in 2022?

So, the supply of quality data engineers is extremely low at the moment and demand is astronomical. And as normal economics will tell you when supply can not match the demand the prices are bound to go up.

“With great demand comes great rewards”

As per Glassdoor, Ambitionbox and Payscale, the average Data Engineer Salary in India is 8–9lakhs per annum. However, the salary can range from 3–4lakhs for freshers to upwards of 30lakhs for people with 10+ years of experience.

More people are even considering moving away from other data roles to a Data Engineer role. Its a great move, even if you are not in the data field.

Great, now that we have addressed the WHY, let us go deep into HOW.

What are the skills needed to become a Data Engineer?

Just like Data Science or Full Stack Developer roles, Data Engineering role is also multi-disciplinary. You need to learn a lot of dependent topics before becoming a great Data Engineer.

However, not everything is needed just to start or break into the role.

Note to beginners

Beginners shouldn’t feel overwhelmed by the huge set of tools and topics needed to learn.

There are several stages of learning involved, and as a beginner you should only concentrate on perfecting the fundamentals.

Once you feel comfortable you can move into the advanced topics with time and experience you will feel at home.

As I have noted above I will divide the complete set of skills and subjects into Fundamentals, Advanced topics & Good To Have.

Fundamentals

The Base is the most important part of any building, and it's here that any construction starts. Hence, it's important to build it better.

It's easy to get distracted. However, it's important that 3–4 months are spent building the fundamentals.

Once this part is mastered the next phase of learning will be much easier.

Data Engineering Roadmap. How to become a Data Engineer?

Below are the fundamental topics to cover, in no specific order of sequence.

  1. Database Concepts:

Basic Database concepts, normalization, keys, constraints, database storage etc.

2. Programming

Learn basic syntax, file handling, connect databases, build APIs, and work with structured and unstructured data (XML, JSON).

Python on Youtube.

Java on Udemy

3. SQL

Basic data extraction, joining tables, keys and constraints, window functions, aggregate functions etc. Data Definition and Data Modification queries.

a. SQL tutorial on w3school

b. SQL from khan academy

c. Scenario-based Hands-on SQL series from Mentorskool.

4. Data Warehouse and Data Modelling

Basic Data Warehouse Concepts, Data Modelling for Data Warehouse, Star-Snowflake schema, Facts and Dimension tables etc.

5. Cloud Fundamentals

Learn about the basics of Cloud computing, SAAS, PAAS, IAAS offerings, distributed computing, Capex vs Op-ex, Elastic scalability, Storage and Compute in the cloud, and Data Stacks in the cloud.

6. Hadoop Eco-System & Spark

History of Hadoop, Hadoop 1,2,3, HDFS, MapReduce, YARN, Sqoop, Hive, PIG, HBase, Oozie, zookeeper, SPARK basics

Basic MapReduce programming with Python / Java

Spark with Python in Udemy, With Scala

Spark Dedicated Course

1st End2ERnd project: At this point, you have all the required skills to create your first basic DE project. Concentrate on the below as you build it:

a. Scrape or collect free data from the web.

b. Convert the data into CSV / json and read the data using Python

c. Analyze and Cleanse the data using Python

d. Load the data into a Warehouse / DB server.

You can not miss the Zoomcamp series. Hands down the best free course on Data Engineering, I have found.

Advanced topics

  1. ETL using Spark (Python API or SQL API)

Creating ETL code in Python / Scala, PySpark, Spark SQL, Spark Context, Spark Jobs, Spark submit, Optimizing Spark Jobs.

2. Data Processing Libraries / Constructs

RDDs, Data Sets, DataFrame etc., Numpy, Pandas

Different file types ( CSV, JSON, AVRO, Protocol Buffers, Parquet, and ORC.)

3. NOSQL Db

Pick any ( Casandra / MongoDB) \ Graph DB is rarely needed, but good to have.

4. Workflow Management and Schedulers

This is a very important component in the modern Data Stack. Pick between AirFlow (most preferred and market leader) or anything else (Luigi, Prefect)

Great Airflow Tutorial

5. Data Streaming

Data Velocity is one of the key parameters for Big Data and Data Engineering.

We all want real-time analysis and feedback on what’s working and what's not, Reverse ETL and real-time analytics have become a must-have in new business.

Apache Kafka, Storm, Flinks. Spark Streaming.

Creating a streaming data pipeline

6. DE in cloud (AWS / GCP / Azure)

Cover the complete Data Engineering lifecycle in any of the major cloud providers. Complete either one from 1–3 below and complete point 4. As Data offerings of Azure / Google / AWS are conceptually not very different and one can easily pick up the other, once they are comfortable with one.

EX:

  1. Azure Stack: Azure Data Lake, Azure Synapse, Azure Data Factory, Azure Cosmos DB, Azure Event Hub, Power BI. Refer this course by Ramesh Retnaswami on Data Factory And also on Spark and Databricks here.
  2. Google Stack: Big Query, Pub-Sub, Dataflow, Dataproc, Looker
  3. AWS Stack: AWS S3, AWS Kinesis, AWS Glue, Redshift, AWS Athena, Lambda, AWS RDS
  4. Cloud Data Warehouses / Lakes: Databricks, Snowflake

2nd End2End project: The second project should cover more hands-on knowledge and should be built more like a real-time project.

Data Engineering Project End to End

Good To Have

  1. Dashboarding Tools: In-depth knowledge of any specific Dashboarding tool is not a must-have for a Data Engineering role. However, it is extremely critical and good to have. Dashboarding can really help identify potential Data Quality issues and the impact of bad data. It can really save a lot of time for developers trying to validate the results of their pipelines.

Power BI / Tableau / Looker are the primary players in the segment.

2. Docker: Docker helps to keep the infrastructure-related complexity away. This helps to independently and easily set up a Data environment.

3. Devops / Data Ops

4. Modern Data Stack: Modern Data Stack refers to a set of independent mostly open-source tools. These tools provide flexibility to the business. Even SMBs and Startups can now easily setup a modern Data Architecture, without worrying about vendor lock-in and high licensing costs.

It's good to have an understanding of the different tools in the stack and how they fit into the whole DE roadmap.

Fivetran for ETL, Airflow for orchestration, Any cloud warehouse/lake, DBT for Data Transformation, Hightouch for reverse ETL, Monte Carlo for Data Observability etc.

https://www.fivetran.com/blog/what-is-the-modern-data-stack

Final Project: Now that all the important lessons are done, it's important to use the learnings and create an end-to-end pipeline as a Capstone Project. The important topics to be incorporated are :

  1. Building containers to run the ETL/ELT pipelines
  2. Creating a pipeline with Python to Load data in the lake.
  3. Creating orchestration to run the codes
  4. Running jobs on Spark, Batch and Stream processing
  5. Data Modelling for Warehouse
  6. Loading data from the lake to the warehouse
  7. Transforming data in the warehouse and preparing the dashboard.
  8. Data visualization and building the dashboard.
  9. Documentation

Conclusion

We might not need each of these skills in the day-to-day job as a Data Engineer. However, you might need one or many of these frequently based on the role.

Learning most of these well will take time. So, keep learning every day. Compounded learning will ensure that with time you get better. There is no shortcut, so don’t believe people who claim to make you a Data Engineer in one or two months.

Stay Up To Date:

Are you already someone like me, has been working in the industry as GUI based ETL developer or Data Modeler? or even a code-based Data Engineer?

The only secret to staying relevant is to stay updated and up-to-date about all the changes happening in the industry. Follow Data Leaders on LinkedIn, and read about blogs and newsletters. And most important keep learning every day.

Free Planner

Follow the below link to get access to a free planner of items to cover as part of your Data Engineer preparation. You can tick the items you already covered and track progress.

The planner is priceless, however I want you all to have it for free. However if you did get any value out of the blog, and only if you want to, you can pay whatever you feel worth. This will keep me motivated to continue to write and add value.

I want the DE2022 Study plan.

You can either follow along or make your own timeline, but ticking off all the items will give you the confidence to face any interviews and also personal satisfaction to have covered the necessary skillsets.

I hope both the blog and the planners inspire you to study along. Go Crush Your Data Engineering Dreams in 2022…..

If you still feel lost, don’t hesitate to book some time with me here.

I will be sharing more stories, writings, and experiences in the data industry. You can follow me for more posts like this.

Thanks for reading! If you want to get in touch with me, feel free to reach me at withsaikatdt@gmail.com or my LinkedIn Profile.

--

--

Saikat Dutta
CodeX
Writer for

Azure Data Engineer| Multi Cloud Data Professional| Data Architect | Career Mentor | Writer(Tech) | https://withsaikatdt.gumroad.com/l/DE2022