Sitemap

The 2024 Guide to Becoming a Data Engineer: Starting from Scratch

What I would do if I had to start over

18 min readDec 14, 2023
Photo by Glenn Carstens-Peters on Unsplash

This article is intended as a guide to those who want to break into data engineering. Becoming a data engineer can seem daunting, especially for those transitioning from unrelated fields. But as someone who has done this myself, I can tell you that it’s more than possible. I went from a philosophy graduate to a data engineer in one year. At the start of that year, I knew absolutely nothing about computers, data, or programming. By the end of it, I landed a job as a data engineer at one of the biggest companies in the UK. And while I did a Master’s in Data Science, almost everything I know about data engineering I taught myself.

Looking back, I can see where I went wrong in my own data engineering journey. And if I had to start over now, I would do it quite differently. Many existing roadmaps and guides on becoming a data engineer tend to cast a wide net, covering a broad spectrum of skills, tools, and languages. This approach can often be too general and, at times, irrelevant. The truth is, the landscape of data engineering is vast and diverse, making it impractical, if not impossible, to master every aspect of it. The key, as I have discovered, is not just in acquiring a range of skills but in strategically targeting the right ones.

This guide aims to provide a more focused and streamlined path for those wanting to break into data engineering. It’s not about learning everything under the sun; it’s about learning the right things well. This guide isn’t a one-size-fits-all but is intended to help you create your own personalised roadmap. While there are some essential things that I think you should learn, in general I try to point you in a direction that will help you.

Before reading this article, I would suggest looking for data engineering jobs that appeal to you. Look at the sorts of tools, skills, and languages listed on those job adverts. These are the skills you should focus on. If you are interested in working in the gaming industry, for example, you might find that you need to work with unstructured and streaming data. You should therefore learn how to work with unstructured and streaming data. Not only will learning these things be more practical than learning generic data engineering skills, but they will actually help you land the jobs that you want.

Keep this in mind as you read this guide. Consider the recommendations here as a foundation to build upon, tailored to align with the specific skills and tools required for the roles you want to get. This will make sure that your learning is not only practical but directly contributes to achieving your career goals in data engineering.

SQL

Learn SQL. SQL’s principles are universal across various database management systems. Whether you work with MySQL, SQLServer, PostgreSQL, or any modern database, SQL remains a constant, and the skills you acquire in one system are easily transferable to others. Learning SQL will truly lay the foundations of your data engineering career.

It will teach you how structured data fundamentally works and what you can do with it, from creating databases and tables, to manipulating and transforming data, to analysing it, and much more. It’s not just about writing efficient, optimised queries that can handle large volumes of data, it’s about understanding and designing efficient data storage. You’ll gain insights into how data is structured and stored in databases, which is crucial for designing scalable and performant data systems. And knowledge of indexing, constraints, and database normalisation will enhance your ability to manage and optimise data.

No matter what programming language you end up using for data engineering, your SQL skills will drastically improve your abilities as a data engineer. In SQL, you have to figure out how to solve the problem in a tabular way or you won’t be able to solve it at all. This forces you to think about data problems in a way that other languages don’t teach you. When you then apply this thinking to other programming languages, you’ll be able to handle data in a way that some experienced programmers might be unaware of. I’ve seen data engineers create complicated for-loops with unions for tasks that can be done in a fraction of the time with simple window functions because they approached problems with a Pythonic mindset. I once refactored a PySpark job, replacing a for-loop with window functions, that reduced the runtime from 13 hours to thirty minutes — and I did it by applying SQL concepts to Python.

SQL is an indispensable tool for data engineers, and it’s important to learn it to a reasonably high level. So, once you’ve learned the basics of SQL, make sure you are competent with DDL (data definition language), DML (data manipulation language), CTEs (common table expressions), subqueries, aggregations, and window functions.

When I first started learning SQL, I read the ‘Practical SQL: A Beginner’s Guide to Storytelling with Data’ by Anthony DeBarros (there’s now a 2nd Edition). If you are a beginner, I’d highly recommend this book. It’s easy to understand, practical, and takes you from the basics to the more advanced stuff.

If you already know the basics of SQL, I’d recommend ‘SQL Cookbook’ by Anthony Molinaro and Robert De Graaf. This book goes beyond the basics and offers practical solutions to specific SQL problems you’re likely to encounter in professional data engineering. From intricate query methods to performance optimization, it covers a wide range of topics. It’s a great resource for intermediate SQL learners who want to take their SQL to the next level.

Once you’ve achieved a good proficiency in SQL, I’d recommend trying to complete Leetcode’s SQL 50 problems. This will prepare you well for any SQL interview (and it’s free!).

A Programming Language

Next, you’ll need to learn a programming language. As previously mentioned, the best programming language to learn is the one that aligns with your career aspirations. If the jobs adverts you have found require different languages and you’re not sure which one to learn, the best bet is Python. It’s probably the most popular language for data engineering and it is also relatively easy to learn. Python is known as “the second-best language at everything”, and new data engineering tools are continually being written in Python or offer Python APIs.

Python is known for its readability and simplicity. Its syntax is clear and intuitive, making it a great language for beginners. And this accessibility will help you understand programming concepts without getting overwhelmed by complex syntax. It’s also a good springboard for learning other languages if/when that happens. It’s possible that you’ll never really need to learn another language after Python, since Python is so popular and it is continually developing its data engineering capabilities. That said, if the kinds of jobs you are interested in use languages other than Python, learn those.

If you have the time/interest/need, I would also recommend learning Scala or Java. This is not by any means essential, but Java and Scala can take your data engineering to the next level. These languages are the next most popular data engineering languages, and they are prevalent in Apache projects like Spark and Hive. The JVM (Java Virtual Machine) is more performant than Python and often gives you more fine-grained control than you can achieve with a Python API. Apache Spark, for example, is written in Scala, and Scala provides access to lower-level features that are often not available in PySpark.

Additionally, proficiency in Java or Scala can open doors to specialised roles in data engineering that require technical expertise in big data technologies. Also, these roles are often better paid. So while not essential, Java and/or Scala can greatly improve your capabilities as a data engineer.

Whether you are learning Python, Scala, Java, or something else, after you have mastered the basics, it’s important to jump in and start coding for data engineering. Coding for data engineering is very different from programming for software. It’s often more imperative or functional than object-oriented, and it revolves around the manipulation, processing, and analysis of data, rather than the creation of software applications. That said, software engineering best practices will give you a competitive edge, and being able to jump into a complex codebase will be a lifesaver when certain technical needs arise.

Focus on learning how to work with data on your chosen language. Learn how to read and write different data formats (e.g., CSV, JSON, XML, Parquet) and interact with databases. Understand how to transform, manipulate, and analyse data with operations like filtering, sorting, grouping, aggregating, and so on. Use these skills to build pipelines that extract data from some source, transform it, and load it into your database (ETL). Develop strong skills in writing tests and debugging code. This is really important and ensures the reliability and quality of your data processing scripts.

One of the first resources I used to learn Python was the Python Essentials course by the Python Institute. It’s designed for beginners and covers all the fundamentals you’ll need to get a strong foothold in Python. It walks you through Python syntax, data structures, basic algorithms, and problem-solving with Python, and also offers interactive exercises for you to apply the concepts you’ve learned. The Python Institute is known for its quality in educational content and teaching methodology, and the course is free!

This free Python Project for Data Engineer on Coursera is also a great resource to start applying your Python skills to data engineering:

The following free Coursera course on Scripting with Python and SQL for Data Engineering is a great intro to gathering data and building SQL databases with Python, and will teach you to combine your Python and SQL skills:

If you want to learn Scala and/or Java, I would highly recommend Rock the JVM. He provides some free introductory courses on Java, Scala, and Apache Spark, as well as more advanced paid courses:

Data Modelling

Once you’ve become proficient in SQL and Python (or some other language) and can comfortably handle data manipulation and all the rest, your next step is to dive into data modelling. This is an extremely important skill in my opinion, and one that is often overlooked. Data modelling is about structuring and organising data to make it useful and accessible. A well-designed model can drastically improve the performance of your data retrieval operations and ensure data integrity and consistency. It plays a significant role in how data is processed and analysed, affecting the overall efficiency of data systems.

It’s hard to understate how powerful data modelling can be. The right data model can transform a sluggish, inefficient system into a high-performing, cost-effective powerhouse. I have been able to dramatically reduce the cost, compute, and storage of a database, while increasing the overall processing speeds by orders of magnitude, simply by applying straightforward data modelling principles.

You’ll also find that data modelling can be used to improve your code. For example, you can model your data in such a way that certain transformations are unnecessary, or if multiple different jobs are generating the same dataframe, you could make this dataframe a persistent table. This will consequently free up your scripts for the important stuff! Data modelling like this not only reduces runtime but also leads to more maintainable and efficient code, as it minimises redundancy and leverages the inherent capabilities of your database system. Data modelling is an essential skill in my view.

With regards to learning data modelling, I don’t think it’s necessary to go too deep, especially at first. Learning the fundamentals is enough to make a significant impact in your work. Here’s what I recommend:

  • Get comfortable with creating Entity-Relationship (ER) diagrams. They are essential for visualising data structures and relationships, and make it easier to plan and communicate your database designs. (www.lucidchart.com or draw.io are great for creating ER diagrams).
  • Learn about the normalization process, which is crucial for organizing databases effectively. Understand the different normal forms and their importance in reducing data redundancy and improving integrity (I’d say up to 3rd normal form is enough to model data well).
  • Gain a solid understanding of how primary and foreign keys work to establish relationships between tables. This is vital for maintaining data integrity and implementing relational database design.
  • Learn the basics of indexing, which is critical for improving the performance of database queries. Understanding when and how to use indexes can drastically improve the efficiency of data retrieval operations.
  • Finally, learn Ralph Kimball’s dimensional modelling techniques. Kimball’s approach to dimensional modelling is a cornerstone of data warehousing and business intelligence. Understanding Kimball’s principles will help you create data structures that are optimised for reporting and analysis.

Focusing on these 5 fundamental areas of data modelling will take you a long way. There’s a lot more to it than what I have outlined here, but this will be enough to make a big impact on your work and projects, and once you have mastered these fundamentals, you’ll have no problem expanding upon them.

It can be a bit difficult to learn data modelling as there aren’t a whole lot of up-to-date resources out there (not that it’s a particularly fast-changing field). I would highly recommend the 4th edition of ‘Data Modeling Essentials’ by Graeme Simsion and Graham Witt. It covers a range of essential topics, from the basics of data structures and relationships to more advanced concepts in database design and normalisation, and it’s known for its clear explanations and practical examples.

The natural next step would be Ralph Kimball’s ‘The Data Warehouse Toolkit’, which is considered the definitive guide in the field and offers a very thorough exploration of the practical aspects of data warehousing. It’s also very comprehensive and is good for learning to apply data modelling principles to the specific challenges of business intelligence, and to create data structures that are not just well-organised but also optimised for analytical querying and reporting.

While these two books will teach you to be a data modelling master, they are quite big reads. For something more approachable, I recommend this fantastic YouTube course:

The Cloud

It’s essential these days to be able to navigate the cloud. There is, however, a lot to learn in this area, so don’t feel like you need to know it all. You don’t need to become an expert in all aspects of your chosen cloud platform. Instead, just focus on acquiring a solid understanding of the essentials that relate to your projects and work. Start with the fundamentals — the compute and storage services —and then move on to things that you can apply to your own projects.

For example, let’s say you want to build your own database with stock market data that you are scraping from the web. You could learn about the various storage options available to you, and learn to use Azure Synapse Analytics to build your own mini warehouse. You could also use Azure Functions to write and deploy the code for scraping the data. Whatever the project, be it building an analytics dashboard or developing a machine learning model, there’s a cloud service that can streamline and enhance your process. Use these services and learn about them.

Again, you don’t need to learn everything about the cloud. The key, rather, is to focus on learning the services and tools that are directly relevant to your projects and goals. Your goal is to be effective in your role, not to know every cloud service available. Besides, you won’t be expected to know every facet of Azure or AWS when you work as a data engineer, you’ll only be expected to know the services that help you get your job done well.

It doesn’t matter too much which platform you choose to learn, the general principles are transferable. If you aren’t sure which one to pick, I’d recommend Azure and AWS over Google Cloud. If you are interested in using Databricks, then I’d suggest Azure over AWS as new Databricks features are dished out to Azure first.

But whether you want to learn about Azure, AWS, or Google Cloud, these companies offer plenty of free educational content. Like this beginners course in data engineering on Azure:

You could also follow the learning paths for Azure’s Data Engineer Associate certification. Whether or not you want to take the exam, these learning paths are very useful:

Ultimately, the best way to learn how to navigate the cloud is to set up your own resources and applications and do things for yourself. Try to incorporate the cloud in your projects as much as possible!

Other Skills

Now that we have covered the main skills and tools for data engineering, let’s discuss some other important things.

Version Control

Version control, particularly with Git, is an essential tool in the toolkit of a data engineer. It lets you track changes in your code, collaborate with others, and manage different versions of a project effectively. You’ll certainly have to use version control at some point in your career as a data engineer, and luckily it's not too difficult. In most cases, GitHub Desktop will get the job done and is a beginner-friendly UI that will teach you a lot about version control. However, cases crop up where you need more control than GitHub Desktop can offer. In these cases, you’ll need to know how to use Git via the command line.

Bash (or PowerShell)

Being able to use bash commands and CLIs will significantly improve your productivity, especially when it comes to scripting or executing operating system tasks. Data engineers often rely on command-line utilities for file processing within data pipelines or for executing Bash commands through orchestration frameworks. This isn’t an essential skill — many jobs won’t require this at all — but it’s very much a good-to-have. If you are working with Windows, learn PowerShell instead. DataCamp does some good introductory courses in both Bash and PowerShell.

Data Formats

Being able to work with a variety of data and file formats is important, as each format has its unique advantages and use cases. Knowing when to use Parquet vs Avro, for example, can make a big difference in terms of performance and cost depending on your use case. It’s also important to be able to handle JSON, not just with programming languages like Python but also in SQL, where modern databases often support JSON data types and queries. This skill is particularly vital for tasks like API integration, configuration management, and working with semi-structured data. Overall, mastery in various data formats, including JSON, empowers a data engineer to handle diverse datasets efficiently and adapt to different data processing scenarios.

Big Data Processing Frameworks

If you are going to be working with large amounts of data, you’ll have to know how to use a big data processing framework. Apache Spark is the number one tool in this regard. Capable of processing huge amounts of data efficiently and quickly, Spark is ubiquitous across the data landscape. Adding Spark to your toolkit will greatly improve your capabilities as a data engineer and will give you a leg up in the job market. DataCamp has some good PySpark courses too.

DevOps

Integrating DevOps principles into data engineering is vital for enhancing efficiency and reliability. Emphasizing collaboration, automation, continuous integration (CI), and continuous deployment (CD), DevOps practices streamline the development and maintenance of data pipelines. By automating pipeline deployment and implementing CI/CD, data engineers can ensure consistent testing and seamless deployment of changes, minimising errors and downtime. This approach not only optimises operational processes but also aligns closely with the evolving needs of modern data-driven organisations. DevOps knowledge is therefore an invaluable asset for a data engineer and something you should make time for. DataCamp does a good introductory course on DevOps.

Additional Resources

I’ll now offer some additional resources that I found helpful in my data engineering journey.

Fundamentals of Data Engineering

First up is ‘Fundamentals of Data Engineering’ by Joey Reis and Matt Housley. This book had a profound impact on the way I understood my role as a data engineer. Less of a technical book, it focuses more on the principles to follow to succeed as a data engineer. It’s a must-read for any data engineer, beginner or veteran!

IBM Data Engineering Professional Certificate

One of the first courses I ever did was the IBM Data Engineering Professional Certificate on Coursera. It covers a lot of ground and provides a good foundation in data engineering skills and concepts. You can actually enrol for free if you contact Coursera and apply for financial aid.

Spark — The Definitive Guide & Learning Spark

I read two books to learn Spark: ‘Spark — The Definitive Guide’ by Bill Chambers and Matei Zaharia, and ‘Learning Spark’ by Jules Damji, Brooke Wenig, Tathagata Das, and Danny Lee. ‘Spark — The Definitive Guide’ is a tad outdated now as it focuses on Spark 2.0, whereas ‘Learning Spark’ teaches Spark 3.0. That isn’t to say you should avoid the former, however. Matei Zaharia is one of the creators of Spark and his insights into the mechanics of Spark are still useful and relevant.

Databricks Academy

To learn about Databricks I used Databrick’s learning platform. There’s a lot of great free content on there, including a Data Engineering Learning Plan.

DataCamp

Another platform I use a lot is DataCamp. It helped me learn things like Bash, Git, unit testing, and a lot more. It has a wide range of courses and learning paths that cover pretty much anything you might want to learn for data engineering. It’s great for diving into the basics and getting started with new skills or languages, and even has a lot of intermediate-level content. I still use it today when I want to learn a new skill or topic. If you want to try it out, subscribe here.

Projects

All throughout your learning process, whether you are learning SQL, Scala, Git, or AWS, it’s absolutely vital to undertake projects. Taking courses and reading books can only take you so far. The real test of your data engineering skills comes from hands-on experience. This is where you really understand and remember things. The difficulty, however, is that data engineering projects can be harder to come up with than data science or analysis projects.

I think this is partly because people tend to isolate data engineering, as though it is a standalone discipline. So when they try to think of a project, they can’t imagine much beyond taking some data off the internet and storing it on their computer. But it’s important to understand that data engineering doesn’t exist in a vacuum. It’s part of a larger ecosystem that includes data science, analysis, and business intelligence. Your project ideas should therefore encompass this interconnectedness.

Start by envisioning a broader data-related goal or question. Once you have this end goal in mind, work backwards to plan your data engineering pipeline. Write a pipeline to extract relevant data, either by scraping websites, accessing APIs, or connecting to existing databases. Transform, clean and structure the data, storing it in a local database, perhaps using MySQL or PostgreSQL. Then, develop a data model. Use the medallion architecture (bronze, silver, gold layers) to progressively refine your data, applying normalisation principles before creating fact and dimension tables. Then, use this data to build a dashboard highlighting interesting insights in the data. You could host this dashboard on Databricks and schedule a workflow to run your pipeline on a daily basis. This will really put your knowledge and skills to the test.

It might seem irrelevant to perform data analysis as part of your projects, but having a grasp of data science and analytics enhances your perspective as a data engineer. Understanding the end goals — what data scientists and analysts need to derive insights — guides better architectural and processing decisions. This cross-disciplinary understanding helps make sure that your data engineering solutions are not just technically sound, but also relevant and useful for downstream applications. It will make you a better data engineer in the long run.

And if you get stuck for ideas, just ask ChatGPT!

Conclusion

To wrap it all up, I have offered what I believe are four pillars for an aspiring data engineer: SQL, programming, data modelling, and the cloud. If I had to go back and learn data engineering from scratch again, this is what I would focus on. Most guides and roadmaps seem to recommend learning Databricks, Snowflake, Airflow, dbt, Tableau, Power BI, and a plethora of other tools. But this is often a waste of time.

You don’t need to know all of these tools to be a data engineer, and if you are just starting out, you need to focus on the data engineering fundamentals and the skills that will actually land you a job. Besides, part of being a data engineer is being able to adapt to new and different tools and scenarios, to be able to use the skills you have to tackle novel problems. This is a fast-changing field; learning the fundamentals will be help you adapt to the new tools that invariably arise.

The other skills I have recommended are not essential, but are things that will greatly improve your data engineering capabilities. These are also things many employers are looking for. Prioritise the main four skills, but try to make time for these additional ones too. Finally, don’t take this guide as a set of hard and fast rules — if the jobs you are looking for require Rust, then learn Rust! If they require NoSQL, then learn NoSQL! Use this to guide your own roadmap and learning process.

--

--

Tom Corbin
Tom Corbin

Written by Tom Corbin

Data Engineer, Spark Enthusiast, and Databricks Advocate

Responses (2)