Mastering Data: A Strategic Guide for Software Developers

Aleh Belausau
Towards Data Engineering
12 min readDec 21, 2023

With the increasing prominence of artificial intelligence and the exponential growth of business data, the demand for skilled data professional has never been more crucial. This roadmap is designed to guide experienced software developers in mastering data engineering within ten months, providing a comprehensive overview of the field’s complexities and empowering developers with the requisite knowledge and skills for success.

Photo by Sincerely Media on Unsplash

Step 0: Familiarize Yourself with Data Engineering Essentials

Before diving into the complexities, it’s crucial to understand the ‘Why’ behind data engineering before delving into the ‘How.’

You should begin your journey by grasping the fundamentals of data engineering through a concise four-hour course Introduction to the Fundamentals of Data Engineering by Joe Reis.

This course covers essential topics like data engineering principles, the data engineering lifecycle, and data architecture basics.

After taking this short course you will understand if you are interested in this field and whether it is worth continuing.

This course is closely related to material in the book Fundamentals of Data Engineering by Joe Reis and Matt Housley, which we will discuss next.

Step 1: Lay the Foundation

For a more in-depth understanding, delve into the book Fundamentals of Data Engineering by Joe Reis and Matt Housley, which helps you gain an overview of the entire data engineering landscape. Don’t let the term “Fundamentals” in the name deceive you; this is a great book, and all this knowledge is crucial for laying the foundation for further learning.

Fundamentals of Data Engineering by Joe Reis and Matt Housle

At a fundamental level, you will gain knowledge in the following areas:

  • Understanding the engineering lifecycle.
  • Effective data architecture: Lambda Architecture, Kappa Architecture, and Data Mesh, as well as Stream-to-Batch Storage Architecture.
  • Differentiating between Data Warehouse, Data Lake, Data Lakehouse, and Data Catalog.
  • Working with Distributed Systems, understanding Scalability, and designing for Failure.
  • Modern Data Stack and how to choose technologies across the Data Engineering Lifecycle.
  • Understanding Data Management, DataOps, and Data Orchestration.
  • Basics and history of The Hadoop Distributed File System.
  • Streaming Storage, File Storage, Block Storage, Object Storage.
  • Understanding ETL, ELT, Data Migration, Schema Evolution, Change Data Capture.
  • Working with Indexes, Partitioning, Clustering, Queries, Modeling, Data Transformation, and Normalization.
  • Serving Data for Analytics, Machine Learning, and Reverse ETL, including serving Data in Notebooks.

And, most importantly, you will decide whether you wish to proceed.

Step 2: Complete Azure Data Engineer Associate Certification

Certifications are a valuable investment in one’s career, leading to increased opportunities, higher earning potential, and professional growth. However, the most important aspect is that you can validate your knowledge and measure your progress.

A crucial certification option for data engineers is the Azure Data Engineer Associate. Among the various cloud-specific data engineering certifications, the Azure Data Engineer Associate stands out as one of the most challenging and valuable. It equips you with knowledge of both common data engineering principles and Azure-specific technologies.

To clarify, my advice is to initiate preparation for this certification at the beginning to address gaps in fundamentals, concurrently with other steps. By the end of your journey, completing it will serve as a significant milestone in your data engineering career, validating your skills and boosting your self-esteem.

Complete Data Engineer Career Path

Microsoft is famous for its learning paths and educational resources, offering a diverse range of learning materials, tutorials, and certifications through platforms like Microsoft Learn. One notable example is the Data Engineer Career Path — Microsoft Learn Official Collection. This resource serves as a comprehensive guide crafted by Microsoft to provide a structured official pathway for individuals aspiring to pursue a career in data engineering. Beginning with this collection ensures coverage of essential concepts, allowing you to acquire foundational knowledge crucial for a successful journey in the field of data engineering.

One more great source to prepare to this certification is Azure Data Engineer Associate Certification Guide by Newton Alex. In this book you will find in-depth coverage of data engineering topics along with hands-on exercises, practical examples, and guidance for preparing for the certification exam.

Azure Data Engineer Associate Certification Guide by Newton Alex

At a fundamental-to-intermediate level, you will gain knowledge in the following areas:

  • Comparing Avro, Parquet, and ORC
  • Designing storage for efficient querying
  • Designing a distribution strategy
  • Designing a partition strategy for files/analytical workloads
  • Dimensional models
  • Designing SCDs (Slowly Changing Dimensions)
  • Handling duplicate data
  • Handling missing data
  • Handling late-arriving data
  • Creating data pipelines
  • Designing a stream processing solution
  • Designing and creating windowed aggregates
  • Designing and implementing data encryption for data at rest and in transit
  • Designing row-level and column-level security
  • Designing for incremental loading
  • Compressing files
  • Transforming data using Apache Spark

At an intermediate-to-advanced level, you will gain knowledge in the following areas:

  • Working with Azure data infrastructure
  • Designing and implementing data lake solutions with batch and stream pipelines
  • Identifying partition strategies available in Azure storage technologies
  • Implementing different table geometries in Azure Synapse Analytics
  • Using transformations available in T-SQL, Spark, and Azure Data Factory
  • Using Azure Databricks or Synapse Spark to process data using Notebooks
  • Designing security using RBAC, ACL, encryption, and data masking
  • Monitoring and optimizing data pipelines with debugging tips.

If you want to know more about Azure Certifications, please check out my certification guides:

Step 3: Master SQL Queries

Mastering SQL is crucial for individuals working with data, as it enables them to interact with databases effectively, conduct complex data analysis, and contribute to data-driven decision-making processes.

The most effective way to master SQL is through consistent practice. I recommend practicing SQL through LeetCode tasks as the optimal method for mastering SQL queries. While there are additional essential concepts related to database engines and optimization strategies, these will be covered in subsequent steps.

Practicing SQL through LeetCode tasks is highly effective for several reasons. It provides real-world problem-solving experience, a diverse set of problems, challenges for optimizing performance, a supportive community for discussion, preparation for interviews, immediate feedback, and reinforcement of SQL concepts through regular practice. This approach offers a hands-on and practical way to develop and refine SQL skills.

While LeetCode is a valuable resource, it’s important to supplement it with other learning methods, such as studying theory, working on real-world projects, and exploring advanced database topics. Combining different approaches ensures a well-rounded and comprehensive mastery of SQL.

Step 4: Learn Bash Basics

Many data tools and technologies have command-line interfaces or can be accessed through Bash scripts. Knowing Bash enables you to seamlessly integrate and interact with these tools, contributing to a more cohesive data engineering workflow.

You will gain all necessary knowledge from the Introduction to Bash Scripting course by DataCamp.

Step 5: Master python as script language

If you are a Python developer, consider yourself fortunate! You can simply skip this step. For everyone else:

For a data engineer, it’s essential to master Python as a scripting language, especially for activities like data processing, creating ETL pipelines, and other data-centric tasks. This is because numerous popular tools and frameworks in the domain, such as Pyspark, Apache Airflow, Prefect, Great Expectations, and AWS Glue, etc. utilize Python.

While Python supports Object-Oriented Programming (OOP), it’s not always necessary to emphasize it for all tasks, especially in a scripting context. Therefore, you should not focus on this part of Python at first.

Simple Way:

You can acquire all necessary skills to start using Python for data engineering tasks by completing these two courses:

  1. Introduction to Python by DataCamp
  2. Intermediate Python by DataCamp

I won’t detail the content of these courses as I trust the names speak for themselves.

Complex Way:

If you have a strong coding background but don’t know Python and don’t want to waste time bending Python to fit patterns you’ve learned in other languages, you should definitely check out Fluent Python by Luciano Ramalho. In this book, the author guides you through Python’s core language features and libraries, teaching you how to make your code shorter, faster, and more readable.

Fluent Python by Luciano Ramalho.

Step 6: Master python as language for Data Operations

You will discover how Python improves the efficient and effective handling of diverse data tasks, positioning itself as an essential tool for any data professional after completing these three courses:

  1. Introduction to Importing Data in Python by DataCamp
  2. Intermediate Importing Data in Python by DataCamp
  3. Data Manipulation with pandas by DataCamp

Oh, how I wish they were real pandas, but it’s just a super useful data analysis and manipulation tool.

Step 7: Learn Spark & Databricks

Next step is advance your data skills by mastering Apache Spark. Using the Spark Python API, PySpark, you will leverage parallel computation with large datasets.

Simple Way:

By completing these three DataCamp courses, you will have a comprehensive understanding of Apache Spark and PySpark, enabling you to proficiently leverage parallel computation with large datasets and advance your data skills for real-world applications.

  1. Introduction to PySpark by DataCamp
  2. Big Data with PySpark by DataCamp
  3. Cleaning Data with PySpark by DataCamp

Complex Way:

If you aspire to become a Spark guru and delve beyond surface-level understanding, you should definitely look at Learning Spark, 2nd Edition by Jules S. Damji, Brooke Wenig, Tathagata Das & Denny Lee. This widely acclaimed book stands out in the Apache Spark literature for its comprehensive coverage and practical approach. Authored by experts associated with Databricks, the book provides both beginners and experienced users with updated insights into the powerful distributed computing framework.

Learning Spark, 2nd Edition by Jules S. Damji, Brooke Wenig, Tathagata Das & Denny Lee

You can grab your free copy of this book as a complement from Databricks here.

Additionally, to learn about Databricks and its possibilities, you should create a free Databricks Community account and practice to become comfortable with its UI and PySpark syntax.

You may encounter gaps in certain topics, such as Delta, Medallion Architecture, File Formats, etc. However, as you progress through other steps, you’ll gain a deeper understanding of these topics.

Step 8: Dive into Distributed Systems & Storage Engines

In contemporary system design addressing intricate issues like scalability, consistency, reliability, efficiency, and maintainability is paramount. Moreover, the available toolset is extensive, encompassing relational databases, NoSQL datastores, stream or batch processors, and message brokers. Fortunately, in the realm of Data Engineering, there exists an excellent book that can serve as a valuable guide in navigating this complex world of data challenges.

I believe this book requires no introduction; within the engineering community, it boasts a formidable reputation. We could even say that Designing Data-Intensive Applications by Martin Kleppmann is a Legend!

At an intermediate-to-advanced level, you will gain knowledge in:

  • Data structures that power different databases
  • Column-oriented storage peculiarities
  • Different formats for encoding data
  • How to work with distributed data
  • Replication, partition, and transactions in depth
  • How to work with/design distributed systems
  • What consistency is and why it is so important
  • Batch processing/stream processing in depth

And a vast array of other important data-related topics.

Designing Data-Intensive Applications by Martin Kleppmann

And the best part of this book is the multiple additional links to great articles that can become an almost never-ending source of new learning materials for learning enthusiasts.

For Database enthusiasts:

While the entire first part of Designing Data-Intensive Applications by Martin Kleppmann focuses on the Foundations of Data Systems, you may still have lingering questions after reading it. For individuals who are particularly passionate or deeply interested in database engines, I recommend “Foundations of Data Systems on steroids” — Database Internals by Alex Petrov.

Database Internals by Alex Petrov

At an advance level, you will gain knowledge in:

  • Storage Engines Internals
  • In-depth exploration of Column-Versus Row-Oriented DBMS
  • B-Tree Basics
  • Various database file formats
  • Implementation of B-Trees
  • In-depth understanding of Transaction Processing and Recovery
  • Log-Structured Storage
  • Comprehensive study of Replication and Consistency
  • Leader Election concepts
  • Practical insights into working with Distributed Transactions

And a multitude of other database topics that database enthusiasts will find intriguing.

Step 9: Gain Proficiency in Data Warehousing

Data warehousing remains a crucial and foundational concept in data management and analytics, despite the emergence of new and innovative data architecture styles. It plays a vital role in organizing and analyzing extensive datasets, facilitating informed decision-making, and providing strategic insights. Despite differing opinions on its relevance, the principles of data warehousing persist and continue to be highly significant in the ever-evolving landscape of data technologies.

Furthermore, the importance of modeling, especially dimensional modeling, and the use of schema designs like star and snowflake schemas cannot be overstated. While you already have acquired a wealth of relevant knowledge and skills from Step 2, it’s important to note that these skills might still be somewhat superficial.

And here comes another legendary book in the world of data engineering — The Data Warehouse Toolkit by Ralph Kimball. Within its pages, you’ll find a comprehensive collection of dimensional modeling techniques, ETL strategies, and various other invaluable insights.

The Data Warehouse Toolkit by Ralph Kimball

At an advance level, you will gain knowledge in the following areas:

  • Expanded Dimensional Modeling Techniques
  • Fine-Tuning Schema Designs
  • Optimized ETL Strategies
  • Data Warehouse best practices that transcend the foundational level
  • Real-world case studies
  • Integrating data warehousing with emerging technologies
  • Kimball’s approach to continuous learning and adaptation

The Data Warehouse Toolkit is undoubtedly a great book but while the core principles of data warehousing remain relevant, the field itself is dynamic. It’s a good idea to supplement your learning with additional resources, stay informed about the latest developments in data warehousing, and consider how newer concepts such as Data Lakes, Data Mesh, and Data Vault, fit into the broader landscape.

Step 10: Understand Data Architectures

Data fabric, data lakehouse, and data mesh have recently emerged as viable alternatives to the modern data warehouse. These new architectures offer solid benefits, but they are also surrounded by a lot of hyperbole and confusion. It is crucial to clearly understand and define their pros and cons.

One more valuable resource in our list of books is Deciphering Data Architectures by James Serra, a big data and data warehousing solution architect at Microsoft. While this book is currently in early access and is set to be released in May 2024, it can be a good starting point to gain a lot of useful and well-structured information about data architecture.

At an intermediate level, you will gain knowledge in the following areas:

  • Developing a working understanding of various data architectures.
  • Understanding the pros and cons of each approach.
  • Differentiating between data architecture theory and practical implementation.
  • Selecting the most suitable architecture for specific use cases.
  • Grasping distinctions between data warehouses and data lakes.
  • Common data architecture concepts to enhance solution-building.
  • Clarifying each data architecture to alleviate confusion.
  • Understanding which architectures are appropriate for each cloud provider.
Deciphering Data Architectures by James Serra

After becoming familiar with all data architectural concepts, you can now proceed to the primary reason for acquiring this information: understanding how to turn data into value.

With Data Management at Scale by Piethein Strengholt, you’ll learn how to design a next-gen data architecture that takes into account the scale you need for your organization.

At an advanced level, you will gain knowledge in the following areas:

  • Construction of a modern data architecture
  • Recent technology developments, and industry trends.
  • Organizing data using data domains and mapping these domains to a technology architecture.
  • Data governance and data security best practices, how to work with regulatory requirements, and address privacy concerns.
  • Data product management, master data, and reference data management.
  • In-depth exploration of domain-driven design and data product design.
  • A detailed examination of data mesh and data fabric.
  • Self-service data marketplaces.
  • The enterprise metadata model.

And on top of all this super useful information, you will get a real example of implementing modern data architecture step-by-step in the final chapter.

Data Management at Scale by Piethein Strengholt

Step 11: Don’t stop

As you embark on this journey to master data engineering, the final step is a reminder that this comprehensive roadmap is merely a starting point. In a world of relentless change, each piece of knowledge acquired becomes a valuable tool, empowering you to adapt, innovate, and excel.

Remember that staying stagnant in your knowledge can quickly render your skills obsolete. So keep your curiosity alive, and maintain a thirst for learning.

Thank you for your attention, and best of luck!

--

--

Aleh Belausau
Towards Data Engineering

Data and Software Engineer specializing in using cloud technologies for business growth. https://www.linkedin.com/in/aleh-belausau/