Mastering Data: A Strategic Guide for Software Developers

Published in

Towards Data Engineering

12 min readDec 21, 2023

With the increasing prominence of artificial intelligence and the exponential growth of business data, the demand for skilled data professional has never been more crucial. This roadmap is designed to guide experienced software developers in mastering data engineering within ten months, providing a comprehensive overview of the field’s complexities and empowering developers with the requisite knowledge and skills for success.

Step 0: Familiarize Yourself with Data Engineering Essentials

Before diving into the complexities, it’s crucial to understand the ‘Why’ behind data engineering before delving into the ‘How.’

You should begin your journey by grasping the fundamentals of data engineering through a concise four-hour course Introduction to the Fundamentals of Data Engineering by Joe Reis.

This course covers essential topics like data engineering principles, the data engineering lifecycle, and data architecture basics.

After taking this short course you will understand if you are interested in this field and whether it is worth continuing.

Introduction to the Fundamentals of Data Engineering

This course introduces learners to the fundamentals of data engineering, and is closely related to material in the book…

www.oreilly.com

This course is closely related to material in the book Fundamentals of Data Engineering by Joe Reis and Matt Housley, which we will discuss next.

Step 1: Lay the Foundation

For a more in-depth understanding, delve into the book Fundamentals of Data Engineering by Joe Reis and Matt Housley, which helps you gain an overview of the entire data engineering landscape. Don’t let the term “Fundamentals” in the name deceive you; this is a great book, and all this knowledge is crucial for laying the foundation for further learning.

At a fundamental level, you will gain knowledge in the following areas:

Understanding the engineering lifecycle.
Effective data architecture: Lambda Architecture, Kappa Architecture, and Data Mesh, as well as Stream-to-Batch Storage Architecture.
Differentiating between Data Warehouse, Data Lake, Data Lakehouse, and Data Catalog.
Working with Distributed Systems, understanding Scalability, and designing for Failure.
Modern Data Stack and how to choose technologies across the Data Engineering Lifecycle.
Understanding Data Management, DataOps, and Data Orchestration.
Basics and history of The Hadoop Distributed File System.
Streaming Storage, File Storage, Block Storage, Object Storage.
Understanding ETL, ELT, Data Migration, Schema Evolution, Change Data Capture.
Working with Indexes, Partitioning, Clustering, Queries, Modeling, Data Transformation, and Normalization.
Serving Data for Analytics, Machine Learning, and Reverse ETL, including serving Data in Notebooks.

And, most importantly, you will decide whether you wish to proceed.

Step 2: Complete Azure Data Engineer Associate Certification

Certifications are a valuable investment in one’s career, leading to increased opportunities, higher earning potential, and professional growth. However, the most important aspect is that you can validate your knowledge and measure your progress.

A crucial certification option for data engineers is the Azure Data Engineer Associate. Among the various cloud-specific data engineering certifications, the Azure Data Engineer Associate stands out as one of the most challenging and valuable. It equips you with knowledge of both common data engineering principles and Azure-specific technologies.

To clarify, my advice is to initiate preparation for this certification at the beginning to address gaps in fundamentals, concurrently with other steps. By the end of your journey, completing it will serve as a significant milestone in your data engineering career, validating your skills and boosting your self-esteem.

Complete Data Engineer Career Path

Microsoft is famous for its learning paths and educational resources, offering a diverse range of learning materials, tutorials, and certifications through platforms like Microsoft Learn. One notable example is the Data Engineer Career Path — Microsoft Learn Official Collection. This resource serves as a comprehensive guide crafted by Microsoft to provide a structured official pathway for individuals aspiring to pursue a career in data engineering. Beginning with this collection ensures coverage of essential concepts, allowing you to acquire foundational knowledge crucial for a successful journey in the field of data engineering.

One more great source to prepare to this certification is Azure Data Engineer Associate Certification Guide by Newton Alex. In this book you will find in-depth coverage of data engineering topics along with hands-on exercises, practical examples, and guidance for preparing for the certification exam.

At a fundamental-to-intermediate level, you will gain knowledge in the following areas:

Comparing Avro, Parquet, and ORC
Designing storage for efficient querying
Designing a distribution strategy
Designing a partition strategy for files/analytical workloads
Dimensional models
Designing SCDs (Slowly Changing Dimensions)
Handling duplicate data
Handling missing data
Handling late-arriving data
Creating data pipelines
Designing a stream processing solution
Designing and creating windowed aggregates
Designing and implementing data encryption for data at rest and in transit
Designing row-level and column-level security
Designing for incremental loading
Compressing files
Transforming data using Apache Spark

At an intermediate-to-advanced level, you will gain knowledge in the following areas:

Working with Azure data infrastructure
Designing and implementing data lake solutions with batch and stream pipelines
Identifying partition strategies available in Azure storage technologies
Implementing different table geometries in Azure Synapse Analytics
Using transformations available in T-SQL, Spark, and Azure Data Factory
Using Azure Databricks or Synapse Spark to process data using Notebooks
Designing security using RBAC, ACL, encryption, and data masking
Monitoring and optimizing data pipelines with debugging tips.

If you want to know more about Azure Certifications, please check out my certification guides:

Step 3: Master SQL Queries

Mastering SQL is crucial for individuals working with data, as it enables them to interact with databases effectively, conduct complex data analysis, and contribute to data-driven decision-making processes.

The most effective way to master SQL is through consistent practice. I recommend practicing SQL through LeetCode tasks as the optimal method for mastering SQL queries. While there are additional essential concepts related to database engines and optimization strategies, these will be covered in subsequent steps.

Practicing SQL through LeetCode tasks is highly effective for several reasons. It provides real-world problem-solving experience, a diverse set of problems, challenges for optimizing performance, a supportive community for discussion, preparation for interviews, immediate feedback, and reinforcement of SQL concepts through regular practice. This approach offers a hands-on and practical way to develop and refine SQL skills.

While LeetCode is a valuable resource, it’s important to supplement it with other learning methods, such as studying theory, working on real-world projects, and exploring advanced database topics. Combining different approaches ensures a well-rounded and comprehensive mastery of SQL.

Step 4: Learn Bash Basics

Many data tools and technologies have command-line interfaces or can be accessed through Bash scripts. Knowing Bash enables you to seamlessly integrate and interact with these tools, contributing to a more cohesive data engineering workflow.

You will gain all necessary knowledge from the Introduction to Bash Scripting course by DataCamp.

Step 5: Master python as script language

If you are a Python developer, consider yourself fortunate! You can simply skip this step. For everyone else:

For a data engineer, it’s essential to master Python as a scripting language, especially for activities like data processing, creating ETL pipelines, and other data-centric tasks. This is because numerous popular tools and frameworks in the domain, such as Pyspark, Apache Airflow, Prefect, Great Expectations, and AWS Glue, etc. utilize Python.

While Python supports Object-Oriented Programming (OOP), it’s not always necessary to emphasize it for all tasks, especially in a scripting context. Therefore, you should not focus on this part of Python at first.

Simple Way:

You can acquire all necessary skills to start using Python for data engineering tasks by completing these two courses:

Introduction to Python by DataCamp
Intermediate Python by DataCamp

I won’t detail the content of these courses as I trust the names speak for themselves.

Complex Way:

If you have a strong coding background but don’t know Python and don’t want to waste time bending Python to fit patterns you’ve learned in other languages, you should definitely check out Fluent Python by Luciano Ramalho. In this book, the author guides you through Python’s core language features and libraries, teaching you how to make your code shorter, faster, and more readable.

Step 6: Master python as language for Data Operations

You will discover how Python improves the efficient and effective handling of diverse data tasks, positioning itself as an essential tool for any data professional after completing these three courses:

Introduction to Importing Data in Python by DataCamp
Intermediate Importing Data in Python by DataCamp
Data Manipulation with pandas by DataCamp

Oh, how I wish they were real pandas, but it’s just a super useful data analysis and manipulation tool.

Step 7: Learn Spark & Databricks

Next step is advance your data skills by mastering Apache Spark. Using the Spark Python API, PySpark, you will leverage parallel computation with large datasets.

Simple Way:

By completing these three DataCamp courses, you will have a comprehensive understanding of Apache Spark and PySpark, enabling you to proficiently leverage parallel computation with large datasets and advance your data skills for real-world applications.

Complex Way:

If you aspire to become a Spark guru and delve beyond surface-level understanding, you should definitely look at Learning Spark, 2nd Edition by Jules S. Damji, Brooke Wenig, Tathagata Das & Denny Lee. This widely acclaimed book stands out in the Apache Spark literature for its comprehensive coverage and practical approach. Authored by experts associated with Databricks, the book provides both beginners and experienced users with updated insights into the powerful distributed computing framework.

*Learning Spark, 2nd Edition* by Jules S. Damji, Brooke Wenig, Tathagata Das & Denny Lee

You can grab your free copy of this book as a complement from Databricks here.

Additionally, to learn about Databricks and its possibilities, you should create a free Databricks Community account and practice to become comfortable with its UI and PySpark syntax.

You may encounter gaps in certain topics, such as Delta, Medallion Architecture, File Formats, etc. However, as you progress through other steps, you’ll gain a deeper understanding of these topics.

Step 8: Dive into Distributed Systems & Storage Engines

In contemporary system design addressing intricate issues like scalability, consistency, reliability, efficiency, and maintainability is paramount. Moreover, the available toolset is extensive, encompassing relational databases, NoSQL datastores, stream or batch processors, and message brokers. Fortunately, in the realm of Data Engineering, there exists an excellent book that can serve as a valuable guide in navigating this complex world of data challenges.

I believe this book requires no introduction; within the engineering community, it boasts a formidable reputation. We could even say that Designing Data-Intensive Applications by Martin Kleppmann is a Legend!

At an intermediate-to-advanced level, you will gain knowledge in:

Data structures that power different databases
Column-oriented storage peculiarities
Different formats for encoding data
How to work with distributed data
Replication, partition, and transactions in depth
How to work with/design distributed systems
What consistency is and why it is so important
Batch processing/stream processing in depth

And a vast array of other important data-related topics.

*Designing Data-Intensive Applications* by Martin Kleppmann

And the best part of this book is the multiple additional links to great articles that can become an almost never-ending source of new learning materials for learning enthusiasts.

For Database enthusiasts:

While the entire first part of Designing Data-Intensive Applications by Martin Kleppmann focuses on the Foundations of Data Systems, you may still have lingering questions after reading it. For individuals who are particularly passionate or deeply interested in database engines, I recommend “Foundations of Data Systems on steroids” — Database Internals by Alex Petrov.

At an advance level, you will gain knowledge in:

Storage Engines Internals
In-depth exploration of Column-Versus Row-Oriented DBMS
B-Tree Basics
Various database file formats
Implementation of B-Trees
In-depth understanding of Transaction Processing and Recovery
Log-Structured Storage
Comprehensive study of Replication and Consistency
Leader Election concepts
Practical insights into working with Distributed Transactions

And a multitude of other database topics that database enthusiasts will find intriguing.

Step 9: Gain Proficiency in Data Warehousing

Data warehousing remains a crucial and foundational concept in data management and analytics, despite the emergence of new and innovative data architecture styles. It plays a vital role in organizing and analyzing extensive datasets, facilitating informed decision-making, and providing strategic insights. Despite differing opinions on its relevance, the principles of data warehousing persist and continue to be highly significant in the ever-evolving landscape of data technologies.

Furthermore, the importance of modeling, especially dimensional modeling, and the use of schema designs like star and snowflake schemas cannot be overstated. While you already have acquired a wealth of relevant knowledge and skills from Step 2, it’s important to note that these skills might still be somewhat superficial.

And here comes another legendary book in the world of data engineering — The Data Warehouse Toolkit by Ralph Kimball. Within its pages, you’ll find a comprehensive collection of dimensional modeling techniques, ETL strategies, and various other invaluable insights.

At an advance level, you will gain knowledge in the following areas:

Expanded Dimensional Modeling Techniques
Fine-Tuning Schema Designs
Optimized ETL Strategies
Data Warehouse best practices that transcend the foundational level
Real-world case studies
Integrating data warehousing with emerging technologies
Kimball’s approach to continuous learning and adaptation

The Data Warehouse Toolkit is undoubtedly a great book but while the core principles of data warehousing remain relevant, the field itself is dynamic. It’s a good idea to supplement your learning with additional resources, stay informed about the latest developments in data warehousing, and consider how newer concepts such as Data Lakes, Data Mesh, and Data Vault, fit into the broader landscape.

Step 10: Understand Data Architectures

Data fabric, data lakehouse, and data mesh have recently emerged as viable alternatives to the modern data warehouse. These new architectures offer solid benefits, but they are also surrounded by a lot of hyperbole and confusion. It is crucial to clearly understand and define their pros and cons.

One more valuable resource in our list of books is Deciphering Data Architectures by James Serra, a big data and data warehousing solution architect at Microsoft. While this book is currently in early access and is set to be released in May 2024, it can be a good starting point to gain a lot of useful and well-structured information about data architecture.

At an intermediate level, you will gain knowledge in the following areas:

Developing a working understanding of various data architectures.
Understanding the pros and cons of each approach.
Differentiating between data architecture theory and practical implementation.
Selecting the most suitable architecture for specific use cases.
Grasping distinctions between data warehouses and data lakes.
Common data architecture concepts to enhance solution-building.
Clarifying each data architecture to alleviate confusion.
Understanding which architectures are appropriate for each cloud provider.

*Deciphering Data Architectures* by James Serra

After becoming familiar with all data architectural concepts, you can now proceed to the primary reason for acquiring this information: understanding how to turn data into value.

With Data Management at Scale by Piethein Strengholt, you’ll learn how to design a next-gen data architecture that takes into account the scale you need for your organization.

At an advanced level, you will gain knowledge in the following areas:

Construction of a modern data architecture
Recent technology developments, and industry trends.
Organizing data using data domains and mapping these domains to a technology architecture.
Data governance and data security best practices, how to work with regulatory requirements, and address privacy concerns.
Data product management, master data, and reference data management.
In-depth exploration of domain-driven design and data product design.
A detailed examination of data mesh and data fabric.
Self-service data marketplaces.
The enterprise metadata model.

And on top of all this super useful information, you will get a real example of implementing modern data architecture step-by-step in the final chapter.

*Data Management at Scale* by Piethein Strengholt

Step 11: Don’t stop

As you embark on this journey to master data engineering, the final step is a reminder that this comprehensive roadmap is merely a starting point. In a world of relentless change, each piece of knowledge acquired becomes a valuable tool, empowering you to adapt, innovate, and excel.

Remember that staying stagnant in your knowledge can quickly render your skills obsolete. So keep your curiosity alive, and maintain a thirst for learning.

Thank you for your attention, and best of luck!

Mastering Data: A Strategic Guide for Software Developers

Step 0: Familiarize Yourself with Data Engineering Essentials

Introduction to the Fundamentals of Data Engineering

This course introduces learners to the fundamentals of data engineering, and is closely related to material in the book…

Step 1: Lay the Foundation

At a fundamental level, you will gain knowledge in the following areas:

Step 2: Complete Azure Data Engineer Associate Certification

Complete Data Engineer Career Path

At a fundamental-to-intermediate level, you will gain knowledge in the following areas:

At an intermediate-to-advanced level, you will gain knowledge in the following areas:

Step 3: Master SQL Queries

Step 4: Learn Bash Basics

Step 5: Master python as script language

Simple Way:

Complex Way:

Step 6: Master python as language for Data Operations

Step 7: Learn Spark & Databricks

Simple Way:

Complex Way:

Step 8: Dive into Distributed Systems & Storage Engines

At an intermediate-to-advanced level, you will gain knowledge in:

For Database enthusiasts:

At an advance level, you will gain knowledge in:

Step 9: Gain Proficiency in Data Warehousing

At an advance level, you will gain knowledge in the following areas:

Step 10: Understand Data Architectures

At an intermediate level, you will gain knowledge in the following areas:

At an advanced level, you will gain knowledge in the following areas:

Step 11: Don’t stop

Written by Aleh Belausau