How to Become a Data Engineer: The AI Plumber?

21st Century roadmap to becoming a Data Engineer

Published in

Omdena

6 min readOct 24, 2020

What is a data engineer?

In broad strokes, a data engineer is responsible for engineering systems and tools that allow companies to collect raw data from a variety of sources, volume, and velocity into a format consumable by the broader organization. The most common downstream consumers of data engineering products are the AI/Machine Learning and Analytics functions of a company.

The best way to start talking and discussing this new and loosely defined role is the Data Science hierarchy of needs brilliantly depicted by Monica Rogatin in the pyramid below.

Source: The Medium post “The AI Hierarchy of Needs” -

A data engineer is the lead player on the first 3 foundational rows of the Pyramid: Collect, Move/Store and Explore and Transform. A plethora of roles from Data Analysts, Data Scientists, and Machine Learning Engineers are the heirs and lead role players on the higher phases of the value chain unlocking.

A Data Engineer is part of the functioning that provides the base to the highly critical job of the Data Scientists by hiding all the complexities involving the management, storage, and processing of the data assets of the company. He or she is a master of data ingestion, enrichment, and operations.

With the deluge of data available within public and private companies, the ability to unlock this value is the critical factor in providing cheaper and better services to stakeholders and customers.

Skills of the trade

Data Engineers do come in different flavors and types. The core skills of the trade can be summarized below in order from essential to important:

Software Engineering: Data Engineering in its essence, is a discipline of Software Engineering where the same rhythms and methodologies of work are applied in order to execute the task at the end. The use of version control, unit testing, and agile techniques to ensure business alignment and quick delivery are paramount for success.

Relational Database/Data Warehouse Systems: Most of the data access in the data engineering space is democratized through access to ad-hoc querying into a relational database environment. Allowing expert users with basic knowledge of SQL to retrieve the data that they need in order to respond to a business query or decision.

Scalable Data Systems/Big Data: It’s central to the modern data engineer to understand data systems architectures. A good grasp of how distributed and parallel processing work is needed. The different types of indexing available in their environment to allow proper and efficient processing of the data at their disposal is a great skill to have.

Operating Systems / Command Line: Familiarity with your local environment of development being OS/*NiX/MIN is primal, particularly the command line where a lot of ad-hoc wrangling can happen.

Data Visualisation: A fundamental skill to effectively expose data products to a more general audience and quickly unlock data value through clear infographics, charts, and interactive analytics. Familiarity with a tool like Tableau, Superset, or Power BI is a must.

Data Science (Basics): An increasingly important user and stakeholder of a Data Engineering organization is the data science team. Understanding how data is used in the context of exploratory data analysis, machine learning, and predictive analytics ensures a virtuous cycle between critical data functions.

Data Engineers don’t need to be experts in all of the areas above. Having two core expertise in the above and a good understanding of the other areas go a long way in delivering value to a project.

A Data Engineer can come in different shapes and forms, so being very specific about your role is very important. As a nascent profession, it lacks standards and consistent job descriptions.

Typically transitions to successful data engineers are seen from the following backgrounds in the industry:
Software Developer/Engineer, Data Scientist, Database Administrator, Business Intelligence Developer, and, Data Analyst.

The path to mastery

To master data engineering I would start with the prerequisite of getting deep experience and expertise in two or more of the following areas.

Distributed Systems / Big Data
Database Systems / Data Warehousing
Software Development
Data Visualization

The most traditional path to mastery is a degree in a discipline with high Computing exposure (CS, EE, Info Sys., Applied Maths/Phys, Actuarial Science/Q) or a Quantitative degree followed by a couple of years in Software Development or Data Science with practical exposure to backend services and production systems. The data engineering field is loaded up with rockstar engineers from non-traditional backgrounds ( high school dropouts, literature majors, etc.).

A couple of top online courses and specialization available at the top websites ( Coursera, Udacity, Udemy, etc.) covering Big Data / Data Engineering tooling can give a good foundation to aspiring Data Engineers. The ones with the best reviews in your preferred learning platform will assist you in building a skill set for the role.

After this initial foundations I would recommend the following books for fundamentals in architecture:

Designing Data Data-Intensive Systems — Martin Klepmann
Data Engineering Cookbook — Andreas Ketz
Foundation of Architecting Data Solutions — Malaska et. AL,
Streaming Systems — Akidau et. al
The Data Warehouse Toolkit — Ralph Kimball

Nothing is more valuable at this stage than getting practical exposure in a real-world data engineer role. Keep practicing and growing the craft for the rest of your career.

Omdena as an organization that promotes AI challenges with volunteers across the world is the ideal place for anyone to sharpen their data engineering skills. In many of the Omdena challenges one of the most important skills needed is data engineering skills to prepare data, set up data pipelines, and operationalize pipelines.

Typical tools of the trade

With all the excitement in the field, a plethora of tools are popping up in the market, and knowing which one to use becomes a problem as there are many overlapping uses of them. A typical data engineer product/service does not differ much in terms of the complexity of a software system.

A typical data engineering pipeline will require expertise in at least one tool per function/category:

Function : Pipeline Creation / Management

Apache Airflow

End to end workflow authoring and management tool.
Provides a computing environment where your processes can run.

Alternatives: Azkaban, Luigi, AWS SWF

Function: Data Processing

Apache Spark

A fundamental tool to process data in many formats at high scalability.
Allows facile enrichment and processing in SQL, Scala, and Python.

Alternatives: Apache Flink, Apache Beam, Faust

Function: Distributed Log/Queueing Systems

Apache Kafka — Scalable distributed queuing system that allows data to be processed and moved at a very high speed and large volumes.

Function: Stream Processing

Alternatives: Apache Flink

Function: Data/File Format

Apache Parquet — Very efficient data format geared for analytics and aggregations at scale on cloud or on-premises.

Alternatives: Arrow, CSV, etc.

Function: Data Warehousing /Querying

BigQery

A cloud-based data warehouse system for structured and relational data storage and analytics.

Alternatives: AWS Redshift, Apache Hive, etc.

Keep in mind that tools go and come over the years. Focus on the picture and functional areas will keep you updated and ready to learn the new fancy tool.

Starting or joining an open-source that uses any data engineering tool is a good move from a growth perspective and longer-term mentorship by captains of the industry.

The future

In order to fulfill the promise of unlocking the value of data, more investment in the Data Engineering space is expected. There’ll be increasingly intelligent tooling available to handle the current and future challenges around data governance, privacy, and security.

I can see an increase in blending AI and ML techniques directly on the Data Engineering toolchain from an operations perspective and data quality assurance. Good examples of such tools are Deequ from AWS Labs that applies machine learning to data profiling. At the center of modern Data Engineering are areas like synthetic data generation to alleviate issues around data privacy when the cost of acquisition of data and compliance is too high Tools to watch out on the synthetic data space: Snorkel and the use of generative adversarial neural networks to generate everyday tabular data.

With the rise of Auto ML for prediction and data analytics, a central role will be given to the underpinning data infrastructure engineering of the datasets that drives the enterprise strategy. From here, we can only see an outlook of increasing relevance and opportunities to contribute positively to society.

I would like to acknowledge Laisha Wadhwa, James Wanderi, and Michael Burkhardt for their input and suggestions on the article.