Data Engineer: The Sexiest Job of the 21st Century

Why We Need Them Before Data Scientists

Matthew Gazzano
9 min readJun 29, 2023
Photo by Christina @ wocintechchat.com on Unsplash

It's been over 10 years since Harvard Business Review published the viral article Data Scientist: The Sexiest Job of the 21st Century. But since then, organizations have failed to derive the intended value from Data Science roles for a number of factors.

Even in light of the recent AI boom with popular services like ChatGPT, many teams are still failing to answer basic questions about their business in a scalable way, nevertheless creating predictive models. But why is this?

Many organizations are missing the leadership, data-driven culture, and infrastructure needed to scale out Data Science projects effectively.

What Specifically is Going Wrong with Data Science Projects?

Throughout the past decade, we’ve seen the following methodologies fail:

  • “A predictive model is the solution, now let's figure out the problem.”
  • “We have a reporting issue? No problem, let's just hire more Data Scientists.”
  • “Let’s have our Data Scientists build an advanced predictive model. The organization will definitely buy in on it.”

It seems as if the hype of new breakthroughs in the analytics domain caused a boom of new roles that many companies simply were not ready for. Additionally, their supporting teams did not have enough of a data literacy background to help Data Scientists succeed.

You may have heard of the 2021 Zillow iBuy program failure that cost the company $500,000,000 and a 25% workforce reduction. This was ultimately an issue of deploying a machine learning model that was implemented for the wrong use case, mainly due to a lack of understanding of volatile real estate market conditions. The failure of the initiative was ultimately caused by pressure to launch the model by upper management.

Its examples like these which illustrate that organizations have been too quick to jump on the Data Science bandwagon, and have not invested enough in their research and planning to be successful.

Organizations Are Starting to Focus on The Fundamentals

Recently, these issues have begun surfacing in many organizations, and as a result the market has seen a boom in Data Engineering roles in the past few years. In order to understand why this role is so important to see continued growth in the Data Science field, we must first dissect foundational elements of the entire data lifecycle.

The Order of Operations: Digital Infrastructure

Data Engineers live in a mission critical role that serves as the bridge between operational source systems and the analytics platforms that Data Scientists sit on top of. In order to build predictive models and BI tools, there needs to be a series of initiatives— both process driven and system driven — that must happen first.

1.) A Data Driven Culture — While not related to any one piece of technology, the first step to developing a successful analytics infrastructure is to have the support and buy in of leadership. Without this, initiatives cannot get prioritized or funded, which ultimately leads any infrastructure development to a grinding halt.

2.) A Well-Defined Business Process — A clear business process needs to exist and be documented. This means that data deliverables are defined and have a clear meaning to the business. For example, if you are retail company, how is an order recorded? What data is required to finalize a sale — you might need a credit card number, recorded date of purchase, list of order lines, the sub total and tax. When a sale happens, what process is triggered after? Is there a delivery / fulfillment process? How is inventory managed?

These questions should be clearly answered and account for as many anomalies as possible.

3.) A Developed Source System — Do you have an operational source system(s) that stores data about your business process? No, Microsoft Excel does not count. And does this account for all of the data that you will ultimately need to report on? Source Systems, or commonly referred to as OLTP (Online Transaction Processing), are the systems that initially record data. You can think of an OLTP system like the following:

As a simple example, think of a CRM system that collects information on customers. The city field needs to have data validation behind it. If it’s a free form text field, “New York” might be input as “NYC”, “ny”, or “New York city” by business users. This will slow down / limit Data Scientists ability to provide meaningful insights. Now imagine if there are more serious problems with the CRM; inadequate associations to Orders, Products, or omitted information that relates to customers. You can’t build any meaningful data product until this is figured out.

4.) A Well Modeled Analytics Platform

Enter the Data Engineer.

Once the organization has a comprehensive source system(s), Data Engineers can start building data pipelines that connect source systems to an analytics environment. These pipelines will produce a series of tables that can be consumed by analysts. Typically, organizations will use cloud providers such as AWS, Azure, or GCP to host their analytics environment. Conversely, teams can implement on-prem or independently hosted databases, such as MySQL, PostgreSQL, SQL Server etc.

In order to be effective, data needs to be modeled and available in these systems in a meaningful way. Teams can implement one of many modeling techniques, which include but are not limited to the following:

  • A Dimensional Model — This divides business entities into Fact and Dimension tables. Fact tables represent the business metrics or measures that are being analyzed- such as sales, revenue, or quantities. Dimension tables provide descriptive attributes or context for the metrics in the fact table.
  • Entity Relationship Model — Data is broken out into business entities as they relate to source systems. They could be tangible entities like customers, products, orders, or intangible entities like events or transactions. It is also stored in 3rd normal form, which eliminates any redundancy.
  • Data Vault — Data is divided into 3 types of tables; Hubs, Links, and Satellites. Hubs represent the core business entities or concepts. Each hub corresponds to a specific entity, such as customers, products, or locations. Links capture the relationships between the hub entities & represent the associations or connections between different entities. Satellites store the detailed attributes and historical information related to the hub entities.

Once the analytics platform has been developed, teams can begin their analysis. This means they can sit BI tools / AI models on top of database tables and begin to answer meaningful questions about the business.

This is when an organization is ready to hire Data Scientists.

Data Engineers in The Data Maturity Lifecycle

A book I would highly recommend reading, Fundamentals of Data Engineering by Joe Reis and Matt Housley, features an article by Monica Rogati on the Data Science Hierarchy of Needs. This outlines the foundational layers of the data life cycle that organizations must follow to create Machine Learning models.

Photo from Monica Rogati on Hackernoon

Data Engineers are primarily involved in the Move/Store and Explore/Transform stages, which makes data accessible and scalable for model building. Without these steps, Data Scientists are likely to create redundant data cleansing / transformation steps, which limit the amount of time they can spend on delivering business value through a dashboard or predictive model.

The Rise of Data Engineering in the Modern Data Stack

Data Engineers have a newfound importance to many organizations for the above reasons, which comes with an increased scope of responsibilities. At their core — they own all elements surrounding the development of the Data Warehouse. Here are some of the most common projects that they are expected to own:

Data Ingestion

This involves gathering data from source system, typically via an API or file loaded to S3 (Parquet, JSON, CSV) that is executed on a schedule. It also involves an orchestration tool such as Apache Airflow, Dagster, or Prefect. These orchestration tools effectively allow a series of tasks / objects to be executed in a certain order so data can be moved across systems. They are all frameworks that are configured using Python. The most popular, Airflow, uses the concept of DAGs (Directed Acyclic Graphs) which typically describe the series of events that take place in a data pipeline. For example, a typical ETL (Extract Transform and Load) process might look like this via an Apache Airflow DAG:

  • Task 1: Read the Salesforce API for relevant data
  • Task 2: Load this data into an S3 bucket
  • Task 3: Transforms data in a Pandas or PySpark (These are Python libraries)
  • Task 4: Load structured data to Data Warehouse
Basic Example of a DAG from Apache Airflow

There are also low code / no code ingestion tools, where consumers pay for the data that they ingest & load. The biggest player in this area is Fivetran which has a series of predefined source system connectors & allow you to sync data to your cloud data warehouse.

Data Modeling & Transformation

As mentioned briefly in the above role- transforming and cleaning data sets tailored to the business needs are becoming more important than ever. Due to the increased focus on this and popularized ELT (Extract Load and Transform) processes, organizations are hiring Analytics Engineers which solely focus on this. This is a subset of Data Engineering which interacts more closely with Data Science and Business Users to understand how data should be modeled for a particular user group. They have the technical skills of Data Engineers but are mostly just focused on the “T” in ELT.

These transformations ultimately allow the Data Warehouse to be more quickly accessible, understandable, and meaningful to the business.

Unit Testing

Data Engineers are also responsible for ensuring that there are data quality checks at all points of the data pipeline. For example, we want to make sure that a primary key of a table is unique, and never null. Of course, more custom and advanced testing will also take place — such as ensuring that a particular aggregation ties back to a source table. It could also mean testing edge cases, like how the pipeline will handle extreme values & understanding what will cause it to fail if an anomaly were to appear in a source system.

Resource and Performance Tuning

It is also a role of a data engineer to ensure that data pipelines are optimized for performance. While most cloud technology scales up as you use it, organizations still need to pay for the amount of compute that is used, which can quickly add up. This means that data engineers need to find the path of least resistance to get data from point A to point B. Additionally, they need to ensure that queries are performant, so Data Analysts and Data Scientist can get the answers they need quickly.

Data Cataloging

Data is useless unless it has business value and analysts know where to find it. This introduces the importance of data cataloging, which gives users a roadmap of where to find data — noting what tables represent, their grain, and column definitions. Data Engineers should tie the physical tables and views back to business entities in how they are interpreted by the rest of the organization. Some popular data cataloging tools that organizations use include Atlan, Data.World, or applications native in a cloud provider such as AWS Glue Data Catalog or Google Cloud Data Catalog.

Data Governance

Often overlooked — data governance is what ties all the above pieces together; it's the overall management and control of an organization’s data assets. It involves the policies put in place to ensure the proper management, availability, usability, integrity, and security of data throughout its lifecycle. While data engineering should have one of the largest voices in data governance, it also involves working with Data Scientists and Data Analysts to holistically understand the organizations analytics ecosystem and agree upon its policies.

Conclusion

Data Science is still a well thriving field and will continue to do so in the foreseeable future. But more of the generic data science roles that include data pipeline building are being outsourced to data engineering. This is because organizations are finally realizing that they are not yet ready to build flashy models and need a better infrastructure first. This further specialization in work will ultimately make analytics teams more productive and scalable.

--

--