From Data Driven to Driving Data— The dysfunctions of Data Engineering

MrTrustworthy
11 min readJul 27, 2021

--

This is not the Data Engineering post you were looking for.

We’ll not talk about how to set up Spark, or what the difference between a HDFS Namenode and Datanode are. Those things don’t matter. I’ve never seen a Data Engineering project fail because someone picked ORC instead of Parquet. Quite the opposite: Many “data driven” initiatives are failing even though they had the best engineers on the task and picked the “best” stack of technologies.

The problems we’re facing in Data Engineering are not the technologies. Sure, the entire ecosystem could do with a few more years of maturing, consolidating and bugfixing. But that’s not the reason why so many Data Engineering initiatives end up as overpriced technology POCs. For too many organizations, their Big Data cluster with dozens of VMs can only be called effective when the goal is to parallelize the burn-down of their cloud budget.

We talk so much about Data Driven companies that base all their important decisions on methodical analysis of data. We admire (and curse) companies like Facebook for turning people’s data into a tradable asset. There are so many articles exclaiming that data is now the most valuable resource in the world. And all that sounds pretty reasonable in theory… but also vastly removed from reality. When asking companies whether they’d prefer a few more terabytes of data or a suitcase full of gold, my guess is that most would take the suitcase. At least with gold, they’d know how to turn it into money.

Yes, many companies are doing ok. They have a Analytics team and some dashboards with KPIs. The developers sometimes push events and logs into something called Kafka. There are Data Scientists doing some fancy Neural-Machine-Something. None of the C-level executives really understands it, but the teams seem happy with their work — and that’s a good sign, right? Yes, that is good. But we should be doing so much better by now.

To be successful in the future, we have to take it further. Being Data Driven isn’t enough. Data is the engine as well as the fuel on which our businesses run. Its flow shapes a company, from the value delivery processes to the inter-personal communication paths of its employees. And to realize the potential of our data to its fullest extent, we need to get in the driving seat. We need Data Engineering — not just in the shape of a team of specialized developers — but as a methodical approach at re-designing how a company treats data, processes and projects on a fundamental level.

Strap in, because this post will be quite a journey. It begins in the Shire, where Data Engineers live in small holes in the ground. It will lead through vast valleys where Developers and DevOps Engineers are fighting for supremacy over Kubernetes. Together, they will sail the shores of Lake Data, where Business Analysts and Data Scientists swim in unison. And the journey will end in Mordor, where we’ll throw the cursed ring of business value into the fires of Mount Doom (which is, coincidentally, where most Product Owners live).

What is a Data Engineer?

Jeff Hale did an analysis of the skills required in Data Engineering job offers and wrote a nice post about his findings some time ago. By clicking through some job ads on Indeed, you can easily see the same pattern he found in his search. If you’re familiar with the industry, there’s actually not much of a surprise to be found by browsing those job postings. SQL, ETL, Spark… if you’re a Data Engineer, all those things should at least sound familiar to you.

What is surprising though, is the combination of breadth and depth in skills that seem to be required according to those job postings. Not only should you possess the skills of multiple different fields (like Cloud, Data Science, DevOps, Software Development and Business Intelligence), but you should also be good in all of them.

Sounds great, right? Someone who can do the work of at least four other jobs! Why even hire Developers or Cloud Engineers when a Data Engineer could also do their job? While we’re at it, why not replace Accounting, Marketing and HR with our golden-egg laying Data Engineers as well?

The different types of Data Engineers

Well, that’s obviously absurd. Looking back at individual job postings, we can quickly see the the source of this: No job requires you to have all the skills mentioned above, instead they all require different ones. The issue with all those job postings is that we don’t have a consistent role definition for a Data Engineer. There’s actually four different roles that companies tend to lump together under the broad umbrella of “Data Engineering”:

Type 1: Specialized Software Developer: A developer who’s focusing on data-driven applications. Regularly searched for by companies that have moved to a microservice-oriented architecture that relies on streaming platforms like Kafka for inter-service communication. Common tasks are either writing services that produce streams of events, or services that listen for those events to perform some action. Sometimes, those developers also maintain the Kafka/RabbitMQ clusters, similar to how Search-teams often maintain their Solr/Elasticsearch setups themselves.

Type 2: Business Intelligence (BI) Developer: Grossly simplified, there’s two different segments in the field of BI: One that’s business-focused and concerns itself with KPI definitions and Dashboards, and one that’s tech-focused and responsible for Data Warehouses and ETL processes. It’s harder to find people to do the second one, which is why (I assume) some companies try to make it more attractive by re-branding the job position as Data Engineering. After all, your ETL cronjobs run on half a terabyte of data — that counts as Big Data Engineering, right?

Type 3: Data Scientist with a systems focus: Often also found under the name of Machine Learning Engineer (MLE), a role with a similarly inconsistent definition. During the height of the Data Science gold rush, companies quickly figured out that if you wanted to crunch through terabytes of data, you not only need algorithms but also infrastructure that can scale to those volumes. Setting up and maintaining Spark and/or Hadoop clusters are typical jobs that fall under this job profile. With surprising regularity, Data Science teams just pick the one among them who’s the least afraid of bash scripting and sacrifice that person to the MLE role. That person isn’t exactly happy about this, but hey, at least it looks good on the CV.

Type 4: Big-Database Admin: When a company starts hiring Data Scientists, they first try to get the Ops/DBA/SysAdmin team to take over the work of managing Data Science infrastructure. The Ops team will then try and convince the CTO that they really don’t have time for that because developers keep breaking MySQL by writing logs into it. If the Ops team is successful, this responsibility might be handed to the Data Science team, who in turn starts recruiting MLEs to get rid of it, and you end up with type 3. As Ops teams typically have lots of practice in arguing against something, this is the most common scenario. If the Ops team somehow isn’t successful in convincing the CTO, you’ll end up with this kind of role instead — Database Admins who manage HDFS and Hive instead of MySQL and MongoDB.

Organic development of the Data Engineering role

There’s an underlying pattern to all of those types, a certain organic development (to anyone in IT known as “catastrophic mess”) through which the role is defined. It tends to start with some group in the company that’s unsatisfied with their (in)ability to work with data. Further analysis reveals a gap — some set of tasks that nobody wants to do because it’s not their job — and the company starts recruiting to fill it. The specific role of Data Engineer that’s recruited is effectively derived from the group that argues for their need first.

If it’s the lead developers/architects that want to move from monolith to micro-service first, you end up with a type 1 Data Engineer. If the analysts (or, more likely, the C-level executives) are not happy with their lack of insight in the companies operations, recruiting for type 2 starts. In the most common cases right now, it’s the newly-established Data Science team who blames the non-existent business value in their last project on the lack of usable infrastructure — and you end up with types 3 or 4 instead.

Now, there’s a variety of issues with this approach. Let’s start with an obvious one: What happens once a second and third group starts to put in their requests for “more data liberation”? Assume that the Data Scientists got their voice in first, and the company got a type 3 Data Engineer. Now the Lead Developer decides to split the existing PHP/Java monolith into neat little micro-services — and they apparently need Kafka for something called CQRS or DDD. Does the Machine Learning Engineer now have to build an Enterprise Service Bus for your Developers? Does the Lead Developer try to persuade the Ops team to set up and maintain a Kafka cluster for him? Do you hire another type of Data Engineer to fill this new gap? And what happens once the BI Analysts decide that they want to create nice dashboards based on the events that your micro-services send, or on the models that your Data Scientists built? After all, analyzing data is how data driven companies decide what to build next.

The value of integration

This is the point where many companies are getting stuck right now: The integration step. No matter from which perspective you approach Data Engineering first — Developer, Data Scientist, or BI/Analyst — you’ll have trouble integrating it with the other two angles later on.

This integration is definitely the hardest part about Data Engineering. Coincidentally, it’s also where the true value of Data Engineering comes from. It’s not only about the massive speed-up to all individual teams working with data. It’s also about the decoupling of organizational dependencies, the new opportunities that can be taken as a result, and the difference between Dev Teams and Empowered Product Teams.

Making everyone happy at the same time sounds like a hard job. But why should we even bother? We have three groups with their individual needs — creating three specialized, bespoke platforms sounds like a much better idea. Developers get some Kafka Brokers, Data Scientists get a dedicated Spark Cluster somewhere in the cloud, and for Analysts we just buy Tableau or Looker like everyone else does. Three rings for the Elven kings under the sky.
Go away Sauron, you can keep your Ring.

True end-to-end cost

Sadly, everyone who’s been in that situation knows that it’s not that easy. The one thing that’s often ignored is the true end-to-end time and effort to turn data into money. This doesn’t happen on purpose, but as a consequence of increasing specialisation and complexity. If you’re in a company with 5 employees (including 3 working students), you only need a daily excel export of your four MySQL tables for reporting. Your “recommender” is a bunch of if-else’s that runs on the operational database. Everyone knows all processes and systems, no need to over-engineer your data pipelines.

When the company grows, it will piece by piece add new processes, tools and teams to specialise in them. At some point, you’ll try to determine where a specific metric, KPI or recommendation is actually coming from, and you’ll find yourself on a surprisingly long journey. From the business department that’s accountable for the KPI, to the analysts, from the DBAs to the developers responsible for the component. You’ll find yourself calling the old working student (probably left the company 2 years ago already) who originally implemented the first version of that business process. At some point you give up asking “why?” and simply rename your KPI to reflect its actual meaning. Average cart value is now called Average cart value (excluding blue pencils) and you end up intently staring at the ceiling fan whenever someone asks you about it. There’s helicopter sounds in the distance.

As predictable as this is, those types of data pipelines structures keep appearing and re-appearing. It’s another type of organic growth. This time, not for job titles, but for data architectures, pipelines, batch processes, and workflows. And working with data keeps getting slower and slower as time goes on.

Data Engineering as a Product Team

My hypothesis on the main reason for this: Data Engineers are often treated as a service team. Some development team needs a Kafka integration? Send a mail to Data Engineering! An Analyst needs another pipeline? Put a ticket in JIRA, tag Data Engineering, mark it as high prio — hopefully they’ll do it fast. And yes, Data Engineers generally act as enablers for other departments, often simply reacting to requests from an outside user. But, Data Engineers can work much better if you give them more responsibility: as a Product Team.

If you think of a classic product team structure, something like the ubiquitous examples of the “Checkout Team” comes to mind. End-user facing teams with designers, frontend and backend devs, maybe even some SRE or DevOps sprinkled into it.

A Data Engineering product team has a slightly different anatomy, more akin to a B2B or internal-facing product team. Different people like different buzzwords, but I personally prefer the term Platform Team. Think of a team at AWS which is building Redshift, or any other cloud provider building an analytics cloud product — that’s how internal Data Engineering teams can work, too. They don’t have to build a cloud data product that works well for every company. Instead they can use public cloud & open source tools to build a data platform that works well for precisely your company, including all your weird communication structures, workflows, industry regulations, and the CEOs preferred color scheme.

Compared to a service team, which would own all 100 data pipelines for different departments, a Platform Data Engineering team would just own the core pipeline product. And this product must allow each department to create and manage their own pipelines in self service.

If a certain data pipeline or report crashes, the specific team that uses this pipeline should fix it themselves. They can’t? Well, then the Data Engineering team has to build better tools for their users, so they can manage (and recover) their pipelines on their own. After all, the platform is their product, and they’re responsible for its success.

Inversion of responsibility

This is an inversion of responsibility, and in its general thought very similar to the ideas behind DevOps. Data Engineers are no longer directly responsible for how quick they can build new pipelines. Now, they are responsible for how fast other teams can build new pipelines on top of their platform. It gives the developers, analysts, and data scientists more responsibility (and control!) over their own, and Data Engineering more freedom to build scalable self-service solutions for the entire company. Instead of being bogged down with dozens of service requests, they can now focus on building an integrated data platform,

If some other team approaches Data Engineering with a request, their first response should be “what is missing in the current platform for you to be able to do this yourself?”. Their key success metrics should reflect how fast new users can build pipelines and reports on their own. They should write release notes for big updates of their data platform products in the internal message boards, and do internal marketing and educational workshops to teach users how to best work with their platform.

It’s not an easy path to take, especially when there’s already an established “Data Engineering as a service team” structure. Explaining to other teams that, no, Data Engineering will not build your dashboard anymore, but they’ll teach you how to do it yourself… will create some organisational resistance. But, with some enthusiasm, a strong vision, and good communication, this can always be overcome. And, in the end, you’ll step by step work towards a more self-service oriented data usage for your company. And nobody will care whether you store it in ORC or Parquet.

--

--