Building Data Platforms II — The Age of the plumber is over
[Update — Part III is now available here]
In the first part of this series we talked about the ETL bias and the problems it brings to the way Data is perceived and managed in companies. The first chapter ended with the following question that is going to be the motivation for this second part.
“If Data is the most precious asset in a company, does it make sense to have only one team responsible for it?”
The answer is of course no and if you ask around in your company they probably have the same feeling. However, if you look around you will probably find a Data team swamped in requests for new ETLs, Data Warehouse schema upgrades and even building dashboards. There’s a complete misalignment of what is intuitive to most people and how they act. In order to align this I believe there are 3 main things that must happen:
- Distributed Data ownership
- Data Plumbers become Data Engineers
- The evolution of the Software Engineer (will be covered in the next chapter)
Distributed Data ownership
The foundational problem of today’s Data landscape is not tooling but a misplaced sense of ownership. The entire purpose of microservices and the application of domain driven design is to have teams working on a given problem, understand it and build functionalities to address it. In the modern world, Data is a vital component that companies use to stay ahead of their competitors. If that is so important, why do we centralize it? Why do we have Data Engineers maintaining ETL pipelines to bring information from domain X into a Data Lake and/or Warehouse to answer questions they don’t nothing about? This puts Data Teams in the middle of teams that known the domain and internal stakeholders that have questions about it. This problem is perfectly described on the great article by ThoughtWorks that introduces the concept of a Data Mesh.
The high level solution to this problem is simple: decentralization. One must bring the teams that own the systems that generate data and understand a given domain with the consumers of that Data. This concept is so clear, makes so much sense and has always been there, in front of us, all this time. This is easier said then done because decentralization brings its own challenges such as governance or in more practical terms: how to ensure things have a certain level of consistency?
Although decentralization is crucial to enable companies to use Data more efficiently I believe that, in most companies, you don’t need to have fully embrace such an elegant but also complex change like a Data Mesh. One the challenges I see with the Data Mesh the following:
How do you bring a Data Mesh to life?
It is very tempting to believe Data Mesh will solve all of our problems. I strongly believe that we must start with changing the mindset of people so they start seeing and thinking about Data in a more decentralized approach. This is the hardest step that has to be done and this requires a big shift in how we, as an industry, think about Data Engineers and Software Engineers. For the first, we need to totally redefine what they do. For the second, we need to understand what they need to start doing.
Data Plumbers become Data Engineers
Let me ask you to do something simple: open your favorite search engine and type the following: “Data Engineering skills”. You will most likely find results that contain some of following concepts that are presented below
What all of these results have in common? They are all part of a mindset where Data Engineers are responsible for writing some sort of ETLs to fetch Data from operational systems into somewhere else. In this mindset, Data Engineers are basically plumbing data around, building pipelines to satisfy a variety set of teams from which they don’t know nothing about the domain. How many Data Teams do plumbing work to support Analytics and also build datasets for AI teams to train their Machine Learning models? Too many and this is one of the main reasons of why “Data is slowing us down”.
In order to start promoting decentralization of Data we need to clarify what Data Engineers should not be doing.
Data Engineers should not build ETL pipelines
This is a strong statement and it is clashes with how a lot of people have seen the role that Data Engineers in the industry. However, if we want to promote decentralization one must destroy this main centralization point. When we assume that this must happen a natural follow up question arises: “What should a Data Engineer do?”
Well, as the name implies Data Engineers should build systems to manage Data. But what does exactly does it mean? If they don’t build ETLs anymore, and they are not plumbers, what else is there to do on the Data side to do? They need to build software that answers the following questions.
Is Data Correct?
This is one of the most important topics due to the importance that Data has in today’s world. Let’s imagine your company has two Data Pipelines, one to feed internal Analytics and other to generate datasets that will be used to train Machine Learning models. Now let’s assume that one of the source systems that has information about any important attribute (e.g. a duration of an action) introduces a bug that causes both pipelines to generate data with errors. Most of the times, the observability metrics of a data pipeline focus on answering the following questions
- Can I connect to the source system?
- Are there any exceptions during transformation?
- Am I able to write to the output destination?
What happens in this case is that, despite all pipelines metrics being green, the system is generating data that is wrong. These type of errors usually take some time to troubleshoot and can have really bad consequences to the company because most of decisions and systems (e.g. specially AI based ones) are based on the quality of the Data.
Therefore, it is crucial that Data Engineers focus on building systems that detect these problems and alert the teams that need to be notified that something is not correct. Tools such as Cuelang, Great Expectactions or Pandas Profiling, can provide great support here.
Is Data up to date?
Having high quality data is not enough for a company these days. Today’s world moves at a very fast pace and that means companies want to get the pulse of how their business is going. The real importance of this is not about latency of processing (batch vs streaming) but freshness of data. Depending on the use case, we might need to have different service levels for different data items. Let me give you an example to prove why is this important. The margin of a product does not need to be available every time a customer makes a purchase but if that data is evaluated on a bi-weekly basis the team must be aware of the latest entry is 4 or 5 days old. A more advanced way to tackle this problem is have a system that understands patterns of data flows and triggers an alarm every time something wrong happens. This system can be anything, anything from an outlier detection model to a set of hard coded rules.
Is Data Accessible and Discoverable?
How many times have you asked yourself (or heard someone asking)
Where can I find information about X and how can I access it?
The common answers to this question are some various of “Go talk with person Y”, “You have these API endpoints”, “It’s on a spreadsheet, I will send it you” (you open it and last edit was 2 quarters ago..). Turns out doing this properly and maintaining it up to date is harder than it looks. Concepts have connections with different levels of relevance and they might exist in more than one system. This brings the aspect of a very important thing that companies do not give the right attention: metadata. Metadata is data about data. Lets take an article here on Medium for instance. On its metadata one would see that it is written by user with ID I, takes around Y time to read, was published at day D, it is related to topic A.B and C and has links to other articles or tools. Without reading the article and only by looking at its metadata you have a pretty good understanding of it. The only way to work with data at scale is by having good metadata.
Data Engineers should build (or maintain) systems that manage metadata to make the lives of data consumers easier. Those consumers can be people in the Finance Team, a group of a Data Scientists, a Product Manager or a Software Engineer that just joined the company. All of them should be able to navigate through the company’s metadata to find definitions of concepts and how they relate to each other. Amundsen is a great example of a tool that tries to solve this problem.
Is Data Secure?
Nowadays, Data security has become crucial to every company. Customers and overall audience is more aware of how companies handle their data. If we want to promote decentralization of Data we must do so with some rules and processes that are adopted by everyone. Data Engineers should work in collaboration with Product Managers, Software Engineers and Security Engineers to:
- Ensure everyone has access to the data they need to do their job (no more and no less)
- Ensure the company has access logs for all data items
- Ensure all systems have proper data retention policies
- Control and centralize the management of sensitive data (e.g PII, customer contracts, etc)
The last part of the three aspects that need to happen to promote decentralization of Data is the evolution of the Software Engineer. Originally, I wanted to put this section on this article but I decided to leave this topic to the next part of this series because this part is already long. In the next chapter, I will cover how do Software Engineers need to evolve and give some examples of how one can work with Data in a decentralized and more efficient way.