A complete guide about how to break the data monolith.

Juan López López

Published in

Packlink Tech

13 min readMar 25, 2020

How to deal with the new big goal in the software world.

Motivation

Data monolith

Human is the only animal that trips twice on the same rock. After years of talking about breaking the monolith in services, we have done the same thing again: data monoliths (a.k.a data lakes and data warehouses).

I have been working on many projects (in different companies) for the last years and I have seen that the problems the data monolith causes are similar:

Errors in data due to changes are not coordinated between code and data pipelines.
Usually, the monolith creates information silos because we need code specialists and specialized teams in data and machine learning.
As companies grow, we have more sources of data, and this is something that only grows, which makes having the data in one place a hard challenge.
It is usually hard for the people working in data pipelines to understand the meaning of the data properly (at least not better than the people who produce it) because they have less context.
Sometimes is needed to make bulk updates in data warehouses to fix misunderstandings and inconsistent data in the ETL processes.
It is difficult to reduce the gap between a new idea and when this idea is in production. So we cannot innovate. If you are slow, you cannot have a fast test and learn cycle and without this, you cannot innovate.

Besides that, data monolith is, in fact, is a monolith, and we understand a lot of bad things (and some good one) this involves; so we don’t need to explain how this stuff (coupled services, problems releasing new features hindering continuous deployment, messy test strategy, etc…) could be a problem.

The usual way to do it

The actual way is to put the focus in our services and then, by an ETL get the data the services/sources produce and after some transformations put the data wrangled in a place where later it will be used/served (data marts, API, BigQuery, model pipelines, etc). These processes are made inside data pipelines and as we said before we can difference three clear parts (ETL):

First, a process that ingests the data from different sources.
A processing part where we clean the data and make some transformations along data pipelines.
Serving step of the data wrangled. In our case, BigQuery makes this service of the data.

How to break the monolith

The idea of this decoupling is to break this pipeline in some pipelines, by domains or well, by services in the same way that we do with the code services.

So, what if, instead of having only one pipeline, each service had its own (or more) that would process its own data? So, each service should be responsible for clean, transform and share their own data in immutable datasets as products.

What can we win with this change? First of all, we don’ t have to forget that there are a lot of changes in the different services. Here is when the problems start because, or we forget to coordinate with the data people about the changes or this coordination is not possible at this moment.
Really, it seems like the responsibility of not break the analytics or the other services in data is of the team is changing the services, right?
Furthermore, who does know better than this team about how are the data and what format has?

So, in summary, each domain prepares its data, put it as a product in a messaging broker from where other services, streaming tools, governance, analytics, data scientists tools (as jupyter notebook), machine learning pipelines, a global storage and much more, can achieve them.

Beyond this, I am sure that if you now you are doing nothing with machine learning, you will do. So, think about it, is it possible to make a good model without having good data? If you want to do a good job in ML you probably need a model pipeline, also by domain, and it will need this data pipeline. If you want more information about this topic you can read this another post I wrote: continuous deployment in machine learning systems.

Data infrastructure as a platform

So, we are going to break the data monolith and we are going to have this by domain but actually, teams have no time to create each its own data infrastructure. To resolve this problem, companies have to put the focus on creating a common data infrastructure.

In the data mesh article from Martin Fowler’s blog (written by my ex fellow in Thoughtworks Zhamak Dehghani) they write some capabilities this infrastructure should have, some of them: scalable polyglot dig data storage, data versioning, data schema, data lineage, data monitoring/alerting/log, data product quality metrics, data governance, security and compute and data locality.

With this data platform, teams (developers and data scientist) only need to take care about what things they want to do with the data they produce and how are they going to serve this data to other teams, like now we are doing with CircleCi or Jenkins (teams only need think about how they want to make its CI, not the infrastructure). For instance, in this image, Metaflow explains the infrastructure from the data scientists point of view:

And the next image (from the data mesh post of Martin Fowler’s blog) could be a possible final status. In this schema, you can see the different domains with their own pipelines and the transversal data platform.

Data Mesh schema from Martin Fowler’s blog

OK, OK OK… I understand the concept and everything else, but… What about the strategy to achieve it?

Well, first of all, we love code, and for years we have been improving the best practices in code and the best methodologies and all the techniques that XP contains. So, let’s use code also in data and all the best practices we know.

We could use a lot of tools and frameworks to make our data pipelines with code. In the Engineering team of Packlink, we love GCP so in this post I am going to explain this stuff with Airflow because the Google platform has a click and play and fully managed workflow orchestration built in Airflow: Cloud Composer.

Remember that the important points (the same as in services is the same if you use CircleCI, Jenkins, etc.) are the ideas, the tool is only a way to achieve our aims. Other good tools/frameworks are Metaflow, Luigi, Prefect, Pinball. At this moment every company is building its own framework and all of them different but, insisting, the point is the idea and that each of them is code. In the next sections I am going through:

Strangler pattern: as a strategy to achieve it.
Data pipeline: how this pipeline must be?
Code: the tested code to write the pipeline.

Strangler pattern

I am going to assume that you have a data monolith right now. If not, if you have a greenfield maybe this chapter is not interesting to you but sorry, we are talking about break the monolith.

We have the data monolith, Which is the best strategy? Not seem that make a waterfall and put the whole infrastructure at once is a good idea so, to achieve our objective we are going to use the Strangler pattern. This is a typical way to break the monolith in legacy systems, and as I said before, we have learned a lot and we want to reuse this knowledge.

This pattern is good to use because; we have a lot of things running well and we want to do this work in a useful and agile mindset in these three steps:

Remove a domain from the monolith building it in a new pipeline.
Put in production and check everything is working fine while the two ways are co-living.
Eliminate this just legacy part to the monolith.

So, to start, we can choose the most simple domain, and build a new data pipeline, then make a lazy migration co-living both systems and when we are secure that everything is running well, delete the legacy part.

Data pipeline

The main points this pipeline must have are:

Data wrangling. Not all the data produced by the services are shared and not all the data that is produced is done raw. Sometimes we need to make some transformations. We will do with code and as I always say, code is code so we need to apply the best practices always.
Schema checking. If we want to ensure the quality of our data, we need to use a schema when we are sharing it with other parts of the company (services, analytics. etc). For instance, we have the age of the people we want to ensure that this value is an integer and that the maximum value is… 150?👵 You can use different data serialization systems to achieve it, I recommend Protocol Buffers or Avro. With this, both the producer and the consumer know what each data point means.
An export of immutable datasets as products. Each service is responsible to share the information with the organization. The optimal one is to share immutable datasets versioned via a messaging broker like Kafka or pub/sub. A good post talking about share data in distributed systems can be read here. Please note this: data, as we learned also in functional programming, must be immutable and also it permits us to have the whole history and make the aggregations needed later without saving them.
Data versioning/data as code. As we made in code, we need versioning our datasets, even more, we want data as code. It has some benefits: we can reproduce easily errors, use the datasets in our tests and our pipelines, a lot of benefits in machine learning (more information here) and definitely all the good things the code has. DVC, a data and models version control system based in git, is the solution, to me, one of the best tools in the data world.

DVC diagrams about *Data and models as code and* sharing data

Data governance and lineage. This is one of the most important points to me in the data pipeline if we want to break the data monolith. First of all, we need in the data platform a tool to save it. There are some good alternatives, but I am going to recommend only two: Lyft Ammundsen and if you use GCP, Data Catalog. Both of them have auto-discovery of the data. The important here is not the tool, it is the concept as well. We must save in our pipelines the description of the data we are sharing and the metadata via API to our tool, like this library does in Data Catalog and put the needed to have a correct lineage of the data. Thanks to that everybody will know where is the data, who is the owner, how is it and what means each value.

Observability. We need to understand how our system is working, and we are going to do it working above three kinds of levels: Infrastructure level, Job level events and Data level. Create a dashboard with clear and useful information. This is the main point. We are doing complex systems so we need to have a dashboard only with the most important information to understand easily if our systems are running properly. With this, when the ship happens, and it will, you will be the first one to discover it. A typical error in data is to forget the silent failures. If a source stops producing new data, we need to be aware of this early because maybe our models, other services or business, depend on this data in a high percent. Furthermore, thanks to a good data pipeline we can achieve an automated anomaly detection of failures, for instance, when there are fewer events in a period of time.
Data quality. Tests, tests and more tests. We ❤️ tests. And besides the tests in code, we need some tests with data. For instance, what do you think about a test in this pipeline that advice to us if we are putting more orders than products? It seems logical and useful, right?
Get data training. If you are working also with machine learning and because we want to make our data scientists’ life, you need to get the training data. Here there are a lot of critical points: importance-weight sampled, cut the data into smaller slices, anonymize all sensitive data… more info here.
We take care of our clients, so we have to take care of their data, so we need a lot of security in this data part. It is important to anonymize all sensitive data if we want to work with them.

Yes Juan very beautiful, but…

Extreme programming to the rescue

First of all, we going to continue trending everything as code, so data pipeline too.

With the strangler approach, and in the first iteration, the logical is reuse part of the logic we have done before, and we are going to do it with TDD, as XP and our common sense say.

If you are thinking about how I am going to explain a simple method to achieve it in a fast way. Besides that, you must decide your strategy in the long term.

If you know Airflow you know that there are multiple operators (python_operator, http_operator, bash_operator, mysql_operator, and a long etcetera.) depending on the kind of code you want to put in every DAG. These operators have some problems. The worse to me is the complex and ugly way of test them, furthermore, if you are using Python you will have listened to something about dependency mess. With these operators you need to install all the dependencies in all the project, doesn’t need a different version of them and use the same version of Python. Moreover, we are a modern company and we want to have a polyglot system and try new languages!

My solution is to use, while we are doing the strangler pattern, the docker operator, kubernetes_pod_operator. If you use Google Composer you should use the Kubernetes one and never the GKEContainerOperator as this can lead to resource competition. With these operators, the code that will be executed in each DAG will be into a container (there can have python code, scala, java, Beam processes, Spark processes, Kafka streams, summarising, whatever you can imagine…), so you can use your TDD in the language you want and you could reuse some of your legacy code. Unit testing and integration testing are done…almost!

We still need to write the own Airflow’s code so we will do TDD also to do it. It will be the definition testing part, included in the unit testing of the Airflow part. Before showing the code I want to thanks Chandu Kavar and Sarang Shinde for the how-to posts about testing strategies in airflow.

So, like we do TDD, first the unit tests:

DAG unit tests

This is the pipeline’s production code that we have generated with TDD:

DAG that contains the example data pipeline

But unit testing is not enough, in Packlink Engineering know that tests are the most important part of the code. So, let’s go with more testing.

Now is time for integration tests in the airflow part (the containers have their own unit and integration tests). With this approach, the integration tests here are related to the Airflow config and XComs which in summary are the way in airflow we have to share information between tasks in a DAG. In our example, we need to test that the info saved in packlink_hello_task is the one taken by stream_data_task:

DAG integration tests

The last part of pyramid testing is e2e tests. In the pyramid, e2e tests are always on the smaller side because they are very expensive. In Airflow maybe even more. You can see a good example here: https://github.com/chandulal/airflow-testing. I couldn’t have done it better.

Conclusions

This brings me to the end of the post. I just wanted to put here some conclusions for summarising the post:

The code is code. Use XP practices.
Data monoliths as code monoliths have similar problems.
Share immutable datasets as products.
Know your date, put love in data governance. Everybody in the company needs to be able to understand the data.
Shit happens. Put attention in observability.
Nobody knows better than the domains about their data. Apply a DDD vision to data.
Packlink rules!.

Resources and links

Data quality checkers | Getaround Tech

At Drivy, we store, process and analyse hundreds of gigabytes of data in our production systems and our data warehouse…

getaround.tech

Observability @ Data Pipelines

Shown above is an example of a real-world data pipeline that powers a business critical pipeline with 10s of data…

quickbooks-engineering.intuit.co

Testing in Airflow Part 1 — DAG Validation Tests, DAG Definition Tests and Unit Tests

Testing is an integral part of any software system to build confidence and increase the reliability of the system…

blog.usejournal.com

Setting up a CI/CD pipeline for your data-processing workflow | Solutions | Google Cloud

This tutorial describes how to set up a continuous integration/continuous deployment (CI/CD) pipeline for processing…

cloud.google.com

How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh

Zhamak Dehghani Zhamak is a principal technology consultant at ThoughtWorks with a focus on distributed systems…

martinfowler.com

How to Use Airflow without Headaches

Tutorial about how you can implement Airflow Tasks and DAGs without problems and with joy.

towardsdatascience.com

Boosting the Data Governance journey with Google Cloud Data Catalog

Thoughts on data discovery and metadata management in Google Cloud