adidas Data Mesh Journey: Sharing data efficiently at scale
This post doesn’t intend to explain what is Data Mesh, if you still have doubts about what Data Mesh is you can check the previous post of the series where the context of Data Mesh is being settled.
The main goal of this post is not to cover the whole Data Mesh, it will focus on a tiny part of the journey “sharing data and getting data access effectively”. Also it’s worth mentioning that Data Mesh is a journey so you cannot implement Data Mesh per-se, you need to adopt the principles and start to make incremental changes.
Now we want to introduce the insights of applying Data Mesh principles in a prototype to share data at scale. In the previous post Javier reflects the current state of the art in data, we have a Data Lake running in one account and we need to scale to the next level.
Spoiler Alert: We open-sourced all code we used in our prototype to share data at scale. You can find the GitHub link at the end of the article.
How to enable sharing of data through Data Mesh at scale?
If you search on the internet you can barely find results that reveal the inners of what looks like an implementation of the data sharing at scale part based on Data Mesh principles.
Therefore we started our long journey into Data Mesh our first stop in the journey is to enable efficient data sharing, therefore we created an automation that aims for that. Let me guide you over this first part.
Our Data Mesh principles
Before starting any development or architecture research we need to gather the requirements in this case the Data Mesh principles. We examined our current setup and based on the input from different teams we set the following principles that our implementation should have:
- De-centralized: Any team could be the producer of analytical data in their own AWS account, reducing the dependency on the big account as long as it meets some interoperability conditions.
- Scalability: We have more than 5 Petabytes of data in our big data platform based on AWS, which we need to share across the company in a self-service fashion.
- Security: Capability to expose only specific tables and prevent sharing of PII data.
- Self-service: Automation to enable the self-service experience Git-Ops enabled.
- Federated governance: Distribute data ownership to the domains and also any central domain knowledge should be required.
- Interoperability: A very important part of the Data community in adidas opted for Spark with Delta Lake as the engine for data processing and storage, we should enable the interoperability with other domains that prefer to consume the data in other engines, like Sagemaker in AWS.
- Discoverability/findability: We need a frontend to present the registry of data products stored in Git. For prototyping purposes we selected our internal wiki system for rendering and searching the data products, the next step will be a proper integration of our Enterprise Data Catalogue.
Initial architectural research
Since we know the basics of what we need to build, we were looking into different approaches from built solutions to creating our custom ones. During this time we were doing different PoCs and evaluating the buy versus build.
The most promising one of the ready solutions had a commercial model which doesn’t fit our expectations, then we decided to build our solution.
In turn, we are heavily relying on AWS to process data actually with EMR. Cross-account access is one of the most important and basic patterns we should enable for the teams.
Our first naive idea was just connecting the dots, actually EMR needs only access to Glue and S3. Then we enabled the cross-account access to those services and changed the EMR hive catalogId property to point to the producer glue catalog. Sadly this solution brings 2 caveats into place: first, you cannot combine producer glue catalog with consumer one, and glue resource policies don’t scale since has a hard limit of 10kb. Therefore is not fulfilling our scalability requirement, we need to look for another solution.
The next idea was to use the STS service from AWS, the idea is to create a role in the producer account which can be assumed by the consumer. But again we found an issue, seems the Hive version which is used by EMR doesn’t support the STS service.
The last try was with a recent service added to the AWS catalog, the service in question is Lake Formation. AWS Lake Formation seems to be a good fit since supports this cross-account sharing, overcoming the issue with the glue resource policy limitation (it uses AWS RAM underneath). Moreover provides a set of features that fits nicely with our principles:
- Support ACID transactions.
- Gives you column-level/row-level granularity.
- Enabling sharing of Glue databases and tables via tags.
Owing to one of the principles being data exploration, we thought would be nice to have a central metadata catalog in place. This requirement can be fulfilled with Lake Formation, where you can share data having a central account that shares data with the different accounts in the mesh.
Implementation Detail
Now we picked up the technology for sharing the data in a federated way and avoiding data replications, the next step is to define in a Git-Ops way the different parts which compose a data product based on our Data Mesh principles.
Following the single responsibility pattern we identify the following JSON files:
- data-product
- data-product-producer-info
- data-product-consumers
- data-product-inputs
- data-product-outputs
- data-product-visibility
Data product
This file aims to describe the information related to the data product itself:
Data product producer info
This file aims to gather all information needed from the producer account:
Data product consumers
The content of the file describes the consumers of the given data product:
Data product inputs
To describe the data system information, we will use this file. This data system can be a system per se like eg Data Warehouses or even another data product:
Data product outputs
This file describes the nitty-gritty details in terms of data of a given data product. Contains all the tables which compose a data product and are the result of applying some transformation to the inputs:
Data product visibility
The main goal of this one is to identify the classification of the data from the data product. We classify the tables in the following categories: public, confidential and internal. In the case of the columns whether they have PII or PCI data or not. To reduce boilerplate information we only specify the tables and columns which are not public or have PII or PCI columns:
Now we are in the middle of the journey, we have selected the technology for sharing data and we defined how looks like a data product as code. The next step it’s to focus on implementing the producer and consumer journeys.
Producer Journey
To bring data into the mesh, we set the S3 bucket as the minimum unit to share data is our first requirement. This was inspired by the great talk from Zalando, where they explain their concept called Bring Your Own Bucket (BYOB for short).
In addition, the producer's data needs to be in a supported format. As we don’t support all formats, this will be covered in detail later on in the data interoperability challenges section.
Our next step is to define the technical steps to enable the consumption of these S3 buckets, those are the following:
- Enabling the access from the central account to the producer’s bucket.
- Using AWS Glue Crawler to get the metadata.
- Registering in Lake Formation the database & tables.
- Create the Lake Formation tags for sharing the resources and create the visibility tags to avoid sharing PII and PCI data.
To accomplish the first step an extra requirement is needed, we need to create a role in the producer account which is being assumed by the central account to enable the granting of the access permission. We choose this pattern because it’s the least intrusive one for the producer account, since the role is created by the producer.
Consumer Journey
To enable the data consumption of a given data product, we need to perform the following technical steps:
- The central account shares via Lake Formation tags the data product resources with the consumer.
- Enable access to the central glue catalog and the s3 bucket from the producer.
- Create a linked database (a reference to the actual database in the central account) in the consumer account.
To perform the last step we need some help from the consumer, the central account needs a role in the consumer account which can be assumed to create the linked databases.
Automation of the Journeys
After defining the steps for enabling the different journeys, the logical next step is to automate those steps. To accomplish that matter we created 3 AWS step functions which are being triggered on the creation/update of a given data product definition.
To execute properly the Step Functions we need the info of the particular data product. Somehow we need to make the info from the JSON files available from the Step Function, we can just upload them to the S3 bucket.
To link the files' lifecycle with the Step Functions we used CloudFormation Custom Resources. We create a CloudFormation stack with the content of the files as AWS Custom Resources and we deploy that stack. Having the files model as custom resources give us 2 main things:
- Having fine-grain control of the lifecycle, custom resources have a life cycle with 3 states: create, update and delete. Furthermore, AWS Custom Resources can get the diff of the content when they detect and update.
- It’s Serverless we don’t need to have a pooling mechanism to watch changes happening in the files.
To orchestrate the execution of the Step-Functions with the Custom Resources we use an event-driven approach. We’ve created rules in Event Bridge for triggering the Step-Function based on the AWS Custom Resource lifecycle.
Since one of the requirements is following Git-Ops principles we can host our data product JSON files in git and deploy them in our CD pipeline, because of that we integrated the files with AWS Custom Resources.
In this image, you can see all pieces together:
Worth to mention that we create a CloudFormation stack per each Data Product. The Step Functions are just created once and they belong to a different CloudFormation stack, which only gets deployed when a change in the Step Functions is needed.
Connecting the dots
Since we have all steps automated and the definition of data product as code, we can create a simple workflow to start using it. For our prototype, we want to be self-contain and we didn’t want to integrate with external systems yet. Therefore we can define the e2e journeys for both producer and consumer.
Producer e2e
- The producer will create a ticket in our ticket system. It will attach or request help to define the data product files.
- Mesh Governance Team will review the data product. In case of approval, it will create a PR.
- The producer journey step function will run and add the data product details to Lake Formation. Also, the data product info is being published in our wiki server to be discoverable.
Consumer e2e
- The consumer will go to the wiki and look for the data products that are interested in consuming.
- The consumer will create a ticket in our ticket system and it will get reviewed by the Mesh Governance and the producer.
- In case of approval, the Mesh team will create a Pull Request to modify the actual data product consumer files to include the new consumer.
- The Consumer step function will be triggered and the consumer will have access.
The following image illustrates the journeys:
You may think since we have the Mesh Team reviewing the tickets for data product creation is not a self-service approach. The truth is the self-service definition is debatable, in case the team is mature enough they can create the JSON files and create the Pull Request without interacting with our issue tracker.
In the end, the revision of the data products is needed to avoid duplicated ones. The route with the issue tracker is only for the teams which are less mature and need help.
Data Interoperability Challenges
During our journey we support natively the data formats which support AWS Glue Crawler: JSON, parquet, CSV, Avro… but we want also to integrate with Delta Lake format.
The first problem we found is that AWS Crawler doesn’t support crawling the data in delta format and if you try to use the normal crawler you will get a weird table and metadata definition.
The next hurdle we found is that Delta Lake doesn’t store the metadata in a compatible way in Hive. To have a sense of what is going on with a delta table:
We got that instead of the usual table description:
Only it’s possible to find the schema definition in the table properties, is a JSON literal describing the schema and is stored as a string in the spark.sql.sources.schema parameter:
Because of Delta Lake format is not compatible with Lake Formation, we need a way to share data even if the format is not compatible with Lake Formation. Therefore in our Step Function, we enable the IAM way of having access to the data.
Recently AWS released the support for Delta Format in the AWS Glue Crawler, which makes it possible to import the metadata into Glue. Unfortunately, the sharing of this through Lake Formation still doesn’t work.
Besides this interoperability challenge, we had another challenge while updating the Glue policies. Due to we have different Step Functions that run in parallel (we create different CloudFormation stacks per data product) we have a race condition while writing the policy, if we have 2 step functions running in parallel we will add only one new policy statement instead of 2.
To fix this, we update the policy with the following architecture which guarantees serial updates. Having an SQS in front as backpressure and behind the SQS a lambda with a concurrency limit of 1.
Next Steps
Still, we are at the beginning of the journey and a lot of work needs to be done. Nevertheless, we identify the following steps to evolve the prototype:
- Fix the issue while sharing delta with lake formation.
- Integration with the Enterprise Data catalog.
- Calculate and automate the DATSIS rating for data product quality.
And, as stated in the previous post, we aim to be more ambitious on the scope to be covered by Data Mesh:
- Include also non-S3 data stores to the concept: no-SQL, relational databases…
- Include operational/transactional plane as part of the concept, like Kafka topics, Rest APIs…
We open-sourced the code of this prototype on Github, you can find it here, I would like to take this opportunity to say thank you to the people who contribute to that: Javier, Ruben, David, Ronald and Jonatas.