Introduction to Data Mesh adoption in adidas - motivation and takeaways
Intro and background
In this post, we will share the context of adidas in data and the takeaways during our approach to Data Mesh; it will be followed by another one, where we will share an implementation of a prototype that is an important enabler for this vision. We consider that it can be useful for other people who are in the same journey.
As assumption for this article, we will assume that the reader is familiar with the definition and basic concepts about Data Mesh. But, for setting the ground, we can take as reference this definition from Thoughtworks Tech Radar Vol 26:
Data mesh is a decentralized organizational and technical approach in sharing, accessing and managing data for analytics and ML. Its objective is to create a sociotechnical approach that scales out getting value from data as the organization’s complexity grows and as the use cases for data proliferate and the sources of data diversify. Essentially, it creates a responsible data-sharing model that is in step with organizational growth and continuous change.
Other people simplifies data mesh definition as applying Microservices pattern to big data, all the over simplifications are dangerous, but reflects well the spirit. In summary, more than a closed definition, it is a set of principles that should be met.
The objective is this post is not creating yet another article about all the Data Mesh theory covering data product definition — there is a lot of content already published about it -, the intention of this post is being much more direct in order to describe the context for the implementation analysed from a group with background in software development and integration. Also, we will share some accessible strategy trade-offs and takeaways during this analysis.
It is important to remark that we believe that this is just the beginning from a long journey, not all the companies are ready to implement data mesh “by the book” and this the first analysis that should be done when approaching this aspiration.
Motivation for data mesh
As many other companies, adidas is currently in an intense journey of digitalisation, in order to be successful, scalability of the tech organisation is crucial: scalability as enabler of speed, having more engineers responding quickly to the demands coming from business (features in software, insights, reports…). Therefore, the first priority is detecting and tackling the bottlenecks that could jeopardise the autonomy of the different units, otherwise “just hiring people” will not translate into speed.
In software development world it was accepted that cloud adoption, containerisation and Microservices were the main patterns to approach that challenge. We could relate immediately to all the concepts that Zhamak Dehghani was describing in these brilliant and inspiring articles (Monolith to mesh & Data Mesh Principles) . They were eyeopener to apply similar concepts to data space — defining proper boundaries in data management can limit the dependencies to create autonomous data products, similar to the analogy of the Microservices with domain bounded context. In this case, the bottleneck to alleviate is the central team managing and providing access to the history of the data that the applications are creating. If we take as reference the principles described in the article, many of them are inspired in the Microservices pattern. Perhaps, the one that is specific to data domain is data quality, but we can make a clear parallelism as well with software quality and Devops, where the teams producing the software should be owning it in production.
At this moment, adidas had the classical data architecture with a Data Warehouse to capture core data objects coming from the backbone and Data Lake based on S3 to centralise the storage of all kind of sources, for data scientists and data analysts community. Also, we had a central team in charge of the heavy lifting of ingesting the data into the Lake.
Therefore, all the notions described in the article resonated in our context:
- The central data team making the heavy lifting and becoming eventually the constraint of scalability due to difficulties to scale of domain knowledge centrally.
- Data quality issues given the big disconnection between the producer of the data and the consumer in analytics — having at least 3 actors in this process, including the enterprise service bus.
- Lack of governance in the integration patterns to exchange data led to inefficient data pipelines connecting the warehouse, lake and in-memory data repositories for reporting.
- Inconsistency with the product-led strategy, where the responsibilities are pushed to the product teams and the central team (platform) is agnostic from domain knowledge.
- Discoverability of data assets was poor leading to duplicated efforts for ingesting the same data and inefficiency storing it multiple times.
We can proudly say that adidas has been successful in the Microservices journey; definitely, applying the same concepts to democratise the access to the data was a clear field of investment.
Also, to complete the context, we implemented a Data Streaming strategy based on Kafka that was following the main principles:
- Domain-oriented decentralized data ownership and architecture
- Self-serve data infrastructure as a platform
The adoption has been massive since the creation of the platform, and a superficial analysis leads quickly to quick outcome: why not applying exactly the same concepts in analytics space, ideally interconnecting both platform with similar domain boundaries. As the article states, and also depicted in the picture below, minimising this barrier between the Operational space and Analytical space would be the most effective way to alleviate all the issues derived from the disconnection.
Constraints and reality
First of all, we should not oversimplify and state: let’s just use the data in Kafka produced by the applications to feed the creation of insights. Although it is quite powerful and inspiring the idea of Kappa architecture , the reality also shows, that the nature of the data captured to generate insights have some intrinsic difference with the data that the applications share to interact with each other. For example, in event driven patterns, we are not always serialising all the data, therefore we need to materialise, compose and keep the historical data. Also, the population of data products created for generating insights is much higher that the data products close to the source for integration purposes… One trivial example for explanation purposes: our ecom can produce the data related to the activity of the customer, and other applications can store the reference data describing the footwear; what it is useful for analytics will be the combination of both sources. We can use Kafka as streaming technology to derive this combination, but it is far from the notion of using the data in Kafka for analytics. Nevertheless, we believe that little by little this pattern will be having more traction as long as stream processing technologies mature their accessibility for data scientists and data engineering communities, this will definitely simplify data architectures to be “data mesh compliant”.
Another easy approach to tackle data product ownership would be that is that the applications that are running the business, the same way that they are exposing API to serve the data in real time for other components, they should provide the history of the data. The reality is that this would have change management implications and this is not the way IT is structured in our company today, data-driven mindset should be completely adopted as a requisite. To put a specific example, the success of the operational systems should not be only measured by stability, throughput or agility to incorporate new changes; also, how good is the data that they emit for creating valuable insights and how much it is adopted by the consumers.
To summarise, we believe that the maturity of streaming technologies will be key for Kappa architectures that will naturally address the disconnection between source and analytics; also, in the future, we consider that the requirement of keeping the history of the data — and all the cost and complexity that it brings — will be less and less, since the insights can be provided aggregating the real time feed of data as it arrives. But, as first, and pragmatic step, we can have already huge benefits applying data mesh principles “only” in the analytical space. Therefore, we are starting the journey focusing on the Data Lake, but, with the clear ambition of having coherence describing real-time Data Products coming from the Operational space.
Some takeaways: fundamentals, vision and trade-offs
1 — Do you need data mesh?
- Trivial, but required question before starting :). We read very often that data mesh is easier to apply in small start ups, compared to large companies. But, then, you might fall in the premature optimisation trap — if the essence of data mesh is enabling the scalability in big enterprises… do you have a clear motivation of implementing it your business? Have you faced the constraints that over-centralisation brings?
2 — The more quality control you have, you will be more successful in Data Mesh… or not?
- One of the main factors for having a successful data mesh implementation is having high quality data products. In order to tackle the data quality issue, it is critical the governance process when creating and feeding the data products: consistency, avoid duplicity, quality of the data, addressable… We should not loose the perspective, the main objective of the data mesh strategy is enabling agility on the company by applying domain driven design concepts to data. In order to have federated governance, ideally machine driven, you need to put more standards and rules. If you over apply these enterprise rules, you will be having the opposite effect, impacting the speed, that was the initial objective.
- It is a trade-off, increase the governance in cross-domain data objects, decrease on the data products on the edges…. Conceptually, the quality controls should be proportional to the potential reusability of the data asset. Also, it is important to highlight that, creating demanding entry point for being part of the mesh can lead to lower adoption, whereas very low entry point might lead to chaos and low quality.
3 — How to incentivise adoption
- It comes natural if you are solving a problem for your customer, it is very difficult to succeed if you need mandate for adoption. In the case of data mesh, the initiative will suffer if you need to oblige to the producers to take the ownership of their data. First of all, data democratisation should be an incentive per se for the producers. Wishful thinking should be avoided as well, assuming that the teams will naturally renounce to a central commodity just for the cool factor of adopting the new trend, without a clear strategy they won’t do that. Also, don’t underestimate gamification as a valid strategy to accelerate the adoption in these transformations.
4 — Does the scope includes real-time
- Currently it is more accepted the definition of data mesh that focuses only in analytical data products. We don’t share this vision that excludes real-time data products from the concept. There are some emotional debates about it, but that would be the essence of having Operational and Analytical worlds as close as possible. Would not be ideal that some data products spans operational and analytical plane? Or, at least, use similar convention and catalogue to be discovered? Also, we consider that ML is changing the picture, since more and more the ML models will need the operational feed of data in real time to fine tune their behaviour.
- If you are curious about this debate, there is one thread on this vibrant data mesh community with an interesting debate about it.
5 — Central platform is compatible with Data mesh concept, not a requisite though
- There is some misinterpretation in the community, where having a central platform is not compatible with a data mesh strategy — we disagree on that, as long as the usage and governance is federated and the interaction of the central team is limited to approving the usage… you could implement data mesh in one single platform. If you do it right, it should not be a pre-requisite, you could have multiple data platforms contributing to the mesh as long as they adhere to the central standards. Again, is it a trade-off with the operational efficiency that you want to gain centralising in one platform versus the variety of technologies that you offer to the users.
6 — Technical “details” are also important
- Sometimes when approaching this transformations, there are some aspects that are perceived as implementation details during the design phase, but the become crucial when going to execution phase. In the case of data mesh this is clearly the governance on the binary format of the data: you can have the optimism that any new platform might be just another node on the mesh, but, not governing the data format, might bring you interesting challenges when you aim to seamless interoperability across the nodes of the mesh.
After the previous context and constraints during the inception phase, we derived the requirements to build a basic prototype limiting the scope for now to make our data lake compatible with the data mesh principles, we will very happy to share this in the next post :)
We hope that you can extract some learning from this post or at some reflection is triggered when reading it!
The views, thoughts, and opinions expressed in the text belong solely to the author, and do not represent the opinion, strategy or goals of the author’s employer, organization, committee or any other group or individual.