Catching Data In a Data Mesh: Principles (Part I)
An in-depth view of how the Data Mesh architectural paradigm is helping Gloo build a platform to support personal growth. This is part one of two where we’ll explain how we are thinking about the mesh internally and the guiding principals for our implementation.
At Gloo.us, we’re building technologies that connect individuals looking to start a growth journey to the resources, mentors, spiritual groups, and recovery centers they seek. We have many applications that serve different needs which require unique combinations of backend services. We’ve created a platform that consists of many heterogeneous processes, microservices, pipelines, and data stores to support this. The most important aspect of the system is the data, arguably, especially for longitudinal data tracking and projecting a path for a growee. Additionally, we need to keep track of each organization’s offerings, growth plans, and make recommendations for ways to best service a community.
One can approach a platform architecture that focuses on microservices with endpoints (REST, for example) that solve the needs of each unique supporting process and applications. Typical examples may look look like this:
- A data pipeline may offer a service for accessing its derived data
- A BI team may fetch the data in order to create reports
- A core object microservice may provide endpoints that support the UIs and other supporter processes
Fairly quickly, this approach may result in endpoints that become increasingly one-to-one relationships, with data organized or transformed in a way that makes the consumption and presentation easier. You end up with service bloat with well defined, specific use cases being baked into several endpoints. (This mesh of services shouldn’t be confused with service mesh that emphasizes shared command and control as opposed to the makeup of the service endpoints.) This problem can be solved in a few ways, but at its core: the coordinating services need data from one another and a source of truth, domain, and separation of concerns needs to be at the core of the architecture.
This is only one problem, but there can be many. The next section gives examples of few other problems.
The solution for a platform architecture should avoid pitfalls that I’ve been plagued and frustrated with over my career. The most common pitfalls are:
- Big Pipelines for Data — As the number of use cases and downstream dependencies grow, there is a common tendency to solve the problems within the pipeline rather than shifting them to the dependent processes. This leads to bloat and complexity, not to mention the fear/risk of breaking one of the use cases. In other words, big pipelines for data.
- Data In a Lake — I’m likely rocking the boat with this one, given the importance of shared infrastructure in 90% of engineering organizations. The problem arises when the “lake” datastore leads to access contention, bottleneck design/schemas, and convoluted representation of data.
- Bottleneck Development — In many cases, shared infrastructure and tooling leads to segregating teams by expertise in a way that leads to teams being unable to satisfy a use cases until it’s embedded into a pipeline. Overlaying these use cases into the pipeline means you must manage dependent use cases across teams, leading to fears of deployment and tons of testing — painful even with automation. This leads to more dependencies across teams than is necessary.
A solution to these problems should empower teams to store process data in their own domain and in a manner that honors its specific use cases without fear of causing failure in other parts of the system. This approach will most likely lead to data redundancy or overlap, however, at the root you can still share infrastructure, like a shared ES instance with application process indices.
A Data Pivot
What if we separated the creator of the core dataset from specific downstream applications or processes? This allows creator process or pipeline to focus on building the best, optimal representation of a core data set independent of other teams. The attributes of the data, a schema, respect the complete contextual picture for the result from the domains’ perspective. The data becomes a highly focused yet complete picture from within as a product. Any data set derived from this base, like an aggregation or transform, is left to the application/process domain. The data domain is singularly responsible for curation, normalization and internal transformations that go into developing the set. It’s holistic rather than customized. I’m dancing around some terminology here because what we call things is extremely important warranting a section of its own.
In order to support this data principal, we pivoted our architectural approach to separate the creation of data from the dependent applications and processes. The principles of a data mesh pattern help us achieve this desired result.
Enter Data Mesh
We’ve embraced the Data Mesh ideas as a way to define our architecture and pivot the culture and philosophy for how we solve data connectivity at Gloo. This has led to the introduction of key tenets for engineering, a common language, enabling technologies, clear definitions of ownership and team responsibility, and clear definitions of the culture we want as a company. We’ve gained broader buy-in and support by emphasizing building it rather than months of pontification on what could be. This doesn’t mean we didn’t gain widespread support up-front — we did, but our approach was “build it and they will come” (when successful, of course). This can be risky, but it was successful for us especially given our base culture of helping others, embracing ideas, and providing solution to root problems. Over the past quarter, we built a data mesh that is now a central component of our platform and architectural vision. Inspired by Zhamak Dehghani, we’ve added our own interpretation and approach to building it and shifting architectural philosophy. The process of moving the organization and culture towards mesh thinking can be hard — possibly the hardest aspect of the shift to a mesh. The technology to realize a mesh is solvable.
A primary principle of a Data Mesh approach is that the organizational structure/teams is the first order problem rather than physical connectivity or reference architecture, however, we chose to focus on mesh architecture, definition, concepts, and creation as opposed to a reorg with the thought that if we trained the the current organization structure on data products and at least partial ownership it would help us shape the org for the future. We created a Data Mesh team to help define data domains, declare data products, drive team ownership and, most importantly, establish enablement as a way to jumpstart the effort. Our team built out the infrastructure to support event driven streams and connectivity to application team data stores. Establishing a team charter is very important to decouple the mindset of teams owning specific technologies from he results providing value throughout our organization.
To influence the organization as a whole, we focused on establishing engineering tenets, clearly defining a set of terminology, establishing the patterns and frameworks using enabling technologies, and creating team charters for a data mesh team vs data engineering team.
Our approach to Data Mesh architecture requires some key tenets to be adopted by the engineering team. Our teams are migrating towards an event driven architecture which is making the way we think about our data approach more intuitive. Our key tenets include:
- Embrace event driven streams of data. Our data evolves over time and should be treated as such. Snapshots are often treated as “current” for long periods of time. We want to proactively provide a stream of changes so that dependent applications can always consume the latest state.
- Embrace eventual data consistency. Application teams must be mindful that an application’s data store is not the source of truth and embrace the eventual consistency. We stream data state changes, however, there is latency to consume and transform the data in the application domain. The asynchronous model is something to keep in mind, but in reality the streams are really fast — usually milliseconds though we have some JDBC connectors configured for 5mins.
- Embrace your data and the responsibilities associated with owning it. Teams should be clear on the responsibilities of providing a source of truth (or data product), the expected quality, and the implementation of ways to access it (or data ports). The usability and accessibility of the data by an application domain is central to the ability of the domain to consume, transform, enrich and store the data in a way that meets its needs. The access ports should always consider the consumers expectation their data store will eventually be consistent. In other words, always be mindful that it is not the source of truth.
- Embrace a new vocabulary. Teams should speak about their data architecture in common terms. The definition of new terminology including Application Process Domains, Data Ports, Data Domains, etc… is a crucial first step. The next section will dig into this terminology.
An engineering team that believes in these tenets will have a solid foundation for implementing a data mesh. The next most important step is to define and agree upon a set of terminology for the concepts and objects when developing aspects of the mesh.
The name used for each component in an architecture, even if conceptual, is very important. It serves as the foundation for cross-discipline and application team communication. In the case of building a data mesh where responsibilities and non-responsibilities are crucial we have to be using the same language. At Gloo, we’ve introduced terminology to help teams communicate and be mindful of our architectural approach.
Data Mesh. A conceptual representation of the interconnection and coordination of principle data nodes. A core tenet of a data mesh includes establishing a clear authoritative source of truth, boundaries of ownership, and governance for the data on a mesh. This shouldn’t be conflated with the technologies or skills applied to enable a mesh.
Data Mesh Platform. The tools, technologies, and patterns used to facilitate communication, synchronization, and access to data on the data mesh.
Data Product. An individual, authoritative, curated data set considered a provider/producer node in the data mesh. It is a source of truth for a specific data type with a clearly defined schema representing the best view of the of data. Access to a data product is controlled by data governance policies through a data domain boundary. If an application, process, or system results in data that is intended to be shared outside of the its boundary, then that data is a data product available on the mesh. A data product shouldn’t be conflated with an application. A data product forms the ideal definition for a data type without transformations to meet the needs of any one consumer or user of the product.
Data Domain. A set of data products that represent the authoritative source of truth and boundary of ownership. The domain boundary is the specific type definition. A data domain can offer one or more data products and may contain supporting data used to create the data product(s) contained within that are not available on the mesh. Data within a domain can be accessed via a data port.
Application Process Domain. An application process domain (APD) is a logical grouping of systems or processes that help an application deliver on a business goal. This domain is likely composed of many services and data stores which may or may not rely on the data mesh. The data may be classified as one of three types: data consumed from a data product, data produced to form a data product, or data that is internal to the application.
Data sourced from an external data product is no longer a data product but is an internal representation. It’s usually aggregated, transformed or simplified to make it easier to work with by the application. The last type is data thats only used locally and isn’t based on any external data products.
Data Port. A means of accessing data of the data product. A port may be SQL, ReST apis, Kafka messages, Webhooks, Kafka streams, or KSQL. We leverage a Schema Registry to track versions. Conceptually, it looks like the following diagram. I’ve mapped Data Governance onto it…that definition is coming up.
Data Governance. The management of availability, usability, integrity, and security of the data in a data domain, based on internal data standards and policies. As we built, we didn’t attempt to solve all the aspects of data governance, such as consent, right from the start, though we are mindful of the complete view of governance. We define Data Governance as an aggregate of these factors:
- Availability. Timeliness, quality, reliable access. We must ensure the data has functioning data ports.
- Consent. The rights to use or revoke access to data. We didn’t conquer this beyond federated deletes via the event stream.
- Usability. The data must be available and in a structure that is reasonably consumable by process domains. We define schemas and ensure we have data ports.
- Data Consistency. Assuming data is accessible, of high-quality, and usable, the consuming nodes on the mesh must be able to keep any local variants in sync. Additionally, any derived data products must have a foundation of consistent data. We solve this using event-driven architecture, but ultimately a consuming application domain is responsible for keeping the data the data in sync through any means that meet their requirements. For example, the consuming data store may be an S3 bucket for import into Databricks where the application domain must consider how often they want to import.
- Data Integrity. By isolating the data into specific domains, we narrow focus and sphere of control making quality easier to maintain.
- Data Security. Secure data ports on data domains
Domain Ownership. A core tenet of the Data Mesh paradigm is domain ownership. This is stressed a ton in the discussions and is almost always put in the context of breaking boundaries between the common silos of Data Science, Data Engineering, and Product Engineering Teams. Solving this problem warrants a section to its own, but in the spirit of a terse description we define domain ownership as a team/organization that is responsible for the curation of a specific, independent data product that is useful throughout the broader organization. The team owning the domain works with others to define a schema, but ultimately the owner is the expert of the data context. Downstream use cases can influence a schema attribute, but any business logic shouldn’t be a driving factor for the data product. This is our interpretation anyway — your mileage may vary.
Summary and Next
The intent of Part 1 is to discuss how we’ve mapped the concepts of a Data Mesh, articulated best by Zhamak. In Part 2 we’ll dig into our architectural vision and how it’s been applied. This post covers our interpretation, assumptions, and foundation for how we build software at Gloo. In the broader community, I see something missing: an architectural example of implementing a data mesh in technical terms rather than organizationally. While we use cloud vendors, we view them as ingredients to help with a technical component rather than the main course. Their solution may be a central, key component, but if its vendor specific I would argue it is not an example architecture. The next part will cover the steps we took to prepare for implementation (re: incremental engineering buy-in) and review an example architecture for feedback.