Catching Data In a Data Mesh: Applied (Part II)
This series will provide an in-depth view of how the Data Mesh paradigm is helping Gloo build a platform to support personal growth. Part one explores how we are thinking about the core mesh principles within Gloo. Part two emphasizes the application of these principles, the path to gain organizational support, and the architectural patterns which leverage the Kafka ecosystem to make it happen.
At Gloo.us, we’re building technologies that connect individuals looking to start a growth journey to the resources, mentors, spiritual groups, and recovery centers they seek. We have many applications that serve different needs and require unique combinations of backend services. To support this, we’ve created a platform that consists of many heterogeneous processes, microservices, pipelines, and data stores. Arguably, the most important aspect of the system is the data, especially the longitudinal data tracking and the projection of a path for growees. The challenge is making platform data accessible and reliable in a way that eliminates access contention, avoids high latency ETLs, and is consistent for our application teams. Additionally, we want to segment the data into secure domains that a team owns. These domains must still support granular interactions by the processes using them. We solved this problem by adopting Data Mesh principles and leveraging Kafka, Kafka Connect, and Kafka streams. A data mesh approach helps us avoid resource contention, pipeline feature creep, or a fear of introducing breaking changes that are common to data lake or big data pipeline approaches.
The intent of this post is to discuss our cultural shift as an engineering team towards a data mesh, exhibit an architectural example, and discuss the technologies we use to implement the mesh.
Setting the Stage
While this discussion emphasizes the architectural patterns, we would be remiss to not provide some context for the platform we build at Gloo. The next diagram sets the stage with a very simplified use case for our system.
In this view, a user — a growee — registers with an organization. The organization offers a curated set of catalysts that guide a user through their goals for personal growth. A growee is assigned a set of catalysts, grows over time, and takes assessments as one factor that tracks progress. This data, when combined with many signals, informs intelligence reporting leading to better recommendations for the growees journey.
Our approach to building a mesh culture started with creating terminology, defining initial domains and ownership, clarifying team charters, and focusing on long-term team enablement.
The single most important step is getting everyone aligning on a common language and applying it in their own context. Once the organization is thinking of our data as product it takes on a new meaning — one of stronger, purposeful intent serving a specific need within system. The management of the data is more deliberate and focused on the role it serves in the system. Our terminology can be found in part one of this series.
Defining Data Domains and Data Products
Once teams shift to thinking of talking about data as product serving a need in a broader ecosystem, the data mesh culture forms naturally — Conway’s Law. The owner of the product thinks in terms of the best interest of the data as opposed to molding it to meet the needs of a downstream use case. In other words, it forces a debate about where the business logic belongs — in the data domain or in a process domain. The first principle of the paradigm is that the data product is built to serve the product rather than the downstream use-cases required in the process domains. It’s a debate where the logic belongs, however, the outcome should be based on this principle. Following the principle leads to a large degree of organizational scale as development teams are able to solve their use cases without a dependency on another team. I have an example in a following section that walks through how this can work.
We decided to start defining our products at the lowest, atomic level of detail, making it straightforward for our development and product teams to define. The teams could leverage existing microservices and ERDs to quickly identity the products and domains. It’s unclear if this is our long term plan or a means of jump starting our data mesh journey as the lowest level of detail doesn’t always translate to some business stakeholders. The structure may be similar to the diagram to the left.
Clarifying Team Charters
A key to the data mesh way of thinking is the structure of an organization. There must be clear owners of a data domain and clear distinction between the data product and application use cases. Defining this structure and shifting the approach to thinking about the work, can lead to a lot of concerns especially when an organization has strongly held beliefs or practices. For example, teams may resist democratizing a “big data” technology, developing with event streams, and/or moving data customizations out of the product. Luckily, our approach didn’t encounter too much resistance in these areas, which I attribute to applying our mesh terminology/rules and forming a new team: Data Mesh Engineering (DME). The intent of this team is to enable teams to connect to the mesh very easily, in order to provide or consume data. The DME team builds the foundation and solves a problem a few times prior to the solution becoming self-serve.Enter Data Mesh Engineering
Enter Data Mesh Engineering
The implementation of a data mesh can be intimidating for teams on both sides of the mesh: the one providing data and the one consuming it. To help teams leverage the mesh, we formed a DME team focused on connectivity, implementation, and architecture. Establishing an enablement team like this needs clear goals and a charter to distinguish it from Data Science, Data Engineering (DE), and the application teams. The responsibilities for this team include:
- Supporting frictionless integrations with events, data sources, and streams
- Implementing, guiding, and architecting solutions using mesh technologies
- Providing tools and seamless integrations for monitoring and visibility, such as Prometheus metrics, to teams building with the mesh
- Creating, maintaining, and evangelizing the cross functional principles for an event driven and data mesh architecture
- Advancing team enablement towards self-serve connectivity and usage of the data mesh
Conversely, the team is not responsible for:
- Building or maintaining data pipelines used to derive or enrich data products from external data sources
- Curating data products
- Driving business value from our data products
- Providing business analysis, data science analysis, machine learning models, or reporting for data products or derivative data sets
This approach still exhibits technology silos until we honor the self-serve principle. The mesh team is building the self-serve tooling such as frameworks and examples to make the connectivity simple and seamless. Additionally, the DME team provides continuous deployment tools to manage topics, configurations, data source/sink connectors, stream definitions, and schemas.
The tools used to implement our data mesh need to support central themes: access to a single source of truth (data product), consumption of data products into process data stores, and support of the event-streaming-first philosophy. We chose Kafka tools to implement the core structure of our mesh, although some domains offer REST interfaces — a data port — with federated GraphQL for our applications. All data ports leverage a schema registry to track versioned schemas for the data products. Our Kafka infrastructure is hosted on Confluent.io, which has been an excellent partner in delivering our data mesh by building infrastructure quickly, offering best practices, and generally going out of their way to ramp up our knowledge. Here’s an overview of the Kafka tools we use to solve the data mesh requirements:
- Kafka. The central component for connectivity of streaming events. Internally, we use the Java and NodeJS clients in our services. Using the node libraries led to some challenges due to issues such as poor support for test frameworks, incomplete client configuration, and other implementation concerns. The Java libraries are first class citizens in the ecosystem.
- Kafka Connect. The workhorse for seamless connection directly between data stores. We’re using the S3 sink, ES source/sink, JDBC source/sink, Redshift sink, http source, generic file system source/sink, and a few homegrown connectors. For example, we may use a JDBC source connector to detect data changes and emit events for consumption by multiple process domains — usually, directly into their data stores such as Postgres, Elasticsearch, or Redshift. Most have been easy to work with and highly configurable. We host the Kafka Connect cluster ourselves (AWS) though we are evaluating a move to Confluent for this as well. While this approach may led to copies of the data that will eventually be consistent, it supports our desire to have process domains own their use cases.
- KSQL(db-server). The tool used to perform minor to moderate transformations on stream events. For example, creating an ES document when we don’t have support processes to build a specialized index. Some use cases creep in KSQL, but the simplicity, ease of use, and the fact that we aren’t polluting the data product make it a solid choice.
The basics of our data mesh architecture can be draw as a high-level conceptual view. In our design, data products are data providers — never mesh consumers. A process domain is responsible for consuming the state-change events into their own data stores. This allows a domain to assume responsible of any complex transformations, enrichments, and enhancements on the data to fulfill specific use cases. The process owns these requirements rather than a data domain. While this approach allows for scale and flexibility, the implication is that the system must support eventual consistency, a shared data product, and data transformations. We can mitigate any consistency issues through the use of an event-driven architecture that keeps our data in sync as fast as needed — from sub-seconds to minutes.
The diagram demonstrates the assumptions and generalized architectural patterns. A process domain is represented by colored rectangles. They may consist of zero or more data product consumers. They may also provide one or more data products. A data product, represented by blue circles only provide data on the mesh. The process data stores, gray circles, may consume a data product for internal use. This data will eventually be consistent with the source data product.
Example: Role of a Data Pipeline
In our system, we use big data pipelines to normalize, transform, enrich, and enhance large data sets from various external sources. An IPaaS layer serves to manage integrations by handling different data rates, control authentication, and feed a data pipeline. The output of the pipeline is a data product which is a provider node on our mesh. An applied use case is the ingestion of (advertising) campaign data from several sources such as Google Analytics, Facebook, Xandr, etc… An implementation may include the creation of a master campaign data representing a normalized view across all the external data sources. The resulting data product is shaped from the perspective of the source data and the value it provides to the platform. It represents the best, complete picture as opposed to being bloated by aggregates or derived attributes that serve one of many downstream use cases. Be mindful that the external use cases are implicitly supported given the normalized and complete view of the data. For example, if an application requires a summarized view of campaign data, then its local data will have everything it needs to meet this use case; the manipulation is not in the data product or data domain itself.
In the following diagram, several data sources are ingested into a big data pipeline to yield a data product. In our model, we declare a “Campaign Data Product” with an associated schema. The infrastructure is flexible enough to be able to provide a more granular (data) product, such as a “Facebook Campaign Data Product”, if needed. The flow looks like this:
- Data is ingested into the pipeline and a data product is formed. The data is made available on the mesh via a Kafka topic filled with data events.
- A Kafka Connector(orange circles) — S3 source connector — , detects state changes and emits events on a topic. This stream can be consumed in several ways, however, we are using sink connectors here.
- We use several different sink connectors for this stream, such S3, Elasticsearch, and JDBC. For example, we’re consuming the stream into Elasticsearch for use in creating Growth Intelligence graphs.
The point is: the team owning the data product is not concerned with the application’s graph, only the integrity of the source data.
In summary, a data pipeline plays a central role in creating a data product that will be made available on the data mesh. The next example is more involved as it focuses on a feedback loop where a consuming process provides results that may inform or improve a source data product.
Example: Feedback Loop for Data Products
The previous example covers how a data product can be created from a data pipeline and how it’s made accessible. The same pattern and principles are applied throughout our mesh architecture. In this example, the focus is on how a consuming domain performs analysis, improves data, and provides feedback to the source data product. One may argue that if a process is influencing the quality of the data product, then it should live in the same domain, however, in our model the consuming domain is doing a ton of related tasks that we think is best left to its own domain. I’ll focus on role of the mesh in connecting participating domains to form an event-driven feedback loop. The specifics of this example are partially hypothetical, but it wholly represents the real patterns we use in our system.
The use case reviews how data which consists of an organization, locations, and services are sourced on a mesh with consuming domains performing actions that can influence the quality of the source data products. The original data may be improved due these quality checks and these changes must be made available to all interested processes. The analysis is performed again and the workflow repeats. In this case, we have instances when an organization may be or contain duplicate data due to the many mechanisms for adding an organization’s detail. Several process domains are dependent on getting organization data that doesn’t have duplicates, such as customer support, business intelligence, machine learning, and core services. At a high-level, the mesh provides a “live” stream of organization data that is consumed by a de-duplication process. This process identifies duplicates and produces a data set of duplicate organizations forming a derived data product that can be used to help inform the original organizations data product — resulting in a feedback loop. Next, we can review the architectural patterns that solve this use case.
As mentioned, our mesh architecture relies heavily on Kafka tools such as Kafka, Kafka Connect, and KSQL. The workflow for the feedback loop involves an organization change which results in the product being communicated to the process domains interested in the events for organizations. A Kafka Connect connector, JDBC source connector, is configured to watch the organization database and generate an event when the change occurs. The connector configuration allows control over poll rates, authentication, and the topic for the events. The domains interested in organizations include:
- Assessments — handling user feedback
- BA — generating internal dashboards
- Growth Intelligence — tracking growth
- Machine Learning — analyzing organizations for duplicates
The Assessments and BA domains are connected to the mesh using JDBC sink connectors which help us update their internal data stores. These are copies which domain processes use to build tailored data sets to drive their respective applications. The consumption by Growth Intelligence and Machine Learning (ML) uses the Elasticsearch (ES) sink connector due to their requirements for really fast queries. In the case of the ML domain, we use KSQL to combine, sub-select, and perform minor transformations on event streams which results in an ES document. This is shown in the following diagram.
We consider this application of KSQL to be on the consuming side and a responsibility of the consuming domain. We aren’t considering the output of the KSQL stream to be a data product given it can be thought of as a consumer side transform of existing data streams. I can foresee the introduction on the KSQL on the producer side to create a higher-order data product, but we are still considering this. The cool thing is that while the consuming nodes eventually become near 1:1 copies of the source data, they are updated as event streams so the latency for consistency is very low.
The processes in the ML domain are the workhorse in the workflow. It contains the processes that apply models to identifying duplicates in our data. When duplicates are detected, we usually need to do a merge of the entities. Today we want this merge process to be very deliberate so we rely on REST (data port) calls to do updates to a data product — not another event stream. When the merge happens, a new state change occurs which triggers an event for the Organizations data product. All the consuming domains pick up the change its reflected in their local data stores almost immediately. The ML domain applies the deduplication model and the process repeats — completing the feedback loop and keeping the data consistent.
In summary, domains that depend on a data product which are managed by another domain access the data indirectly through local data stores. These stores are kept in sync by using common tools for connecting to the mesh. If the event stream model is used, any consistency latency is very small; we use Kafka tools as the backbone for moving events. We’ve built out the connectivity infrastructure, tooling, and patterns to the extent where teams can self serve their use of the mesh.
Challenges and Lessons Learned
Our journey towards a data mesh included challenges, learnings, and challenges that became learnings. The progress towards this vision benefits a great deal from the company culture — one that focuses on personal growth. The company support includes enablement, experimentation, and the cultivation of new ideas. I acknowledge that many other organizations experience resistance to change and roadblocks while attempting to get backing on ideas, however, our culture allowed for action and iteration to demonstrate value and catalyze full buy-in from the organization. The primary challenges or learnings include creating a common language, applying new technology (to us), monitoring mesh health, starting with atomic data products, and the need for discoverability of mesh components.
Data Mesh Terminology
The founding article gave us a head start and I appreciate how it addressed the data science point of view really well. However, we had a gap. Our internal definitions needed detail for both a higher, non-technical and lower-level implementation views. This is the single most important step that requires buy-in and participation. The next blog in this series will cover how we gained adoption.
Over several years we’ve been moving towards an event driven architecture so introducing more events is natural. Kafka has worked out well for us as our message broker. While we don’t have large loads or demand high throughput, the benefit of having multiple consumers, such as the ability to sink data off an event topic to multiple places, is the primary benefit. We hit a few challenges as we applied other Kafka tools, such Kafka Connect.
Kafka Connect is newer to us and we’ve worked through quite a few issues as we wired up the mesh. Some examples:
- We were using a Redshift sink connector that had awful performance and led to a major shift in direction. With more experience we would’ve written our own.
- We’ve struggled a bit to find workarounds for JDBC connectors when an implementation leverages proprietary or non-standard types.
At this point we have a broad set of connectors in use in production, making self-serve a reality as teams can pull from a set and reference a ton of examples. There isn’t much code at this point so its broadly accessible. We have pretty solid CICD so deployment with confidence is common and requires limited involvement. Itcan be tricky to trigger replays of the event stream data, but that need is largely while we are in development. The use of KSQL is new to our teams and we have deployments integrated into our CICD workflow, but in the future we’ll revisit how to make that process better. We followed a “migrations” pattern, similar to something like Rails, but we haven’t fully implemented it. The CD process mostly requires happy path deployments — no rollbacks. The mesh is in flight and we’re moving data events as they occur in our system.
My personal belief is that visibility into system performance, trends, and bottlenecks is required for good software. A mesh makes that requirement critical. We used the jmx_exporter to monitor all of our components paired with prometheus/Grafana charts to give us insights. The exporter is solid and we check the results daily.
We started defining our data products at a very low level which allowed us to build a mesh very quickly. We got a jump-start due to our microservice architecture dealing with our basic conceptual objects, such as organizations, catalysts, and people. It’s very understandable for engineers, but has the disadvantage of exposing too much of the ERD. With a solid connectivity foundation in place, the next steps will focus on better business objects, as necessary. Additionally, this will be reflected in the GraphQL layer which will be used by integrating partners.
Need for Discoverability
A Data Mesh Architecture follows the principles of a Service Oriented Architecture, such as contracts, reusability, autonomy, and accessibility. One important aspect of autonomy is discoverability.
As the mesh expands, we need a comprehensive catalog of our data products, schemas, connections, and events. Manually updating a centralized document or repository quickly becomes unmanageable or a hinderance to development — it never works. Providing wrong or outdated documentation can often be more harmful and frustrating that no documentation. This gives need for providing a means of auto-generating this catalog of the data domains, schemas, data products, event streams, and configurations.
We’re in the process of evaluating tools and systems that will help us provide this catalog
Our journey at Gloo benefits from the value and respect that we place on our data. The company vision is consistent — connect individuals seeking growth to others with strengths in mentoring and helping others. One persons journey and what they find valuable is not necessarily the same as another even if their goals are aligned. The data forms a picture of leaders, catalysts (actions for growth), growees, resources, and programs. The data forms community. In the future, the mesh needs stronger governance to continue to safeguard relationships, a persons unique path, and the feedback loops for the personal journeys. We’ll also continue to form the right data products — higher level? atomic? — our growth will guide us.
The future of the mesh will see even better consent management, federated deletes, subject rights control, and the management of the chain of catalysts. To me, this feels like a good application for blockchain. Blockchain with a Data Mesh…If anyone has experience, I’m always looking for a mentor.
This part of the Data Mesh series focuses on application, architecture and example. It also lightly covers approach. In talking with colleagues and partners at Gloo, many believe the approach is interesting and could potentially be helpful for others. My partner, product owner, and friend, Alex will walk us through that in the next part!