Modern Data Analytics Ecosystem — Data Mesh Architecture Delivering Data Value at Scale
I’m super excited to start the modern Data Analytics tech blog series with the first upcoming technology trend on Data Mesh. Data Mesh is an up-and-coming paradigm in data analytics architecture that helps organizations address certain pain points associated with modern data platform approaches. There’s a lot more to it than just technical change — it’s a change of People, Process, and Technology. In this blog post, we will cover the following topics:
· What is a Data Mesh?
· How to build a Data Mesh?
· How can Informatica help in architecting a Data Mesh?
· Is Data Mesh the right architecture for you?
1.0 What is Data Mesh?
Introduced by Zhamak Dehghani, Data Mesh lately has been one of the most actively discussed topics related to data platform thinking and data architectures. There are a lot of articles and blogs that have been written on this topic. As Albert Einstein said, “Insanity is doing the same thing over and over again and expecting different results.” I won’t bore you by giving similar examples. I will explain the Data Mesh concept by drawing a parallel with software programming. Here we go:
1. Local PC: We started software engineering by writing a small program and then executed it on our local PC.
2. Client-Server Architecture: We wanted to share this program with other folks for reusability. Hence we deployed it on a server and accessed it via a client or web application.
3. Single Monolithic Program: Now this program became popular, everyone wanted to add new features to the program. Hence we scaled the dev team & added more functionality. As a best practice we divided this functionality into different modules and deployed the large program (with multiple modules) onto the server.
4. Distributed Computing: Later our consumers (those using this program) started increasing, hence we will have to scale up the server and deploy this program on multiple servers for distributed computing.
So far, distributed computing looks good and serves the purpose of the consumers. Many consumers realized good value from the monolithic program. This approach had some advantages such as one-code, tech stack, etc. However, when the functional requests and consumers started growing exponentially, this approach started creating problems such as:
· Since all modules are deployed as a single program, any changes or introduction of a module potentially had an impact on the entire application
· Deployment had become very complex as the teams developed modules grew — there was a lot of time wasted in integrating, debugging with the main program, and having coordination between the developers of various modules
· Although there is heavy usage of few modules, the entire program infrastructure needed to be increased based on traffic
· Modules were forced to use the same tech stack as the main program (because of monolithic architecture) even though there was an effective/cheaper alternative available
· As many started using the program, new requests were often created as new modules because the developers were afraid to change the existing ones, resulting in duplicating the functionality in many modules.
· There was no clear ownership of the main program, which in turn made it difficult to isolate issues. Everyone started blaming modules of other developers for challenges such as the slowness of the applications.
· Over a period of time dependency tracking, governance, etc., became extremely challenging.
To avoid the above disadvantages, software engineering introduced
· Here, modules are converted to autonomous microservices that can be deployed in the container
· Every service has the autonomy to choose the tech stack
· Every service can be scaled independently
· Every service has well-defined APIs for access, input, output & boundary/scope
· Every service has an owner who is responsible for its functionality, usage, etc
· There are incentives if many are using their service thereby encouraging to deliver better functionality
For solving a complex use case, several microservices need to talk to each other, creating the “Service Mesh” concept.
Now, let’s apply the same analogy to data/analytics:
· Module → Individual source systems (domains)
· Single large monolithic program → Centralized data lake, Data warehouse,
· Developers → Operations plane
· Consumers → Analytical consumers
· Service Mesh →Data mesh
1. Spreadsheets and databases: When we had small, less complex requirements, we used spreadsheets and relational DBs for analytics
2. Centralized Data Warehouse: When more sources got introduced, we created more database schemas/tables. Then we switched to a separate centralized data warehouse for analytics (DW appliances such as Teradata, Netezza) to better serve customers.
3. Big Data: When the volume, Variety, and velocity of data grew, we pivoted to big data (Hadoop) for distributed processing for analytics
4. Cloud Data Warehouse and Data Lake: To achieve better cost & efficiency, we moved to CDW/CDL. As consumers and source/LOB systems (domains) keep on increasing, centralized CDW/CDL[RL4] [RK5] has the same issues as described in monolithic programs.
5. Data Mesh: This is the service mesh for data.
2.0 How to build a Data mesh?
In the Data Mesh approach, problems are addressed by shifting the way we think about data. It has four characteristics:
- Domain-oriented Ownership & Architecture:
Decentralize the ownership of sharing analytical data to business domains who are closest to the data — either is the source of the data or its main consumers. Decompose the data artifacts (data, code, metadata, policies) -logically — based on the business domain they represent and manage their life cycle independently. Ease of use and automation are key to build domain-oriented ownership.
2. Data as a Product
Existing or new business domains become accountable to share their data as a product to data consumers. Exposing Data as an API is a key to this requirement.
3. Self-serve Data Platform
A new generation of self-serve data platforms to empower domain-oriented teams to manage the end-to-end life cycle of their data products. Purpose-built UI, ease of use & automation are key to this requirement.
4.0 Federated Data Governance
Federated Data governance ensures that data is secure, accurate, and reusable. The technical implementation of data governance, such as collecting lineage, validating data quality, encrypting data at rest and in transit, and enforcing appropriate access controls, can be managed by each of the data domains. However, there is a need for central data discovery, reporting, and auditing are needed to make it simple for users to find data.
3.0 How Informatica helps to build Data Mesh Architecture?
In April, 2021, Informatica launched industry’s first Data Management Cloud (IDMC), the most comprehensive, cloud-native, AI-powered data management platform for driving digital transformation. Thousands of enterprise customers are using the Informatica IDMC platform to operationalize and govern data warehouses and data lakes. Informatica is uniquely positioned to support any new data architecture including the Data Mesh. Here is how Informatica supports Data Mesh principal:
Domain-oriented ownership & Architecture
IDMC offers a metadata-driven approach to building and scaling data pipelines for any data consumer or producer. The IDMC platform also offers enterprise data catalog, governance and privacy capabilities. This makes it easy for data producers and consumers to register or discover domain-specific datasets to use within their data pipelines.
Data as a Product
IDMC enables enterprises to visualize, analyze, and collaborate on their data regardless of location, type, format, or the underlying source. It is a comprehensive cloud-native platform enabling data and app integration, API and data management at scale, all in a single platform. IDMC also exposes data assets as an API to provide a data marketplace experience for both data suppliers and consumers.
Self-serve Data Infrastructure
IDMC offers self-serve data infrastructure with a low-code or no-code experience allowing customers to go directly from ideation to implementation, responding to dynamic business requirements and changes in real time without the overhead of developing and maintaining code. The IDMC platform includes purpose-built wizards and user experience for every type of user.
Federated Data Governance
IDMC has security and trust as a design principle, not an afterthought. It offers some of the highest industry standards for data security. It delivers governance capabilities such as automatic data asset classification based on domains, ability to manage and enforce policies which ensures appropriate teams (producers/consumers) can quickly access and understand data and other artifacts like AI models and pipelines. It ensures trust with consistent enterprise-wide data quality, protects data to minimize privacy risks, and facilitates regulatory compliance. It also offers enterprise-wide catalog of catalogs.
4.0 Is Data Mesh the right architecture for you?
Current data analytics approaches build centralized cloud data warehouses and data lakes. This centralized platform with a specialized team usually works well for small and medium-sized enterprises and organizations whose data landscape is not constantly changing, or whose business domains are relatively simple. For a Data Mesh architecture, you need to consider the following:
· People: Is the size of your data platform and the team supporting it becoming a bottleneck and, in turn, slowing down innovations?
- Process: Do the individual departments within the organization have the skills, knowledge, and flexibility to make their own decisions? Are you seeing departmental data solutions deviate from the centralized data solutions in meeting the needs of LOB users?
- Technology: Is your organization/LOB good at prototyping and selecting the best technology? Does your business wants to adopt the latest technology stack suited for the domain-specific use case without waiting for central IT/Data team approval?
If you answered “yes” to one or more of the questions above, you might consider a Data Mesh to resolve scale issues, reduce time, and deliver domain-specific data products.
5.0 Summary & Conclusion
There are many publicly available references to Data Mesh implementation by companies like Intuit, JPMorgan Chase, HSBC. However, we have yet to see widely proven business benefits of Data Mesh architecture. Organizations that face the problem of scaling data delivery for analytics are the ones who are best suited to adopt Data Mesh. But if not carefully planned, it can easily lead to a more siloed and ungoverned architecture without a data management platform like IDMC. If you decide to embark on this new Data Mesh architecture, Informatica is uniquely positioned to support you through the industry’s first Intelligent Data Management Cloud that can enable you to quickly build and realize the value of Data Mesh architecture. That’s all from my side in this blog. Stay tuned for my next blog. 😊