Implementing Data-as-a-Product(DaaP) using distributed data architecture and Smart Data Platform on GCP
Data has evolved at an unprecedented pace. Big data technologies have revolutionized the way data is captured, stored, and processed. Many organizations have embraced Big data platforms, which combine the functions of data lake, data warehouse and data marts along with data management capabilities, as an effective way to handle data at scale. However, unlocking value out of data utilizing the existing monolithic data platforms and incumbent (or non-existent) data architecture still remains a challenge for many organizations.
Data platforms appear to be constrained by centralized thinking of data. It diminishes the value of data as it moves through its lifecycle. Often there is lack of domain ownership of data posing a barrier to creativity and business insights. As a result of the disparities in non-domain specific data, AI/ML initiatives are impeded by increased complications and inconsistencies. There is disconnect between how centralized core teams and centralized data teams work together to enforce and control data operations including data governance, data quality, compliance, metadata cataloging, and so on. The centralized team’s operations and data teams’ operations are often autonomous and decoupled, with no common interface or communication channel, leading in a lot of friction between the teams and misalignment of business goals and expectations.
Now, solving this entails a paradigm shift in how we think about data as an organization, and a paradigm shift of your data architecture. Adopting the “Data-as-a-Product” approach can be a great way to address it.
By augmenting your data platforms with DaaP philosophy and intelligent technology layer you can democratize data and gain more insights and economic value from it
At its most basic level, DaaP involves a logical management layer that can assist in the creation of a more manageable unit of data grouped by domain without the need for physical transfer or data duplication. These manageable data units can be given proper ownership which should include domain representatives, and can be made to follow certain standard principles and policies. They should also support central management and governance.
DaaP implementation can be taxing on both a technical and human level. As a result, DaaP adoption benefits from a smart and intelligent technology platform. This intelligent technology layer is conducive to DaaP and should provide the logical data management layer, alleviate some of the pain of uniformly implementing principles across multiple domains and data products through automation, and help reduce friction between the centralized team and data domain specific teams by providing a common set of tools, services, and interfaces.
Today, in this article we will look at how we can vivify “Data-as-a-product” by using state-of-the-art technologies and services on GCP. This nextGen data platform on GCP will be referred to as “Smart Data Platform”. Based on the concepts¹ of distributed domain driven data architecture, self-service infrastructure, a centralized CoE team, and application of DaaP principles, the smart data platform can enable your business to adopt and deploy DaaP. Zhamak Dehghani’s Data Mesh architecture¹ influenced a lot of these ideas.
The underpinning components
Let’s break it down and make it as simple as possible. Let’s have a look at the fundamental components and their properties from the standpoint of implementation.
- Distributed Domain Driven(DDD) architecture: This notion is inspired from by Eric Evans’ Domain-Driven Design. A data architecture in which the business or internal operations owns, processes utilizing standard decoupled pipelines, hosts, and provides their domain datasets in a safe and easily consumable manner. This architecture allows you to break down your data into more manageable chunks. Each logical group in your business or internal operations team that serves or consume data can be treated as a domain.
DaaP and data domains: A domain is responsible for providing high quality data products to consumers, both internal or external to the organization. Each domain can consist of one or more data products. A dataset, a file transferred by FTP or shared drive, a data-based report, one or more data pieces ingested or consumed by API, or a stream of data are all examples of data products. Each data product in a domain should follow the DaaP principles.
Designing your domains: How you go about designing your domains is mostly determined by the size of your organization. A domain might be either source(near to data sources and business) or consumer-oriented, or it can be focus on internal operations. A source oriented domain can be system-of-reality and near to data sources and can be limited to being producers only. Each domain should be managed by its own independent cross functional team, which can include a data product owner, a data engineer, a data analyst, and a data steward at the very least. You can choose to build the domain more use case focused for smaller businesses, and cross functional teams can be limited to one or two data specialists.
- DaaP principles: The following design principles can be incorporated into each of the data products to alleviate concerns about harmonization: discoverable¹, addressable¹, trustworthy¹, self-describable¹, interoperable¹, secured¹, privacy-centric, auditable, timely, version-able, and shareable. This will aid in the development of best-in-class data products for producers as well as consumers.
- DaaP Center of Excellence(CoE): While DaaP emphasizes decentralization, a core team is required to maintain alignment, shared values, and accountability. Consider this to be the federal government, whereas data domains are state governments. At a high level, responsibilities may include ensuring uniform adoption of principles, best practices, and standards across domains, conducting education and training sessions, defining and enforcing data governance policies and compliance(especially with external data sharing), creating & maintaining global knowledge graph, and providing domain agnostic reusable components such as CI/CD, provisioning tools, ingestion frameworks, data connectors, standard api interfaces and documentation templates among other things and last but not least pave the path for innovation. This is a foundational component, however it is beyond the scope of this article to discuss this further.
- Smart Data Platform: Choosing a smart data platform will aid in the acceptance and execution of DaaP. It will be difficult to construct all of the tools, services and frameworks required to support DaaP, especially in a monolithic data platform environment. Data quality or metadata cataloging activities and KPIs, for example, are frequently unclear and lack adequate ownership. Most of the time, developers are held responsible for this as part of the application or pipeline development. Because they are not data gurus, this will be difficult to implement, and the approach will not scale as data grows. A Smart Data Platform can help with these issues by providing a turnkey solution for Data Quality that can be conducted as data is fed into the data domains and centrally managed and enforced.
A smart data platform will: 1. provide a unified and intelligent data fabric capabilities to ease the data management despite where the data resides without the need of data movement or duplication. For e.g. if your data is stored in GCS, BQ and other data stores as part of your data lake, they can remain there. 2. facilitate data exchange/sharing and monetization within and outside the organization. 3. facilitate central alignment and governance.
A data fabric will be in responsible for offering features such as a turnkey data quality solution, automated discovery and compliance jobs, unified data governance, schema evolution, and metadata harvesting, among others. These features will make it easier to apply DaaP principles consistently across data domains without requiring a lot of human work.
Smart Data Platform: Reference Architecture
Now that we’ve established what the important components are let’s go into the technical aspects of the Smart Data Platform, which is the main topic of this post.
The following are the key building blocks of the Smart Data Platform:
Self-service data infrastructure
Needless to mention, Cloud inherently is jam-packed with oodles of self-serve infrastructure services. Google Cloud platform provides infrastructure as a service, platform as a service, and serverless computing. All of which is empowered by Google cloud’s trusted global presence, secured and efficient data centers, fast and reliable global network, multi-layered security, highly available, and last but not the least sustainable. Google Cloud’s smart analytics solutions provided a number of data and analytics products to help drive innovation and adopt DaaP architecture. Below we will explore some of them.
Data fabric — Dataplex
Dataplex is an intelligent data fabric that enables organizations to manage, monitor, and regulate data across several data stores from a single control plane. Dataplex is built for distributed data and allows for data unification via a logical layer, eliminating the need for data migration or duplication. It offers data intelligence through automated data discovery, global data quality checks, and a variety of other features. It also manages, monitors, and audits data authorization and classification policies from a central location.
Dataplex is a logical management layer that can be used in conjunction with your data lake, data warehouse, and data marts to make implementing DaaP principles easier and smarter.
Dataplex can be used to construct and organize domain-oriented data lakes within or across GCP project, with the option to grant ownership to an independent cross-functional team using IAM. While GCP projects help with organization, they are largely focused on resources and billing, making it difficult to align and link data to a business domain.
Data Exchange — Analytics Hub
Without surrendering control or power, analytics hub allows you to exchange data and insights across organizational boundaries. It serves as the foundation for monetizing commercial data and analytics. It has a built-in integration with Data Catalog. Data feeds hosted by data domains can be easily searched for and subscribed to by consumers. Analytics Hub offers both external and internal data exchanges, data monetization, data access monitoring and auditing, and data product versioning (supported by the underlying storage layer), all of which are important DaaP characteristics. Learn more.
Data Experiences — Looker
Looker’s embedded analytics, in addition to all other advanced ML and BI capabilities, is a proven and practical approach to commercialize and monetize data.. Learn more here.
API Management — Apigee
Design, secure, analyze, and scale APIs anywhere with visibility and control. Also provided the build Monetize API products and maximize the business value of digital assets. Learn more here.
Data governance — Data catalog and Partnering tool
Collibra and GCP, for example, provide a solid framework for central federated data governance in multi-cloud and hybrid systems. Through Cloud IAM and Cloud DLP, you may enforce data security policies and maintain compliance with Data catalog. Learn more about this here.
A sample case study
Here’s an example of a DaaP deployment for a financial services firm that offers both credit cards and payment services. Consider the following LOBs and BUs: credit cards, credit scoring, transactions, customer, issuer, acquirer, merchants, payment and fees, marketing, risk, and fraud.
An autonomous Cross-Functional(CF) team with domain representation can own a source-oriented domain lake(producer only) nearest to the source of data for the LOBs. BUs can be more consumer-oriented domain lakes that drive both external and internal end-user/app consumption.
As soon as the domains are ready to host the data, the first step is to publish the data product’s metadata, which should include SLOs (data accuracy, confidence score, etc. ), ownership information, consumption mechanisms, technical metadata, and business metadata, among other things.
Consumers of data, who are either internal or external to the consumer-oriented domains, can search for and subscribe to the data they want to consume. Data from their domains can be consumed once they have gained the necessary access.
It’s time to re-think and modernize your data architecture if you’re going on a Cloud journey, want to get more value out of your existing data, or want to reduce data management overhead.
Adopting a ready-made smart platform rather than doing it yourself will help you overcome some of the technical and human problems connected with DaaP, allowing you to move forward with its implementation faster. It’s time to go on the journey of data architecture modernization once you’ve figured out how technologies like Smart Data Platform can help.
It’s important to remember that this isn’t about re-architecting or redesigning your data lakes, which would be a huge undertaking This is a more logical, strategic, and forward-thinking approach to data management. There are futuristic use-cases that can benefit from DaaP such as open banking, data and analytics monetization and other prospective use-cases. Embrace this sooner rather than later to reap the rewards in near future.
P.S. If done correctly, this can be a successful endeavor. In a subsequent article, we’ll look at the fallacies of this architecture and how to overcome them. We’ll also talk about how the method may differ and be tailored to your company’s size and complexity.
I hope you found this post to be useful.