Designing Modern Data Platform on Cloud

Shailesh Kosambia
13 min readNov 10, 2021

--

The Change and value proposition

Most organizations that have used Data platforms traditionally for decision making are in process of transitioning to elevate their Data platforms for gaining insights to power consumers experiences and create product/service differentiation by embedding intelligence. Also, the types of different users have expanded from just business, analyst & power users to Data scientist, Data citizens developers and also direct customers. There are major changes in the way how traditionally analytics was just derived from source data that were structured tables to now more unstructured data and also scrapping data from web pages, logs, sensor events etc. From products perspective, architecture data platforms have massively evolved by separating the storage and compute to make it modular and now the ability to use the right type of compute like say Spark and right type of storage like say blob depending on workload and use cases. AI and ML have introduced a complete new pillar in decision making which never existed before and with it brings new set of tools and techniques making it integral part of many applications development that make it smarter. Data governance now plays a very integral part of the process to ensure that consumers trust is maintained, data is secured from breaches and democratized for easy access at the same time it follows all regulations like GDPR so that there is no reputation loss for the organization. There is now huge opportunity to monetize data with investment in modern platform to make money, save money and gain customer’s trust. The ROI far outweighs the cost and value realized to firms is a game changer.

How to transition to Modern Data platform architecture

The journey to Modern data platforms will need a drastic change in areas of People, Process and Technology. There are variety of tools and services available now compared to what was there traditionally in data space and whole mindset change is required on selecting the right service and tools for right purpose. I will illustrate this with an example of architecture of modern data platform on cloud assuming you have to build from scratch and will provide various example of underlying tooling options based on Azure, AWS and GCP Cloud providers services.

Centralized vs Decentralized data architecture

There are two approaches to design modern data platform i.e. Centralized and Decentralized also called data mesh. One of the biggest differences between the Data Mesh and other data platform architectures is a data mesh is a highly decentralized distributed data architecture as opposed to a centralized monolithic data architecture based on a data warehouse or a data lake.

A centralized data architecture means the data from each domain/subject e.g. finance, HR, operations etc is copied to one location in a data lake under one storage account, and that the data from the multiple domains/subjects are combined to create centralized data models and unified views. It also means centralized ownership of the data which in one horizontal Data IT team.

A decentralized distributed data architecture means the data from each domain is not copied but rather kept within the domain and each domain/subject has its own data lake under one storage account and each domain has its own data models. It also means distributed ownership of the data, with each domain having its own owner.

Which do I select? Is decentralized better than centralized?

The first thing to mention is that a decentralized solution is not for smaller companies, only for really big companies that have very complex data models, high data volumes, and many data domains. I would say that means at least for 70% of companies, a decentralized solution would be overkill. Let me cover Centralized approach in this article and will cover Decentralized data mesh in another article.

Conceptual Centralized Modern Data Platform

The diagram below illustrates the different layer of centralized modern data platform. Detail of each layer are mentioned below,

  1. On-Premise Data: consist of data in various application database called system of records (SOR) which say today exist on-premise data centre infrastructure. Data will be ingested into cloud staging zone from on premise SOR for respective application / Data Sources System of Records.

2. Data hydration: This layer will move on-premise data to cloud staging zone. From On-premise to cloud staging data hydration we will have 4 ingestion patterns i.e. Batch, Stream, Replication and IOT event consumption depending on use case we have.

3. Staging Zone : First landing area of data on cloud will be in staging zone. Data will be made available here for faster retrieval. This can also be called as raw / bronze zone with unrefined data in native format like CSV, JSON, Text binary, XML etc

4. Integrated Data Warehouse : Data from staging zone will be ingested into data warehouse for curation. Ingestion process can use one of the data hydration patterns mentioned above. Curated version of data will be made available to the consumer application in this layer. All analytical workloads will use curated copy for business ad-hoc analytics. This layer can also be called Silver zone. Silver will have refined data where data quality standards are applied.

5. Virtual Data Marts : Application/downstream analytics will use curated copy to create de-normalized materialize view as per need. This can also be called as Gold zone. It will have aggregated data sets for downstream systems.

6. Analytical Sandbox : Will have production quality data for Data scientist teams for exploration and creating, training and testing ML models. This area can also be called ephemeral area for exploration and advanced analytical endeavours.

7. Channels : Analytics, Insights will be consumed by various channels for Internal and External facing applications, Dashboards and reporting tools used by end users

8. Data Governance : All data since its inception to cloud will be channelized through data governance process to ensure data standardization, data classification, data ownership, adding business and technical metadata and data quality

9. Data Security : Data will be protected using data entitlement both at user level and row level, data encryption, tokenization and masking.

What to use for What?

Below table provides the sample ideal product choice I would recommend for your above modern data platform architecture for top 3 cloud providers in market you might have selected.

Modern Data Processing implementation approach

Primary there are two different data processing implementation approach for Modern data platform. One is Lamda architecture commonly used and other is Kappa architecture

a. Lamda Architecture: This has are 2 separate processing layers. One for Streaming real time need called hot layer and other is batch requirement layer called cold layer. This is used when there are more complex integrate speed layer and batch layer hybrid requirement.

In the hot path we use Event hubs, Pub Sub, streams. eg Stream flow through services like Azure stream analytics can directly send data post processing to real-time application or reports/dashboard. We can use serverless function like Azure function or AWS lamda for any stream data processing and grouping logic before sending to the consuming application, BOT or dashboard

In cold path we use ETL like Azure Data factory, Batch ETL orchestration and store that data in Data lake immutable datasets, Data warehouse like Azure Synapse analytics Semantic MPP layer.

The arrow from hot path to cold path shows that the real time data copy can be stored in Data Lake or data warehouse for specific analytic or ML use cases that would need more fresh real time data.

This data that is stored through cold path and combined with view from hot path in serving layer can supports advanced analytics like databricks can use for processing ML model processing in jupyter notebooks. We can test, train, deploy, ML models in the cold path as illustrated in above diagram.

Lambda architecture uses the functions of batch layer and stream layer and keeps adding new data to the main storage while ensuring that the existing data will remain intact. Companies like Twitter, Netflix, and Yahoo are using this architecture to meet the quality of service standards for their data processing needs.

b. Kappa Architecture: Is used for streaming like Kafka. Kappa Architecture cannot be taken as a substitute of Lambda architecture on the contrary it should be seen as an alternative to be used in those circumstances where active performance of batch layer is not necessary for meeting the standard quality of service. This architecture finds its applications in real-time processing of distinct events.

Data Platform Practices

Below are some guidelines from my experience to keep in mind while you develop data platforms and sub systems.

a. Being able to select appropriate stack. Example Databricks is better for advanced analytics initiatives, whereas if you want self-serve use Synapse/Redshift which support semantic layer better is very critical. Do your assessment well before choice, keeping must have features, performance, concurrency, cost etc in mind

b. Design for scenarios like how to restart pipeline when something fails, incorporating all data validation and what to do when it fails, standard simple way of new data source onboarding keeping in mind the type of orchestration patterns based on business needs and data availability. Thus, you will need to make sure you have a made workflow and orchestration pattern with all this thought and have strong overall governance around.

c. Understand what your data patterns are e.g. Offline (Lake) vs online (MPP) vs or nearline (Streaming)

d. Start classifying and securing data as soon as Raw stores are being generated

e. Data storage is cheap so make best used of it. As data transforms through data pipeline transformation layer, version your data and track the lineage with various cloud provided services.

f. As you work through data pipeline test to thoroughly at each step and lookout for any side effects as you test the data. Focus well on unit test cases and not just end to end integration to find data quality issues and breaking of logic.

g. When replay information through new pipeline understand how that would work. -reproducibility — backfilling and recreating results.

h. Don’t miss out on tracking costs. Create action rules and integrate them, with services like Azure logic apps to help control resources.

i. Standardize by implementing Governance and DataOps early on. Better collaborate with Git, Jive DevOps

j. Identify metrics about analytics processes and track them to improve troubleshooting and improve transparency. Like last 3 months analysis of runtime pipeline or categorization of various data incidents statistics.

k. Do due diligence in advance to implement remedies to advance evolution of the platform

l. Write datasets with an atomic and immutable approach

m. Partition data into hierarchical predictable structure similar to domain / subject area

n. Leave folder for corrupt or Bad outputs through a similar subfolder

DataOps Best Practices

DataOps is extension of DevOps to be agile and nimble to delivery things quickly. Its applies practices for data engineering, analytical development and deployment. It involves optimizing operating model, tools and processes.

There are different cycles of DataOps like you go through local code verification, there is continuous development, applying local test and integrated into automated process. Then your CI platform can apply and run those which are than piped into continuous delivery. As soon as those test are verified the automated tests check out can push them through an automated way to through a release to an end platform. When you are done with deployment and still want to rerun certain tests especially input / output logic test to make sure you are seeing on production is what you were actually expecting in development you should design for that in DataOps. You should enable that with mechanism where certain unit test will be excluded from this type of test stuff because you don’t want to burden the server. Finally deployment verification monitoring should be in place where you keep track of operational metrics about the server state so you know what resource usage are, compliance restriction etc try to think about automating a lot of it or including it in dashboard for transparency.

DataOps fits in data ops manifesto and you can get details here The DataOps Manifesto — Read The 18 DataOps Principles. Apply different practices to reach green goal that you don’t want users, developers worrying about what going to break in the prod system. You don’t want your key development team members spending hours at night to fix something in prod rather than be able to create value in strategic platform investment.

Lot of organizations fail in applying data logic test right. You can think about testing pyramid test driven development. Lowest layer apply to unit test code analysis. This should include naïve test which are simple counts. With this you know conformity wise how things are looking like if end date less than a begin date. With this very simple test you can implement right from get go. In integration test you focus on how do you check whether a portion of the pipeline that’s maybe ingesting information from data warehouse say Azure synapse is processed through databricks and them landed into say blob storage or relational database. Use data pipeline test with Azure data factory to automate this test along with some frameworks Python or AWS Glue to carry out this test. End to end manual test ensure information from raw zone is in line with expected final curated and aggregated to front end semantic model like azure analysis services or even Power BI. Below are 3 types of testing which should be done and automated with above approach.

· Business Logic test i.e non naïve, domain based thresholds.

· Input test : counts, conformity’s, temporal consistency, application consistency, field validation

· Output test : completeness range verification

Branch and Merge

For your Data, ETL, ML models, BI reports and dashboards leverage Git and agree on a branching methodology like Trunkn Gitflow etc. Express stateless parts of objects through code where possible eg yml files for CI-CD pipelines, sql for db objects , json for ARM templates to deploy infrastructure as code, Python .py scripts for databricks. This way its easy for code comparison and review through the pipelining process.

Automate delivery

Below is sample build DataOps pipeline which you would build on cloud

You should run local code, unit and lint tests through build pipelines to catch issues up front. Use successful pipelines to trigger releases. CD pipelines for deploying code and artifacts s to environments and running more tests. Integrate work to DevOps or other ticketing systems to understand the impact of your code base changes

Modularize and reuse by parametrizing processing. Example is have reusability of copy activity in azure data factory dynamically for multiple tables. Track run time parameters operation tables where you can understand what you ran and when you ran it. Recognize diminishing returns when implementing modularization and parametrization.

Data Governance Practices

Data governance has various pillars like Data Discovery & Classification, Data Quality and Data access. Let’s cover data discovery first by classifying in 3 different buckets as below.

Data classification: Here we would classifying data into categories like PII, Non PII and types of PII based on data sensitivity. It will evaluate data assets for categorizing based on privacy and access level. If we look at azure we can leverage Data Catalog for this.

Metadata Management : This would enable data discovery — what data do you have, where is it and how do you describe it. We look at Business and System technical metadata also in this. Again If we look at azure we can leverage Data Catalog for this other option is Collobira or AWS Macie. For system metadata we should have automated way of capturing and integrating with CI/CD pipeline into central Data Catalog tool

Content and Documentation: Improving centralization and storage of documents in the form of digital asset management is very important. We can use Azure DevOps Wikis and Sharepoint with Team integration

The other area is Data Management and quality which can be subdivided as below

Data Quality: This deals with Profiling, assessing and testing of data I/O and transformation. We have several tools in azure for this. Data Catalog, Data Factory, SQL Server, Data Quality Services (VM), Databricks + Python libraries. Data Quality checks should be done at various point starting from origin of data. Lot of Metadata management tools have also integrated Data profiling feature now.

Data Lineage: Tracks data movement, transformation and usage. In Azure we can use Data Factory, Databricks and Power BI, Collibra is also good at data lineage.

For Security and Access, we should follow the principle of least access. We can divide this area in below areas:

Data Security and Access: Define governance standards and rule around security and access. In azure we have several services in this area. Azure Active directory, IAM, RBAC, ACLs, Native PaaS Security + Encryption, Azure Key Vault, Azure sentinel. Every cloud provider has similar set of services in this area.

Data Policies and Compliance: This area includes governance standards around compliance that include encryption, auditing, monitoring, retention etc. We have Azure services like Azure Monitor, Log analytics, Native PaaS Logging and Auditing, Azure Blueprints and Azure Policies. Immuta is another addition we can have that does very good row level security.

API Access: Its important to allow access of data not only to analytics and reporting tools but also to any application in Restful way like access to downstream applications through web interfaces for easy integration. In Azure we have Azure App services, OData Common Data Service (Power Apps)

Conclusion

This article covered the modern data principle’s along with recommendation on best practices to help you in your journey to accelerate and have successful data driven organization.

--

--