Building a Data Platform to Enable Analytics and AI-Driven Innovation

Build a Data Mesh & Set up MLOps

Published in

The Startup

8 min readJul 26, 2020

Businesses realize that as more and more products and services become digitized, there is an opportunity to capture a lot of value by taking better advantage of data. In retail, it could be by avoiding deep discounting by stocking the right items at the right time. In financial services, it could be by identifying unusual activity and behavior faster than the competition. In media, it could be by increasing engagement by offering up more personalized recommendations.

Key Challenges

In my talk at Cloud Next OnAir, I describe that, in order to lead your company towards data-powered innovation, there are a few key challenges that you will have to address:

The size of data that you will employ will increase 30–100% year on year. You are looking at a 5x data growth over the next 3–4 years. Do not build your infrastructure for the data you currently have. Plan for growth.
25% of your data will be streaming data. Avoid the temptation of building a batch data processing platform. You will want to unify batch and stream processing.
Data quality reduces the farther away from the originating team the data gets. So, you will have to provide domain experts control over the data. Don’t centralize data in IT.
The greatest value in ML/AI will be obtained by combining data that you have across your organization and even data shared by partners. Breaking silos and building a data culture will be key.
Much of your data will be unstructured — images, video, audio (chat), and free form text. You will be building data and ML pipelines that derive insights from unstructured data.
AI/ML skills will be scarce. You will have to take advantage of packaged AI solutions and systems that democratize machine learning.

The platform that you build will need to address all of these challenges and serve as an enabler of innovation.

In this article, I will summarize the key points from my talk, and delve into technical details that I didn’t have time to cover. I recommend both watching the talk and reading this article because the two are complementary.

The 5-step journey

Based on our experience helping many Google Cloud customers go through a digital transformation journey, there are five steps in the journey:

Step 1: Simplify operations and lower the total cost of ownership

The first step for most enterprises is to find the budget. Moving your enterprise data warehouse and data lakes to the cloud can save you anywhere from 50% to 75%, mostly by reducing the need to spend valuable time doing resource provisioning. Ephemeral and spiky workloads will also benefit from autoscaling and the cloud economics of pay-for-what-you-use.

But when doing this, make sure you are setting yourself up for success because this is only the first step of the journey. Your goal is not just to save money; it is to drive innovation. You can get the ability to handle more data, more unstructured data, streaming data, and build a data culture (“modernize your data platform”) and save money at the same time by moving to a capable platform. Make sure to pick a platform that is serverless, self-tuning, highly scalable, provides high-performance streaming ingestion, allows you to operationalize ML without moving data, enables domain experts to “own” the data but share it broadly with the organization, and does all this in a robust, secure way.

When it comes to analytics, Google BigQuery is the recommended destination for structured and semi-structured data. Google Cloud Storage is what we recommend for unstructured data. We have low-risk migration offers to quickly move on-premises data (Teradata/Netezza/Exadata), Hadoop and Spark workloads, and point data warehouses like RedShift and Snowflake to BigQuery. Similarly, if you need to capture logs or changes from transactional databases to the cloud for analytics.

Step 2: Break down silos, democratize analytics, and build a data culture

My recommendation to choose the storage layer based on type of data might seem surprising. Shouldn’t you store “raw” data in a data lake, and “clean” data in a data warehouse? No, not a good idea. Data platforms and roles are converging and you need to be aware that traditional terminology like Data Lake and Data Warehouse can lead to status quo bias and bad choices. My recommendation instead is for you to think about what type of data it is, and choose your storage layer. Some of your “raw” data, if it is structured, will be in BigQuery and some of your final, fully produced media clips will reside in Cloud Storage.

Don’t fall into the temptation of centralizing the control of data in order to break down silos. Data quality reduces the further away from the domain experts you get. You want to make sure that domain experts create datasets in BigQuery and own buckets in Cloud Storage. This allows for local control, but access to these datasets will be controlled through Cloud IAM roles and permissions. The use of encryption, access transparency, and masking with Cloud Data Loss Prevention can help ensure orgwide security even if the responsibility of data accuracy lies with the domain teams.

Each analytics dataset or bucket will be in a single cloud region (or multi-region such as EU or US). Following Zhamak Dehghani’s nomenclature, you could call such a storage layer a “distributed data mesh” to avoid getting sidetracked by the lake vs. warehouse debate.

Encourage teams to provide wide access to their datasets (“default open”). Owners of data control access to data, but subject to org-wide data governance policies. IT teams also have the ability to tag datasets (for privacy, etc.). Cloud IAM is managed by IT. Permissions to their datasets are managed by the data owners. Upskill your workforce so that they are discovering and tagging datasets through Data Catalog, and building no-code integration pipelines using Data Fusion to continually increase the breadth and coverage of your data mesh.

One problem you will run into when you build a democratized data culture is that you will start to see analytics silos. Each time a Key Performance Indicator (KPI) is calculated is one more opportunity for it to be calculated the wrong way. So, encourage data analytics teams to build a semantic layer using Looker and apply governance through that semantic layer:

This has the advantage of being multi-vendor and multi-cloud. The actual queries are carried out the underlying data warehouse, so there is no data duplication.

Regardless of where you store the data, you should bring compute to that data. On Google Cloud, the compute and storage are separate and you can mix and match. For example, your structured data can be in BigQuery, but you can choose to do your processing using SQL in BigQuery, Java/Python Apache Beam in Cloud Dataflow, or Spark on Cloud Dataproc.

Do not make copies of data.

Step 3: Make decisions in context, faster

The value of a business decision, especially a decision that is made in the long tail, drops with latency and distance. For example, suppose you are able to approve a loan in 1 minute or in 1 day. The 1-minute approval is much, much more valuable than the 1-day turnaround. Similarly, if you are able to make a decision that takes into account spatial context (whether it is based on where the user currently lives, or where they are currently visiting), that decision is much more valuable than one devoid of spatial context.

One goal of your platform should be that you can do GIS, streaming, and machine learning on data without making copies of the data. The principle above, of bringing compute to the data, should apply to GIS, streaming, and ML as well.

On Google Cloud, you can stream data into BigQuery, and all queries on BigQuery are streaming SQL. Even as you are streaming data into BigQuery, you can carry out time-window transformations (to take into account user- and business-context) in order to real-time AI and populate real-time dashboards.

Step 4: Leapfrog with end-to-end AI Solutions

ML/AI is software, and like any software, you should consider whether you should build or whether you can buy. Google Cloud’s strategy in AI is to bring the best of Google’s AI to our customers in the form of APIs (e.g. Vision API) and building blocks (e.g. Auto ML Vision, where you can fine tune Vision API on your own data, with the advantage that you need much less of it).

When it comes to AI (arguably, this is true of all tech, but it is particularly apparent in AI because it’s so new), every vendor seems to check all the boxes. We really encourage you to look at the quality of the underlying services. It is not the case that any competing natural language or text classifier comes close to Cloud Natural Language API or Auto ML Natural Language. The same holds for our vision, speech-to-text, etc. models.

We are also putting together our basic capabilities into higher-value, highly integrated solutions. Contact Center AI, where we do automated call handling, operator assistance, and call analytics as a packaged solution is one example. As is Document AI, where we tie together form parsing, and knowledge extraction.

Step 5: Empower data and ML teams with scaled AI platforms

I recommend that you split your portfolio of AI solutions into 3 categories. For many problems, using APIs and building blocks will be sufficient. Build out a data science team to solve AI problems that will uniquely differentiate you and give you sustainable advantage.

Once you decide to build a data science team, though, make sure that you enable them to do machine learning efficiently. This will require the ability to experiment on models using notebooks, capture their ML workflows using experiments, deploy their ML models using containers, and do CI/CD for continuous training and evaluation. You should use our ML Pipelines for that. It is well integrated with our data analytics platform and with Cloud AI Platform services.