Data and Analytics on Google Cloud Platform

Srivatsan Srinivasan
8 min readJun 27, 2019

This article is to give a gist of various data and analytics services available on Google Cloud Platform. One of the key strength of Google Cloud Platform is its data and analytics capabilities, more specifically its unique offering like Big Query, Cloud Data Fusion, Cloud Dataflow, Cloud BigTable and Cloud Dataprep (Managed Trifacta)

If you are looking for Artificial Intelligence and Machine Learning capabilities/services on Google Cloud Platform, refer to my article below dedicated for that. Google Cloud has multiple level of abstraction when it comes to Artificial Intelligence and Machine Learning as depicted below, few of which I have detailed out in my link to previous article

Getting back on track, this article core focus is data and advance analytics and will provide high level view on GCP services and its placement in data life-cycles starting from ingestion, curation till exploration. Will also go into details on some of the new capabilities that was introduced recently as well as few of its existing services

Considering GCP breadth of data products, it is very difficult to cover all at once, so consider this more of an introductory article to selected services offered. Reference section has links to get more details on information presented in this article

First to start with, Why big data analytics on cloud in general?

  • Cloud allows to focus on solving business problem, Not worry on infrastructure
  • Keep up on innovation with seamless access to new releases/features of open source and managed services
  • Faster time to market and to generate value on data applications
  • Greater agility and relatively less product lock in by pay on use services
  • Hybrid cloud is opening up new possibilities to expand private compute to cloud on usage spikes
  • Easy access to open source components with partner solutions allowing multi-cloud portability
  • The boring benefit — cutting or keeping cost low (Subjective though)

At each stage of data lifecycle Google Cloud Platform provides services that you can select from, tailored to your data application need and workflow. Below is a view of list of GCP data and analytics services. I have intentionally left out services like cloud firestore and few others as felt might not be right fit for analytics intensive application. Also partner products are not accounted for unless they offer native integration into Google cloud Platform services

Below is a detailed view covering Data Management, Data Governance and other core aspect of data products like

Metadata Management — Increase visibility of data assets and make data more accessible to everyone

Data Security — Protect data from unauthorized access. Profile and classify sensitive information and comply with regulations pertaining to data security

Let us now compare few of the services in google cloud platform starting with storage. With good number of storage options available below is quick comparison of various storage options on GCP along with use cases it might fit well on

Cloud Dataproc v/s Cloud Dataflow

Going past storage one of the question that frequently comes up is scenarios when to use Cloud Dataproc (fully managed service of open source Spark/Hadoop) against Cloud Dataflow (fully managed serverless service for transforming and enriching batch and real time data)

Below are some of the consideration on using Cloud Dataproc

  • If you have substantial investment in Apache Spark or Hadoop on premise and considering moving to cloud
  • If you are looking at Hybrid cloud and need portability across private/multi cloud environment
  • If in your current environment Spark is your primary machine learning tool and platform
  • In case your code depends on any custom packages along with distributed computing need

Now, when to use Cloud Dataflow

  • None of the above consideration made for Cloud Dataproc is relevant
  • When you need a unified streaming and batch pipeline
  • When using it as a pre-processing pipeline for ML model that can be deployed in GCP AI Platform Training (earlier called Cloud ML Engine)
  • Simpler DevOps pipeline (Not a substantial benefit though)

Lift and Shift on-premise Hadoop/Spark workload to Cloud Dataproc is a simple 3 step process as highlighted below

Cloud Dataprep

Cloud Dataprep is a serverless data preparation tool managed by Trifacta. Dataprep helps one with rapid exploration, cleaning and preparation of data through visual point and click interface.

Cloud Dataprep sits between GCP storage, processing environment and downstream analytical tools/applications. Dataprep uses Cloud Dataflow under the hood and is natively integrated with google cloud services like cloud storage, BigQuery and also one can upload custom dataset,

Below is a simple depiction on how Cloud Dataprep fits into data exploration and curation flow

While Dataprep can help in wrangling, data processing and cleaning structure and unstructured data, one of the interesting feature it provides is Visual Profiling. Visual profiling helps explore and interpret large volumes of data

Below are some sample screens of Dataprep Profiler that can assist in rapid exploration of dataset and enable you to make quick assessments of problems, unusual patterns and changes that might be required to your data

BigQuery

Google BigQuery is an fully managed enterprise data warehouse for storing and querying massive datasets even petabyte scale. BigQuery allows one to focus on the code and let google infrastructure handle the rest for you.

BigQuery provides seamless integration with most GCP services like Tensorflow, Cloud Dataflow, Google Data Studio among others. BigQuery Data Transfer Service automates data movement from 100+ SaaS applications to Google BigQuery on a scheduled, managed basis. There are even connectors available in google marketplace that allows data movement from Amazon S3 to BigQuery.

BigQuery supports streaming ingestion as well as has support for in built machine learning capability via BigQuery ML.

BigQuery in my view is very unique and also an key offering on GCP. Below is short video explaining BigQuery, there is also plenty of documentation available on it on GCP site

Cloud BigTable is an NoSQL service that supports low latency querying and is massively scale-able. You can refer to documentation in reference section below for more details

Let us now look at some of the new features that was recently announced in Google Next’19 event

Cloud Data Fusion

Built on Open Source CDAP project, Data Fusion features a visual point-and-click interface that enables the code-free development of ETL pipelines. Data Fusion comes with broad open-source library of connectors and transformations allowing shift focus away from code to insights and action.

Below is quick sample on how Data Fusion pipeline looks like and also features of Cloud Data Fusion

Data Catalog

Data Catalog helps enterprises quickly discover, manage and understand their data assets. Data Catalog offers a simple and easy-to-use search interface for data discovery and offers a flexible and powerful cataloging system for capturing technical and business metadata.

Cloud Dataflow SQL and Dataflow FlexRS

Tired of writing Java or Python code. Dataflow SQL help build Dataflow pipelines using familiar SQL, with support for both batch or stream data processing.

Dataflow Flexible Resource Scheduling (FlexRS), offers cost benefits for batch workloads. If you’re processing non time-sensitive data, you can benefit from pre-emptible resource pricing.

Using Dataflow SQL complex logic can be expressed as simple SQL, avoiding Java or Python functions.

PCollection<Row> output = app_ratings.apply(
SqlTransform.query(
"SELECT Names.appId, COUNT(Reviews.rating), AVG(Reviews.rating)"
+ "FROM Apps INNER JOIN Reviews ON Apps.appId == Reviews.appId"));

BigQuery BI Engine

Provides an extraordinarily fast, in-memory analysis service for BigQuery. With BigQuery BI Engine, users can analyze complex data sets interactively with sub-second query response time and with high concurrency in Google Data Studio

Few key features in BI Engine include

  • Real-time dashboarding over streaming or static data with sub-second query response
  • Standard SQL, eliminates the need to move data or to create complex data transformation pipelines
  • Serverless and support high concurrency

Connected sheets

Connected sheets provides spreadsheet like interface that combines the simplicity of a spreadsheet with the power of BigQuery. There is no row limits with this connected sheet — it works with the full dataset from BigQuery, whether that’s millions or even billions of rows of data

Analysis in connected sheets can be done with formulas, pivot tables, and charts, and also can be visualized as dashboards and shared with anyone within an organization

You can refer to the first link in the reference section to get details on new launches on data space in google cloud platform

To finish up this article, below are some of the partner ecosystem of data products available in Google Cloud Platform

Note: Some of the images used are taken from google cloud website and sessions. Few new launches mentioned might still not be available to general public

References:

--

--