Cloud technologies have made access to powerful computing infrastructure possible for everyone. From large corporations to one-person startups, everyone is moving their high performance workloads such as data analytics and machine learning to the cloud. Naturally, public cloud providers have been offering specific solutions to data engineering problems and Google is at the forefront with its cloud offering, Google Cloud Platform(GCP) and its philosophy of fully managed solutions. Tools available on Google Cloud Platform for end-to-end data processing can be broadly classified under 4 categories of data lifecycle; ingest, store, process & analyse, explore & visualise.
Data can be ingested from multiple sources. User data can be captured from applications hosted on compute services such as App Engine, Compute Engine, etc., machine data can be captured from Stackdriver Logging whereas data captured from IoT devices can be input using a server less messaging queue such as Cloud Pub/Sub. GCP also facilitates for challenges related to migrating bulk data from other cloud platforms such as AWS and also from on-premises systems by providing services like Cloud Transfer Service and Transfer Appliance providing users with easy access to secure data ingest in the cloud.
Once the data has been ingested, depending on the use case, data can be stored to an appropriate storage option. Primary filtering can be done based on factors like structured or unstructured, SQL or No-SQL and relational or non-relational. Storage options on GCP can either be proprietary such as BigQuery or Cloud Spanner or based on open source solutions such as Cloud SQL. Bigtable is a highly scalable No-SQL non-relational database that although is a proprietary setup, yet offers an Apache HBase API. Other forms of data such as machine generated data can be hosted on Stackdriver, streaming data can be served through Cloud Pub/Sub whereas batch data can be stored in Cloud Storage.
After the data is stored, GCP provides an array of tools to serve almost all use cases around processing and analysing it to produce resourceful insights. ETL jobs, depending on the flavour of choice, can be executed on tools like Cloud Dataproc which is an autoscaling highly available service for the Hadoop ecosystem or on Cloud Dataflow for embarrassingly parallel workloads using the Apache Beam framework. GCP also provides services tailored to tasks such as data cleaning that are implemented via Cloud Dataprep. Cloud Dataprep can create Cloud Dataflow pipelines by simply using its intuitive UI. Complete workflow orchestration via Apache Airflow is possible via Cloud Composer that allows the end-user to create direct acyclic graphs to schedule workflows. A similar solution is also available in the form of Cloud Data Fusion that requires writing zero lines of code.
On the machine learning front, GCP offers 3 flavours of choice that can be implemented depending on level of customisation needed by the end-user. Cloud AI Platform is an integrated platform that allows the user to train, validate, evaluate and predict on a custom model without having to worry about underlying infrastructure. It supports multiple machine learning frameworks such as sci-kit learn, Keras, Tensorflow and PyTorch. It also supports deploying Kubeflow pipelines. Cloud AutoML uses transfer learning to train on users’ data to provide semi-custom models again without having to worry about any infrastructure. Finally, GCP enables users to add intelligence to their applications using pre-trained machine learning models that are served through APIs.
GCP also offers hosting private Jupyter notebooks in the cloud in the form of Cloud Datalab. Cloud Datalab spins up a Compute Engine instance in the background to provide users with necessary computational power. In the end, the data can be either exported to Google Sheets or Cloud Data Studio for exploration or visualisation.
— — — — — — — — — — — — — — — — — — — — — — — — — —
This write-up was heavily inspired by https://cloud.google.com/solutions/data-lifecycle-cloud-platform
The goal is to write more expansively on each of the phases and also cover certain favourite products such as BigQuey and Cloud Dataflow.