How to Automate and Scale SQL capabilities using Google Cloud Platform

Dominic Imbuga
DevCNairobi
Published in
4 min readJul 23, 2021

GCP NETWORK, The largest of its kind on earth.

You have noticed that we are running all our SQL commands on our Termux terminal from our previous article demo at https://link.medium.com/n8UPtodv7hb .

How can we scale this up? Can this approach be used to handle terabytes of data ? You maybe be wondering. Of course not, the good news is that all this commands that we have been writing can be used in the same way on the cloud .

A little about me is that I’m a cloud engineer by training, thanks to the Google Africa Developer Scholarship 2020 program where I was as student and the same 2021 program where I’m a mentor for the associate cloud engineers track.

This has journey has totally absolutely changed my perspective and taught me that If I don’t enjoy that mindset of being a life long learner, I can have a career but I will not be happy with it.

Through the scholarship, I learned implementating, deploying, migrating and maintaining applications in the Cloud. I noticed that I understand different Google Cloud Platform (GCP) services and features and I'm able to apply my knowledge, analyse requirements, evaluate different options and create my own services.

Why not use this skill set to accelerate my machine learning journey ? I thought, If using a managed service allows me to outsource a lot of the administration and maintainance overhead to Google.

In this article we will be architecting our own data platform infrastructure in GCP and move from PostgreSQL on Termux. Let us think global and develop locally, we have to have the bigger picture for kiwanda data infrastructure, right?

We definitely want to choose a managed service to make our work easier, this will enable us to focus on first things first which is uncovering meaningful insights from data using SQL. Our upgraded tools lists for this cloud job will be:

• BigQuery

• Cloud Dataflow

• Cloud Dataprep

• Cloud Dataproc

BigQuery

BigQuery is GCPs serverless highly scalable and cost effective cloud data warehouse. It is a petabyte scale data warehouse that allows for super fast queries using the processing power of Google's infrastructure.

Because there is no infrastructure for you to manage, you can focus on uncovering meaningful insights using familiar SQL without the need for the database administrator.

BigQuery is used by all types of organisations and there is a free usage tier to help you get started. You can access BigQuery by using the GCP console, by using command line tool or by making calls to the BigQuery rest API using variety of clients libraries like Python, Java or .net. There are third party services that help with data visualization or loading data

This querry below takes less than a minutes to execute on BigQuery to process on a table of over 100 billion rows on 4.1TB ! it would take hours or days through a serial execution like we have been doing on our our dataset.

Cloud Dataflow

It is a managed service for executing a wide variety data processing patterns . It's a fully managed service for transforming and enriching data in stream and batch modes with equal reliability and expressiveness.

This data could come from other GCP service like Cloud Datastore or Pub/Sub or ingested from third party services. After you transform data in cloud data flow you can now analyse it in BigQuery , AI platform or even Cloud Bigtable.

Cloud Dataprep

This is an intelligent data service for visual exploring , cleaning and preparing Structured and unstructured data for analysis , reporting and Machine learning.

With automatic schema , data types, possible JOINs and anomaly detection. You can skip time consuming data profiling and focus on data analysis.

Cloud Dataprep is fully managed and scales on demand to meet your data preparation needs so you can Focus on analysis. The refined data can be then exported to BigQuery or cloud storage for analysis and Machine learning

Cloud Dataproc

This is a fast, easy to use fully managed cloud service for running Apache spark and Apache Hadoop clusters in a simpler way.

Without using cloud Dataproc it can take from 5 to 30mins to create Spark and Hadoop Clusters on premise or through other IaaS providers.

Cloud Dataproc clusters are quick to start, scale and shutdown with each of this operations taking less that 90mins. Built in integration with BigQuery, Cloud storage, cloud Bigtable and this provides you with a complete data platform.

I hope this article made you feel more comfortable with using different GCP data tools to automate and scale your data platform ready for machine learning. Now it's your turn, go ahead and apply what you have learnt in this article by architecting your own data infrastructure in GCP.

--

--