What a beautiful stream !!

Published in

Google Cloud - Community

4 min readJul 15, 2022

It’s monsoon season in India .. and it has been pouring heavily in most of the places .. green everywhere .. streams of water everywhere .. lakes filled up .. time to go an a picnic !!

Makes one wonder how do the lakes get filled so quickly .. is it the rainwater or the groundwater seepage or the streams that feed into the lake !

And the scenario is no different in the “other type” of lake .. the data lake .. all with it’s multiple sources of data getting in from a variety of channels ..

As a data engineer I have always found data extraction to be a very tricky area to deal with. Considerations around latency, type of extraction i.e. API, file based push from data source, audit column based, etc. And especially if the data source is a bespoke application the conversations with the DBAs of the application DBs have been un-ending around the extraction methodology. Change Data Capture (CDC) tools did come in as saviour to address some of these challenges. While the CDC technology has been around for a while, the setup & operations of many of these products have been a tedious job ..

Not anymore !

Google Cloud recently made Datastream CDC generally available ! And like many of the other data analytics products it is serverless too ! The best part is Datastream addresses both extraction and ingestion.

Let us take a shot at creating a stream in 6 simple steps.

Step 1: Define a stream. Select the source and destination types. Currently, Oracle and MySQL are supported on the source side and cloud storage on the target side.

We strongly believe in meeting you where you are (your data) ! MySQL as a source is supported on other clouds as well.

Step 2: Create a connection to the MySQL source and test it.

Step 3: Here is an interesting thing .. The selection of the schemas and tables (including the future ones) is a breeze ..Simplicity .. inspite this not being a day to day activity for the data engineer !

Step 4 & 5: Create a connection to the target bucket/folder where the delta data is to be ingested. Currently, avro and JSON (including a compression feature) file formats are supported. This bucket/folder location could typically be the raw zone on the data lake.

Step 6: Review the stream .. clear for take off !

Let’s take a view of things on the cruising altitude.

The stream shows a “running” status now.

Hey, we need to do an initial load of the existing data ! No problem, Datastream does support an initial load with the initiate backfill feature.

Let us put Datastream through a test for the core scenarios-Insert, Update & Delete. In the example below, we execute an initial load for the insert operation .. this is followed by an update to one of the records .. followed by a delete operation of the same record via the backend.

So how do we handle the delta data from MySQL on the destination side? Datastream creates a partitioned folder structure in the target bucket by date and time. All the three database operations are captured as separate JSON files in three different folders.

Delta data is staged in multiple folders, is there an easier way to handle this for target load/processing?

There are multiple ways to solve this. I prefer BigLake over here. While BigLake is an all encompassing platform in itself, here we have a BigLake table overarching the root folder with all the delta data.

As we can see, there are three rows corresponding to the three different instances of data record. Having the timestamp, change type and other operational metadata of CDC further helps to propagate the change capture on to the target table.

In conclusion, Datastream has made the CDC implementation extremely simple, agile and very intuitive. The serverless aspect of technology has encapsulated the entire complexity & effort of ops. The Data engineer within me has certainly been empowered ;-)

What a beautiful stream !!

Written by Mandar Chaphalkar