Google Cloud Serverless Change Data Capture into BigQuery

Ash Broadley
Appsbroker CTS Google Cloud Tech Blog
8 min readOct 7, 2022

Google Cloud are further cementing their place as the leaders in Data by way of introducing the ability to sink data from MySQL, in near-real-time, into BigQuery.

Let’s take a little step back. Back in November 2021, Google Cloud announced the launch of a new service called Datastream. This enabled users to sink data from a MySQL database into Cloud Storage. That data was then available to do whatever you wanted with — namely load this data into BigQuery.

While then in July 2022, Google released the ability to stream data from PubSub, directly into BigQuery, without the use of Dataflow. This meant that if you had a way of extracting CDC type data from your operational database into PubSub, writing it into BigQuery was a straightforward process.

You can see where this is going…

You could put two and two together and take a guess at where this is heading — and you’d probably be right!

In mid-September 2022, Google Cloud announced a preview feature, enabling a user to create a Datastream stream that joins both the extract and sink together to create an extremely simple experience for loading CDC type data from MySQL into BigQuery and have it available in near-real-time.

Why is this important?

This matters because more and more organisations are wanting to use their data more frequently, with lower latency to drive better decisions. Making operational data available in an analytics warehouse almost immediately is not an easy challenge — but with the latest announcement in Datastream, this becomes a doddle.

I decided to give it a try, as this is something I am particularly interested in.

First off, we need an operational data store. So I deployed a MySQL CloudSQL instance:

Navigate to CloudSQL and hit the Create button:

CloudSQL create instance

Select MySQL as the database engine:

Select MySQL as the database engine

In the create form, there’s a few bits to fill out. Enter a name for the instance and generate a password:

CloudSQL instance name and password

Note the password for later use!

Ensure the version is the latest one (in this case 8) and select Production as the configuration:

CloudSQL MySQL version and configuration

Select a region for the instance to live in. Remember this, as the BigQuery dataset we create later needs to reside in the same location. I’ve also selected “Single Zone” seeing as this is just a PoC and I don’t need HA.

CloudSQL region selection

Again, as this is a PoC, I’ve selected a small machine type to keep it cheap.

CloudSQL machine type selection

You need to ensure that point-in-time recovery is enabled on the instance, as this tells MySQL to write to the binlog which is how CDC typically works:

Ensure point-in-time recovery is enabled!

Once you’ve filled in all the details as above (and noted the password and region!), go ahead and click the Create button.

Go grab a brew — creation takes a few minutes.

When the instance has finished being set up, you should see a solid circle with a tick inside next to the instance name, something like this:

CloudSQL instance successfully created

Once the instance has been created, we need to perform some additional configuration to enable Datastream to read the binary log and consume those lovely CDC events. To do so, we need to connect to the instance and we can do this through Cloud Shell:

Open Cloud Shell

Cloud Shell will open with the connect command automatically populated:

Connect to the instance via Cloud Shell

Hit the enter key and when prompted, authorise Cloud Shell to use the Google Cloud API’s. Return to Cloud Shell and you may need to hit enter again. Cloud Shell will whitelist your IP address for connection and then you will be prompted for the instance password from earlier. You should land in the MySQL client.

Enter the below SQL commands to create a user for Datastream (be sure to replace [YOUR_PASSWORD] with an actual password for the user). We also grant the right permissions on the new user so it can read the data and then finally reload the permissions.

mysql> CREATE USER 'datastream'@'%' IDENTIFIED BY '[YOUR_PASSWORD]';
mysql> GRANT REPLICATION SLAVE, SELECT, RELOAD, REPLICATION CLIENT, LOCK TABLES, EXECUTE ON *.* TO 'datastream'@'%';
mysql> FLUSH PRIVILEGES;

While we’re here, we’ll create a table for Datastream to stream. Run the following SQL commands:

CREATE DATABASE IF NOT EXISTS test;
USE test;
CREATE TABLE IF NOT EXISTS test.example_table (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
text_col VARCHAR(50),
int_col INT,
created_at TIMESTAMP
);

And let’s insert some initial data:

INSERT INTO test.example_table (text_col, int_col, created_at) VALUES
('hello', 0, '2020-01-01 00:00:00'),
('goodbye', 1, NULL),
('name', -987, NOW()),
('other', 2786, '2021-01-01 00:00:00');

Source setup complete! Now on to Datastream!

Navigate to the Datastream console and click Create!

Create new Datastream

In the configuration, give the stream a name and the ID will auto-populate:

Datastream stream name

Remember the region we deployed the CloudSQL instance into? Well be sure to select the same region next:

Datastream stream region

Select the source type as MySQL and the destination as BigQuery:

Datastream stream source & destination types

Click the Continue button. Next we’ll configure the connection settings!

Datastream stream CloudSQL connection name

Now we’re going to tell Datastream actually where and how to connect to our CloudSQL instance. You can find the IP of your instance in the CloudSQL instance page, and for the Username & Password you should use the details from the MySQL commands we issues previously:

Datastream source connection details

Move on to the “Define connectivity method” section and select “IP allowlisting”. This will show the IP addresses that we’ll allow access shortly to the CloudSQL instance. The IP’s will be different depending on which region you created the stream:

Datastream connectivity method

Copy the list of IP addresses, open a new tab/window and navigate back to the CloudSQL instances page and click Edit:

Edit the CloudSQL instance

Scroll down and expand the “Connections” section. Under the “Authorised networks” heading, click “Add Network”. Enter one of the IP addresses you copied earlier and add “/32” at the end. This converts the IP address into “CIDR” notation. Do this for each of the IP addresses you copied.

Allow each Datastream IP address in CIDR notation

Once added click the Save button! The instance may take a short while to update with the new configuration.

Once the configuration has been applied, switch back to the Datastream tab/window and at the bottom of the connection configuration, click “Run Test”, and if the demo gods shine upon us, you should see a success message:

Datastream connection success!

If so, click the Create and Continue button. Next, you should see the options for selecting which source databases and tables to stream. Under which “objects to include”, select “Specific schemas and tables”. In the list that loads, select the database we created during our setup of CloudSQL:

Source selection

Hit Continue! Now we start to configure the destination for the stream by giving the connection a name:

Destination connection name

Hit Continue! Next we configure in which region the BigQuery dataset should live and a prefix to show that where the dataset originates from.

Destination dataset configuration

Now, this next part is where the real magic of Datastream and its integration with BigQuery comes in. We get to select how stale we want our data to be. For the purposes of this demo, I’ve selected “0 seconds” (as you may have guessed by the previous references to “rt” in my configuration 😏).

Data staleness selection

Hit Continue and then Validate. If everything is going to plan, then you should see something like this:

Stream validated

Hi “Create and Start”, confirm the choice and the stream will start to deploy. Once deployed, you should see a nice little running icon:

Datastream running!

Now, we should be able to head over to BigQuery and we should be able to see the dataset and table in the explorer:

Our MySQL table replicated into BigQuery!

Let’s query it! You should see the initial data we populated the database with:

We have data!

Now, what’s the use in setting all this up if we’re just going to replicate the initial data? Shouldn’t we at least replicate some new data? Of course we should!

Switch back to CloudSQL instance page and reconnect to the instance via Cloud Shell. Let’s insert some more rows! Run the query a few times.

INSERT INTO test.example_table (text_col, int_col, created_at) VALUES
('hello', 0, '2020-01-01 00:00:00'),
('goodbye', 1, NULL),
('name', -987, NOW()),
('other', 2786, '2021-01-01 00:00:00');

Return to the BigQuery query interface and resubmit the query.

Well would you look at that! By the time we’ve returned to the BigQuery interface and resubmitted our query, the data has been replicated!

And just like that, we have real-time CDC from MySQL into BigQuery, making our operational data ready for analytics almost instantly.

Thanks for reading and following along!

About CTS

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for; Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!

--

--

Ash Broadley
Appsbroker CTS Google Cloud Tech Blog

Interests include, but not limited to: Google Cloud (Certified Arch, Data Eng & DevOps), Programming, Squash, Technology, Space, Food. Views are my own.