Google Cloud Serverless Change Data Capture into BigQuery
Google Cloud are further cementing their place as the leaders in Data by way of introducing the ability to sink data from MySQL, in near-real-time, into BigQuery.
Let’s take a little step back. Back in November 2021, Google Cloud announced the launch of a new service called Datastream. This enabled users to sink data from a MySQL database into Cloud Storage. That data was then available to do whatever you wanted with — namely load this data into BigQuery.
While then in July 2022, Google released the ability to stream data from PubSub, directly into BigQuery, without the use of Dataflow. This meant that if you had a way of extracting CDC type data from your operational database into PubSub, writing it into BigQuery was a straightforward process.
You can see where this is going…
You could put two and two together and take a guess at where this is heading — and you’d probably be right!
In mid-September 2022, Google Cloud announced a preview feature, enabling a user to create a Datastream stream that joins both the extract and sink together to create an extremely simple experience for loading CDC type data from MySQL into BigQuery and have it available in near-real-time.
Why is this important?
This matters because more and more organisations are wanting to use their data more frequently, with lower latency to drive better decisions. Making operational data available in an analytics warehouse almost immediately is not an easy challenge — but with the latest announcement in Datastream, this becomes a doddle.
I decided to give it a try, as this is something I am particularly interested in.
First off, we need an operational data store. So I deployed a MySQL CloudSQL instance:
Navigate to CloudSQL and hit the Create button:
Select MySQL as the database engine:
In the create form, there’s a few bits to fill out. Enter a name for the instance and generate a password:
Note the password for later use!
Ensure the version is the latest one (in this case 8) and select Production as the configuration:
Select a region for the instance to live in. Remember this, as the BigQuery dataset we create later needs to reside in the same location. I’ve also selected “Single Zone” seeing as this is just a PoC and I don’t need HA.
Again, as this is a PoC, I’ve selected a small machine type to keep it cheap.
You need to ensure that point-in-time recovery is enabled on the instance, as this tells MySQL to write to the binlog which is how CDC typically works:
Once you’ve filled in all the details as above (and noted the password and region!), go ahead and click the Create button.
Go grab a brew — creation takes a few minutes.
When the instance has finished being set up, you should see a solid circle with a tick inside next to the instance name, something like this:
Once the instance has been created, we need to perform some additional configuration to enable Datastream to read the binary log and consume those lovely CDC events. To do so, we need to connect to the instance and we can do this through Cloud Shell:
Cloud Shell will open with the connect command automatically populated:
Hit the enter key and when prompted, authorise Cloud Shell to use the Google Cloud API’s. Return to Cloud Shell and you may need to hit enter again. Cloud Shell will whitelist your IP address for connection and then you will be prompted for the instance password from earlier. You should land in the MySQL client.
Enter the below SQL commands to create a user for Datastream (be sure to replace [YOUR_PASSWORD] with an actual password for the user). We also grant the right permissions on the new user so it can read the data and then finally reload the permissions.
mysql> CREATE USER 'datastream'@'%' IDENTIFIED BY '[YOUR_PASSWORD]';
mysql> GRANT REPLICATION SLAVE, SELECT, RELOAD, REPLICATION CLIENT, LOCK TABLES, EXECUTE ON *.* TO 'datastream'@'%';
mysql> FLUSH PRIVILEGES;
While we’re here, we’ll create a table for Datastream to stream. Run the following SQL commands:
CREATE DATABASE IF NOT EXISTS test;
USE test;
CREATE TABLE IF NOT EXISTS test.example_table (
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
text_col VARCHAR(50),
int_col INT,
created_at TIMESTAMP
);
And let’s insert some initial data:
INSERT INTO test.example_table (text_col, int_col, created_at) VALUES
('hello', 0, '2020-01-01 00:00:00'),
('goodbye', 1, NULL),
('name', -987, NOW()),
('other', 2786, '2021-01-01 00:00:00');
Source setup complete! Now on to Datastream!
Navigate to the Datastream console and click Create!
In the configuration, give the stream a name and the ID will auto-populate:
Remember the region we deployed the CloudSQL instance into? Well be sure to select the same region next:
Select the source type as MySQL and the destination as BigQuery:
Click the Continue button. Next we’ll configure the connection settings!
Now we’re going to tell Datastream actually where and how to connect to our CloudSQL instance. You can find the IP of your instance in the CloudSQL instance page, and for the Username & Password you should use the details from the MySQL commands we issues previously:
Move on to the “Define connectivity method” section and select “IP allowlisting”. This will show the IP addresses that we’ll allow access shortly to the CloudSQL instance. The IP’s will be different depending on which region you created the stream:
Copy the list of IP addresses, open a new tab/window and navigate back to the CloudSQL instances page and click Edit:
Scroll down and expand the “Connections” section. Under the “Authorised networks” heading, click “Add Network”. Enter one of the IP addresses you copied earlier and add “/32” at the end. This converts the IP address into “CIDR” notation. Do this for each of the IP addresses you copied.
Once added click the Save button! The instance may take a short while to update with the new configuration.
Once the configuration has been applied, switch back to the Datastream tab/window and at the bottom of the connection configuration, click “Run Test”, and if the demo gods shine upon us, you should see a success message:
If so, click the Create and Continue button. Next, you should see the options for selecting which source databases and tables to stream. Under which “objects to include”, select “Specific schemas and tables”. In the list that loads, select the database we created during our setup of CloudSQL:
Hit Continue! Now we start to configure the destination for the stream by giving the connection a name:
Hit Continue! Next we configure in which region the BigQuery dataset should live and a prefix to show that where the dataset originates from.
Now, this next part is where the real magic of Datastream and its integration with BigQuery comes in. We get to select how stale we want our data to be. For the purposes of this demo, I’ve selected “0 seconds” (as you may have guessed by the previous references to “rt” in my configuration 😏).
Hit Continue and then Validate. If everything is going to plan, then you should see something like this:
Hi “Create and Start”, confirm the choice and the stream will start to deploy. Once deployed, you should see a nice little running icon:
Now, we should be able to head over to BigQuery and we should be able to see the dataset and table in the explorer:
Let’s query it! You should see the initial data we populated the database with:
Now, what’s the use in setting all this up if we’re just going to replicate the initial data? Shouldn’t we at least replicate some new data? Of course we should!
Switch back to CloudSQL instance page and reconnect to the instance via Cloud Shell. Let’s insert some more rows! Run the query a few times.
INSERT INTO test.example_table (text_col, int_col, created_at) VALUES
('hello', 0, '2020-01-01 00:00:00'),
('goodbye', 1, NULL),
('name', -987, NOW()),
('other', 2786, '2021-01-01 00:00:00');
Return to the BigQuery query interface and resubmit the query.
Well would you look at that! By the time we’ve returned to the BigQuery interface and resubmitted our query, the data has been replicated!
And just like that, we have real-time CDC from MySQL into BigQuery, making our operational data ready for analytics almost instantly.
Thanks for reading and following along!
About CTS
CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.
We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for; Retail, Financial Services, Media and Entertainment.
We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!