3 Pro Tips For Datastream
Information Technology has been evolving for over half a century, and this evolution is key to company transformation. Recently, the shift to cloud computing has changed where servers and applications are hosted. Moving applications and data to the cloud requires careful planning to ensure business and application availability.
Avoiding service disruptions during this transition is a significant challenge. One way to address this is by replicating data in real-time from one location to another, and then moving the software (double run, and then the shift)
The Datastream service
On Google Cloud, Datastream is designed for data replication from various sources (Oracle, PostgreSQL, MySQL, and SQL Server) to destinations like BigQuery or Cloud Storage.
Datastream is a Change Data Capture (CDC) tool. It listens for source database events (inserts, updates, and deletes) and forwards them to the destination.
A managed CDC service
Datastream simplifies data replication.
- It’s a managed service, so there’s no need for deployment, installation, or updates.
- You simply configure the source and destination. Optionally, you may need to configure network routes to reach your database in your private network.
- Select the schema and tables to replicate, run the stream, and that’s it! Datastream reads the source schema, creates the tables in BigQuery, and initiates the backfill.
If you want to replicate data to a destination other than BigQuery, you can use Cloud Storage as the Datastream destination and use one of the various Dataflow templates to replicate the data to your preferred location.
For example, here’s a sample to use Spanner as the final destination.
The Drawbacks of Managed Services
Managed services, while convenient, often come with limitations and restrictions. Because the service is ready to use out of the box, you must comply with its requirements and there’s a lack of customization.
Reaching the Limits
At Carrefour, we aim to modernize our legacy software, currently running in our data center, and leverage the power of Google Cloud. Specifically, we want to replicate data from our Oracle database to BigQuery, coupled with Looker Studio, for its powerful dashboard and analytics capabilities. Features which are impossible today with on-prem Oracle database.
However, while using Datastream, I encountered three challenges, for which I eventually found solutions:
- Unsupported Primary Keys by BigQuery
- Tables with “logical” primary keys (not defined in the schema)
- BigQuery query cost and optimization
The importance of the primary key
When you configure Datastream in “merge” mode, it gathers your data (writes them to a temporary table in BigQuery) and performs deduplication in the target table (either in real-time or periodically, depending on the Staleness configuration, see bellow).
To ensure deduplication, you must define primary keys that guarantee row uniqueness in the table. BigQuery supports primary key definitions, even if no enforcement is done on them.
Unsupported Primary Keys
The first issue I encountered was unsupported primary keys. About 10% of our tables had a primary key with a FLOAT datatype, which BigQuery does not support.
This is a design issue, but since the software and database are managed by a third-party editor, we cannot update the schema ourselves. As a result, these tables cannot be replicated by Datastream automatically.
Solution
Since Datastream is a managed service, there’s no option to tune the automatic table creation or adapt the source schema into something BigQuery accepts.
Therefore, the solution is to create the table yourself. I asked my DBA for the Oracle table DDL. I adapted the field data types according to Datastream’s default type conversion for Oracle sources. For the primary key, I set it to NUMERIC instead of FLOAT.
I ran the CREATE TABLE statement and initiated a new backfill for this table. Datastream detected the table and was able to coerce the FLOAT type into the BigQuery NUMERIC type.
Problem solved!
“Logical” primary key
In this case, the primary key is managed at the software level, not the database level. I assume the software configuration chose to implement some uniqueness in these tables, even if the database schema wasn’t designed for it.
The problem with tables without primary keys is Datastream’s inability to perform deduplication. New database source events (inserts, updates, or deletes) are simply appended to the table with a new field ROWID
as primary key.
While we could perform post-processing or use workarounds, I found a solution to let Datastream handle this.
Solution
Datastream is an aggregation of more granular services, one of which is the PubSub CDC feature. One requirement for this is to use BigQuery tables with primary keys defined. I decided to try the same with Datastream.
- I removed the table from the source in the stream.
- I truncated the table.
- I added primary key constraints to the table.
- I added the table back as a source in the stream.
- I started the backfill.
- Then, I performed updates in the source table.
This eliminated duplicates, leaving only one entry per “logical” primary key. Manual schema updates were key!
Cost and Optimization
With Datastream, you can optimize BigQuery costs at two levels:
- Ingestion level: to control freshness and merge frequency.
- Consumption level: to optimize data read when a query is run.
When you ingest data with Datastream in MERGE mode, it performs deduplication in BigQuery tables. This merge scans the primary key of every row and performs the correct operation based on the source event.
For large tables, this operation can process a lot of data and incur significant costs.
Datastream includes an optimization parameter called “staleness” that you can define in minutes. The principle is:
- Datastream stores change events in a temporary table.
- Every “staleness” duration, it performs a merge between the temporary table and the destination table.
The consequences are the following
- Pro: you merge less often in BigQuery and you save money
- Con: while the data are in the temporary table, they are not usable/queryable. So, you decrease the data freshness.
By setting this parameter, Datastream creates the BigQuery table with this staleness value.
Depending on the table type and freshness criticality, you can set your own staleness value directly in BigQuery using the ALTER TABLE SET OPTIONS statement.
Important note: If you created your tables using a CREATE TABLE statement, don’t forget the staleness option!
Once the data is in BigQuery and correctly replicated by Datastream, users can query it.
One best practice for BigQuery to optimize processing and limit the data scanned (and therefore the cost) is to partition the tables.
However, Datastream does not create partitioned tables by default, as partitioning doesn’t always exist in the source (legacy) database engines.
Worse, once a table without partitioning is created in BigQuery, you cannot add it later. You have to delete and recreate the table with the correct partitioning option.
Solution
To address this limitation:
- Extract the current DDL from BigQuery using the information schema TABLES.
- Add the partitioning option that matches your table constraint and optimization.
- Drop the table.
- Create it with your updated DDL.
- Run the backfill with Datastream again.
You Are Still The Boss
Datastream is a powerful service for replicating data from source to BigQuery. Our business users are very happy with it, and we’re expanding its use to other applications.
Most of the time, Datastream default behaviors and limitations are not a problem for your data replication.
However, to go beyond, you can still take the lead and create your tables. Be proactive!
- Create tables with Primary Key types supported by BigQuery.
- Create tables with your OWN primary key, even if the source schema has none.
- Create tables with your BigQuery optimization parameters: staleness and partitioning.
In fact, you need to adapt Datastream to your context:
Let Datastream doing the hard work, but tune it for your use cases when you need. You are still the boss!