Mirror Mirror on the Wall: BigTable to BigQuery Mirroring

Yaniv Bordeynik
Tech at Trax
Published in
4 min readSep 4, 2022
Photo by Gabor Szucs on Unsplash

In the world of Big Data, there are cases when data needs to be saved in more than one database. It may sound wasteful, there may different reasons to do this:

  • Different DB usage (direct access vs scanning).
  • Costs considerations regarding saving the data and querying it.
  • One may be more suitable for BI and reporting than the other.

These are some of the considerations for writing the same data into more than one database. However, there are implications to this practice. When a micro service handles millions of events per day, you must need consider the message processing performance time, the memory usage, the connection to the DB’s and of course - lines of code.

Lately, I have been working on a project where I wrote data into BigQuery for BI needs. Later on there was a demand to write the same data to BigTable in order that another service will be able to read from it with direct access. First thing I did was to add another function that all it did was to write the same data into BigTable. After a while we noticed that more teams across the company was needed to write the data not only for BI and reporting use but also for real time purposes.

We thought on how we can make it better and decided on the concept of data mirroring. The idea was to create a new service and it will be responsible on the writing part.

This helped to solve our problem by:

  • Reducing the message processing time in the queues.
  • Reduce memory usage.
  • Remove lines of codes.
  • Maintain code to write to two databases.

TLDR — Data Mirroring from BigTable to Bigquery by a serivce.

In our case the databases in use are BigQuery and BigTable. BigQuery and BigTable are both cloud native databases created by Google. While BigQuery is a data warehouse for relational structured data, commonly used for storing and analysing data, BigTable is a key-value store, NoSQL database, ideal for heavy reads & writes and direct access.

The proposed mirroring method takes advantage of the fact that each time data is written to BigTable, an event is likewise published to a PubSub queue. The mirroring service will get the data from the pubsub and will determine (with the help of a configuration file) whether the event is configured to be mirrored into BigQuery.

Service architecture (image by author)

When you have mutliple services and ETL processes which are all producing huge amounts of data, you want to save the data in a relational database to be able to query it later on for reports, dashboards and analysis. You may want to save the data as is and duplicate it (or just part of it).

When writing such a service, we want to make it as generic as possible so it can handle different format of events and to build it as a simple task for the user as much as possible without refactoring the service and deploying it every time new data needs to be written.

When the mirroring serivce gets a new event from the queue, it will check whether this event need to be handled or not.

Configuration

In order to manage the services and destination tables orchestration, there is a configuration file which is stored in Git repository.

Why do we even need a configuration file?!

Photo by Ferenc Almasi on Unsplash

A configuration file will allow the end user to set the properties that he needs. The user will be able to do it simply and without the need of code writing and deployments.

In the configuration file we will define the table that the data needs to be written to with some other properties: the project name and the dataset on BigQuery. The advantage is that the configuration file can also define the scehma. If the data meets the schema requirements the table is expecting, the service will write it to BigQuery table. If not, the event will simply be ignored.

Using a configuration file, also helped us to add an option to validate the data if needed (if the user wish to), or choose to ignore it if not.

Final Words

As you saw in this article, our goal was to reduce performance time issues caused by services and ETLs having to write data into two databases.. The solution was to do data mirroring using a configuration file. The largest gain from that was that the configuration file is generic enough to handle different use cases. Working on the process, we took advantage the fact that event is already published to the Pub/Sub, simply read it with the new mirroring service, and used it as we wish in order to write the data as we need.

There is a little prior work that is required of by the end-user before the uese:
to create the table the user wishes to write to in BigQuery and update the configuration file with the new table. A quite simple task without the need of new deployments and code changes.

Please feel free to comment and ask anything, and clap if it make sense :)

--

--