How we are streaming thousands of rows per second into BigQuery — Part II: Google Storage loading
In the previous post, we described how we are leveraging Google Cloud Dataflow to record our data in BigQuery. Though it has some great advantages, it began to become more and more expensive.
So we had to figure a way to reduce these infrastructure costs.
The cost items
The cost came from 3 Google Cloud components:
The basic idea
We quickly looked into the possibility to load data into BigQuery from Google Cloud Storage files (abbreviated GCS from now on). It is the only other way to load data into BQ if we want to avoid streaming costs.
We found that Dataflow has an option to not stream inserts into BQ, but rather load them using batch loading. But using this solution did not address our 2 main items. So we had to figure a way to also reduce those costs.
So we searched for a way to write our data quickly enough into GCS to meet our main demands :
- Don’t impact our running services
- Don’t rely on a message bus system requiring to scale up when the demands grow and which was expensive
- Write data quickly enough for the data to be readable on our reporting UI. We need the data to be loaded as often as 15 minutes to have the freshest data possible.
How we did it
To find our solution, we began from the end of the workflow. Indeed we knew we had to use the file batch loading into BQ from GCS.
Then we had to find a way to write JSON into GCS without impacting our services. Since we were using Kubernetes, we knew that we could attach sidecars to our running pods. And these sidecars could share the same volumes as our services.
So we simply tried to write JSON lines into files directly in the pods file system and made a sidecar to transfer these files to GCS. Then the last step was to simply create a cron to load these files periodically.
That’s when our solution called GCS Streaming was born. The final architecture now looks like this :
When we developed the GCS Streamer we already had some jobs running in Go. So we naturally went for it.
The code is pretty simple:
- Listen to directory modification events (file created, file moved) using the library FsNotify.
- Transfer those files to GCS
It looks like this:
The cost reduction was between 5 to 10 times versus the dataflow costs depending on the data throughput. We are currently using a mix of both solutions where the voluminous data is going through this new solution, and where critical data still go through the old one.
We stuck with the old one for critical data for 2 reasons:
- The data is streaming nearly instantly into BQ which allow us to use it in monitoring and reporting tools. And some services also need the data to be available as soon as possible
- The dataflow solution is lossless, whereas going through file system streaming and copying locally and then on GCS is causing a minimal loss (mainly due to pod restarts). But this loss, which is constantly monitored, was acceptable regarding the reduced cost and the data criticality. Nowadays we may be losing less than 0.01%.
We were also using the Dataflow streaming in our QA context until recently because e2e tests needed the data to be loaded as quickly as possible. However, we recently switched to the batch loading version as our preproduction environment are now able to load GCS file on the go, which is as fast as the streaming version. It even fixed some issues where data was stuck in the BQ streaming buffer.
Using this new way of inserting data in BQ is saving us a lot of money. It allowed us to scale up peacefully, and to accept much more incoming load without seeing our bills skyrocket.
We chose this architecture because we could allow medium latency insertions. And it allows us to scale up without a linear cost hike.