How we are streaming thousands of rows per second into BigQuery — Part I: Google Cloud Dataflow
The Dataflow job we described in this article, can be found on our AntVoice GitHub account: https://github.com/AntVoice/bq-streamer
Context of Google Cloud BigQuery usage
At AntVoice we have 2 main usages for our data:
- Provide some analytics dashboards to our customers to quickly see how their advertisements campaigns are performing.
- Use this data to train our models to give the perfect recommendation for the good user
As we are working intensively with Google Cloud, the technology of choice was Google Cloud BigQuery as it allows quick data ingestion and reasonable query time on very large datasets.
There are mainly 2 ways of loading data into BigQuery:
- You can stream data live from your services into your tables. Data are then buffered to be persisted later (few minutes generally)
- You can write your data in Google Cloud Storage and then load them into BigQuery.
We choose to stream data directly to BigQuery as it was the simpler approach for our small and growing company.
Our first solution
Back then, when we first implemented this service, we needed some flexibility and the volume of data to load was reasonable. So we first created a Google Cloud PubSub topic to treat all of our data inserts asynchronously in a dedicated micro-service.
It worked for a time but soon our service was overwhelmed and it began to crash often. Even adding new instances, we weren’t able to handle the growing throughput.
At that time we weren’t using Kubernetes yet, making it more difficult to scale properly to handle the growing load. So we searched for a more managed solution that was able to scale more or less automatically. And we found Google Cloud Dataflow that could potentially help us meet the demand.
Why did we need to create our own AntVoice Dataflow job?
Dataflow is a Google Cloud managed service that handles flows of data like an ETL. For a given job you create, you can define where does the data come from, what to do with it and where to store it. Then Google Cloud creates and manages the VMs your job is using to process your data.
Google is providing simple ready-to-use job templates such as :
- Load data from the given PubSub topic
- Write these lines into the given BigQuery table
If you only have a few tables to stream to, that might suit you. However, at that time we already had approximately 20 tables to stream to. And we didn’t want to have 1 topic and 1 job per table given our currents volumes.
We finally decided to implement our own flow using Apache Beam SDK. And we chose Java to develop it, as it was close enough to our current C# and Scala stack.
What are our Dataflow job advantages?
The idea behind this BigQuery streamer job is to listen to a single PubSub topic for messages that contain the row to insert into BigQuery. Each PubSub message defines in its metadata the target dataset and table.
We also added the possibility for the JSON row to be gzipped to save cost on PubSub billing. Whether a JSON is compressed or not is also indicated in the metadata.
As shown below our job consists of 4 consecutive steps.
- We read incoming rows from a PubSub subscription.
- We read all the information needed like the target table or the compression status. And if needed we uncompress it.
- The 3rd step is a technical step which is batching our rows to insert several of them at the same time.
- Finally, we stream the rows into BigQuery
The metrics displayed in the screenshot are not representing the real throughput this kind of job did handle. We are now using this solution to stream only our most critical data. As you will see in a second article, we completed this Dataflow job with another kind of batch data loading to mitigate growing costs.
When we integrated all our data through this streaming job, it was streaming approximately 100 million rows every day.
This solution has 2 major advantages:
- Knowing that every row entering PubSub will be streamed to the target BigQuery table. It allows for a precise customer reporting UI where they can find exact data.
- You can very simply stream a new table inside this job since all BigQuery destination table information is sent in the metadata. Without ever changing your Dataflow job.
Since 2018, this Dataflow job is running smoothly. As soon as we have an entity in our C# code (or any other language, even Bash) that corresponds to a table schema, streaming to BigQuery is as simple as sending the entity JSON to PubSub with the correct metadata.
What are the cons of using this solution?
We have encountered 3 issues so far with this solution:
- When some messages are holding a malformed JSON that does not correspond to the table destination. In this case, the Dataflow job is reporting some BigQuery errors which can block some or all your job.
It happened to us a couple of times, but at the moment instead of designing an automatic solution, we preferred to handle those rare cases manually to be certain no meaningful data were lost in the process. - We tried to adapt the auto-scaling feature to our use case but we never managed to make it scale properly. As a temporary fix, we set up a manual scale up and down in the morning and the evening.
- Our Dataflow job was becoming very costly. Indeed, that solution requires paying for the PubSub throughput, the Dataflow job VMs and the BigQuery streaming cost. And as our customer base grew we were facing an increasing amount of data to stream every day. That represented a hefty cost to pay each month.
The next AntVoice solution to decrease Dataflow costs
When we tried to diminish our Google Cloud billing we looked into the other way of ingesting data in BigQuery, which is loading them from Google Storage.
We created a system to write data in Google Cloud Storage on the go, and load them periodically into BigQuery.
This new way of loading data will be described in the following post.
To be continued…