Streaming data directly into BigQuery

Ajay Chitre
2 min readJul 16, 2018

--

In my previous post, I pointed out that I am quite excited about the speed of BigQuery so now it’s time to figure out what are the most (cost) effective ways of pumping data into it. My first choice of pumping data into any datastore is always ‘near real time’. I absolutely love one of the slogans of Confluent which says, “Life doesn’t happen in batch mode” -:) For some reason, if streaming doesn’t work, one can always fall back on the batch mode so I decided to focus on streaming to begin with.

I know purists frown upon the idea of streaming directly into a datastore but for smaller loads sometimes it does make sense to avoid additional overhead. Of course, inserting a row at a time would be a BAD idea so BigQuery provides an API called, BigQuery.insertAll, to insert rows in batches. All we’ve to do is create a InsertAllRequest.Builder object & keep adding rows to it. Once a desirable batch size is reached, we can pass this Builder object to ‘insertAll’ API & then re-create the builder object. The following image contains a code snippet that shows usage of these APIs:

For this experiment I decided to use ‘sample’ tweets made available by Twitter & used the Twitter4j APIs to save tweets on to a table in BigQuery. As for speed, 10,000 rows were getting inserted consistently in less than two seconds. Of course, do keep in mind that my row size is fairly small but still this ain’t a bad speed!

If you want to try it out yourself, the source code for this experiment is available at: https://github.com/ajaychitre/gcp-tools

You can follow instructions in the README.md file to run this experiment in your environment.

Conclusion: For slowly changing datasets, writing data directly into BigQuery is certainly an option worth considering. The cost of ‘Streaming Inserts’ is only $0.01 per 200Mb. There are ‘Active’ & ‘Long term’ storage costs that also should be considered but overall the costs look fairly reasonable.

On Deck Circle: In the next blog post I will describe how I wrote a Dataflow job to push data into BigQuery. I used Google’s PubSub messaging tool to write the incoming tweets into a Topic & then read the tweets from this topic in a Dataflow job. Please stay tuned.

Next post: From ‘On Deck Circle’ to the ‘Home plate’. It’s now ready for you to read: https://medium.com/@chitre.ajay/using-dataflow-pubsub-to-load-data-into-bigquery-562be26ef71e

--

--