Using Dataflow & PubSub to load data into BigQuery

Ajay Chitre
3 min readJul 16, 2018

--

In the previous blog post, I wrote about directly inserting rows into BigQuery; which is acceptable if we’re dealing with slowly changing datasets but what if we’re dealing with large amount of constantly changing data? What if, in addition to writing into BigQuery, we want to perform some additional tasks; such as aggregate incoming data based on a time window & save the results on a different table? In cases such, it’s better to write the incoming data into a PubSub ‘Topic’ & then create as many ‘Subscribers’ as we need to perform different tasks.

PubSub: This is Google’s version of ‘Apache Kafka’ & Amazon’s ‘Kinesis’. Unlike Kafka, it’s hosted so there’s no setup involved. You pay as you go. The PubSub API is fairly straightforward. You can simply create a ‘PubsubMessage’ object & Publish it to an ‘ApiFuture’ object to write it asynchronously as shown below:

One thing you need to be careful of is ‘shutting down’ Publisher before terminating your process. This can be done using a ‘Shutdown Hook’ as given below:

For this experiment I decided to use ‘sample’ tweets made available by Twitter & used the Twitter4j APIs to write Tweets into a ‘Topic’ as described above.

Dataflow: I like the concept of Dataflow. It allows us to write code once & run it with different Runners. The ‘DataflowRunner’ can be used to run the job in Google’s Hosted environment. There’s no need to start a cluster such as EMR or Dataproc. All we’ve to do is specify ‘numWorkers’ and it will allocate those many VM instances to run the Job. But if you already have a Spark cluster, you can simply use the ‘SparkRunner’ to run the same code! Concept is great. How well does it work, though? That’s what I wanted to find out.

You can checkout the code for the TweetsWriter class but here are a few code snippets:

1. First you can read the Tweets from a topic & extract the message as follows:

2. You can then write ‘TableRows’ & finally write to BigQuery in ‘Streaming Inserts’ mode as follows:

You can try this experiment in your environment by following steps given in this README.md file. Check out the section titled: Using Dataflow & PubSub to load data into BigQuery.

Conclusion: This works like a charm! The speed is GREAT! As soon as a message comes in the topic, almost instantaneously it gets written to BigQuery! Just make sure you don’t get any NullPointerExceptions or some such errors. The Dataflow as it stands right now doesn’t do a good job of reporting errors in the log! But overall, Dataflow ROCKS!

On Deck Circle: I am going to try running a similar program under ‘Spark’ & see how it compares with the ‘Dataflow’. Please stay tuned. Thanks for reading!

--

--