Going further with Cloud Dataflow: conception of a real-time polls app — part 2

Learn how to use GCP Cloud Dataflow to aggregate unbounded data streams.

Welcome back to my journey building a real-time, scalable polls app using Firestore! In the last story [part 1], I talked about sharding counters which is a common technique to reduce data contention by breaking a counter into N different counters and incrementing one of them at random. However, this has a drawback.

In fact, if you don’t know your max load ahead of time, you won’t be able to reduce the cost and to maximize performance. Also, I don’t think the shards management is super convenient.

The solution could be in the Cloud

I discovered GCP Dataflow which is a fully managed data pipeline for consuming and transforming data at a virtually unlimited bandwidth. This solution looks amazing because we no longer have to think about how many users do we have, how fast our entities are going to be updated or how many document should we shard across.

We let Dataflow scale our app automatically while invoking workers and managing CPU/memory usage 🔥. Also, instead of asking Firestore to aggregate each single vote, we’re going to send him stacks of post-aggregated votes at a rate it could handle.

Here is what we are going to build 🔥

Updating the logic

First of, we no longer need shards and we’re simply going to add a votes counter directly in each answer document.

Re-Modeling #1: Answers

Next, the onVote() Cloud Function will radically change. Instead of updating a shard at random, we’re now going to encode the reference to the answer, ie: polls/{pollId}/answers/{answerId} .Then, we publish a new message on a topic.

This is important to keep the payload as small as possible to reduce cost and the logic behind our pipeline.

onVote() Cloud Function

Writing the pipeline

Alright. We reached the heart of our implementation: the pipeline. I’ll not go into details because it may sounds complicated at first, but you can’t find some pipeline examples such as the WordCount.

Just imagine that a pipeline is a giant funnel-shaped thing that takes bounded or unbounded data in input and it’s going to get this data out after being transformed by several functions (groupBy/count values etc.)

In this example, I send a path that should be unique (because the last part is the answer UID), then I convert each value into a tuple (“path/to/the/answer”, 1) and finally I group them by key. Therefore, I’m able to count them and to publish a new Pub/Sub message with the formatted data.

Transforming data
Note: As we are working with unbounded data (a stream), the process doesn’t know when it’s the time to output the data it has seen. Therefore, we would apply a processing time window to divide this unbounded data into finite chunks to group them, count them and output the result. Windowing strategies…
Sample of the pipeline written in Python

Updating votes

The very last step is to listen for the output Pub/Sub topic and when it publishes a message, I decode it and update the related answer votes with the value I get. Here’s the function, there is still room for improvement but you get the idea.

Pub/Sub Cloud Function to update votes counter

And voilà!

We’ve made a real-time polls app that can scale planetary, with near 0 effort!

A working example!

I pretty like GCP Cloud Dataflow because we longer need to think about how do we scale, we just sit, drink some juice and enjoy it. However, this solution comes with a cost. In fact, in streaming mode you need a Computer Engine machine type of n1-standard-2 or higher and at least 1 worker. If you let this running 24/7 a month, this will cost you around to $130! I’ll let you be the judge of that… 😃

That’s the end of my journey! Thanks for reading.

HAVE FUN! 🔥 🔥 🔥