Aggregate thousands of inputs per second with Firebase

So, it’s about time we deal with this topic. It’s been a while since the first time I wrote about data aggregation with Firebase and since I feel that the topic is not very well explored, I figured that I’ll try to to explain a way of updating a Cloud Firestore document in real time with aggregated values from thousands of distributed inputs per second.

Dennis Alund

Published in

oddbit

5 min readOct 11, 2018

The scenario

Imagine that you’re having a game where thousands of players are producing inputs to the game state simultaneously. Let’s say we’re counting votes on five popular contestants. We want to be able to easily present a real time score-board without being concerned about scaling issues.

How might we deal with this bandwidth of data without hitting the limitations of either Firebase Realtime Database or Cloud Firestore?

One way of doing it

As described in a previous article and as proposed by Google, you can try to split your input into shards and sum up the counters in each shard on client side.

The limitation of that technique though, is that reading data becomes less convenient and it also requires an understanding of the expected load that the algorithm must be able to deal with. Too many shards make it harder to count, and too few shards might still lead to data contention.

Scale with Multiple Databases | Firebase Realtime Database | Firebase

The best way to optimize performance and scale your data in Firebase Realtime Database is to split your data across…

firebase.google.com

It doesn’t matter if we’re working with the Realtime Database or Cloud Firestore, we’re closing in on performance limits and we’re required to pay extra attention to the volumes of new data that our app is collecting.

But, is there another way?

Of course there is. There’s always another way. Sometimes better, sometimes worse, sometimes just different. I’ll let you be the judge of which one of those this is. I’d be happy to hear your thoughts in the comments below.

I’ll explain how you can utilize Google Cloud Platform’s Cloud Dataflow to start and grow without concerns about scalability. The beauty of this solution is that it is not overkill setup for small scales of data and it has virtually no upper limit of capacity; you can start small and grow to planetary level.

Let’s have a look at the schema below and read our way from left to right.

A) Recording individual data

Cloud Firestore is a really good choice of database for both reading and writing data. Not only does it have a constant “time to read” performance regardless of the size of our database. It currently also have a capacity of 10,000 writes per second… which is planned to be “unlimited” once it’s out of beta.

It’s a much higher capacity of writes per second than we can have with the RTDB (without making multiple databases, which is again a concept of shards, that we’re trying to avoid in this article).

On top of that, ten thousand writes per second is an insane amount of data that we’re able to deal with straight out of the box without the need of understanding the complexity of scaling strategies that we’d normally have to start reading up on when it comes to this amount of bandwidth.

B) Pipe the data over to Cloud Dataflow with Cloud Pub/Sub

For each created document we’ll trigger a cloud function that sends a Cloud Pub/Sub message on a topic of our choice.

The data in the message should be kept to a minimum, just enough for your Cloud Dataflow implementation to do its computation. Because your API quotas are also counted towards the amount of data that you shuffle around.

In the most minimal scenario, we’d only need to pass over the contestant identifier.

{
  contestant: "Alice"
}

C) Collecting and transforming data

We use Cloud Dataflow to consume the data posted to the topic on which we published votes in step B, above.

Cloud Dataflow is a fully managed data pipeline for consuming and transforming data at a virtually unlimited bandwidth. We’re going to continuously consume a few thousand Pub/Sub messages per second as a stream and process it at a pace in which Firestore can handle it (about once per second).

Image source: https://cloud.google.com/dataflow/model/windowing#fixed-time-windows

The Beam SDK that we’re using to develop the pipeline is allowing us to create fix windows of time, in which we group and process the data that we’ve received since the opening of the window.

Within each window we’ll group the votes per contestant, count them and publish a message on Pub/Sub topic.

D) Pipe the data back to Firebase with Cloud Pub/Sub

For each window of time we’ll get up to five Pub/Sub messages (one per contestant) containing a total number of votes for the contestant within that window.

{
  contestant: "Alice", 
  votes: 123
}

E) Save the aggregated data

We create a cloud function that triggers on the Pub/Sub topic that we published our post-back aggregated data to above.

At this stage we know for sure that the function is only triggering around every two seconds for each contestant. That allows us to securely run transaction writes to a Cloud Firestore document without the risk of data contention.

Finally

This is just a conceptual walk through of how you can accomplish real time counting and aggregation of high volumes of data. The actual implementation of the Cloud Dataflow pipeline is another chapter of its own. Getting started with it can be a tough nut to crack. But there are plenty of examples to find on the website and I am thinking that I might make an article dedicated especially to that after this.

Please let me know what you think in the comments below and I’d be really happy to hear your own experience and stories of dealing with scenarios like this: what was your challenges and how did you handle it?