Stream Millions of events from your client to BigQuery in a Serverless way, Part #1

Abdul Rahman Babil
Google Cloud - Community
4 min readMar 21, 2021

One of the success keys for startups is DATA, collecting data never been a bad idea, even it might be costly in time and resources, but it’s that treasure that you will use to understand your product and move your platform to a new level of ML.

I’m gonna show you a way to handle millions of data events in a scalable way, using Serverless services provided from GCP and do have a free tier!!

This article would be split into 2 parts:

Part #1: Building an API to receive raw data from clients and push it to Pub/Sub

Part #2: Running ETL Jobs to transfer and load events into BigQuery

You can find out the code for API and gcloud to create all required resources with more in-depth details in the Github repository.

Events flow

Producer API!

in most cases, you have one endpoint to receive data from the client-side, maybe the client will send a batch of data to the endpoint, this endpoint should be reliable and available all the time with low latency and high capacity of scaling, for this example I will choose a simple way to build this API, I’m gonna use CloudRun from GCP.

Cloud Run does support hosting your contained app in a Serverless way with scaling from 0 to N based on your traffic, it’s very easy to deploy and lets you focus on building your application.

I did use node.js with express to write a simple endpoint that can accept a JSON array of JSON events.

We expect millions of events data in a short period of time, these data could be a low priority for your app, so no need to store it in your primary DB, but you can store it in a warehouse/data leak, in this example I will use BigQuery as the final destination for these data.

Stackdriver and Pub/Sub

Question: How we can send events into Pub/Sub?

Usually, by using Pub/Sub SDK, it provides a set of functions to push messages into a Pub/Sub topic, then the topic will deliver to its subscriptions, this way consider good but might have some cons, like network latency to make an HTTP request to Pub/Sub service for every API call, or higher cost when you send a small number of messages for every API call.

The other way is to use Stackdriver with Stackdriver sinks, by default Stackdriver will collect logs from all GCP services, so any logs in stdout will be collected by Stackdriver, and you have the option to store these logs as text files in GCP or to send these logs to BigQuery or Pub/Sub.

So first we have to let the API writing events into stdout as a JSON format, for node.js I will recommend you to use “Bunyan”.

const bunyan = require('bunyan');const logger = bunyan.createLogger({name: 'events-service',streams: [{stream: process.stdout, level: 'info'}]});app.post('/receive', (req, res) => { req.body.forEach(event => {    event.receive_timestamp = Date.now()    logger.info(event); }); res.status(200).send({done: true});});

After preparing the code + Dockerfile, we need to deploy it on Cloud Run, next step is to create a Stackdriver sink, the first sink will direct all logs to a Pub/Sub topic, you can create the sink from the GCP console or gcloud CLI.

Better to stop/exclude these logs from Stackdriver _default sink , just to save cost!

You can call your Producer API with some events, later you go to Cloud logging (Stackdriver), search for logs with this filter:

jsonPayload.name="events-service"

These are events sent when Producer API got called, just landed in logs, within few seconds, messages would be pushed into the target topic waiting for consumer jobs to do the ETL work!.

Congratulations! your events are waiting in Pub/Sub just within few lines of code, no need to care about the amount of events clients sending, Pub/Sub and Cloud Run will handle it, bare with me to read part #2 so you can transform and load these events into BigQuery.

Feel free to share your thoughts in the comments below, If you have any questions or feedback don’t hesitate to ask so!

Continue reading Part #2: Running ETL Jobs to transfer and load events into BigQuery.

--

--

Abdul Rahman Babil
Google Cloud - Community

Tech Lead at Newswav | Backend and Android developer | entrepreneur | micro investor