Stream Millions of events from your client to BigQuery in a Serverless way, Part #2

Abdul Rahman Babil
Google Cloud - Community
4 min readMar 21, 2021

This article is the completion of another article:

Part #1: Building an API to receive raw data from clients and push it to Pub/Sub

Part #2: Running ETL Jobs to transfer and load events into BigQuery

You can check the Github Repository to run/view-code this example, full details to run this example is there, just one command!

ETL Job deploy

Events subscription could have 10 messages or 10 million messages, so the number of jobs will vary based on the number of outstanding messages!, there are many ways to load background jobs in GCP, you can use Cloud Function, App engine, GKE, Cloud Build, Cloud Dataflow, Compute Engine, and Cloud Run.

Cloud Dataflow is a managed solution for ETL, but it will restrict you to use Apache beam which is available only within few programming languages, but my approach in this article will work with any programming language you prefer.

For this example I will go with Cloud Run, a job would be an API call that will run for < 10 mins, during this period, it will pull messages as much as possible, do transfer for each message(validate, map, filter, … etc) put these transferred messages in a (CSV, JSON-LD) and call BigQuery API to append these events into a table.

We create a new Job Cloud Run service, you have to make sure this one is independent of Client API one, also to make sure only authenticated scripts can call it to do the job, the timeout for API call to be around 15 minutes, the job must finish its work within 10 mins because we gonna set Acknowledge-Deadline to 600 seconds, we will keep pulling messages from events subscription for 7~8 minutes, then we load these events to BigQuery, once it’s a success from BigQuery side, we acknowledge all processed messages in events subscription so it doesn’t get delivered again, later the job will return a 200 response to acknowledge the message coming from invoking subscription.

ETL Job flow

Extract: clients’ events are now in events subscription, waiting to be transformed and to load into the final destination, ETL job will pull a batch of events while the job is running and subscription dose have more events.

Transform: the job will do a required transform on each event, like mapping, filtering, .. etc

Load: will send a big bunch of transformed events into BigQuery

ETL Load to BigQuery

Question: How we can load transformed events data into BigQuery?

BigQuery has two ways to load data, Streaming insert and Batch load, for this example I would recommend using a Batch load, is cheaper than Streaming insert and more scalable, you can read about loading data from here.

The main idea behind batch loading is to create a structured text file has data rows and create a job in BigQuery to load the data from the file into a BigQuery table.

Preparing events to load into BigQuery in batch loading could take few seconds, obviously, that will make API slower!, the better solution is to send these events from API to a queue service (Pub/Sub), then later we run jobs to process a big batch of events and load it into BigQuery.

ETL Job trigger

Now, ETL job needs to get invoked from an HTTP request to start processing, how we can do that??

1- Using Cloud schedule, send Pub/Sub invoke msg in a corn way, every 5 mins for example!

Cons:

a- If subscription didn’t receive messages for hour, Job will run multiple times and do nothing hence subscription is empty.

b- If subscription got to many messages, job instance has a limited capacity to process in one run, that means messages will stuck in the subscription for long time because processing power won’t scale up when load scale up.

2- Using Cloud monitoring, monitor the number of undelivered events in events subscriptions, and based on the number it will trigger X ETL jobs.

You can use both Solution #1 and Solution #2, #1 will have regular ETL every X minutes, #2 will add more ETL jobs if events in subscription grow in fast-paced, that will help to do parallel ETL processing at the same time.

Unfortunately, Incident can send only 1 message

Alert policy over events subscription

We add a policy on events subscription, the policy has many conditions with OR operator to trigger an incident, the incident will send a Pub/Sub message to invoke topic, invoke subscription will post an HTTP call to trigger ETL Job and wait until ETL job until it finishes processing, the incident will keep sending invoke messages while subscription is empty!

Number of messages sent to invoke topic
Transformed events in BigQuery after ETL got called

Conclusion

This is an approach to handle BigData coming from clients, these data will be delayed for minutes before reaching BigQuery, you might face some challenges when your application receives a high number of events, but be sure there is always a solution around!.

I would like to hear your thoughts and experience in handling such scenarios, what kind of technologies you would use and how hard is to build?

--

--

Abdul Rahman Babil
Google Cloud - Community

Tech Lead at Newswav | Backend and Android developer | entrepreneur | micro investor