Monitoring BigQuery costs in real time using Cloud Pub/Sub and Apache Beam

Nhat
NE Digital
Published in
5 min readJan 9, 2023
Photo by Sebastian Pichler on Unsplash

Backgrounds

The economy has faced challenges in 2022 and it is expected that 2023 will be similarly or even more difficult. In light of this, the Data Team has made cost reduction a key performance indicator for the year. As the new year began, the Data Engineering Team recognized the need to proactively prevent issues and optimize our BigQuery (BQ) costs in order to achieve this goal.

We had already implemented various measures, such as daily reports and dedicated Data Engineers (DEs) , but we were still reacting to issues rather than preventing them. Additionally, ad-hoc queries and new Looker Studio dashboards were proving to be difficult to manage.

The journey starts

We began researching potential solutions and eventually learned about Apache Beam and Google Cloud’s DataFlow service. We were excited about the possibility of using these tools to build a real-time solution for monitoring BQ costs.

Apache Beam is an open-source data processing framework that provides a unified programming model for defining both batch and streaming data processing pipelines. It enables developers to write code that can be executed on a variety of execution engines, such as Apache Flink, Apache Spark, and Google Cloud Dataflow.

Github: https://github.com/apache/beam

After conducting some tests and proofs of concept, we decided to give it a try. We spent the next few days setting up a pipeline using Apache Beam to read data from a Cloud Pub/Sub topic that contained our BQ query logs.

Image for illustration only.

We set up a route from Cloud Logging to Pub/Sub, and used Apache Beam to read the data from Pub/Sub. We then applied a series of transformations to the data using Apache Beam.

The first transformation we applied was filtering out any queries that were canceled or failed, as they are not relevant to our cost optimization goals. We could have also filtered out these events by applying the appropriate conditions when routing the events from Logging to Pub/Sub.

Image for illustration only.

Next, we aggregated the data by user and project, so that we could see which individuals and teams were generating the most BQ costs. We also added a few additional transformations to clean and enrich the data, such as adding the query duration and the number of bytes processed.

One advantage of Apache Beam is that it enables us to store state and update it as necessary. In our scenario, we can maintain all of the cost information for our users within the state.

Image for illustration only.
Image for illustration only.

After the data had been transformed, we also used Apache Beam to write it to another Cloud Pub/Sub topic. We set up a Cloud Function or a GKE Pod to trigger on any updates to this topic and send an email notification to the relevant team member if the cost of users or teams exceeded a certain threshold. This way, we could alert users and DEs about the issue and work together to find a solution.

With our real-time solution for monitoring BQ costs using Apache Beam and Cloud Pub/Sub in place, we wanted to make it even easier for our DE team to manage the threshold for triggering notifications. We decided to build a simple web UI that they could use to modify the threshold for various cases.

Image for illustration only.

With the UI in place, DEs were able to quickly and easily modify the threshold for each case as needed. They no longer had to rely on complex code or manual updates to the. Instead, they could simply use the UI to make changes on the fly.

With the help of Apache Beam and Cloud Pub/Sub, we were able to build a real-time solution for monitoring BQ costs. We were no longer relying on reactive measures and were able to proactively prevent issues and improve our handling of ad hoc queries or new Looker Studio dashboards.

Image for illustration only.

In the future, we plan to implement the Flat Rate to control costs within a budget. However, we still need to monitor the usage of slots in BigQuery so that we can easily adjust our logic to handle any changes.

Conclusions

Apache Beam and Pub/Sub are powerful tools that can handle a wide range of simple data processing tasks in realtime. They are fully supported by GCP, which means they can easily be integrated with any other GCP services. For example, if you wanted to extend your current service by adding recommendations for users basing on their queries, you could easily integrate with GCP’s Dataproc or Vertex service. Overall, Apache Beam and Pub/Sub offer a convenient and efficient solution for a variety of data processing needs.

If you’re interested in other works by the Data Engineering team at NE Digital, you can find more information here: https://cloud.google.com/blog/products/data-analytics/ntuc-uses-datahub-data-platform-across-platforms-and-clouds.

--

--

Nhat
NE Digital

A data engineer who is passionate about data products and platforms