From Naive to User ready: Building a Big Data BI report

Published in

Israeli Tech Radar

7 min readFeb 21, 2023

Business Intelligence (BI) reports are a great tool to view and analyze your data in tabular and graphical formats. But sometimes the volume of the data is greater then you expected and you find yourself scaling a Big Data BI report and wondering what you did wrong, and whether you should scrap everything and start again.

If you can relate to this issue, you, like me, probably wrote a naive BI solution which became a Big Data BI problem, or perhaps you’re about to begin and you want to make sure you know what’s ahead of you. In this post I will share the journey and decisions we made, and how we built the architecture to be flexible, scalable and ready for users.

It all started with a search engine: I was working in a team that developed a search engine, and they wanted to analyze the engine’s search journey: They wanted to run experiments of saved searches on different implementations of the search engine so they know what happens when they:

Fix a bug
Change the algorithm
Add/Remove data sources

The naive approach was: Add an API endpoint to the search engine that returns the entire journey and the results. Which they did, but it was really slow and they still didn’t know how to visualize the results.

I suggested, why don’t we, instead, run the search engine in a special mode that accumulates events during the search journey and sends them to a message bus? Once we have it in the message bus we’ll figure it out based on our joint experience.

My team leader was willing and we started the journey, with only 2 limitations:

Since this feature is for research purposes, please use managed cloud services, to save us the hustle. In that case, managed services of the Google Cloud Platform (aka GCP)
Use the BI tool they were accustomed to: Metabase

Our great plan was simple:

Run search requests
Do “something” to put the events on GCP for analysis

Send events from the search engine to Google Cloud Platform — Send all the events to the Google Cloud to be analyzed

Our starting point: Send all the events to a message bus. In our case Google’s PubSub.

Since the search engine is written in Java Spring + Spring Boot, we used the integration provided by Google and Spring.

This brought us to:

Search engine sending data to PubSub, BI reports potentially extracting data from PubSub — Data is in PubSub, make a BI report from it

How do we make a BI report in Metabase from the data?

The research showed that it’s best that we direct the data from PubSub to a relational database instance, and direct the report to the database.

For our database we chose Google’s Cloud SQL, to direct the data to it we chose Google’s Dataflow — Unified stream and batch data processing that’s serverless, fast, and cost-effective.

The benefits of using Google’s Dataflow to stream the data from PubSub to different targets is that it is based to scale (automatically and manually), it provides both templates and the Apache Beam’s SDK for writing pipelines, and that it has excellent support for Google Cloud. Using its automatic scaling feature allowed us not to worry about tuning the jobs repeatedly.

To allow analysts to also read and analyze the raw data, we used Dataflow to also push the events to Google’s BigQuery.

Search Engine sending data to PubSub, Dataflow pulling data from PubSub and putting it in both BigQuery and Cloud SQL. BI report reading data from Cloud SQL — Final yet naive architecture

You may rightly ask: Why are you putting the events in both Cloud SQL and BigQuery? The reality is that BigQuery would give the fastest results for the analysts queries over the raw-data, but Metabase doesn’t (at least at 2021) connect to BigQuery so we chose to have the reporting tool query a CloudSQL instance, and direct the analysts to BigQuery since we regard the data in it as the source of truth.

Well, we’re done aren’t we? Let’s start testing, I ran a few experiments, opened Metabase and this is what I saw:

This actually makes sense: If I need to put the raw data in BigQuery for research, how could a relational database handle the same raw data with the same speed?

Have no fear! Since our reports only use the summary SQL function, we can use the Dataflow jobs to aggregate the data and let the BI report query the aggregated reports.

Search engine sends events to PubSub, Dataflow sends the data to BigQuery in mini-batches, writes the events to a file in Cloud Storage (new file every 5 minutes). A new Dataflow job reads the data from the files, aggregates it and writes the data to CloudSQL. The report queries the CloudSQL. — Final and less naive solution

To allow data aggregation we wrote files with raw data to Google’s Cloud Storage every 5 minutes, and added a new Dataflow job to read new files, aggregate their data and write it to the CloudSQL for the report to read.

I started testing again, and I was glad to see the reports are now rendered in Metabase.

Okay, I’m ready for the users. We made a presentation and a demo and told the users to “go for it”, the excitement was high and so was the usage and very quickly I got complaints:

Oh boy, what can I do now?

What if we?

Improve the table indexes
Horizontally and Vertically enlarge the CloudSQL
Direct the Metabase reports to a read replica of the CloudSQL

These actions definitely made improvements, but we always found a bigger experiment whose report didn’t render fast enough and most likely ended with an SQL timeout.

New Idea: What if we create a separate job that runs once an hour and computes the entire report from the aggregated data?

That only moved the SQL timeouts from the report rendering to the report computation job.

Back to the drawing board — Why isn’t this working?

The problem begins with the unpredictable volume of data-

The experiments system is very popular and is executed 10x times during a working day
The number of search requests in each experiment varies between 1 and 10,000
Each search request produces roughly 100,000 events to PubSub

We must treat the report as a Big Data report, but Cloud SQL is not a Big Data database.

Cloud SQL allows running instances of Postgres, MySQL and SQL Server which are row-based databases, however column-based databases are better suited for BI Big Data queries. If we look at the following diagram we can see a simplistic example of row based representation of table vs. a column base representation of a table:

Table of employee bonuses. The data can be store with easy access to each row or to each column — Row vs. columnar representation of the same data

Though this is very simplistic, we can see from how the data is organized that in row-based databases we can quickly find the bonus of the employee with ID number 2, however in the column-based databases it would be easier to calculate the total bonuses given to all employees.

How does that relate to our problem?

Since BI reports generally summarize values of table columns, in the domain of BI Big Data a columnar database would give faster calculation times.

Does Google Cloud have a columnar solution?

BigQuery is actually a columnar data warehouse. But, our BI tool, Metabase, doesn’t connect to BigQuery.

At this phase, we realized we have to consider a different BI tool. With the help of the analytics team we considered two options:

DBT — A platform that lets you reliably transform data using SQL and Python code, it can be used to create the transformation layer that would be presented in the report. To read more on DBT and BI reports, see here
BigQuery saved queries and with the Tableau HTTP integration — because apparently other teams in the company use Tableau!

We ended up choosing BigQuery saved queries, and our architecture became much simpler:

Search engine sends events to PubSub, Dataflow writes that from PubSub to BigQuery, the report tool queries BigQuery — The architecture that worked

This means that the solution was to replace Metabase with Tableau, which can integrate with BigQuery which can bring us much better query times.

My takeaways

If you cannot predict the volume of the reported data, you should build the system to scale, preferably auto-scale on every step of the way (budget permitting of course)
The selection process of the database/data warehouse was flawed as it was only based on the needs of the BI tool and our naivety
It is okay to challenge the requirements, as you can see once we replaced the BI tool we had more technical options to choose from

More ideas

During the research phase I consulted with my colleagues in Tikal and was given two more ideas I’d like to mention:

Use Presto (or Trino) — That was a very cool idea, but I think would have been an overkill since we only had one data source
Perhaps a dedicated DBA would have been able to help with the Cloud SQL efforts — Though that would have helped I think we ended up making the right decision