Moving to BigQuery — Data at our Fingertips
Improving data querying speed by upto 50x and savings by nearly 50%
Recently we announced that we were going Back to Basics in 2018 (read more here). This wasn’t only on the product front. In 2018, Kavin Bharti Mittal (our founder/CEO) decided it was also time to re-tool our systems and re-build our culture with one big loaded question:
“How do we supercharge the productivity of our teams so that they can move incredibly fast, take high-quality decisions, to build incredible experiences that our users will absolutely love?”
As a result, there is an incredible amount of work ongoing at Hike working backward of this question.
Today we’re particularly excited to talk about the incredible progress that our team has made on our Data Systems to do exactly that — Supercharge Productivity.
This year, we also began using OKRs, another incredible way to enhance teams productivity (another post coming up soon). Here are the Data teams OKRs we setup for Q3'18:
The Start: “Analyst Bandwidth Nahi Hai”
One of the most common phrases we heard from our PMs, EMs & DMs in 2017 was: “Analyst bandwidth nahi hain”. In English, that means “We don’t have any analyst bandwidth”. For a long time, we thought this was ok. It meant that a lot of work was happening and that we had to prioritize. Well, not so much.
Turns out, we had to completely rethink our data systems, given the complexity of our system, and the billions of rows of data being generated everyday. From our OKRs, you can see we had 3 important goals:
- Data Query should be Extremely Quick (seconds to minutes for most queries)
- Everything should be Automated & Self-Serve (Anyone should be able to pull out data)
- Reduce Cost while doing all of this
- Make sure “Analyst Bandwidth nahi hai” was a phrase that was never used again :)
Our move to BQ was key to achieving this goal
Hue + Hive — Our Previous Data Processing System
Hike is generating more than 10 Billion analytics events a day and which comes to around 4 Terabytes in size and we have accumulated more than 2 Petabytes of data. For processing this amount of data, we were spanning more than 100s of Virtual Machines.
Cost & Performance were clearly 2 big nightmares for us.
Previously our data storage backend was Google Cloud Storage or GCS and the processing unit was Hive on Dataproc. We used to run HQL batch jobs on Hive. All the summarized tables were stored in Hive and BigQuery. We were using BigQuery (BQ) only for the Business Intelligence Tool (Tableau). For all ad hoc analysis we were using HUE .
Cost of dataproc clusters
We were using three different dataproc clusters for processing. One for ad hoc analysis via HUE. Second for all aggregation batch jobs and third for light weight jobs which needed resources in a short time. We were spending quite a lot of money on these dataproc clusters.
Performance on HUE for ad-hoc analysis
All data (raw and summarized) was stored on GCS. When queries ran on HUE, it fetched data from GCS buckets to worker machines and then processed it. This was a slow operation as it involved data migration from GCS. Analysts had to wait a longer time for the results. Many times the expiry of preemptible instances further delayed job completion. Moreover, out-scaling of cluster VMs took some more time in case the cluster capacity was full.
Intro BigQuery (BQ)
After much in-depth work and analysis, to overcome cost and performance challenges, we planned the BiqQuery shift. Here’s what we did
Step 1: BigQuery As Final Data Warehouse and an ad hoc analysis tool:
By making BigQuery the final data warehouse we aimed to make our analysts and other teams more productive by speeding up data extraction. Running queries in BQ takes seconds whereas running the same query on HUE took a longer time. BQ is more than 10x faster for long running queries. As our raw logs were present only in Hive, moving all the raw logs to BQ was not a good move and hence we plan to move last 90 days raw logs to BQ. This will cover more than 95% of our query cases and in case there is a need for older logs, we can use federated tables to query those.
Using BigQuery with an on-demand query pricing is costlier and hence we opted for BigQuery flat pricing model. Now, we can run all the ad hoc queries on BQ without worrying about the query cost.
Step 2: Move to Clustered tables in BigQuery:
In Hive we had the flexibility of creating partitions on multiple columns which helped decrease data scan. BigQuery had the option to create partitions only on date column or ingestion date. We thought of splitting daily raw tables into multiple tables to make our queries faster but this was not a good approach as it involved too many unions to fetch every day’s data. Then Google released a Beta version of Clustered tables. It solved the problem of multiple column partitioning. We also setup a data pipeline where the hourly data is imported into an hourly table in BigQuery and at the end of every day we import data from the hourly table into the clustered table.
Following is the comparison of running a query on Hive and BigQuery with the clustered table. In both the systems, there were enough resources available. In BigQuery, this took around 7.5 seconds whereas in Hive it took around 518 seconds or 70x more time!
Step 3: Moving all the batch jobs from Hive to BigQuery:
Using BigQuery only for ad hoc analysis and Tableau backend was not cost efficient. We already had another system where we were running all batch jobs with Hive on dataproc. Since there is a flat pricing model for BQ, we shifted all aggregation jobs to BQ. We were earlier using our homegrown solution for scheduling and running batch jobs. It was built atop Hive. But we now needed a similar framework for aggregation on BigQuery which could be used by any analyst to schedule a batch job with dependency on different job instances. We considered Airflow and Google Dataprep but these did not meet the requirements we had. So, we built a web-based tool which can enable Analysts to schedule batch jobs in Standard SQL or BQ legacy SQL style. These jobs are executed using Airflow DAG which test, run and retry these scheduled jobs. All the inter job dependencies and external dependencies are handled by Airflow DAG itself.
Step 4: Using require_partition_filter:
Sometimes Analysts can submit queries without thinking of the amount of data the query will scan. For instance:
This query will scan the whole table and if we use on-demand price then the cost for this query will shoot up to bizarre levels. We can control these things by setting require_partition_filter=true during the table creation to block these kind of open-ended queries. By setting this filter, analysts are compelled to give partitioned column name or a logical query with a specific date range.
For hike.analytics_clustered table we have set require_partition_filter=true and BQ reports an error incase we fire queries without specifying partitioned column. For instance:
Here is the current architecture of our data pipeline after shifting completely to BigQuery
It has been over a month since we’ve shifted to BQ. Through the above optimizations, we’ve been able to reduce the cost and achieve incredible performance on our Query Times. Fetching data from BigQuery is now up to 50x faster! And our cost is nearly 50% less!
The Analysts, Product teams, Engineering teams and all those constantly reaching out to us with requests for data are thrilled with the quick turnaround time.
OKRs Key to Moving Work Forward
The “Analyst Bandwidth nahi hai” phrase has nearly become a thing of the past. We’re going to make sure this does indeed become history after we shoot down all our OKRs for Q3’18. OKRs have been an incredible tool to moving forward. They are transparent across the company and have been key to breaking silos to accelerate progress.
There’s a lot more work happening behind the scenes to achieve our Dream State. You should see our Q4’18 OKRs 😃. More on that in another post, soon enough!
Originally published at blog.hike.in on September 11, 2018.