Batch Processing Pipelines for Better Data Analysis
How we generate intelligible insights from our data warehouse using batch pipelines.
Gojek’s Data Warehouse, built by integrating data from multiple applications and sources, serves as a central point of analysis that also helps generate actionable insights. Our batch pipelines process billions of data points periodically, in order to help our business teams gather an effective view of data.
This post explains our approach to building batch pipelines that leverage complex data in an efficient way.
I will start by providing some context on our data warehouse and the data we store in it, and explain the use cases of batch processing that we tackle at Gojek. Then we’ll talk about how we tackle running batch processing jobs and handling scheduling dependencies.
Data Warehouse setup
At Gojek, we use Google Bigquery (BQ) as a Data Warehouse. All data points ranging from booking and searches to location are published in real time using Beast, our open-sourced tool.
We also push data from Kafka to our data lake, which is Google Cloud Storage(GCS). These data points vary in terms of the data format (which might not be relevant to the analysts). They need to gather insights from the data without needing to know how it is stored.
To solve this, we wanted to create an abstraction such that the users of data only need to know about the constitution of data, and not about where the data is coming from or what format the data is stored.
Batch processing use cases
Typical use cases of batch processing at Gojek revolve around enriching real-time data with additional data points mined from huge amounts of historical data.
A few examples of use cases include:
Creating a customer profile:
In order to provide our customers with the most relevant discount and deal vouchers, we enrich customer profiles with the last few months of the customer’s order and search history. This enables our team of data analysts and data scientists to experiment with customer segmentation and targeting. This use case has been covered in much detail by my colleague Mayank in this blog.
Personalising search results:
In order to personalise the search results served up by our food delivery app GoFood, we leverage batch processing to gather insights about trending, popular, and highly-rated restaurants near the user that match their taste profile. More details around how we went about this use case are covered in this blog.
Running Batch pipelines
As I previously mentioned, the users of data usually don’t need to know the format in which the data is stored. They would benefit from having a unified interface to interact with data.
We leveraged Dataframes in Apache Spark to provide the unified interface.
DataFrames or Datasets
Spark provides an abstraction on top of the data underneath — called DataFrames or Datasets.
DataFrames are distributed collections of data in which data is organised in the form of columns.
Conceptually, a data frame becomes similar to a database.
Few examples of reading from Bigquery and GCS are as follows:
Using these clients make it very easy for our analysts to read GCS and Bigquery data into Spark and interact with it.
Running Spark Jobs
We use Google Dataproc hosting a Spark cluster to run our batch pipelines. On each trigger of a batch job, we create an ephemeral cluster to run the job, which means that the cluster is destroyed after the batch job completes.
The batch job is written in Pyspark, which all our analysts are familiar with. This provides a good interface to interact with Spark Dataframes.
Scheduling Dependencies between Jobs
As the Spark jobs become more complex and handle many responsibilities, it becomes important to break them down into simpler jobs that can be better managed.
But this breakdown brings more challenges.
We now have to make sure the related jobs are scheduled taking in mind the scheduling dependencies between different jobs.
For example:
If there are two jobs, the first one calculates the last 6 months of order history and the second job uses the order history to calculate the preferred locations from which the customer has ordered, it becomes important to run the first job and then schedule the second job.
Our solution to handle such scheduling dependencies is to use Apache Airflow. This is a tool to programmatically schedule and manage scheduling dependencies between different jobs.
The scheduling dependencies are written as a Directed Acyclic Graph (DAG) and we set a schedule for the DAG to run. Simple. 🙂
With Airflow, we are also able to assign retries for each job. In the instance of a job failing, Airflow will rerun the job by itself.
As a final precaution, we have also added Slack integration and StatsD metrics with Airflow, in order to get alerts for when the jobs have failed and need to be fixed.
So that’s all for this post. Hope you liked it! If you’d like to work on cool problems and help us scale a #SuperApp for Southeast Asia, make sure to check out gojek.jobs. Until next time. 🖖
Want our updates delivered straight to your inbox? Sign up for our newsletter!