Traveloka’s journey to stream analytics on Google Cloud Platform

Rendy B. Junior
Apr 16, 2018 · 4 min read

By Rendy Bambang Jr., Data Engineering Lead, Traveloka; Agung Pratama, Data Engineer, Traveloka; and Gaurav Anand, Solutions Engineer, Google Cloud

Traveloka is a travel technology company based in Jakarta, Indonesia, currently operating in six countries. Founded in 2012 by former Silicon Valley engineers, its goal is to revolutionize human mobility.

One of the most strategic parts of our business is a streaming data processing pipeline that powers a number of use cases, including fraud detection, personalization, ads optimization, cross selling, A/B testing, and promotion eligibility. That pipeline is also used by our business analysts for monitoring and understanding business metrics, both for historical analysis and in real time.

In this post, we’ll describe how Traveloka recently migrated this pipeline from a legacy architecture to a multi-cloud solution that includes the Google Cloud Platform (GCP) data analytics platform.

Legacy architecture

Previously, our pipeline relied on Apache Kafka (for user events ingestion), sharded MongoDB (for an operational data store), and sharded MemSQL (for real-time analytic queries). In this approach, data from Kafka was processed by our Java consumer and stored with user IDs as primary keys in MongoDB. For analytics, events data was consumed from Kafka and stored in MemSQL where it could be accessed with BI tools for doing real-time analytics.

Image for post
Image for post

As Traveloka grew over time, several problems emerged, including:

Based on these pain points, we drafted the following requirements for the future system:

Overview of the new multi-cloud architecture

We did our homework on technology that could support these requirements for our use case. In the end, we opted for a cross-cloud environment (AWS-GCP) including Google Cloud Pub/Sub (for events data ingestion), AWS DynamoDB (for operational data store), Google Cloud Dataflow (for stream processing), and Google BigQuery (for our data warehouse). (Note: Although Google Cloud Datastore was our preferred operational DB, the fact that it is not yet available in Singapore necessitated the use of DynamoDB.)

Image for post
Image for post

Next, we’ll provide some more details about our new pipeline.

Events ingestion (Cloud Pub/Sub)

User events that occur on our services now go through Cloud Pub/Sub. But before that step, we ensure the data complies with a defined schema — which is very important since inconsistent data (for example, unknown changes on datatype) can disrupt subscribers.

Cloud Pub/Sub is very convenient for us because unlike our previous architecture, which required capacity planning for events ingestion, we can rely on its autoscaling capability to handle volume and throughput changes without any work on our part.

Data processing (Cloud Dataflow)

Cloud Dataflow is a fully-managed GCP service for processing data in streaming or batch modes with equal reliability and expressiveness. Its ability to spawn new pipeline workers (and to autoscale) without any user intervention is a big advantage for us, especially when have to backfill a pipeline in order to process historical data. For example, we’ve seen Cloud Dataflow spin up dozens of workers over the course of 30 minutes, with 40–50k data/second throughout (with throttling by the downstream operational datastore). In most cases, end-to-end latency is less than three seconds.

Cloud Dataflow’s unified programming model (aka Apache Beam) for batch and streaming processing is also very useful. To switch between modes, all we have to do is change the pipeline’s source and sink:

Cloud Dataflow’s windowing and trigger functions allow us to easily handle late-arriving data as well.

Real-time data warehouse (BigQuery)

As part of the GCP streaming analytics solution, BigQuery’s support for streaming data is a major advantage for us to support our real-time analytics use case. Furthermore, we no longer worry about storing 14 days of historical data because BigQuery stores all of it for us, with required compute resources auto-scaling on demand as we need it. (Most of our analytics queries are time-based, so using _PARTITIONTIMEis fast and cost-effective for us.)

Conclusion

For Traveloka, the ability to do large-scale stream analytics on GCP, without having to manually spin-up resources when we need them or move data around, was most important for success of this project.

Traveloka Engineering

Empowering Discovery Through Technology

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store