Data scientists engineering a Data Platform
A 0–1 journey of building a data platform at Pelago
Introduction
Late in 2019, I received a call from BCG DV looking to hire for a data science position at one of the travel start-ups that they had been incubating for less than a year. After a few months of discussions and consulting with friends and family I decided to take a leap of faith and joined Pelago in April 2020.
Pelago is a travel experiences platform in South East Asia backed by Singapore Airlines. We strive to make our customers’ travel journey as seamless and worry-free as possible — from delivering personalised discovery feeds to ensuring a flexible and hassle-free booking and payments experience.
Since Covid wasn’t bad enough to question my decision, we had further decided to rebuild our website from the ground up. While this situation might be beneficial to most software engineers (who are builders), for a data scientist not so much. The commonly asked question was “What data science can you do without data?”.
Due to the proposed revamp we were tasked with architecting and building the Data Platform for Pelago. Our engineering team comprised majorly of software engineers from whom we could receive some help and guidance, but not rely entirely upon. We knew that we couldn’t afford the luxury of data and DevOps engineers which bigger data science teams are equipped with and so it would be up to us to pull off something we hadn’t done in the past before we could actually do any data science for which we had originally joined.
Problem
Having found the rare opportunity to design and build something from the ground up, we weren’t going to waste it. A fellow data scientist and I devised a roadmap promising to heroically design and deliver the entire data platform comprising an event streaming platform to collect user behaviour data, consume it and provide relevant product recommendations to our users in real-time. The use-cases would be further extended to handle all Business Intelligence and Analytics needs of the future.
High-level plan
While we were quite used to utilizing pre-built systems, building them production-ready was going to be a first for us. We divided the data platform into four main components viz.
- Event streaming platform to receive and process events from the website
- Temporary and permanent data storage
- Workflows manager to periodically run ETL pipelines and build personalization algorithms aka models
- Personalization service to deliver real-time product recommendations
The plan was for two data scientists to build an in-house data platform that could collect sufficient data from users of Pelago to eventually facilitate building personalization algorithms and perform data analyses to improve user experience
The next part of this article elaborates on our experience of building each of these components. It goes into the technical depths to evaluate our choices and trade-offs made at each step while being laser focussed on our primary objective of being able to improve user experience via personalization algorithms. Lastly, I will summarize our achievements and learnings from the project.
Architecture and component details
Pelago is hosted on Amazon Web Services (AWS) and therefore you will see most of the components in our architecture below to be native to this cloud service provider.
Event streaming platform
This is arguably the most critical component of the Data Platform. It’s objective is two-fold — while we wanted to enable data analyses on user behaviour tracked on our website, we also required our personalization engine to consume this data in real-time and provide personalized product recommendations to our users.
However, before we got started, we needed to justify building this platform over using third-party customer data platforms like segment.io. While segment.io is a great framework to track and collect events and provide seamless integrations to multiple third-party applications, we found that this data access has high latency. Moreover, we wanted the flexibility of being able to determine our own data structure in the destination database.
We decided to build an in-house streaming service instead and once we set our eyes on Kafka, we never looked back. We chose to use the managed service provided by AWS — Amazon MSK. It was the obvious choice since we did not want the overhead of managing our own cluster from an engineering standpoint. Post provisioning an MSK cluster (takes about 5 mins), all we needed to do was write Producer and Consumer code to write and read data from it at the specified intervals. Despite the overhead of having to spin up containers (as opposed to the serverless components), we loved the way Kafka easily allows the creation of Consumer groups to pass the same message to multiple workers to be processed differently.
Data Storage
Simply put, we needed a place to store and access data from. Some data is accessed frequently and is maintained in a key-value store, while the remaining data is permanently stored in the data warehouse.
Specific consumer groups were created to move data from Kafka to Redshift to serve as our data warehouse. At the same time, the more frequently accessed data was also written to DynamoDB by another consumer group. This two-pronged approach supported data analytic needs via Redshift without impacting real-time product recommendations via lookup of contextual data viz. last 5 products clicked, brand and category filters applied etc.
Workflows manager
This component is the heart of our data platform doing all the heavy lifting ranging from Extract-Transform-Load pipelines to scheduled model deployments. This was crucial to automate all our periodic tasks.
Apache Airflow is head and shoulders above AWS step functions for these tasks. Being an open-source platform built by Airbnb in 2014, there is much better community support. We had agreed to trade the extra overhead in setting up and deploying the platform for the model training and deployment customizations it supports.
Airflow enabled us to easily modularize our code into sizeable chunks ( tasks) which can be executed depending on the success of its upstream task(s). Multiple such tasks combine to form a Directed Acyclic Graph (DAG). Each DAG can be thought of as a series of tasks being performed to complete an objective. For example, a DAG to calculate product similarities would look like below.
Further, the status of our daily scheduled tasks can be easily visualized on the GUI. Coloured boxes indicate which task has run successfully versus which one failed or is pending.
While this was the most complex component in our architecture, it has provided a significant reduction in resource utilization and costs.
Personalization service
Finally, we enabled our application core API to call this personalization service on-demand in real-time. Product recommendations ranged from generic product listings on our destination page to similar products on the Product details page.
This service is a lambda function (serverless compute component) that can talk to other services for retrieving varied data. Connection to DynamoDB enables the identification of repeat users and retrieval of product similarities. It also maintains a cache to quickly lookup product information including product hierarchy, price, etc. Moreover, the entire business logic viz. product assortment, product boosting, etc. resides in this singular component.
While devising our recommendation and personalization strategies we needed to be mindful of the cold start problem and therefore a higher weightage was given to content-based features. We also ensured that our commercial teams were able to boost certain products. This not only allowed our models the time to learn from data but also provided a baseline to evaluate uplift.
In this article, we wouldn’t delve deep into the algorithms that were tried and implemented. Basically, we built content and image-based algorithms to calculate the similarity between products. This similarity score was combined with a Multi-armed bandit score to determine the top-k products to be recommended to the user.
Results
The entire architecture was built in four months. We were successful in capturing user behaviour across the platform ensuring our data flowed immediately from user devices to the event streaming platform. Some workers process data in real-time to feed our personalization algorithms while others move data at specified intervals into the data warehouse for permanent storage and facilitate data analyses.
On the upstream side, our models are trained daily using Airflow, picking up data from the data warehouse, processing it, and calculating product similarities and rankings.
Lastly, an AB testing platform was designed to help us evaluate the performance of personalization algorithms deployed online. We were optimizing for click-through-rate to increase engagement on the platform. Across multiple sections on the platform, we achieved a 30–50% uplift in CTR over random product recommendations and a 10–15% uplift over heuristically recommended ones.
Learnings along the way
Most significantly, we were able to optimize the platform for data science and analytics. We were able to achieve a level of data science maturity on day one that many organizations do years after inception.
Personally, the understanding of the intricacies and complexities of the system has and should hold us in good stead against any requests for data science features.
We ensured that each of our tasks tied quantifiably to Pelago’s core values and organization-wide KPIs. For example, personalized products on the homepage aim to increase engagement via time spent on the platform and reducing drop-off. It ensured that we were creating value and were able to justify our work easily.
Lastly, working in a start-up is very different from working in larger organizations. We wore multiple hats — sometimes that of a product manager, at times dev Ops, and most times that of a data engineer. Being able to get out of your comfort zone is key to even being moderately successful in the endeavour.
Future Work
Over the next few months, we will focus on utilizing data collected from the platform to generate insights and optimize personalization and search algorithms to improve user experience. We will also expand this platform to make data easily available across the organization. Adding a Business Intelligence layer over the data warehouse should solve the most immediate needs.
Further, we will also integrate this data with other third-party software for customer communications, marketing, etc. Lastly, as we expand to multiple destinations, we will scale and evolve this platform to maintain the same level of performance in terms of quality, latency, and ease of access across our user cases.
We are also expanding our team having proved our worth over the last year. If you are someone who loves playing with data, please do check out our careers page.