The driving force behind Skyscanner? Data Science.

Hear it from the travel economy’s data scientist

Published in

SMUBIA

9 min readSep 26, 2019

Our club’s president Zexel (left) and Data Scientist Mr Peh (right)

Data Science prevails in most industries, especially in this pro-tech age: banking, e-commerce, healthcare, travel industries alike immerse themselves in the sea of data floating around.

One area of application is in Car Hires, under Skyscanner @Skyscanner Engineering where Singaporean data scientist Shuming Peh works. We invited the SMU Alumni to interest us with his exposure and experience in Data Science. Read on to discover his analytics projects in detail!

Big dogs, and how it all began

When asked to describe himself and how he got into the data science field, Mr Peh recalls,

“I previously worked in Electronic Arts (EA) in their data science team. I graduated from SMU with a Quantitative Finance major. I also run a data consultancy on the side and loves big dogs, so if you require favours from me I accept payments such as playing with big dogs,” he says with a laugh.

“My journey in data science begun in 2013 when I managed to secure a position in Ubisoft as a Business Intelligence Intern, which conveniently gave me a doorway to be a Data Analyst at EA in 2014, and eventually I moved on to Skyscanner in 2017.”

Since then, he has stuck to the role of a data scientist.

Facing Today’s Industry Demands

The industry is skill-specific; Mr Peh believes strongly that these are particular traits a budding data scientist must possess.

Sufficient Engineering Standards

“You need to know how to interact with a data warehouse, creating your data pipelines,’’ says Mr Peh. He notes that not many people know how to produce end-to-end solutions, as most data scientists focus on modelling or feature engineering, which is quite impossible to be just confined to such limits in Skyscanner (or anywhere).

Moreover, skills such as Python (or R — but Python is generally better for making API calls and scripting) and SQL are must-knows.

Understanding How Algorithms Work

There are two schools of thought: one being the need to understanding algorithms from a Computer Science point of view, another from a Mathematical point of view.

Many might lean towards understanding how the model operates at high-level, but Mr Peh personally pushes for the notion that understanding the math behind algorithms benefits data scientists more because it helps them understand things like what model parameters to focus during hyperparameters optimization; why L1 is used for feature selection under Lasso Regression, or why L2 used in Ridge Regression cannot do the feature selection.

Interpreting the Situation through Pragmatism

Noting that there has been a lot of hype about Machine Learning, or Deep Learning, Mr Peh points out that not everything can be solved through Machine Learning.

“We need to be able to write out solutions to your problems. Understanding what you are solving is important. You must have an end idea that what you’re solving is scalable.”

The Office Life at Skyscanner Singapore

A Day’s Agenda

“Well, I spend 30% of my time on predictive modelling. I spend the next 20% of my time attending to ad-hoc questions (explained below), and the bulk of my time is mainly spent on infrastructure/data piping/turning models into production and beyond.”

There are two types of work in general: Decision Science & Building Data Products.

Decision Science

Ad-hoc questions would fall under Decision Science, in which questions about users are asked, such as “Why did retention level drop this week?”. What Mr Peh does is to essentially use EDA (Exploratory Data Analysis, a method to understand the dynamics of your data at hand) or beyond and from there, attempt to explain the outcome.

Another form of Decision Science are experiments, where data scientists are told to run their hypothesis in an experiment and from the results, relay what improvements are possible to be implemented.

People in the company might also ask for in a predictive manner, “What should I do with a certain product in the next 6 months?” As these questions are rather time-specific, they would be what Mr Peh classify as ad-hoc questions.

Building Data Products

On the other hand, Building Data Products would refer to the features that users interact with on the front-end of Skyscanner website, application, or what Skyscanner’s stakeholders might interact with.

For example, our sorting feature offers multiple sorting methods, be it “Best Sort”, “Cheapest First” or “Fastest First”. “Best Sort” would be an example of a data product, since it utilizes a learning model to rank results according to relevance.

The projects shared by Mr Peh below lean towards Building Data Products.

Types of Projects Done

Each project has a unique focus: it could be understanding users, deciding on the best marketing strategy or recommending users the next best opportunity.

In understanding user behaviour, there are concerns such as

Propensity (how likely users are to convert, the likelihood of users being retained)
Lead time (refers to time modelling; it is the time between the notification being sent to users till the time users actually book a flight/ hire a car on Skyscanner)

And then there are the more classical marketing problems, such as

Forecasting (an example: Customer Lifetime Value)
Clustering (used more for marketing to understand user behaviours and tailor target-specific campaigns from clusters)
Attribution (finding out how much should be attributed to paid campaigns and other marketing strategies)

Lastly, there are recommendation problems like

Result sorting (example: best sort for flights, hotels and car hire)
Vertical recommendations (which vertical to be recommended after a redirect is done)

With all these different project types, we got Mr Peh to share with us two specific case study exclusively with us. Things may get a little technical from here — our tech readers are more than welcomed to dive into this detailed section on case studies!

Case Studies

#1 Semi-Supervised

The question at hand poses: “Are there any sort of common behaviour (or actions) or APAC users to keep coming back to user Skyscanner app?”

So how was it carried out?

“I set out a certain number of assumptions and constraints because I want this to be under a controlled environment and make a controlled comparison.” Mr Peh tells us that the users selected were the ones who made a new app installation from ‘2018–01–01’ to ‘2018–05–25’. The actions of users after the first 90 days of install were recorded for a more uniform comparison.

He measured different attributes such as:

What the user used within the app
How many times a user searches for a flight
Whether the user logs in
The various app feature interactions users have had over the 90 days (app features including ‘inspiration’ and beyond in the Explore tab)

Front-end display of the many app features users can interact with

K-means clustering was used to discern between high and low retention users. Features like sessions and the number of unique days returned (within the allocated 90 days) were used in clustering users. The choice for the number of clusters was determined using SSE within clusters (indicating the variation of the cluster) as well as silhouette score (which tells how near the clusters are; distant and distinct clusters are more preferable. If they are near, then the level of correlation will be high).

Assigning clusters for high and low-frequency users

After clustering, it was found that cluster 7 returned the lowest value, signifying the number of days out of 90 where the user installed the app. In cluster 7, users are only using the app on the day they installed and did not return after.

On the other hand, cluster 3 contained users who are frequently returning back to the app, hence making it the highly retained cluster.

That’s not all; Across the ≥ 34 app actions that were categorized, the app feature usage was also calculated by dividing the number of sessions that recorded an app action by the total number of sessions. The results were plotted on the following chart:

Concluding Case Study #1

Unfortunately, the plots (not limited to the images above) did not return a clear distinction that differentiated cluster 3 from the rest.

Given the nature of the problem is to understand what app features were important. Classical machine learning models are preferred, such as Random Forest, which helps out with feature selection.

After applying a preferred model, the difference in feature usage is drawn between those who use the app frequently and those who do not. From there, the marketing teams can be informed of successful features of the app, which they will then try to direct users to use more often, either by altering their on-boarding process or by sending notification campaigns.

It’s important, however, to note to the stakeholders that what is done above only suggests correlation and not causation. These are only tools of inference statistics and it will be difficult to conclude causation.

#2 Supervised

The question of this project was: “What can we do to increase car hire acquisition funnel? And hopefully, increase revenue too?”

For a value chain, a data scientist can usually contribute through these four segments:

Cross-Selling involves bringing traffic from other verticals (within Skyscanner) to car hire.

Redirect means to divert someone’s browser/app outside of Skyscanner, while car hire conversion refers to completed bookings.

Mr Peh shares with us that a cross-sell involves bringing traffic from other verticals (within Skyscanner) to car hire. The main aim of this project is to predict users who will do a car hire search from flight redirects (on the app).

A proof of concept was still required to check if this was plausible and worthy of investment or not. And with the 8 steps, he carried out this project:

Proof of concept
• Check if there are enough users
• Deciding on communication medium → how to reach out to users (be it email or push notification etc.)
Exploratory Data Analysis
• It is always ideal to obtain some level of understanding of the data and gain some intuition before moving onto the modelling.
• Are there cities that are biased towards car hire
• The time interval between flights (or hotels) bookings and car hire
Mapping flight redirect to car hire search
• Using outbound and inbound dates to check against the date of booking car hire
Modelling a classification prediction
• Binary classification (using 1 or 0 to indicate whether the user did a car hire search)
• Separating data according to training and testing sets
• Cross-Validation
• Using deep-neural network model to determine propensity (how likely a user will hire a car)
Iterate on improving the model until evaluation metrics are met
• Hypertuning
• Adding more features
• Depending on the problem, some metrics are more important over others
Lead time modelling
• A limitation of the cross-sell model is such that we are only able to predict the likelihood of users using car hire. It does not inform us when the user will make their car hire search
• An analytical approach is relied upon to inform us of the (minimum and maximum) limit of the number of days we should send a push notification to users after a flight redirect
Rolling out the model for implementation
• Working backwards from the customer journey (user receiving a notification and moving from then)
•Data flow:

Output list of users are sent to our push notification tool that will pick up and target users (* scoring step is what happens daily after the model is deployed)

8. Push campaigns that are designed from the model
• More iterations to further improve cross-sell for car hire

After spending 3 months, Mr Peh was then able to finish answering the business question. Fortunately, this tuned out to be successful.

Before you take off

Perhaps this article takes on a more scientific lens, with multiple technical jargons strewn across the sentences.

Despite the technical nature, Mr Peh’s sharing to our Data Associates has surely drawn a specific overview for aspiring data scientists.

Data science is not simply about the algorithms or the steps required to produce an end-to-end solution. In fact, it requires the data scientist to transpose from a technical point of view and augment the techniques into business concern. Through forming relevant links in this multidisciplinary field, data science proffers a challenging but worthwhile learning journey.