Stories by Lucas Bernardi on Medium

Mining the Stars: Learning Quality Ratings with User-facing Explanations for Vacation Rentals

Lucas Bernardi — Tue, 23 Mar 2021 14:26:00 GMT

Published in the 14th ACM International Conference on Web Search and Data Mining WSDM 2021

Online Travel Platforms are virtual two-sided marketplaces where guests search for accommodations and accommodation providers list their properties such as hotels and vacation rentals. The large majority of hotels are rated by official institutions with a number of stars indicating the quality of service they provide. It is a simple and effective mechanism that contributes to match supply with demand by helping guests to find options meeting their criteria and accommodation suppliers to market their product to the right segment directly impacting the number of transactions on the platform. Unfortunately, no similar rating system exists for the large majority of vacation rentals, making it difficult for guests to search and compare options and hard for vacation rentals suppliers to market their product effectively. In this work we describe a machine learned quality rating system for vacation rentals. The problem is challenging, mainly due to explainability requirements and the lack of ground truth. We present techniques to address these challenges and empirical evidence of their efficacy. Our system was successfully deployed and validated through Online Controlled Experiments performed in Booking.com, a large Online Travel Platform, and running for more than one year, impacting more than a million accommodations and millions of guests.

Mining the Stars: Learning Quality Ratings with User-facing Explanations for Vacation Rentals was originally published in Booking.com ML & DS Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Machine Learning in production: the Booking.com approach

Lucas Bernardi — Tue, 29 Oct 2019 13:22:42 GMT

During the last five years, Machine Learning became a standard tool for Product Development in Booking.com. Today, it plays a role in every step of the customer journey. Hundreds of Data Scientists build, deploy and experiment with hundreds of machine-learned models exposing them to millions of users every day.

Supporting Machine Learning at scale involves many challenges, not least of which is shipping the models to production reliably, as fast as possible and accommodating a large variety of model types, invocation settings, libraries, data sources, monitoring approaches, etc. Inspired by one of the core values of Booking.com (diversity gives us strength), we built a system that supports a large variety of Machine Learning approaches. In this article we present RS, our Machine Learning Productionization System.

A big part of Amsterdam’s charm comes from the diversity of bicycles running all over the city.

Diversity gives us strength

This is one of the main values of Booking.com. Nationalities, languages, ages, backgrounds, genders, hobbies, religions, diets… all are immensely diverse. Machine Learning does not escape this. People building Machine Learning models do it in many very different ways. Some use a small data set and R, others a huge data set and a command line tool like Vowpal Wabbit. Some like to write their own optimization algorithm in Java, others use sklearn or H2O. Some build Deep Learning models in Pytorch, others in Tensorflow, and so on. We believe in such diversity and therefore encourage it and support it with tools, courses, infrastructure and more, but…

Diversity gives us… a challenge

Most of the models we build are only valuable when integrated into a larger system like the main Booking.com website, our mobile apps, our partner-facing services, customer service systems, among others. The process of taking a working machine-learned model and integrating it with the relevant system, making it available to our customers, is what we call productionization. This is a critical step and it entails a set of requirements. Let’s look at the most important ones:

Consistency

This means two things: the online predictions must match the offline predictions and they must be the same regardless of which Data Center, Server, Pod, etc. is hit by the request. Not meeting these conditions brings many bad consequences like user experience degradation, difficulty to debug the models and invalidation of experiments to name a few.

High availability

Most of out systems are available 24/7, which means our machine-learned models must also serve requests seven days a week, 24 hours a day, worldwide. Upgrading a model, scaling it to a wider audience, maintaining the hosts where it runs, testing new tools, etc. must not interfere with the availability of the model in production. The stakes are high: a whole experiment might end up invalid if one of the competing models fails too often.

Low latency

The model must make predictions really fast. Many models only decide if an icon will be shown on top of an accommodation card in search results. This is a tiny part of the whole web page so we can’t really take a lot of time to do that. Furthermore, many models collaborate to construct a page, so even if individual models are not slow, the aggregated latency might become prohibitively high.

Scalability

Booking.com is constantly growing, not only in terms of customers and transactions but also in terms of number of listings available, and even number of products as we expand our offer beyond accommodations. As a consequence, our models must be ready to take in growing number of requests per second.

Observability

The environments in which our models operate are very volatile. For example, the FIFA World Cup might suddenly bring a higher number of visits to France. Or maybe the website changed, and the “great deal icon” is now displayed in a different location. Such volatility implies we must closely watch the behavior of our models. Has the model output changed? Are the inputs correct? Is the input space changing? We need tools to make sure we can observe our models to react appropriately and promptly.

Reusability

Many of our models solve a rather generic task. For example, one interesting model determines if a hotel is family-friendly, which means that the hotel has good amenities and an environment where families can enjoy together (as opposed to a business hotel for example). This model can be applied in many different product features. We can use it to highlight those family-friendly hotels in the details page or to create a filter in the search results page, or to reinforce a user decision in the booking process. These are all examples of model reutilization, which helps us a lot to make the most out of our models.

Meeting all these requirements for models built in such a diverse set of approaches is quite challenging. But it is exactly the type of challenge we constantly face.

The Fantastic Four

Let’s take a look at four basic mechanisms to productionize machine-learned models. These mechanisms form the basis of our system, and they embody the types of trade-offs we make when deploying our models. They work a bit like the Marvel heroes, they all have strengths and weaknesses but working together they achieve great things.

Lookup tables

Most of our models map an input vector to a prediction (or set of predictions in the case of recommender systems). A very simple way to deploy a model in production is to precompute all the predictions for all the possible inputs and store them in a key-value store. At prediction time all we need to do is lookup the prediction using the input as the key. This is a very naive approach but has many great advantages:

Since no computation is done at prediction time and since most key-value stores are optimized for fast reading, latency is usually very low.
Horizontal scalability is quite straightforward, most key-value stores take care of it.
Huge modeling flexibility. It doesn’t really matter how the model was trained, whether it is a Linear model trained in R or a Dual Attention Network trained with Keras, as long as the predictions can be computed and written to the key-value store in time the model can be productionized reliably.

It also has several drawbacks:

When the input space is large, it might be difficult to store that many combinations, or impossible to precompute all the predictions within a reasonable time. Besides, chances are many of the input combinations will never actually happen in production, resulting in waste of computation and storage resources.
The process of writing the predictions to the key-value store might be cumbersome when we consider model versions, modifications to the feature space, etc.
Continuous inputs are not supported.

For problems with a small and discrete feature space this method offers huge modeling flexibility together with all the nice attributes of a good key-value store: fast reads, high availability and horizontal scalability. In Booking.com this method is quite popular in the Front End, where we usually have discrete feature spaces, or we have a natural key like user / accommodation / destination identifiers.

Generalized Linear Models (GLMs)

In this method the model is represented by a scalar weight for each input plus a global bias; at prediction time, we just compute the inner product of the input with the weights, add the bias, apply a scalar function (the “inverted link”) and return. If we are interested in ranking items, we do the same for each item, and sort them by their score, more formally:

Prediction(X) = 𝓕(, 𝓣(X)>)

Where <,> means inner product, X is the input vector, W is the weight vector (the model), 𝓕 is the inverted link function (scalar to scalar), and 𝓣 is a vector to vector transformation applied before the inner product is computed, finally, the output is a single scalar.

Ranking items is very similar:

Ranking(X) = arg sortᵢ 𝓕(, 𝓣(X, i)>), i ∈ 𝓘

Here, 𝓘 is a set of items, Wᵢ is a weight vector for the item i (W, the model, is a matrix now) and 𝓣 transforms the input vector X into some other vector that might depend on the specific item i. The output is a sorted list of items, of course one can attach the score itself for downstream purposes or even take the top k items as opposed to sorting which can be done in linear time.

Now, depending on how we instantiate all these variables we get very different models, all linear in the parameters W. For example:

𝓕 and 𝓣 identity gives a plain linear regression.
𝓕 sigmoid and 𝓣 identity gives a plain logistic regression.
If X is a 1-dimensional vector with just a user id, 𝓕 is the identity, 𝓣 transforms X into a d-dimensional user vector (independent of i), and Wᵢ is a d-dimensional item-vector for item i, we get bi-linear models like Matrix Factorization and cosine similarity based k-Nearest Neighbors.

In general this simple prediction rule allows to serve predictions for many kinds of regressors, binary or multiclass classifiers, recommender systems or rankers, and even simple sequence models like Markov Chains.

Of course the actual model depends on how W is learned from data. The linear regression could actually be a SVM classifier, but we don’t need to care about this. It doesn’t matter which training algorithm is used as long as it can be represented by a weight vector, an input transformation and link function, we can run it in production.

This method directly addresses the issues of lookup tables. GLMs can easily model continuous inputs, large amount of inputs, and they are more efficient since only actually requested predictions are computed. This flexibility comes at a cost: we can only productionize linear models. Note that this means linear in W, nothing prevents us from introducing non-linearities through feature transformations (𝓣) like interactions, bucketing, clipping, or even using embeddings. One more disadvantage is that we need to transform our model from the library used to train it, to the linear predictor format, which can be error-prone and adds one more step in the deployment process.

Due to its flexibility and online performance, this method is very popular and applied to deploy many models in a variety of business cases like user preference models, user context models, destination recommendations, budget prediction, hotel attributes prediction and more.

Native libraries

This method consists of simply using the library used to train the model to make predictions in production. For example, if a model is trained using sklearn, we can save it in pickle format, upload to a production server, where it would be loaded using the sklearn and pickle APIs, making it ready to serve predictions. If we trained a model using H2O we can serialize it using the Java Serialization API, and just as with sklearn, upload it to a server, deserialize it and use it to make predictions. Most libraries offer some form of serialization of an already trained model, so this approach is quite general in principle.

This method brings a new dimension into the picture: ease of use. We can simply train our model and upload it without any transformation to intermediate formats. It also provides high consistency, the same code used to train is used to predict, no surprises in production. On the other hand, native libraries might be difficult to actually deploy, since they require a specific runtime environment. For example if our servers run on Java, deploying Python models is not straightforward, or if our servers run on Python, deploying an H2O MOJO can be really hard. In general, this leads to develop support for libraries compatible with the server runtime, which offers much less flexibility compared to lookup tables and GLMs. Another disadvantage is that native libraries might not be optimized for serving time but rather for training time, increasing the risk of latency.

We use this approach a lot when our models are Tree-based, like Random Forests or Gradient Boosted Trees or when they are Neural Networks.

Scripted models

A scripted model is simply a script with a predefined interface that is invoked for every request. The script is of course written by the model’s author and they can do pretty much anything they want. This approach gives huge flexibility since it allows, among other things, to control the run-time environment, and to perform complex tasks at prediction time. On the other hand, scripted models are a weak link in the online request life cycle: every line of code will have an impact at prediction time, increasing the risk of latency and failure.

We use this approach to deploy models built with unsupported libraries and models requiring some logic on top of one or several predictions.

Trade-offs and the iterative hypothesis-driven approach

There are many more mechanisms to deploy models, but these are the four canonical methods that allow us to make sensible trade-offs. Not all business cases require the same level of modeling flexibility or robustness, and for one single business case, this requirements are also not at the same level on different stages of the solution. For example, if we want to create a recommender system, the first model might very well be just a popularity model with two or three categorical features, built just using SQL, so a lookup table is a perfect approach at this stage since we don’t want latency to be an issue at all at the beginning of the project. The next step might be to test the hypothesis that more features make the recommendations more relevant. We can still do that with a lookup table, but if the amount of features is too big, a GLM allows us to do it without compromising latency and giving us the freedom to use the programming language and software library we are most comfortable with. As the project succeeds, the model evolves towards higher complexity, we might abandon our slick SKLearn linear model to be able to test the effect of non-linearities with Random Forests for which H2O deployed as a serialized model will do a great job. Finally, a mature model might evolve to a powerful RNN trained in Pytorch and we might even be willing to pay with a few milliseconds for much better recommendations.

Bakfiets are very heavy, but they can carry lots of people and things. Not the best choice for commuting to the office, but very handy to take four kids to school or a 20 people picnic to the park.

Following a rather informal analysis of the four productionization methods, we find that Flexibility and Robustness are two software attributes creating a trade-off plane: the more Flexibility a method offers, the less Robust it is and vice-versa.

Flexibility can be decomposed into flexibility with respect to the Input Space, to the Modelling Approach or to the Stack (programming language and libraries). Likewise, Robustness can be decomposed into Latency, Consistency (between the training model and the actually running artifact) and Observability.

The following table illustrates these trade offs, using a -1, 0 +1 (red, yellow, green) scoring system along these 6 aspects:

Adding up the scores (assuming they are all equally important) allows us to locate each method in the trade-off plane:

Lookup Tables and GLMs are both at the origin, offering about the same level of Flexibility and Robustness, but different flavors: Lookup Table’s Flexibility is about Modeling while GLMs is about Input Space. Lookup Table’s Robustness is about Latency, while GLM’s Robustness is about Observability. So we can choose which method depending on which flavor of Flexibility and Robustness we consider a better fit for our problem.

From the origin we can get a bit more Robustness if we are willing to give away a bit of Flexibility using Native Libraries. Or we get a lot more Flexibility if we are willing to loose some Robustness with Scripted models.

Of course this analysis is rather subjective. One could always argue that all these aspects are not equally important and that their weights actually depend on the specific application or even that there are more aspects to consider. But this chart is just an illustration of how these four basic mechanisms do a fairly good job at covering the Robustness vs Flexibility trade-off plane.

RS, our Machine Learning productionization tool

RS is our machine-learned models productionization tool. It provides support to productionize models using all four methods described before and adds a lot of functionality that is method-agnostic, helping model authors to achieve the requirements described at the beginning.

The main idea behind RS is to decouple training from prediction. It doesn’t matter how a model was built, or which productionization method is used, model consumers can use exactly the same API. This simple but powerful idea is what allows RS to support huge diversity.

But prediction is just one functionality, many others can also be decoupled from training and productionization method, like monitoring and A/B testing. The following diagram summarizes the services RS provides for all models, regardless of how they were trained or deployed.

RS in a nutshell

Implementation

Models can be uploaded through a programmatic interface or through a web portal. RS distributes the model across nodes in a cluster where a Java process takes care of loading them into memory and makes them available to serve predictions through a standard HTTP interface. One RS node serves many models and any given model is loaded in many nodes, this approach helps to achieve High Availability and Scalability requirements.

Lookup tables are implemented using the Cassandra key-value store (or in memory if they are small enough), users can point to a table in our Hadoop cluster and RS imports it into Cassandra. GLMs are served through an in-house developed linear prediction system that uses simple text files as model descriptors. Native Libraries are supported through H2O MOJOS, Tensorflow and Vowpal Wabbit binaries. These are the most popular libraries used in Booking.com, and they are Java friendly which matches the RS runtime environment. Finally, Scripted models are Python scripts, each running in its own virtual environment allowing users to upload additional modules and dependencies as needed.

On top of this, RS adds a lot of value through extra functionality, including a web portal that allows to search and browse all the available models. Each model has its own page with detailed information such as experiments using the model, monitoring tools, documentation, link to the training code, etc. It also provides a basic state machine that allows the author to transition the model through states like in-testing, production-ready, disabled, etc. Another interesting feature is the Playground that allows occasional users to just go ahead and use a model to see what it does. All these helps a lot with the Observability and Reusability requirements.

RS also mitigates many of the red flags identified by the trade-off analysis. For example, caching and batch requests help a lot with latency; the Linear Prediction system was extended to support Factorization Machines, mitigating the model flexibility red flag; test cases are enforced when models are uploaded to mitigate the Consistency red flag. The supported native libraries are friendly with several programming languages (Python, Java, R, C) mitigating the Stack Flexibility red flag. Altogether, RS equips model authors with high flexibility and robustness, regardless of the productionization method they choose.

Lessons learned

RS is one of the many tools in Booking.com that we use everyday. Adoption is growing and plays a fundamental role in scaling up our applications of Machine Learning across the whole organization.

Cumulative number of newly created models and experiments with Machine Learning and RS

Embracing diversity was by far they most important success-factor, but we also learned many other things along the way, so to wrap it up let’s see three of them:

Solving common concrete problems

RS started as a simple utility to run linear models in the website. This tiny utility achieved huge impact because it solved a common and concrete problem of those days. It removed one obstacle and open the path for what came later.

Keeping the customer close

Since the very beginning we kept our customers (model authors) as close as possible, brainstorming together, solving business cases together, building up a vision together. We like to think RS is built by our Machine Learning community and not only by the core RS team.

Reinventing the wheel

Reinventing the wheel is usually seen as a bad practice (although it has been actually re-invented many times). Asking ourselves what are the concrete requirements we want to satisfy, showed us that building our own system was a much more scalable approach giving us, among other things, the chance to integrate with other tools like our Experimentation Platform and fronted libraries, and to focus on the aspects we considered fundamental like latency and high availability. Reinventing the wheel gave us the chance to invent a perfect-fit-wheel for our requirements, plus high flexibility to adapt smoothly as they change.

Reinventing the Bicycle —Firefighter Bicycle by Pivari.com [CC BY-SA 3.0]

That is all folks!

So there it is, the Booking.com approach to machine-learned models productionization. In future articles we might explore specifics about Monitoring, Feature Engineering, Experimentation and other topics. We hope you found this one interesting and stay tuned for more.

I want to thank Adolfo Mazorra and Jean Schmidt, two of the main developers behind RS, for their insightful input, Themis Mavridis for reviewing the whole article and Steven Baguley for making it human readable.

Machine Learning in production: the Booking.com approach was originally published in Booking.com ML & DS Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Don’t be tricked by the Hashing Trick

Lucas Bernardi — Wed, 10 Jan 2018 16:14:23 GMT

In Machine Learning, the Hashing Trick is a technique to encode categorical features. It’s been gaining popularity lately after being adopted by libraries like Vowpal Wabbit and Tensorflow (where it plays a key role) and others like sklearn, where support is provided to enable out-of-core learning.
Unfortunately, the Hashing Trick is not parameter-free; the hashing space size must be decided beforehand. In this article, the Hashing Trick is described in depth, the effects of different hashing space sizes are illustrated with real world data sets, and a criterion to decide the hashing space size is constructed.
Out-of-core learning
Consider the problem of learning a linear model: an out-of-core algorithm learns the model without loading the whole data set in memory. It reads and processes the data row by row, updating feature coefficients on the fly. This makes the algorithm very scalable since its memory footprint is independent of the number of rows, which is a very attractive property when dealing with data sets that don’t fit in memory.
One possible implementation uses a mapping from a feature name (and possibly a value) to its corresponding coefficient. For example, if the data contains ‘country’ as a feature with values like ‘Netherlands’, ’Argentina’ or ‘Nigeria’, the weights associated to each country are updated like this:
W[‘country_argentina’] = W[‘country_argentina’] + delta(X, y, ‘country’)
This is implicitly applying one-hot encoding to the feature ‘country’. The fact that it happens implicitly is great — we don’t need to create a file with one column for each possible country, use sparse data file formats or any other preprocessing. We don’t even need to know how many categories there are; we can just take our .csv data as is, throw it into our learner and get back a coefficient for each feature. This also plays nicely with numerical features, for which we just need to skip the value:
W[‘price’] = W[‘price’] + delta(X, y, ‘price’)
While this approach looks good at first glance, it relies heavily on the way we implement the mapping from features to coefficients. This takes us to the next topic: Hashing.
Hashing
In order to implement the coefficients table W, we could use a hash table. This is a data structure that maps ‘keys’ of arbitrary length to ‘values’. In our case, the features are the keys and the coefficients are the values. This mapping is performed using a hash function, which takes a key of arbitrary length as input and outputs an integer in a specific range. The hash table uses this integer as an index in a 1-dimensional array where the values are stored.
Good hash functions have an interesting property: they map the expected input evenly over the output range, which means that every hash value has roughly the same probability of being observed after hashing a typical sample of the key space.
Hash functions also come with a not-so-nice side effect: they can hash different keys to the same integer value (this is known as ‘collision’), which is a big problem for our hash table: we can’t simply store each coefficient in the array where the hash of the key points to, since it could be overriding a previously inserted value.
A collision-resolution strategy is necessary. The most straightforward approach (known as ‘separate chaining’) is to store a list in each position of the array, containing all the keys and values hashed to that bucket. When the value associated to a key is requested, the key is hashed giving an index in the array, then the corresponding list is scanned until the requested key is found, at that point the associated value is returned. When a new key and value are added to the table, a similar procedure is followed, the only difference is that instead of returning the value, the key and value are added to the list.
This whole process makes reads and writes slower though, and, forces us to keep the keys in memory. This is bad news for our out-of-core learning algorithm, since it means the whole learning process is slower and its memory footprint is now linear in the amount of features.
Unless we are willing to accept collisions...
The Ostrich Algorithm
The Ostrich Algorithm is very simple: ignore the problem. In our case, this would mean we don’t look to solve collisions — we just go ahead and do:
W = 1-dimensional array of length n
…
W[hash(‘country_’+X[‘country’])] = W[hash(‘country_’+X[’country’])] + delta(X, y, ‘country’)
This means that we don’t need to scan a list nor keep the keys in memory, making our out-of-core learning algorithm constant in memory and fast.
Collisions do happen though. What’s the effect of collisions on this approach?
Figures 1, 2 and 3 show the effect of collisions on predictive power for 3 data sets:
Booking.com’s internal dataset containing ~200k features and ~250M examples for a 4 classes classification problem;
The Criteo Display Advertising Kaggle Challenge data set, which contains ~30M features and ~40M examples;
The Avazu CTR Prediction Kaggle Challenge with about ~40M features and ~40M examples.
For all data sets, the algorithm is a simple logistic regression trained with Vowpal Wabbit (one against all for the multi-class problem).
In all charts, the x-axis is the actual proportion of colliding features during training. The blue curve, associated with the left hand side y-axis is n, the hash size in number of buckets; the red curve, associated with the right hand side axis, is the log loss on a held out data set.
All charts are quite consistent: collisions are bad, the more, the worse the log loss, which is expected but (surprisingly) not that bad — even for 50% colliding features, the performance lost is much less than half percent. This could be considered a big win for the Ostrich Algorithm, but depending on the application such impact could be unacceptable. Furthermore, the impact also depends on the specific data set: at 50% colliding features the Booking.com dataset is impacted more than twice as much as the Criteo data set.
Figure 1 — Booking.com Dataset
Figure 2 — Criteo Kaggle Challenge Data
Figure 3 — Avazu Kaggle Challenge Data
Another effect of the collisions is the potential loss of interpretability of the model. Even if collisions don’t impact predictive power, they might produce absurd models; for example, binary features like ‘logged in/logged out’ could share exactly the same weight, which wouldn’t make much sense. Another example would be colliding countries; if we want to estimate the price of a hotel and use ‘country’ as a feature, we might expect prices in Denmark being much higher than in Greece. But if Denmark and Greece are hashed to the same bucket, they would have exactly the same contribution to the price of a hotel, mistakenly challenging our intuition.
Why not just use as big a coefficients table as possible? That would certainly minimise collisions — but at the expense of memory, which defeats one of the purposes of the Hashing Trick: to avoid the consumption of memory by the features. Minimising memory consumption has many advantages, but one of particular interest is that it allows us to train many versions of the model in parallel in a single computer, which can be very helpful for many things like hyperparameter search for example.
Whether we care about a specific metric to the last bit, or how interpretable our model is, if we want to make the most out of our computational resources, understanding the dynamics of the hashing space size, the feature space size and the number of collisions is key to make principled trade-offs — and, as you’ll see, a lot of fun. Let’s dig deeper.
Expecting Collisions
If we know roughly how many features our data has (denoted by k) and we fix the hash space size (denoted by n), what’s the expected number of features colliding? This is a nice puzzle, and it is related to a very famous problem: The Birthday Paradox. In the original version, it is described as follows:
In a group of k people, what is the probability of at least 2 persons having the same birthday?
But in our case we are interested in a slightly different problem:
In a group of k people, how many of them are expected to share the same birthday with at least one other person?
For example, consider a group of 5 people, 2 of them having their birthday on February 17th, and 3 of them on March 23rd, then everybody shares their birthday with at least someone else. But if 2 of them have their birthday on February 17th and the others on April 25th, August 2nd and December 18th, then only 2 out of 5 people share their birthday with someone else. That is 40% colliding people.
So let’s solve this puzzle step by step:
1 The probability of persons X and Y sharing their birthdays is simply 1/n
2 The probability of persons X and Y not sharing their birthdays is then 1–1/n
3 The probability of person X not sharing their birthday with any of the other k-1 people in the group is therefore:
4 Then the probability of X sharing their birthday with at least one other person is:
5 Define one binary random variable for each person: it gets value 1 when the person does share their birthday with at least one other person in the group and 0 otherwise. It follows a Bernoulli distribution with probability of success p.
6 Define another random variable H as the sum of the k previously defined random variables:
which simply represents how many of the k people share their birthdays with at least someone else.
7 We want the expected value of H. By linearity of expectations we can write:
8 Finally we have that in a group of k people, the expected number of people sharing their birthdays with at least one other person, when we consider n possible birthdays is
Equation 1
Back to hashing. The metaphor is quite haptic: the group of people is the feature space, the number of possible birthdays is the hashing space size, and having the same birthday is sharing the hash value, a collision. We assume that all birthdays are equally probable, which is actually not true for real birthdays, but it is very close to true for hashing: we are just assuming the hash function is a good one.
one experiment is worth a thousand equations
So equation 1 looks good, but, one experiment is worth a thousand equations. Does it actually work? For our 3 data sets, we know how many features there are, how big the hash space is, and how many collisions actually happened. Let’s compare what equation 1 says and what reality shows:
The results are conclusive. They show that the model is correct, since it can successfully predict how many collisions will happen in all 3 data sets.
Controlling Collisions
We now have a good model for the expected number of collisions given feature space and hashing space sizes. A very natural next question follows: for a given feature space size k, how big should the hashing space n be to produce expected number of collisions c? All we need to do is to take equation 1, fix c and write n as a function of k:
Pretty ugly function, let’s see if it works,
Figure 5
Figure 5 shows the results of applying this formula. The x-axis represents the different feature space sizes (k), the y-axis represents the hashing space size (n), and each curve a specific number of collisions (c). The chart can be interpreted as a contour plot of equation 1. One surprising result is that for a fixed number of collisions, the hash size seems to grow quadratically with the number of features. It is also surprising that all curves can be perfectly fit by a parabola, which suggests that it should be possible to find a parabolic form for n given c and k (as an alternative to that ugly formula).
Indeed, under rather weak assumptions ( n>1 and k ≪ n), we can apply a binomial approximation to the power term in equation 1:
and then plug it back to get:
which allows us to write n as a function of k and c:
Equation 2
To validate this approximation we can simply compare with the original formula:
Equation 2 is a very good approximation and can therefore be used directly to decide the hash space size (given the feature space size and the desired number of collisions). It’s worth noting that for the case of 0 collisions we can conveniently set c=1 to get n=k².
Trade-offs
In practice there are two main approaches to implement the hashing trick:
Global hashing space: There’s only one hashing space and one single parameter to decide, but cross-field collisions can happen — countries can collide with user ids, for example. This is how Vowpal Wabbit works.
Per-field hashing space: There’s a hashing space per feature, allowing finer grain control at the cost of some speed and more parameters to tune. This is how Tensorflow works.
Also, in practice, there are two main types of categorical features:
Moderate cardinality and static: These are features with less than a thousand categories that don't change a lot, like country. Usually it’s important to have one weight for each category, so having 0 collisions is what we’re after.
High cardinality and dynamic: These are features with more than a thousand categories and constantly getting new ones, like user id or hotel id. Usually it’s not that important to have a weight for each category; the weight distribution over the categories is what matters. Collisions are acceptable but it’s hard to tell how many we are willing to accept. 5% collisions is a pretty conservative ansatz, giving a simple rule: n=20k . Cross-validation techniques can also help, but require more time and resources.
One criterion: After quite a bit of analysis, we have now all the elements to recommend a practical criterion to decide the size of the hashing space that gives a good balance of memory usage, interpretability and predictive power:
If you can choose the hashing space on a per feature basis, use k² for features with less than a thousand categories and 20k for the others.
If there is only one hashing space and less than twenty thousand features in total, use k², otherwise use n=20k.
If you want to control for collisions as a proportion r of the features, then use n=k/r.
Conclusion
Of course the proposed criterion is not absolutely general, that is not the intent of this article (text problems might show different behaviours for example). This work presents general principles that govern the Hashing Trick, the trade-offs involved, and an analysis that gives tools to construct heuristics and criteria to decide the size of the hash in many different regimes: Your mileage may vary.
A couple of findings were surprising:
In practice, the effect of collisions on predictive power is very low.
The minimum hash size required to expect a fixed number of collisions grows quadratically with the feature hash size (this is especially relevant for the case of 0 collisions, if we only want to control collisions as a percentage of the number of features, then the hash size n grows linear with the feature space size k ).
I like to see the Hashing Trick as a successful example of the Ostrich Algorithm. Indeed algorithmically, the collisions issue is just ignored.
Finally, two things: all the code to reproduce these results are available in my github, which contains some useful scripts to deal with Vowpal Wabbit models and some freaky snippets like maths with big numbers in python. And, if you are interested in a formal analysis of the hashing trick, I recommend Feature Hashing for Large Scale Multitask Learning by Weinberger, Dasgupta, Attenberg, Langford and Smola.
I want to thank Stas Girkin for asking this (at that moment) awkward question, Kristian Holsheimer for figuring out the power expansion, Denis Bilenko for the fruitful discussions, Themis Mavridis for reviewing the article and Steven Baguley and Kristofer Barber for making it readable.
Don’t be tricked by the Hashing Trick was originally published in Booking.com ML & DS Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

k-Nearest Neighbours: From slow to fast thanks to maths

Lucas Bernardi — Wed, 31 Aug 2016 07:00:00 GMT

Abstract: Building the best travel experience for our customers in Booking.com often involves solving very challenging problems. One that appears very frequently is the k-Nearest Neighbours problem (k-NN). In simple words it can be stated as follows: Given a thing, find the k most similar things. Depending on how thing and similar are defined, the problem becomes more or less complex. In this post we’ll treat the case where the things are represented by vectors, and the similarity of two things is defined by their angle. We’ll discuss solutions and present a practical trick to make it fast, scalable, and simple. All of it, thanks to maths.
Photo by Jason Leung on Unsplash
1. Things and Similarity
Suppose that we are given a database with pictures of handwritten digits, and the task of finding similar digit handwriting. Every picture contains exactly one digit, but we don’t know which one. For every given picture we want to find other pictures that contain the same digit written with a similar style.
First, we need to find a representation for the pictures that we can use to operate. In this case it will be simply a vector, the length d of the vector is the number of pixels in the picture, and the components are the RGB values of the pixels. This representation has the advantage of working both at the computational level, but also at the mathematical level.
Second, we need to define similarity. Since the pictures are represented by vectors, we can compute the angle between any two of them; if the angle is small, the vectors point in the same direction and the pictures are similar. On the other hand, if the angle is big, the vectors diverge from each other and the pictures are not similar. More formally, the similarity between two pictures represented by vectors X and Y is given by:
This quantity has a very nice property: It will never be more than 1, or less than -1. Given two vectors, if their similarity is close to 1 then they are very similar; if it is close to -1 they are completely different. 0 is in the middle — not very similar, but not completely different either.
Let’s see this in action:
Two very similar 4s, their similarity according to equation 1 is 0.85
Two very different 4s, their similarity according to equation 1 is 0.11
Figure 1 shows a graphical representation of the vectors computed from two handwritten digit pictures. These two digits are very similar, and when their similarity is computed by their angle using equation 1 it gives 0.85, which is, accordingly, quite high. On the other hand, Figure 2 shows two quite different numbers; this time their similarity is 0.11, which is quite low but still positive — even though the writing style is very different, both pictures are still a 4.
These pictures were handpicked to illustrate the vector representation and the cosine similarity. We now move on to finding an algorithm that finds similarly handwritten digits.
2. A simple solution
Now that we know how to represent things and compute their similarity, let’s solve the k-NN problem:
For every picture in our database, compute its associated vector.
When the k-Nearest Neighbours for a picture are requested, compute its similarity to every other picture in the database.
Sort the pictures by ascending similarity.
Return the last k elements.
This is a very good solution (especially because it works). Figure 3 shows this algorithm in action. Every row shows the top 9 most similar pictures to the first picture in the row. The first line captures very rounded 3s, the second inclined 3’s, the fourth line shows 2’s with a loop on the bottom, and the 5th line shows Z like 2’s. Notice that this algorithm has no information about what digit is in the picture (nor, for that matter, anything about what kind of things the picture has), but it nevertheless succeeds to group by digit, and even by typographic style.
But let’s take a closer look by analysing its computational complexity. Consider n pictures with d pixels each:
Computing a feature vector is O(d). As this is done for every picture, the first step is O(nd).
The similarity function is O(d), and again this happens $n$ times, then the second step is also O(nd).
The third step — sort n elements — is O(nlogn)
Finally returning the last k elements could be constant, but let’s consider it O(k).
In total we get:
k is very small compared to n and d, so we can neglect the last term. We can also collapse the two first terms into one single O(nd). Now, note that logn is much smaller than d, for example, if we have 10 million pictures with 256 (d) pixels each, logn would be 7, much smaller than $d$. That means we can turn the O(nlogn) into another O(nd). Therefore the computational complexity of this algorithm is O(nd), that is, linear in both total the number of images in our database and the the number of pixels per picture.
k nearest neighbours for some pictures. Every row depicts the top 9 most similar pictures to the first picture in the row
3. Can we do better?
If we do computational complexity analysis, it is natural to ask ourselves whether we can improve. So let’s try.
One idea to consider is to use a heap to keep track of the k most similar items. The heap would never be larger than k so every insertion involves O(logk) similarity computations (O(d)), so an insertion is O(dlogk). Since there will be n insertions, in total we get O(ndlogk), which is not an improvement.
We could also try to exploit the fact that we do not need to sort the $n$ elements, just to get the top k. The algorithm would be exactly the same, but step 3 would be replaced by applying quick-select instead of sorting. This would change the O(nlogn) term to O(n), that gives O(nd) + O(n) which is O(nd), again, not an improvement.
The last idea we will consider is to use a Space Partitioning Tree (SPT). An SPT is a data structure that allows us to find the closest object to another object in logarithmic time. A priori this seems to be the right solution but there is a problem: SPTs can only operate under certain distance functions, specifically metric distances.
SPTs work with distances, not with similarities. But there is a very close relationship between similarity and distance. In the context of k-NN, for every similarity function there exists a distance function such that searching the k most similar items is equivalent to searching the k closest items using that distance function. Just multiplying the similarity by −1 gives such a distance. So now we have a cosine distance that we could use in an SPT, but unluckily this cosine distance is not a metric distance.
A metric distance is a distance that complies the following conditions:
distance(x, y) ≥ 0
distance(x, y) = 0 ⇔ x = y
distance(x, y) = distance(x, y)
distance(x, z) ≤ distance(x, y) + distance(y, z)
Cosine distance clearly violates the first condition, but this is easy to fix by just adding 1. The second and the third conditions are met. Finally the fourth condition is violated and this time we cannot fix it. Here is an example of 3 vectors that violate the fourth condition:
And then:
which proves that condition 4 is not met.
In the following sections we are going to show a trick to overcome this limitation.
4. Maths
Let’s introduce some properties of vectors that we’ll exploit later.
Cosine distance is invariant under Normalization
First, let’s make a few definitions:
A consequence of these two definitions is the following:
Which says that the norm of a normalized vector is always 1. This property is quite obvious, but here is a proof:
Another consequence is the following:
In words, this means that the angle between two vectors doesn’t change when the vectors are normalized. Normalization only changes the length (the norm of the vector), not its direction, and therefore the angle is always kept. Again, here is the proof:
From Euclidean to Cosine
The second property we need for this trick is the following:
If Χ and Υ are vectors with norm 1 (unit vectors) then:
This states that if Χ and Υ are unit vectors then there is an exact relationship between the euclidean distance from Χ to Υ and the angle between them.
The proof:
And since X and Y are unit vectors, dividing by their nor is dividing by one:
Cosine ranking is equivalent to Euclidean ranking
By looking at equation 5 we can already see that if 1-cos(X,Y) is bigger, then ‖X-Y‖ must be bigger. That means that if Y gets away from Χ in the euclidean space, it also does in the cosine space, provided both X and Y are unit vectors. This allows us to establish the following:
Consider three arbitrary d-dimensional vectors X, A and B. (they don’t need to be unit vectors). Then the following holds:
This equation says that if the cosine distance between X and A is less than the cosine distance between X and B then the euclidean distance between X and A is also less than the euclidean distance between X and B. In other words, if A is closer to X than B in the cosine space, it is also closer in the euclidean space.
The proof: We start from the left hand side expression and apply operations to get to the right hand side expression.
cosine is invariant under normalization (see equation 4)
doubling and taking squared root keeps the inequality
normalized vectors are unit vectors (see equation 5)
This is all the maths we need to apply the trick. Let’s see what is it about.
The k-NN Trick
The goal of this trick is to find a way to be able to use cosine similarity with a Space Partitioning Tree, that would give us O(log n) time complexity, which is a huge improvement.
The idea is actually very simple: Since cosine similarity is invariant under normalization, we can just normalize all our feature vectors and the k-nearest neighbours to X will be exactly the same; but now our vectors are all unit vectors, which means that sorting them by cosine distance to X is exactly the same as sorting them by Euclidean distance to X, and since Euclidean distance is a proper metric we can use a Space Partitioning Tree and enjoy the logarithm of n. Here’s the recipe:
Normalize all the feature vectors in the database
Build a Space Partitioning Tree using the normalized vectors
When the k nearest neighbours to an input vector X are requested:
- Normalize X
- look up the k-NN from the Space Partitioning Tree
6. Experiments
Experimentation is at the core of Product Development at Booking.com. Every idea is welcomed, turned into a hypothesis and validated through experimentation. And Data Science doesn’t escape that process.
In this case, the idea has been thoroughly described and supported with practical examples and even maths. But let’s see if reality agrees with our understanding. Our hypothesis is the following: We can improve the response time of the algorithm described in section 2 by applying the trick described in section 5 guaranteeing exactly the same results.
To test this hypothesis we designed an experiment that compares the time needed to solve the k-NN problem using the full scan solution with the time needed by the k-NN trick solution. The k-NN trick is implemented using two different Space Partitioning Trees: Ball Tree, and KD-Tree.
The database consists of handwritten digits pictures from MNIST. For n ranging from 5000 to 40000 randomly sampled n pictures from the original database; then applied the different solutions to the same sample, computing the 10 most similar pictures for 20 input pictures.
7. Results
The results of our experiment are summarized by Figure 4:
Comparison of the full scan solution (brute force) and the k-NN trick (norm euclidean ball tree, and kd tree) for different database sizes n
From the chart we can make several conclusions: First, the time complexity of the full scan solution is indeed linear in n as suggested by the blue dots. This confirms the theoretical analysis in section 2. Second, although it is hard to say if the k-NN trick based solution is logarithmic, it is clearly much better than the full scan, as suggested by the green and red dots. Third, the Ball Tree based solution is better than KD-Tree solution, though the reason for this fact is not clear and requires further analysis and experimentation. Overall, the experiment strongly supports the hypothesis.
8. The Trap
Every trick sets up a trap, and every gain in one aspect hides a loss in another. Being aware of these traps is key to successfully apply these tricks. Let’s see what trap the k-NN trick sets, or, in more technical words, what kind of trade-off are we dealing with?
In the simple solution, before being able to answer a k-NN query all we need to do is to compute the feature vectors of each object in the database. On the other hand, when using the trick, before we are able to answer a query we not only need to compute the feature vectors, but also we need to build the Space Partitioning Tree. In the experiment we run, we also recorded the time it takes to be able to answer queries. The results are displayed in Figure 5 and show that the trick-based solutions scale much worse than the simple solution. This means that when using the trick we are trading off query response time with start-up time.
This trade-off must be taken carefully, and for big databases this can have very negative consequences. Consider an e-commerce website that goes down for whatever reason; imagine that this e-commerce uses k-NN to serve some recommendations, (a very important yet not critical part of the system). As soon as we fix the problem, we want the system to reboot as soon as possible, but if the booting process depends on the k-NN system we fall into the trap — users won’t be able to purchase anything until our Space Partitioning Tree is built. Not good.
This can be easily solved by breaking the dependence using parallel or asynchronous processes to boot different parts of the system, but the simple solution is clearly more robust in this instance, up to a point where we don’t even need to care. The k-NN trick forces us to consider this very carefully and act properly. For many applications, this isn’t a bad price to pay for the speed and scalability we get at query time.
9. Conclusion
In this post we described a trick to speed up the standard solution for the k-NN problem with cosine similarity. The mathematical rationale for the trick was presented, as well as experiments that prove its validity. We consider this as a good example of a scalability problem overcome by applying elementary maths. This is also a good example of Reductionism: The trick is a reduction from cosine similarity k-NN problem to a Euclidean distance k-NN problem which is a much more studied and solved problem. Maths and Reductionism are two concepts sitting at the core of applied Data Science at Booking.com, always at the service of the best travelling experience.
Ready-time comparison of the full scan solution and the k-NN trick for different database sizes n
Would you like to be a Data Scientist at Booking.com? Work with us!
k-Nearest Neighbours: From slow to fast thanks to maths was originally published in Booking.com ML & DS Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.