During the last five years, Machine Learning became a standard tool for Product Development in Booking.com. Today, it plays a role in every step of the customer journey. Hundreds of Data Scientists build, deploy and experiment with hundreds of machine-learned models exposing them to millions of users every day.

Supporting Machine Learning at scale involves many challenges, not least of which is shipping the models to production reliably, as fast as possible and accommodating a large variety of model types, invocation settings, libraries, data sources, monitoring approaches, etc. …

In Machine Learning, the Hashing Trick is a technique to encode categorical features. It’s been gaining popularity lately after being adopted by libraries like Vowpal Wabbit and Tensorflow (where it plays a key role) and others like sklearn, where support is provided to enable out-of-core learning.

Unfortunately, the Hashing Trick is not parameter-free; the hashing space size must be decided beforehand. In this article, the Hashing Trick is described in depth, the effects of different hashing space sizes are illustrated with real world data sets, and a criterion to decide the hashing space size is constructed.

Out-of-core learning

Consider the problem of learning a linear model: an out-of-core algorithm learns the model without loading the whole data set in memory. It reads and processes the data row by row, updating feature coefficients on the fly. This makes the algorithm very scalable since its memory footprint is independent of the number of rows, which is a very attractive property when dealing with data sets that don’t fit in memory. …


Lucas Bernardi

Principal Data Scientist at Booking.com. My focus is Data Science for Product Development, Recommender Systems and Productionization of Machine Learning Models

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store