NLP at SnapTravel: Hotel search refinement using SpaCy entity recognition

UW Data Scientist
Super.com
Published in
6 min readAug 14, 2018

In this blog post, I will give an overview of the project I’ve been working on for the past few months as an NLP Research Engineer at SnapTravel.

SnapTravel helps people book hotels by talking to a chatbot. Using plain English, the user describes to our bot what kind of hotel she’s looking for, and we recommend her the best hotels that fit her criteria. In order to do this, we need to build NLP models to understand the user’s message and extract the relevant pieces of information.

For example, a typical message might look like this:

In this query, the user specifies a bunch of things, including the city she’s travelling to, the checkin date, duration of stay, number of people, and also mentions she needs parking.

Another example:

This time, the user wants to book the Bellagio, a luxury resort in Las Vegas. The word “bellagio” might mean a city in Italy, so by itself, “bellagio” is ambiguous. However, when combined with “vegas”, we know for sure that she wants the resort, not the Italian town. This is just one of many examples of the linguistic ambiguities that we need to consider when designing our NLP system.

First attempt: Regular expressions

The first thing that we tried was a handcrafted set of regular expressions that looked for keywords. This rudimentary system worked well enough for some things — there aren’t that many ways to say you want parking. It also serves as a good baseline to compare our more sophisticated models against.

Sample regular expression to detect star preference

We used regex-based heuristics to parse certain simple keywords like star preferences. However, for other categories, like city and hotel names, it was clear that regular expressions weren’t going to scale. With over 1.5 million hotels and 60,000 cities in our catalogue, we needed a smarter approach.

Data labelling

Before we could do any machine learning, we needed some data to train our models as well as evaluate how well they’re performing. To do this, we built an annotation tool using PyBossa, and hired some professional data labelling agents to spend a few hours each week annotating our data.

For higher accuracy, we set it up so that one agent did the labelling, then the same message got handed over to a second, different agent who would check for mistakes.

Our data labelling interface

Soon enough, we got a few thousand annotated messages. Let the machine learning begin!

SpaCy neural named entity recognition

In this section, we tackle the problem of resolving city and hotel names in user messages.

If your friend tells you “I drove to Santa Barbara over the weekend”, you can infer from context that “Santa Barbara” is a name of a place, even if you’ve never heard of it before. It’s obvious by the shape of the word, and the words surrounding it. Using NLP, computers can do this too.

This problem is a well-studied area of NLP, called named entity recognition (NER). NER deals with extracting tokens in a text that are proper names, and deciding what type of proper name it is, for example, is it a name of a person, place, or organization.

Neural NER using bidirectional LSTM and a CRF (Lample, 2016)

Lample et al. (2016) was the first to use recurrent neural networks for NER. In their model, the input sentence is transformed into a sequence of word embeddings, which is fed into a bidirectional LSTM. The LSTM learns to output a sequence of entity tags, one for each word, while a conditional random field (CRF) learns constraints so that the output sequence is “locally” reasonable.

This sounds complicated, but SpaCy implements a variation of this algorithm so that you don’t have to implement it yourself. All you need is a dataset of sentences and ground-truth NER tags, and it handles the input preprocessing and training the network. Even more, it comes with pre-trained word vectors that you can fine-tune with your own data.

When we trained our SpaCy NER model, it learned to extract location tokens out of a sentence with decent accuracy, but it had difficulty distinguishing between city names and hotel names. Consider the following queries:

  • “Bellagio Vegas next weekend”
  • “Abano Terme Italy next weekend”

In the first example, “Bellagio” is the name of a hotel and “Vegas” is the name of a city. What about the second example — is “Abano Terme” a hotel “Abano” in a city called “Terme”?

Actually, Abano Terme is a city in Italy, but there’s no way to know unless you knew some Italian geography. The two sentences are syntactically identical. Without access to any external world knowledge, the NER model struggles to learn the difference.

Augmenting NER

As mentioned above, the NER model by itself doesn’t get us the accuracy we needed.

To improve our hotel and city resolution, we augmented SpaCy’s NER with our own models (confidential to SnapTravel). This substantially improved our model accuracy: while SpaCy’s NER resolved the correct hotel with 65% F1 score, we were able to lift the F1 score to over 80% by combining it with other models.

Our model also recognized situations that required asking the user additional questions to clarify what she’s searching for.

Future directions

That’s it for a high-level overview of our search refinement NLP system!

I’ve done internships as a software engineer at several companies before, but this is my first machine learning and data science internship. Doing machine learning is a vastly different experience from doing software engineering. The work is a lot more experimental, and you never know what models are going to work or what F1 score you’re going to achieve until you try it. For every model that makes it into production, I have five Jupyter notebooks full of failed experiments that didn’t work out.

A good approach is to read a lot of research papers to get ideas for things to try, and also understand the characteristics of many different models, so you can judge which ones are most suitable for your task at hand. In machine learning, the “no free lunch” theorem says that there’s no universal learning algorithm that’s good for all problems. Instead, you must adjust your models to take advantage of implicit business assumptions and the specific properties of your dataset.

In the next few months, we will continue to work on the many little details to bring this system into production, and continue trying different models to improve its accuracy. Thanks to Leon Jiang and Nehil Jain from our Engineering team for their support to make this project possible!

Interested in learning more about the SnapTravel team or joining in to help us solve these tough challenges? Check out our Careers page here!

--

--