How can AI help people make better decisions?
Machine Translation of Reviews in Booking.com.
Since long, the time when we lived together as hunters and gatherers, we have been accustomed to follow what the majority went about. Why so?
Because it saved energy from spending it on trial and errors. It gave us quick and safe paths to pursue which was already tested for its risk and proven worthwhile our time.
The fundamental behaviour of humans hasn’t changed much in the millions of years whence passed. What has changed between then and now is only the medium through which we gather wisdom of the crowd.
Back then, people gathered in a spot, mostly designated as public square, to exchange information and collectively decide the next course of action. Today, we gather ourselves on Internet, a virtual meeting place, to assimilate public information and individually decide our actions.
Village is not the World:
Though the decision making process between then and now may appear to be same but they are quite not!
With internet and as a result of more information, dreams of an individual has grown, too. The volume and variety of information available on internet is much larger than what used to be few decades ago. And so, one has to search for more information much longer, assimilate and retain greater amount of relevant information and spend significant amount of energy to weigh options and decide.
An example for such complex decision making scenario happens when one searches travel websites to find a stay or things to do in an unknown part of the world. I will take the case of Booking.com where I work. On Booking.com, people have choice of 2 million properties to search through and 160 million reviews to read before they decide to book a hotel.
From user research studies we know that travel reviews are one of the top 3 factors influencing decision making next to Price and Location.
Price and Location are decided by the market and the user needs respectively and they hardly change much. A traveller might be willing to go little above or below her budget and she may drop or change her destination basis the political stability and threats in a city but they are less subject to change.
Whereas, reviews on the other hand are opinion and facts shared by other travellers and they are available free to read. They contain a treasure trove of information that can be transformed (summarised or categorised) or translated that is easier to consume than raw text. On a daily basis 300,000 new reviews are added. They can’t be processed by humans to glean necessary information and only an AI could handle data at such scale.
Why should reviews be machine translated?
Reviews contain on-the-ground facts and perceptions a traveller encountered during her stay. It contains both positive and negative sentiments and provide a well balanced account of the experience one had had. Such is the power of reviews that it can tip a traveller to either book or abandon a room.
Majority (80%)of the hotels and homestays don’t get visited that often. Only a handful of popular (20%) ones get bookings repeatedly and garner much of the attention and market share. The Pareto Principle is totally applicable in case of hotel bookings.
Though, there are some fundamental reasons why people find only certain hotels favourable to book: A hotel that gets booked gets endorsed by the travellers and with time it amasses a significant number of reviews. It acts as a social proof for future bookers to trust and book it. The cycle continues leaving less and less chance for new hotels to be booked and reviewed.
What should be done to break the vicious cycle?
There are two different cases:
- Hotels that don’t even have a single review
- Hotels that have some reviews
Hotels with no reviews can be helped to get more bookings through other means of improvement. That would be another discussion and we will not discuss here.
Hotels that have some reviews but not sufficient enough can be helped. A review in one language can be translated into multiple languages to increase its visibility.
A hotel having couple of reviews in English, when translated into 10 languages, will now have 10X more visibility. Having more visibility has proven to bring in more bookings and as a result more reviews.
How did we translate reviews?
Very first thought was to use Google Translate. We ran AB test with Google Translate with actual users and found it not to be helping. Why?
There were two main issues:
- Users did not find the quality of translations good enough
- We interpreted that having to click a button and wait for translations to load was not an optimal experience (especially on mobile and on slow internet connections)
We then decided to build our own translator within Booking.com.
What went in to building a high quality MT?
At first, we wanted to benchmark ourselves with Google to know how far we are lagging behind and to know how much effort is needed to build a good translator ourselves. We thought that comparison with Google will indicate time, effort and cost needed to invest in building a model for Reviews. We translated a sample of reviews with both Booking and Google and evaluated them with professional translators. The initial results were,
As expected Google was definitely better than ours!
Why did Booking Translator have problem in translating reviews?
We dug little deeper into reviews data to find why Booking MT was worse.
We figured that it translated reviews with positive sentiment, with correct grammar and sentence structure well. But it failed to translate ones with negative sentiments or with punctuation errors and emoticons.
“Smells so baaaaaad!” :(
“Haaapppyyyyy!!” :)) :O
We surmised that the reason for bad translations was biased training data. We hypothesized that we can achieve better translation accuracy if we had more and varied training data. So it boiled down to having enough training data to achieve better translation than Google. We knew that more data can solve the problem but we did not know how to get more data.
Hunt for data:
We needed answers for two questions:
- Where do we get training data?
- How much data is good enough?
We went looking for free data sources that might have what we were looking for and, we found few. They were mostly of movie subtitles, TED talk subtitles, Euro parliament translations and many other types of data loosely related to travel domain.
They were not the purest form of training data and neither did they contain exact sentences we have in our customer reviews. We did not know then such data would improve translations accuracy but we wanted to explore the possibility and went ahead in using them as training data.
How open source data improved reviews translations?
We experimented with free data sources available on internet (Movie, Ted Talk subtitles etc.). We did not know if it would work. Neither did we know how to pick sentences similar to customer reviews. After multiple discussions, and reading many white papers we figured a methodology to isolate sentences good enough for our requirement. We picked most relevant sentences from the subtitles and used them in the training data. We evaluated the translations with professional translators and the translation quality improved. It supported our hypothesis that adding more data with negative sentiments in fact improved translation accuracy. It tuned the model to translate new contexts and semantics of words. We learnt that we can improve model’s accuracy by training it with freely available translations on the internet.
Booking: 53.5% | 62.5%
But still, 62.5% is not enough. Some white papers talked about using Synthetic data to augment the training set.
What is Synthetic data?
Synthetic data is created with our own machine translation system that is still under development. Sentences are translated by the machine and they aren’t perfect.
It isn’t an ideal translation. It contains errors and incorrect translations. The good thing about it is that it can teach model about context and help it form associations with words in a sentence.
How did it improve Booking MT?
Synthetic data is prone to have errors but they help the machines learn word usage in multitude of contexts. With Synthetic data, machine can be taught to learn about many different synonyms and context by creating millions of translations which is otherwise very costly to create with human translations.
English: I go to office by bicycle
German, Human Translation: Ich gehe mit dem Fahrrad ins Büro
German, Synthetic Translation: Ich fahre Büro (Back translation: I cycle [missing to] office)
By training with Synthetic data, translation accuracy improved by 5.5%.
Booking: 53.5% | 62.5% | 68%
The translations quality was still not enough to test it with end users. As last resort, we had an idea of translating few thousand reviews with humans and use it to train the model in addition to all the other data. Human translations is the purest form of training data with almost no errors in meaning and fluency of translations.
Training with human translations did improve the quality.
Booking: 53.5% | 62.5% | 68% |77%
So far, we had used movie subtitles, synthetic data and some human translations to improve the performance from 53.5% to 77%. All the experiments that we did thus far were conducted on smaller neural network. Smaller network architecture allows one to iterate faster as it has lesser training time. We now wanted to train our MT model on complete (bigger) architecture.
The quality improved by extra 16.5%
Booking: 53.5% | 62.5% | 68% |77% |93.5%
Having reached good accuracy in reviews translations it was time to test it on actual users who were booking an accommodation for their travel.
A/B Experiment and its results:
50% of the traffic on Booking.com were shown machine translated reviews and the rest were in control with no translations. The results of the experiment were,
- It increased number of people who booked
- It increased bookings for new properties that had few reviews.
- It increased number of people booking an accommodation abroad.
Top 2 reasons behind Booking.com’s MT success:
It took 18 months to build it from scratch and to bring it up to perfection, well almost. Booking MT can translate our reviews better than Google and it can actually make difference in how people make decisions. The reasons for success are mainly,
- High quality of translations — it was possible because we could custom build MT for travel specific use case.
- We pre-translated all the reviews and showed the translations upfront to the users. So they did not have to click and wait for the translations to load. It was a huge improvement from the previous AB test that used Google Translate.
Reviews are the one of the most important factors for travellers to decide which accommodation to book. But majority of the hotels and homestays don’t have sufficient number of reviews in many languages. Translating reviews from one languages to another help people in their decisions. Though, translating millions of reviews is costly when done with human translators or with a third party like Google Translate or Bing. So, Booking built its own Machine Translation system. With state-of-the-art approaches in data selection and training methodologies it improved translation quality from 53.5% to 93.5%. The machine translated reviews also proved value to millions of travellers and hotels, at a scale unimaginable without AI.