Travel information data pipeline

Vincent Osinga
TUI MM Engineering Center
11 min readOct 5, 2021

While the Triposo group is part of TUI MUSEMENT, it is probably not very well known what we do. With this article we will try to change that. The article will try to explain the heart and soul of our group, ‘the pipeline, as we like to call it. The pipeline is a set of steps that starts by gathering information from all around the world wide web and ends with a database of classified and scored touristic information.

Our pipeline is written entirely in Python. We have a Make script that considers dependencies for each of its steps, so that when one source or piece of code changes, it is not necessary to rerun the entire pipeline. Because the pipeline processes huge amounts of data it is important to be able to do the calculations in parallel processes, that is why we use MapReduce for the steps where this is possible.

Below you will find a schematic overview of the main steps of our pipeline. We will describe each step we use in TUI MUSEMENT and give an overview of the techniques used for each step.

Crawling

The pipeline starts by gathering the raw information from several sources. We use three ways of importing data from a source. The preferred way is to fetch a new version of a dump, like for Wikipedia. When a dump is not available, we either use an API (if available) or a batched circular crawl. For dumps we simply import the latest version of several sources. This ensures data freshness.

If the source in question has an API, we normally use that to extract the information. This normally gives us very structured data, which is handy when it comes to parsing the data. However, in many cases the number of API calls we can make without incurring high costs is limited, which means that it can be slow.

A batched crawl means that we incrementally crawl a fixed number of urls for a source at each round. This number of urls is actually a queue that assigns priorities based on the date that a url was last crawled, giving preference increasing by older date, to keep the source as up-to-date as possible. Obviously, a url that has not been crawled gets the highest priority. Urls that do not fit in the current round get a higher priority in the next round.

Let’s consider an example that will make this more clear. We want to crawl 1000 urls in total, and we can crawl 10 urls in one go. One “go” (i.e. one batch) takes one day to crawl. We have been crawling for 100 days in a row, as a result we have all 1000 urls that we need, but they all have a different age. The 10 pages that were crawled on day 1, will be the oldest ones. As we want to keep our data up to date, these are the ones that should be fetched again on day 101 to keep our source fresh.

In our next go, on batch 101, we should crawl again the same urls that we crawled in the first batch on day 1. Let’s say that now, in these pages, 5 new urls appeared that are not in the initial set of 1000. This means that we now have 1005 pages, but we only have 1000 of them crawled. Thus, batch 102 will consist of these 5 new urls that we do not have at all (and as a result get the highest priority) plus 5 pages from batch 2.

This method ensures a minimum freshness for the entire source (or a maximum age for each page), as long as the batch size is bigger than the rate at which new pages appear. At the same time, it gives us the benefit of a clear control. For example, if a source changes its layout during a certain batch, we can catch that early enough and update the corresponding parsers for the batches that will follow.

This brings us to the next challenge, which is parsing. As every source has a different layout that might change through time, this means we need to build a collection of parsers, one for every single source and one for every version of that source. A generic single parser for a family of sources is also something that we use, but obviously offers a lower degree of detail.

The tool that we mostly use for batched crawls is Python Scrapy. Scrapy is a mature and actively maintained python open source tool that provides a rich option set on how to crawl something. Options include crawl rate, how to process a crawled item and more.

Before adopting a new source for our next release, we always go through a check of the parsed results. The checks consist mainly of counting specific categories and items of the parsed results. An unexpected large counter change usually indicates that something has gone wrong, and the parsers need to be adapted. Although minor changes are almost always needed when importing a new source, sometimes bigger changes are needed as well.

Feature Identification

At this step, for each parsed object of each source, we identify its category. This is not just location vs Point of Interest (POI) vs article, but each also has several subcategories, like region, island, city for location, eating out, sightseeing, practical for POIs and animals, artists and artworks for articles. The set of locations, POIs and articles are called features.

For identifying our features, we have configured a set of rules for each source, which indicate to which category or subcategory on our end the feature belongs. But it’s not just that, in some features we can find indicators of more than one category, thus we also weigh the features and pick the one that is most likely. Of course, not all of the objects that we parse are interesting to us, for example the Wikipedia page of the UEFA Champions League (i.e. https://en.wikipedia.org/wiki/UEFA_Champions_League) might be interesting for millions of people, but not for us, that’s definitely not something we want to carry forward and thus we throw it away.

Matching

After we have identified all the features that we have crawled from all of our sources, we need to match them. Each source could have its own description for a location or POI, and we want to make sure that all descriptions of one feature are assigned to the same record. The first round of matching is done based on coordinates and name within their own category. Name matching is done through tokenization with stopwords removed (most importantly we remove the city name, if known), assigning a score to the name match which when combined with the distance, leads to the decision if something is a match or not. After the first pass of matching, we take everything which is not matched and try to match using different criteria, dropping the category and look at stuff like address and telephone number. This way we can even match sources that do not contain coordinates.

After matching our locations, we have to decide if they are interesting and important enough to keep in the rest of our pipeline. This is especially important because in the next step we will be assigning POIs to these locations. And if we decide to keep too many, the POIs will be scattered and may not end up in the right place. The main indicator for this decision is the number of sources we have for this location. But if we only find one source, we use a Machine Learning algorithm to decide if a location is important enough to retain. This algorithm has two sets for training it, one set of places which we are sure about (i.e assigned to proper locations) and a random set from all other locations.

Merging

After the matching we have to decide to which city our POIs belong and to which region and/or country our cities belong. In some cases, some of our cities are merged into other cities and become districts of that city as well as its own entity. A good example of this is Hollywood and Los Angeles. Tourists visiting LA will definitely want to know about the sights of Hollywood, but you should also be able to find Hollywood on its own.

The logic for these merges is based on the shapes and breadcrumbs found in our sources, as well as populations and distances between the cities and POIs. For instance, if we do not have a shape for a city we will assign POIs within a larger radius to a city if the city has a higher population.

Opinion mining

By this point, each POI has a lot of text associated with it. We can now go through the relevant reviews and descriptions and look for what we want to know. This means that for a lot of things that we look for we manually define the keywords which we scan for. But some of our categories are dynamically generated through category pages found on Wikipedia. This happens for instance for country specific specialities. The page on Wikipedia that we would use would be like https://en.wikipedia.org/wiki/Category:Canadian_cuisine.

As this is a computationally expensive procedure, we do that for POIs that only fall under certain categories that interest us (which by this point in the pipeline we know).

“What we want to know” is dish specialties, features such as takeaway or wheelchair accessibility and many more. We can now start searching for reviews mentioning the relevant features. We also identify the sentiment of the review about the feature. For example, does a review mention a “delicious cheesecake” or a “horrible cheesecake”?

For identifying the feature itself, we use regex collections, as well as word2vec embeddings. For sentiment identification, we use collections for positive and negative regex patterns, associated with a relevant sentiment score. For example, a review mentioning an “extremely delicious cheesecake” should get a higher score than a review mentioning a “good cheesecake”.

Finally, every review gets a sentiment score about every feature that we are looking for.

Tagging

Subsequently, we normalise the total opinion score about the feature compared to the amount of reviews. Every feature then has its own absolute minimum number of reviews required, so that we can say with confidence that “this bar is wheelchair accessible”.

The review number threshold difference makes sense in cases such as wheelchair accessibility. For this specific feature, for example, not a lot of people are expected to mention its presence or not, as they will most likely not notice if it does not concern them. Nonetheless, if there is a single mention about its presence, it probably means that someone needed it and it was present.

Furthermore, we discard sentiment about features that are mentioned in reviews a few times compared to the entire sample, as the confidence is too low, treating them as noise.

Whether a POI will get the tag or not, depends upon its total sentiment score for the tag. But this is not all. Sources such as OpenStreetMap, often provide tags that in some way correspond to Triposo tags. We then trust our source and this is enough for the POI to get the tag.

If the POI has also a sentiment score about the tag, this will contribute to the POI’s tag score and total score. Otherwise, it will simply get the tag with a minimal indicative score.

Although this might initially seem as incomplete information, it is not bad as a tag score is often meaningless. For example, when a tag score for “good cheesecake” makes sense, a score for wheelchair accessibility would be meaningless. It would be enough to know whether this feature is simply present or not.

Apart from the opinion based tags, we also have more obvious tags like the type of POI (church, museum, restaurant, etc.). This is either already found in the feature identification or derived from categories or tags from our sources.

Scoring

Probably the most asked questions for any tourist when visiting a city or a country is: what sight is not to be missed, what is the best restaurant and what is the best place for a coffee. To make this information available to our end-users we score everything. We give our POIs, our locations and our tags a score.

But not only that, we also score each of our POIs for each of its tags. A restaurant can be really good for its wine, but really bad for its pizza (because they put pineapple on it). The logic for this scoring is quite complex, we take a lot of factors into account, like the length of description for the different sources, the completeness of the information, the number of different languages in which we have information, the number of sources that we have for the place and the number of references from other places.

The tag scores for our POIs, especially for tags belonging to restaurants, are very much dependent on the results from the opinion mining. If we have lots of positive reviews for a certain dish the score for this dish specific tag will be higher. We also aggregate dishes that belong to a certain cuisine and they all have influence on the cuisine specific score.

After we have calculated all of our raw scores we normalise our scores to make sure they lie between 0 and 10. We normalise them within each category (sightseeing, nightlife, eating out etc.) but also for each specific tag. This normalisation is done for all POIs in the world, but also within each country to remove bias based on the fact that in some countries people are more likely to write about the POIs or review them compared to other countries.

Content Generation

Having features categorised, tagged, scored and, in general analysed is pretty much the job complete but we need to do one more step. We want to show them in a natural and inspirational way. The best things to use are images. In particular for locations we pick a nice landscape image which tells travellers how exciting a place can be, but we try to do the same also for POIs whenever possible.

Along with images we also want to add meaningful snippets to give more breadth at a glance. First option is to reuse important parts of available descriptions we find in our main sources. If we can’t find anything we try generating our own snippet. This is a process where we start filling templates we have configured for each entity category with the information we mine along the pipeline, including their sentiment, if available. Then we need to make the sentence more correct from a grammatical perspective and we do this by making Part of Speech tagging and Dependency parsing with the Spacy library; finally we pass the sentence’s tokens to SimpleNLG library accordingly for generating the final English sentence. Sometimes the available sources for a POI or location do not yield us a usable description. In this case we use the generated snippet also as its main description.

In reality, we always generate our own English description, even if it doesn’t end up in the snippet and/or description field; we do this because we’ve seen that external Machine Translation services are pretty good in translating template-based sentences and therefore we can easily have descriptions in multiple languages.

Throughout almost every step of our pipeline we are not working with exact results, we don’t know if the Leaning Tower of Pisa is better as a sight than the Duomo of Milan. The heuristics we employ in our pipeline try to get as close to what a tourist could reasonably expect, but in the end there will always be things that people disagree with. We check our results by running lots of automated tests that look at the results from our pipeline to see if anything has gone wrong. A test like this could be: Milan should be among the top 5 cities in Italy. We try to minimise the number of failing tests, but in the end, it is impossible to satisfy all of them. When changing one rule to solve one of the tests, it is very well possible that three other tests now fail, even if they previously passed. It is a really complex system, and it is not always clear what will come out if you tweak one of the steps. On the one hand this can be really frustrating, but on the other hand this is also what makes it interesting.

If you want to see the results of our data pipeline you can have a look at our API. Some examples can be found at: https://www.triposo.com/api/documentation/latest/.

--

--