Analysing Search Terms: N-grams

Vio.com
Vio.com
Published in
4 min readNov 22, 2015

Our search marketing campaigns generate a lot of search terms. Search engines can assign all kinds of search terms (which they think is relevant) to the keywords we generate. While this is a good additional source of traffic, it also has significantly higher risk of generating bad traffic as these search terms haven’t gone through our keyword generation process. As always we are on the lookout for two things:

  • Problems: Figure out any systematic bad search terms and block them out by either improving our keyword generation rules or adding negatives.
  • Opportunities: Are we getting good search terms that we don’t have as keywords? We should improve our keyword generation process to include them.

The challenge is that we get a lot of search terms (at least 500k search terms for each of the top languages) and mostly very few clicks per search term so it is difficult to spot problems or trends. Here is a subset of what a search term report output (minus the performance statistics) looks like after aggregation:

As you can see, there are two obviously problematic search terms of people not searching for accommodation. These can be easily identified automatically using conversion statistics but they are only two cases that popped up. Since the data is very long tail, it is not possible to use performance statistics to identify all the bad search terms which mostly have a few clicks each. While insignificant by themselves, overall they add up to a large sum. So just as in many cases of doing quantitative analysis of text, n-grams are very useful. By splitting search terms into words and then grouping them into n-grams we are able to notice significant trends. It is easier to explain by an example:

The search term “cheap hotels in Amsterdam” will get split into:

  • 1-grams: “cheap”, “hotels”, “in”, “amsterdam”
  • 2-grams: “cheap hotels”, “hotels in”, “in amsterdam”
  • 3-grams: “cheap hotels in”, “hotels in amsterdam”
  • 4-grams: “cheap hotels in amsterdam”

Then for each n-gram level, the output from all other search terms is aggregated which makes it possible to identify the problems and opportunities. In our case n-grams up to 5 are interesting, as longer ngrams appear too infrequently in our data to be able to identify trends.

Here is an example of a Python script that does the job:

from nltk.corpus import stopwords from nltk.stem.snowball import SnowballStemmer from collections import defaultdict from nltk import word_tokenize from nltk.util import ngrams import string import operator sentences = [ 'Find hotel at Amsterdam', 'Book hotel in Amsterdam?', 'Book cheap room in ugly busy hotel middle of nowhere', 'Book busy hotel in the middle of nowhere'] stemmer = SnowballStemmer("english") ngram_count = defaultdict(lambda: 0) N = 5 # Get me those 5-grams! english_stopwords = stopwords.words('English') for sentence in sentences: dirty_words = word_tokenize(sentence) # Remove stopwords and punctuation clean_words = [w for w in dirty_words if w not in english_stopwords and w not in string.punctuation] # Use stems (maybe not what you want?) stems = [stemmer.stem(x) for x in clean_words] for gram in ngrams(stems, N): ngram_count[gram] += 1 # Print the result for ngram, count in sorted(ngram_count.items(), key=operator.itemgetter(1), reverse=True): print("{}: {}".format(ngram, count))

And an example output of top n-grams:

1-grams2-grams3-grams4-grams

As you can see, now the search terms are really aggregated and we can start to make sense of the share of certain n-grams in the overall traffic. For a language with 500k+ search terms, this process would generate less than 100k n-grams of which usually the top few hundred would be potentially interesting. This removes the long tail words (mainly related to specific locations) and highlights the most common words people use to search. A potential learning in this example is to use the short spelling of “b and b” as a keyword, in case we had missed it (our marketeers would never forget such a thing). A funny example is “tea”: It turned out people were searching for having tea at hotels in England, of all places.

This approach can be simplified by using “bag-of-words” so that “london hotel” and “hotel london” are grouped together. There is a reason other than laziness that I haven’t done this though: There are cases where the order of the words can make a difference. They can either be users searching for different things, or have different levels of competition in the ad auctions.

So this is one of the first steps in analysing search terms in scale. After this comes using dictionaries and more advanced natural language processing methods. Do you have any other ideas on how to make sense of and draw conclusions from search terms in scale?

Originally published at https://blog.findhotel.net on November 22, 2015.

--

--