Build a simple autocomplete model with your own Google search history

Budvin Chathura
Analytics Vidhya
Published in
5 min readJul 22, 2020

Here we are using a simple probabilistic approach for the autocomplete model.

First we have to extract your google search history.

Go to https://takeout.google.com/settings/takeout

Click deselect all. (we don’t need all the data, google search queries will be sufficient)

click deselect all

Locate ‘My Activity’ section and select it.

Then we have to filter our results for google search queries. For that, click ‘All activity data included’ and deselect every option except ‘search’.

Then you may select the output format as JSON. For this, click ‘Multiple Formats’ and select JSON in the drop-down list.

Proceed to next step and select your preferred download method. After couple of minutes you will be able to download the zip file named as takeout<some_numbers>.zip . Unzip and locate MyActivity.json file. This will be the input file for our model.

Let’s do some preprocessing to this MyActivity.json file and extract the search queries to a myactivity.txt file. This text file fill be the input for our model.

import json
import re
data=[]
processed=[]
regex = re.compile('[^a-zA-Z ]')
with open('myActivity.json',encoding="utf8") as f:
data = json.load(f)
for i in range(len(data)):
search_query = data[i]['title']
if search_query.startswith('Visited'):
search_query = search_query[8:].lower()
search_query = regex.sub('', search_query)
if search_query.startswith('http'):
continue
processed.append(search_query)
elif search_query.startswith('Searched for'):
search_query = search_query[13:].lower()
search_query = regex.sub('', search_query)
if search_query.startswith('http'):
continue
processed.append(search_query)
else:
pass
print(processed[:50])with open('myactivity.txt','w') as file:
file.write('\n'.join(processed))

For next steps we will use JavaScript (😋 why not?). Maybe you will be able to integrate this model with a simple front-end system with text suggestions.

Import the required packages. For this, you may have to download the npm package https://www.npmjs.com/package/sentence-tokenization

Now we have to read the file myactivity.txt line by line and tokenize each sentence. Tokenizing is simply splitting the sentence to words and storing them as elements in an array-like structure. For this simple model I have only considered words with English letters. For more advanced models you can use punctuation marks and emojis as tokens.

Before executing the main() function, we need some helper functions for our model.

Now we have to build a small vocabulary. In this example we will build the vocabulary with tokens which appeared at least 2 times in the input file. After building the vocabulary, replace other tokens in the input sentences as <unk> token. (which says it is an out-of-vocabulary word)

Now let’s dive into the actual logic of the model.

This method is based on n-gram language model. N-gram is an adjacent set of n number of tokens. For simplicity consider n = 2. In this scenario, n-grams would be called as bigrams.

Bigrams of the sentence “this is a simple probabilistic model” would be

["this", "is"], ["is", "a"], ["a", "simple"], ["simple", "probabilistic"], ["probabilistic", "model"]

For n=3, trigrams would be

["this", "is", "a"], ["is", "a", "simple"], ["a", "simple", "probabilistic"], ["simple", "probabilistic", "model"]

Likewise we have to count number of times each n-gram appeared in the input text. For this we can use an n sized window to scan the token list. Before that we have to add special tokens to denote the start and end of a sentence. <s> and <e>. <s> token is added n-1 times at the beginning of the sentence and <e> is added only once.

Now we have the counts of bigrams and trigrams (i.e. n-grams and n+1-grams). Now we can estimate the probability for next word, given the previous two words.

Suppose the previous two words (tokens) are [“what”, “is”]. Then, for each word w in vocabulary, we can calculate the probability as

There is a possibility for above counts to be 0. As a solution we can add a smoothing constant k to the equation. Then the equation becomes,

Likewise we can get the probability for each word in the vocabulary.

After calculating probabilities for each word, we can choose the best suggestion based on the maximum probability. You may also choose several suggestions with words with maximum probabilities.

We can further develop the model by using several n-gram counts. If the starting words are “what is a” we can use several models and get more suggestions.

  • Use [“a”] as starting token and use unigram and bigram counts
  • Use [“is”, “a”] as starting tokens and use bigram and trigram counts
  • Use [“what”, “is”, “a”] as starting tokens and use trigram and quadgram counts.

Note that always we have to use n-gram count and n+1-gram count for each method.

Here I have used up to 5-gram counts.

With this approach, for starting words “what is a”, I got following suggestions with respective probability values.

function → 0.020064602834036733
lightweight → 0.008857318249583442
good → 0.0019493177387914227
good → 0.0019529433646424251

Note that the word ‘good’ appears 2 times because both 3-gram and 4-gram models predicted the word ‘good’.

Complete source code:

As you can see we managed to build a simple autocomplete model with a probabilistic approach. If you are interested in this kind of natural language processing models, I highly encourage you to take the Natural Language Specialization by deeplearning.ai in Coursera. You will be able to learn more advanced models with machine learning and also with deep neural networks.

Another useful resource book https://web.stanford.edu/~jurafsky/slp3/

Good luck !!!

--

--