Since the last TechSEOBoost where I presented my text generator based on GPT-2 on Google Colab, I have seen many innovative projects leveraging the same technologies, primarily ones integrating with Google Search Console data.
I’m going to use JR Oakes’ open source code (released early February 2020), which extracts queries from Google Search Console, aggregates them by category and finds semantically close categories via BERT.
I really like this model as it is easy to understand and easily reproducible. However, I wanted to go the extra mile by adding a Search Console connector (so anyone could work with their own data), and, most importantly, add a multilingual mode to categorise across 6 languages: French, Italian, Dutch, German, Portuguese and English.
Here’s the Colab workbook if you want to test it. Improvements are welcomed! https://colab.research.google.com/drive/14JC2uQniiVDNAUpVEjdNTyK7rmepwjWB
The original article from JR can be found here.
His work is inspired by the Apriori algorithm created by Rakesh Agrawal and Ramakrishnan Sikrant in 1994 . It is an algorithm used to recognise properties recurring in datasets, as well as inferring related categorisations. To find out more, check the learning rules of association.
For example, an ecommerce site can utilise this algorithm to find products often purchased together.
Now here’s JR’s clever twist: rather than focusing on transactions, the script runs on queries from Search Console, leveraging state-of-the-art technologies such as Spacy, HuggingFace, and of course BERT. More on that later.
A QUICK WORD ON GOOGLE COLAB
Google Colab is a cloud service based on Jupyter Notebooks, designed for training and research. With it you can train Machine Learning models without having to install anything on your computer.
First, it’s important to note that the notebook won’t run by default, you must to open it in “playground mode”, by clicking here:
You will then have to perform each step sequentially, by running the code cells and filling in the forms — do not worry, I will walk you through this below.
Ready? let’s go!
BEFORE STARTING, A QUICK WORD ON HUGGING FACE
Created about a year ago on GitHub, the start-up Hugging Face launched the “Transformers” project, aiming to create a community around their own NLP library. More than 250 contributors are currently working on this ambitious project, already used by more than a thousand companies around the world.
At the moment, https://huggingface.co/models’s internal search engine does not yet allow you to find models based on languages. Therefore, when searching for a language specific model, you have to know the model name, for instance: ‘camemBERT-french‘, ‘flauBERT-french‘ etc
Here, I used CamemBERT and FlauBERT, two accurate models for French.
Before starting the notebook, you have to write the name of your model and choose your language:
Once done, let’s fetch our data via the Google Search Console API.
CREATE ACCESS TO THE GOOGLE SEARCH CONSOLE API
To get your data, you first need to authenticate your Google account.
1) Create a new project or use an existing one in the Google API console, via: https://code.google.com/apis/console
2) Click on ‘library’ and verify that you have enabled the Google Search Console API. If not, click on “Enable”: https://console.developers.google.com/apis/library
3) Select “Credentials” from the left menu.
4) Select “Create Credentials”.
5) Generate an API key for your application.
6) Write down your client ID, your client secret to copy them later in the following cell
A link will then be generated, just follow the authentication steps.
Once you have authorised the application, you will get a code that you need to copy and paste into the Notebook. If successful, a project selector will appear.
With Google Colab, you can create advanced forms -e.g. if you use the “Choose your project” form, the code will generate a selector with all your projects in the Search Console. This way you can easily test the script on the project you want.
Once selected, the script is ready to use..
Then, the “Get your data” and “Prepare Comparative data” forms format your data compliantly.
Note that if you specify a 30-day period, the script will split the data in two and compare the last 15 days with the first 15.
LET’S TEST THE QUERYCAT LIBRARY TO CATEGORISE QUERIES
The querycat.Categorize function, takes the following arguments.
– position 1: a Pandas Dataframe that includes at least a column of search queries.
– col: The name of the query column from the DataFrame
– alg: apriori (Apriori Algorithm), or fpgrowth (FP Growth Algorithm)
– min_support: The number of times the subset terms should be found in the dataset.
When run, it will print the number of found categories, as well as the frequency distribution of queries in each category:
Here for example, the script found 15 categories.
Then, there is another form called ‘Show original data‘ which appends each category to each search query.
You can also save that dataset to a csv file via this line of code:
Finally, the “Convey upwards and downwards variations” form calculates upward or downward variations between the two periods.
GO TO THE MOST INTERESTING PART WITH BERT.
BERT is included in the Hugging Face library detailed above, itself included in the querycat.BERTSim function.
This function will identify the categories that are semantically close.
To initialise the BERTSim class, you must specify the model to use (we already defined it above).
Then there are two useful functions:
– get_similar_df retrieves the top matching categories
– diff_plot plots the output in 2d vector space
CALCULATE A SIMILARITY SCORE
The get_similar_df function takes a term as a parameter (corresponding to a category ) and returns the top matching categories by cosine similarity (for connoisseurs, a similarity score from 0 to 1):
The above screenshot shows an example in French where BERT understands the connection between ‘cruise‘, ‘trip‘ and ‘voyage‘.
VIEW THE RESULTS
The diff_plot function can plot your categories in a 2d vector space, adding information about click variations:
– green for a positive variation
– red for a negative variation
– the size of the bubbles indicates the magnitude of the change
With this function, you can test dimension reduction strategies (algorithms such as tnse, umap or pca), simplifying the 768 dimensions of BERT embeddings down to two dimensions.
We will leave it to the reader to investigate these algorithms, but as JR pointed out in his post, “umap” usually has a good mixture of quality and efficiency.
Now you have a plug and play solution to:
– auto-categorise Search Console queries semantically in 6 different languages.
– narrow down the time frame.
– visualise results and follow trends on a weekly basis.
Big thanks to Charly Wargnier https://twitter.com/DataChaz for the editing and proofreading. Also Kudos to JR Oakes, author of the original script.