Decoding Google Topics (1/3)

Published in

Weborama

10 min readMar 31, 2022

At the beginning of 2020, Google announced the Privacy Sandbox initiative designed to enhance privacy on the Web, making third party cookies obsolete. But the AdTech ecosystem and interest-based advertising have been relying on third party cookies and Chrome represents around 65% of browser market share.

In order to both ensure user privacy and preserve the advertisement business, Google envisaged, among other things, a controversial FLoC (Federated Learning of Cohorts) proposal. Chrome users would be aggregated into cohorts depending on their surfing behavior, and these cohorts would be communicated to third party players. However Google recently dropped this solution since FLoCs would add fingerprinting surface — the system carried the risk of exposing an individual’s browsing history, according to Google product director Vinay Goel. Now the Topics API, driven by the aim of providing more human-understandable transparency, replaces FLoC. Topics involve a certain amount of randomness and are supposed to be a lot less granular than FLoCs, at least when they’re considered individually.

What could we expect from the Topics API, in terms of insights, interest-based ad capabilities, and of course, performance?

Trying to answer these questions, thanks to in-house NLP and semantic AI tools, is the purpose of our series of articles.

As we’re writing this first episode, Topics proposal has not yet been launched for testing. Trials have just been announced, but we’ve decided a while ago not to wait for the trials and to start exploring the subject by building our own simulations, relying on the collect data accessible to Weborama’s Data Science Team.

Too Long ; Didn’t Read

Google is replacing the FLoC initiative by Topics API which is supposed to provide less granularity in terms of insights (but to what extent?), and to be more efficient regarding privacy.
Weborama wants to question the Topics API approach: usefulness for behavioral targeting, richness of information and insights, consistency, contextual targeting opportunities.
Topics not being available yet, Weborama has decided to start working on data computed based on Surf History thanks to semantic AI, reproducing Google’s announced methodology: we’ve launched simulations, and we’ve built a Decoder app.
The Decoder allows us by design to understand what a topic or a topic set is actually made of, semantically speaking: what are its representative lemmas (words)?
First results from the Decoder let us believe that there will be insights when considering not only topics, but topic sets (up to 3 topics combined), making these combinations relevant for behavioral targeting.

Google Topics briefly explained

We won’t deep dive into the Google Topics process, you may check the Topics Github if you’d like to have a close look.

A reminder of the key Topics API concepts should be enough to follow the process of the experiments described further. Third party players calling Chrome Topics API will receive a set of maximum 3 ad topics (this set is referred to below as a topic set) for a page visitor — one topic per past week. Thus AdTechs will be aware of user interests without receiving any kind of identity information.

Topics selection process

A web user surf history is divided into 7 day epochs by Google. For each of the last 3 complete epochs, the top 5 most visited topics on web pages (NB: topics for a web page only depend on the page’s domain, all web pages within a domain get the same topic profile according to Google’s methodology) are retained. 1 additional random topic is considered. Then with a 95% rate, 1 of the top 5 topics is selected randomly per epoch. The remaining 5% times, the additional random topic is selected. Whatever topic is selected for a given epoch, will continue to be returned for a third party player on that site for the three following epochs. A caller will not get topics inferred from the navigation of the user on sites outside of the caller’s observed touchpoints.

In a nutshell it is important to keep in mind those two concepts:

The non ordered topic set returned for a call made on a site visited by a user remains unchanged during a full epoch
Sites visited weekly will learn at most one new topic per week per user

Due to the randomness injected in the selection process, different sites often get different topic sets for the same user, making it hard to cross-identify the user. This strongly encourages privacy.

Weborama’s simulation

Assigning G-Topics to domains

Google provides an initial hierarchical taxonomy containing 349 topics such as “Sports”, “Sports/Skiing & Snowboarding”, “Finance”, “Finance/Insurance/Health Insurance”, etc. The first thing we have to do is to assign a maximum of 5 of these Google topics organized in tiers to every domain site in Weborama’s database.

Weborama’s Surfing History database records touchpoints (visits) of users (cookies) on web pages in the scope of our collect layer. We store data such as the visitor id, the url, the date of the touchpoint. In addition we extract the content of the page and lemmatize it. Lemmatization is a Natural Language Processing task, aiming at transforming a text into canonical forms of inflected words called lemmas. Lemmatization and morphosyntactic analysis facilitate text mining operations using AI tools. These tasks also include challenges such as N-gram grouping and disambiguation. As a rule of thumb, bigrams are often less ambiguous than unigrams, three-grams less ambiguous than bigrams, etc. Think of words like “Roissy Charles de Gaulle” (which refers directly to the airport) — disambiguated compared to “Roissy” (city), and “Charles de Gaulle” (person), or the 4-gram “12 year old whiskey”, which has a complete different meaning than “12 year old”, and is more specific than “whiskey” or even “old whiskey”. Part of speech tagging and disambiguation are enhanced when relying on an up-to-date lexicon. Weborama’s lexicon is continuously enriched with filtered, annotated, “realistic”, Web data.

For every topic in the taxonomy suggested by Google, we started by automatically generating a list of specific lemmas belonging to the topic. Then we scored domains by looking up (that’s what we call semantic matching) these lemmas inside lemmatized URL content. Lists of lemmas have been established thanks to the W2V model, a model learning semantic proximity between lemmas. When a Google topic is entered as input, the model will recommend a list of lemmas corresponding to the topic. Historically, the W2V algorithm has been broadly used in Weborama products. We periodically retrain such a model on the corpus made of lemmatized web pages in our database, leading to very accurate and unique word embeddings. That’s why we decided to use these embeddings to score domains. In addition, we included a normalization phase to make sure that low level, specific, tier-2 or tier-3 topics get a higher chance to be assigned to a domain, rather than high level, more common, topics (e.g. “Pets & Animals/Pets/Dogs” vs. “Pets & Animals”).

Example for the “Beauty & Fitness/Fitness/Bodybuilding” tier-3 Google topic.

Here is a list (sample) of lemmas matching the topic:

[‘bodybuilding’, ‘bodybuilder’, ‘powerlifting’, ‘Dorian Yates’, ‘Mr. Olympia’, ‘CrossFit’, ‘powerlifter’, ‘Phil Heath’, ‘Functional Fitness’, ‘Ronnie Coleman’, ‘Olympic weightlifting’, ‘weightlifter’, ‘kickbox’, ‘Arnold Classic’, ‘weight training’, ‘weightlifting’, ‘calisthenics’, ‘IFBB’, ‘Mat Fraser’, ‘kickboxing’…]

And here’s a list (sample) of domains within our collect perimeter matching the topic (we also show other matching Google topics) :

www.muscleandfitness.com
Beauty & Fitness/Fitness
Beauty & Fitness/Fitness/Bodybuilding
Beauty & Fitness
Sports/Running & Walking
strengthlevel.com
Beauty & Fitness/Fitness/Bodybuilding
Beauty & Fitness/Fitness
Sports/Gymnastics
Autos & Vehicles/Motor Vehicles (By Type)/Hybrid & Alternative Vehicles
Beauty & Fitness/Face & Body Care/Clean Beauty
www.setforset.com
Beauty & Fitness/Fitness
News
Beauty & Fitness/Fitness/Bodybuilding
Business & Industrial/Business Operations/Flexible Work Arrangements
www.bellamagazine.co.uk
Home & Garden/Home & Interior Decor
Arts & Entertainment/Concerts & Music Festivals
Arts & Entertainment/TV Shows & Programs/TV Dramas/TV Soap Operas
Beauty & Fitness/Fitness/Bodybuilding
Hobbies & Leisure/Diving & Underwater Activities
www.t-nation.com
Beauty & Fitness/Fitness/Bodybuilding
Beauty & Fitness/Fitness
Sports/Running & Walking
Internet & Telecom/Web Design & Development
barbend.com
Internet & Telecom/Email
Beauty & Fitness/Fitness/Bodybuilding
Beauty & Fitness/Fitness
Sports/Running & Walking
powerliftingtechnique.com
Beauty & Fitness/Fitness/Bodybuilding
Beauty & Fitness/Fitness
Beauty & Fitness/Fashion & Style
Arts & Entertainment/Visual Art & Design/Design
fitnessvolt.com
Beauty & Fitness/Fitness/Bodybuilding
Beauty & Fitness/Fitness
Home & Garden/Home & Interior Decor
Sports/Rugby
www.trailrunningmag.co.uk
Sports/Running & Walking
Beauty & Fitness/Fitness/Bodybuilding
Beauty & Fitness/Fitness
Arts & Entertainment/Music & Audio/Musical Instruments

Selecting the topic set

Now that domains are characterized by 5 Google Topics max, the second part of the job is to emulate the Google Topics API behavior explained earlier. In order to do this we followed the details described in the Topic github, and we took the discussed issues into account. We wanted to go through all the process without making it simpler than it already purposely is.

Technically, this simulation hasn’t been done on the fly on streaming data. We used a BigQuery snapshot of our Surfing History data between September 2021 and February 2022. The idea was to backtrack the simulation and run it over 6 months. For example this period on the French market represents 3,728,664,822 web visits, 163,853,950 web users and 40,019,697 URLs.

The topic sets returned after a web visit have been added to each record of the Surfing History database, thus replicating, or rather, anticipating, real world data we should obtain when the Google Topics API is launched.

Topics Decoder Streamlit APP

In the past four years, we, the Weborama R&D Data Science Team, have started to build our own (Flask/Python) Web Apps. And that’s what we did to explore our new Surfing History table enriched by Google Topics. But we’ve deviated from our Flask technology habit to test the popular and simple Streamlit open-source Python library, which perfectly fits our needs here — mainly having to display bar charts.

The web app is quite simple. The user selects 1, 2 or 3 Google topics, which will be looked up in our Surfing History table. Bar charts of the most important elements characterizing the topic set are then displayed:

Lemmas (e.g. what are the most representative words on web pages related to a user interested in Gardening?)
Topics co-occurring with the entered topic set (e.g. if a user is interested in Electric Cars, what are his or her other interests?)
Weborama Generic Taxonomy segments (we will describe what this is further)

Let’s take the example of “Lemmas” to explain how our Topics Decoder works.

Once the user has selected the topic combination they wish to inspect (example: {“Books & Literature”, “Food & Drink”}), they indicate the period of the desired study (example: February 2022).

The decoder starts by retrieving from our Web Surfing History table visits occurring on web pages, in the specified period, where the simulated Topics API has returned a topic set containing all the topics entered by the user. Then it counts the occurrences of every lemma appearing in these web pages. Simply ranking the lemmas according to their frequency would be naive because very common lemmas (“be”, “do”, “have” …) will most of the time have much higher counts than specific lemmas (“house music”, “apple pie” …). To solve this we prefer to use an uplift metric, which has the same spirit and goal as TFIDF . We need to both calculate:

counts of lemmas in the global surfing history table on the desired period
counts in the local table (the one obtained after filtering the global table on the topics requested by the users)

We also propose a set of options to the user for refined results but it is out of the scope of this article. All in all we obtain the following bar charts showing most important lemmas characterizing the topic set in a Google Topics Web Surf context (another approach would be to directly use topic assignments to Web pages, which corresponds in fact to an option in the Topics decoder):

Uplift of top 10 lemmas for the topic set {“Books & Literature” , “Food & Drink”}

We notice that the top lemmas could match a “cooking blog” vocabulary, which seems to be well related with the topic set {“Books & Literature” , “Food & Drink”}.

We also check the topics that are co-occurring with the ({“Books & Literature” , “Food & Drink”}) topic set entered by the user, and display a bar chart for the most relevant co-occurring topic, as it could be very useful, for our clients, to discover hidden associations between topics.

Top co-occurring topics with the {“Books & Literature”, “Food & Drink”} topic set

Using this type of representation, we may sometimes confirm expectations: we see here that the “Home appliances” topic often co-occurs in user profiles with the {“Books & Literature” , “Food & Drink”} topic set, and this seems to make sense. But thanks to this view, we also find out that “Jobs & Education”, appearing in the 5th position, is correlated with the input topic couple.

Let’s now introduce the Weborama Generic Taxonomy. In order to analyze web URLs and build audiences, Weborama has created and maintained for several years a Generic Taxonomy. The philosophy is quite similar to the one we have constructed based on Google’s approach. The main difference is that the Generic Taxonomy is very detailed and accurate while the one we have emulated regarding Google Topics has been calculated thanks to W2V recommendations without any human annotation. It becomes interesting for us to visualize the Weborama Generic Taxonomy clusters and how they relate to Google Topics, using Surfing History data, to see what this kind of mapping could look like. We hope this can bring extra insights and widen results previously obtained with co-occurring topics.

Top Weborama Generic Taxonomy topics describing URLs linked to the {“Books & Literature” , “Food & Drink”} topic set

As expected, Topics related to foods appear on top. But now we know what kind of food specifically. Moreover this view is consistent with the “co-occurring Google topics” decoding described earlier: “Kitchen Appliances” pops up again. The Decoder used with the Weborama Generic Taxonomy reveals that the cluster “Laundry” is not far away.

In the following episodes, we will see:

How content can be derived from the Decoder, leading to the creation of visual semantic insights.
How this content can be used as is, or extended via Lookalike algorithms, or even broken down into specific semantic segments allowing therefore a bridge between Topics — made in the first place for Behavioral Targeting — and Contextual Targeting.

Decoding Google Topics (1/3)

Google Topics briefly explained

Weborama’s simulation

Topics Decoder Streamlit APP

Written by Nicolas Tastevin