Decoding Google Topics (2/3)

Published in

Weborama

12 min readApr 15, 2022

In the first episode we presented a faithful simulation of the Topics API, the new Google initiative to enhance privacy on the Web while preserving the behavioral ad tech industry. This simulation was based on Weborama Surfing History database, allowing us to test the Topics API behavior in advance and to share some observations before its broad release. We also introduced the Decoder which aims at decoding a Google topic set (up to three topics), with bar-charts showing co-occurring topics, dominant web domains for a topic or a topic set, representative lemmas (words) and correlated Weborama Generic Taxonomy semantic segments.

The first study revealed hidden connections between simulated Google topics, the possibility to build insights based on topics, and the potential gain for a marketer to study topic combinations before launching a digital campaign. First observations encouraged us to pursue our Topics API simulation, and see how we can leverage this thanks to the use of semantic AI tools.

As a quick reminder, a caller requesting the Topics API at the time of the visit of a web user on a URL will receive at most three topics related only to the past web navigation of the user. This will help AdTech companies to market campaigns based on user behavior in a cookieless world. And this can come in addition to contextual marketing directly inferred from site content.

We are now going to see how interest based ad targeting may survive in a cookieless environment, or even be reinforced in certain conditions, and how we could do the most of the solution proposed by Google. What’s at stake for us at Weborama, and for many other companies of the AdTech sector is the evolution of the Product offer. For instance: how should a behavioral targeting feature be adapted to Topics API, and lead to campaigns targeted on complex topic sets (combinations)? How could a contextual tool be strengthened by the Topics API?

TL ; DR

This article describes ways an AdTech could benefit from the Topics API for digital marketing:

The Topics API can be exploited in order to enhance contextual targeting with the help of semantic similarity algorithms, and dataviz tools

i) a topic set can directly lead to a web inventory — that is, a set of web pages — thanks to a tool like the Decoder:

this set of web pages (URLs) can be used as is for contextual targeting (from Google Topics to Contextual Targeting)
the URL list can be amplified and turned into a web contextual inventory thanks to the use of lookalikes computed via semantic AI (w2v or SBERT): reach may be increased

ii) another way is to extract insights from the inventory directly matching the topic set:

this would allow the user to answer the question: what kind of content is truly, deeply, behind a Google topic set? Is there something more than just a label? Is the content attached to a label, or a few labels (on the intersection mode) consistent? How specific may it be?
in order to achieve this, we use SunFish, a Weborama interactive visualization tool allowing the exploration and selection of semantic segments
these semantic segments may then be used as a basis in order to build a matching web inventory

The Topics API can be used for rich semantic behavioral targeting, thanks to a recommendation engine

a user may enter words, explore the semantic universe, create a custom segment, and this can be translated into most representative, specific, topic sets, for behavioral targeting

Use Case

Let’s consider an organization — for instance, the website nature.org — promoting an ecology event with a digital campaign. Obviously, they can start by directly targeting websites with ecology content. But such a simple interpretation could lead to low volumes and to an audience made of environment-conscious people only.

Nature.org would have to find connected topics, typical user behaviors to broaden their target and maybe draw the attention of people showing similar navigation patterns.

Google Topics API aims at solving this. In this case, the organization targets topic sets including either Science/Ecology & Environment or Food & Drink/Cooking & Recipes/Cuisines/Vegetarian or Cuisine/Vegan Cuisine. But, due to the special randomness of the Topics API (one topic is selected among the user’s top five per week), volumes would be lower than the ones obtained with a classic cookie based scenario. In other words, a user slightly interested in ecology will not be associated with this Google topic (because it would not be in the user’s top 5), and may not be even if the topic is a great interest for them, simply because of the randomness in the process.

Moreover nature.org would have no control over what’s behind those Google topics, and cannot be sure they are accurate enough for the campaign’s purpose. That’s why we think it is important to start by decoding in depth a topic set, and we do this by generating rich visual insights.

Visual Semantic Insights

In addition to drawing bar charts, the Decoder allows users to download the set of URLs derived from a topic set. Those URLs are simply all the URLs where the Topics API has returned a combination of topics which includes the targeted topic set. Due to the behavioral property of the Topic API, it is important to keep in mind that a URL derived from a topic set does not necessarily embed content directly related to it. That’s why decoding the set of URLs obtained thanks to visual semantic insights should bring added value and widen the area of interest of web users attached to the topic set in the first place.

Weborama’s semantic tool, SunFish, allows the description of a corpus via word bubbles organized hierarchically and grouped according to semantic similarity. (Nota Bene : insight extraction is not limited to web content and can be performed on various kinds of sources. For example, the AI & Big Brother post demonstrates how visual semantic insights may suggest a new reading experience of the novel 1984. Visual insights may also be applied to encyclopedic content like Wikipedia which we commonly use at Weborama, in particular but not only for training purposes.)

View of a corpus about *Ecology* (source Wikipedia)

The hierarchical lemma bubble view can have up to 4 depth levels. This means a lemma bubble may have one parent, one grandparent, one great grandparent bubble. For instance, the lemma wildfire is included in the Natural disasters bubble, itself included in the Phenomena bubble, itself included in the Natural environment bubble.

Zoom in on a depth-2 bubble *Natural Environment*

SunFish Bubble View relies on 4 big tech modules:

a lemmatization module whose goal is to transform a text into canonical forms of inflected words called lemmas
a keyword extraction module inspired from the page rank algorithm, which is a good compromise between fast computational time and accuracy
a custom hierarchical clustering algorithm (dendrogram computation)
a bubble tagging method allowing to assign a common concept label to a set of lemmas. (Nota Bene: this has been and remains a challenge for many reasons. First, no rule could be established from global occurrence on the web. Parents could be less specific than children (e.g: sports is more present on the web than handball), or parents could be more specific than children (immunology is less present than vaccination). Also, a lemma can have a huge number of acceptable parents (e.g: Iphone could belong to smartphone, Apple products…)

But let’s go back to our use case — Nature.org.

The organization feeds SunFish with the ensemble of URLs obtained with the decoder, matching the desired topic sets. We expect to get a very different Bubble View than the one corresponding to a Wikipedia corpus. For two reasons : a) content is web content, and not encyclopedic ; b) this is subtle difference : it’s a user-centric approach since content will come from sites visited by users interested in Ecology, and not a site-centric approach where content would be directly parsed from sites about Ecology.

Bubble View of {Ecology, Vegan} (source Google Topics API)

Without any surprise we encounter vocabulary about weather, animals, edible fruits, nuclear etc. Those segments can be selected, refined, and activated for a contextual campaign. Other segments, having an indirect connection to the underlying topic, such as Social Sciences, may be explored before taking actions. As we zoom in on this bubble, sub-bubbles appear, including one tagged with childcare which is itself divided into new sub-bubbles as we can see:

The client, nature.org, can use SunFish in order to explore the corpus, and highlight some selected lemmas, this will give context:

Now we understand that web users flagged with the topic set {ecology, vegan} visit websites about parenting, family life. Knowing that, in our use case, the client can create an accurate semantic segment set about parenting, childcare, working mom, to get a bigger audience for their contextual digital campaign. They could also quickly decode the People & Society/Family & Relationships/Parenting Google topic so as to decide whether to include this topic or not in the campaign.

Enriching the Contextual Campaign

As we’ve seen, using visual semantic insights will help to go beyond the obvious and create large and specific contextual campaigns based on lemma segments. The visual insights also confirm in a way the relevance of the chosen topic set. Another possibility for nature.org is to directly use the URLs obtained with the decoding phase as an advertising inventory. Targeting a URL inventory presents some advantages and drawbacks over targeting topic sets.

A URL set can be enriched using AI methods, based on URL similarity calculation. At Weborama, we built a set of algorithms solving this problem with various strategies. Most techniques rely on URL content — it comes to vectorizing the content in space where proximity between URLs is measured thanks to a cosine similarity metric. We used w2v to take advantage of our lexicon on the one hand, and on the other hand with the help of transformers and state of the art deep learning techniques. Another direction is to embed a URL according to User Surfing History. The algorithm mimics the philosophy of the skip-gram w2v implementation, but using sequences of consecutive URLs visited by a user as input, instead of sentences made of lemmas. We call this “Prime Profiling”.

However the traffic of a web page is evolutive and a URL can have a large traffic only during a short period of time. That’s the case of course of news articles and blog sites. That’s why targeting a fixed inventory of past URLs could lead to a poor audience, which is not the case a priori regarding topic sets since these are frequently refreshed and returned on the fly each time the API is called. But contextual targeting may also be operated in a streaming mode, where new URLs are constantly feeding the inventory. Furthermore, a temporal model allows us to predict the future of a URL in terms of traffic. This is useful to see to what extent the URL set is ephemeral, and thus to anticipate the volume.

From Custom Semantic Segments to Google Topics Behavioral Targeting

So we have described an approach which starts with a topic set. We decode the topic set, explore and create a taxonomy, build an inventory of URLs ready for contextual targeting : From Topics to Contextual. But another strategy would be to use a reciprocal process: starting with a custom taxonomy made of words owned by the brand, and create a list of relevant matching topic sets that can be used for behavioral targeting: From Custom Semantic Segments to Behavioral.

Classically when building a cookie based behavioral campaign, we begin by elaborating a taxonomy of topics we want to target. We collect past data, user IDs visiting websites matching the taxonomy. Then we activate the audience. Allowing marketers to construct taxonomies is crucial. The more the lemmas in the segment, the more accurate they are, the bigger the volume of relevant user IDs will be available for the client. We are going to check how this method can lead to Google topic sets and not cookies or user IDs.

Here is a snapshot of the recommendation engine we use to build semantic segments:

The user enters a seedword, a word representing its taxonomy, and the engine will recommend lemmas that can be added into the taxonomy. The user gets a real time audience estimation: numbers of cookies or IDs matching the words. The accuracy of the engine depends in particular on the underlying lexicon, its capacity to handle N-grams and disambiguation. We emphasized for instance that bigrams usually are more specific than unigrams. Let’s illustrate this fact by comparing the recommandations obtained with the unigram climate, used as seedword, to the ones obtained with the bigram climate change.

In a cookie-less world, collecting user IDs becomes impossible for third party players. But we can adapt the strategy by replacing the collection of user IDs by a collection of topic sets.

In our use case, the client nature.org creates in MoonFish, Weborama’s product, a custom segment of lemmas representing the Ecology theme with the recommendation engine. Then MoonFish will filter the Surf History table on URLs matching this segment and collect Google topics returned by the Topics API on those touchpoints. Thanks to an uplift metric, the adapted MoonFish will draw a bar-chart presenting dominant topic sets on the matching corpus of URLs.

Let’s have a close look at the returned topic sets. The combination {Food & Drink, Early Childhood Education, News/Weather} appears in first position and confirms SunFish previous insights exploration about childcare.

Regarding the Ecology & Environment topic, it appears in the fifth topic set. One can wonder legitimately why this topic does not belong to the top topic set. The thing is that Google Topics aims at synthesizing past navigation of users — Topics are derived from browsing behavior patterns. And most people visiting URLs related to ecology do have other areas of interest in life. For example, environment-conscious people are overrepresented among students, maybe searching for a solution to finance their university tuition with a personal loan, and this could very well explain the presence of the Loans topic.

Moreover, Google Topics decided to assign topics at the domain level while a web visit happens at the finer URL level. As a consequence, a news website domain dealing with multiple society subjects (politics, sports, economy, people, justice, ecology …) will not necessarily be related to {Ecology & Environment} even if it publishes a large amount of ecology articles. In fact, websites focusing especially on this topic will have higher chances to be profiled with this topic. And so, in our example, apparently many URLs dealing with ecology belong to domains not restricted to ecology. This is consistent with the coarse grained approach purposely established by Google.

That’s why converting a custom taxonomy into a collection of topic sets is essential to reach a large audience. Limiting the activation to the Ecology & Environment a priori matching topic would result in capturing only people frequently visiting specialized ecology websites, people strongly involved in the cause.

The client may finally activate a campaign by targeting the obtained topic sets. They now have full control regarding the description of their audience, in a more accurate and flexible way than just choosing directly coarse grained Google topics. Also, this reciprocal process allows clients to design behavioral audiences in areas that do not exist in the Google Topics Taxonomy. In a way, the adapted MoonFish acts as a Custom Taxonomy ⇒ Google Topics converter.

Conclusion

This post shows how, as a player of the AdTech industry, we can make the most of the Topics API as it is presented by Google, in order to enhance both Contextual and Behavioral Targeting. All this preliminary work has been possible due to a simulation of Google Topics API, the official API not being available broadly at the time this article has been written. Topics are currently testable (via the JavaScript API) in the development version of Chrome: Canari, by activating the experimental flag privacy-sandbox-ads-apis.

The trials have been announced by Google, and Google also shares a timeline for the privacy sandbox. Once Topics get deployed in the real world (meaning, available by default on real Chrome instances on the web), we’ll be able to collect them on our network, and to describe real world tests with observed and collected Google Topics data. How will a 3rd party player be impacted by Google’s strategy, limiting the scope of accessible data? What will the distribution of topics on URLs look like? Will Google use a TF-IDF strategy? This is something we would be able to observe and monitor. What may be expected in terms of accuracy, reach, and of course, performance? This will be for our next episode.