Lab notebook: improving topic modeling for digital political ads
C4D is building better ways to understand how advertisers try to influence the public
Topic modeling is an essential tool for surfacing themes and patterns in the flood of online political advertising reaching people across the globe every day. At NYU Cybersecurity for Democracy (C4D), we are working to enhance the way we surface insights on issues such as abortion, guns, and immigration.
Ahead of the 2020 U.S. elections, and again in 2022, C4D created an online, free dashboard, Ad Observatory, designed to provide the public with a way to gain insight into the millions of political ads on Facebook and Instagram. A key element of that dashboard is topic search — available in both Spanish and English — providing ways to visualize patterns in political spending.
With the U.S. midterm elections concluded, we have taken a step back to focus on improving how we detect topics. Often, our previous approach failed to identify topics. We relied heavily on human judgment to fine tune the topics. This meant that replicating the tool in other countries would be resource-intensive. We wanted to find ways to automate more of the process, and improve performance in other languages.
Our new approach identifies more ad topics through contextualized semantic understanding that reduces reliance on human-specified keywords. We will still need experts to tweak the model, but their workload will be significantly reduced.
Our goals: specifications for the ideal topic model
Our databases contain millions of political ads. We need methods to quickly determine the topics of these ads. For our purposes, a good approach must meet the following criteria:
Extensible: flexibility to capture quickly changing political conversations
Topics in political ads are expansive and dynamic. We need the ability to add new topic labels and adjust the scope of topics. This is crucial to countering the phenomenon known as data drift, which occurs when patterns in data shift over time, invalidating the original model. For example, we observed this as the covid-19 pandemic and January 6 Capitol riot generated new political conversations.
An extensible topic model also needs to be largely unsupervised as expert time is a limited resource. Supervised learning is an approach in machine learning where a model learns relationships between an input—such as a political ad — and a label — such as economy — by repeatedly guessing an answer and adjusting its understanding based on whether the guess was correct. Generally, the more labels, the better the model’s performance. Often models require thousands of labels to perform adequately. Creating this dataset is extremely time consuming. By contrast, in unsupervised learning, a model discovers the structure of the data without guidance. Given limited resources, a practical solution is to develop a model that requires only limited supervision.
Multi-label: political ads often have more than one topic
Ads can contain multiple, overlapping topics. A good topic model should be able to identify all the salient topics present in an ad. For example, the ad text below contains the topics abortion and judicial branch:
This precious life is worth fighting for! Sign the After Roe Pledge today to commit your support in the fight for life.
We don’t know when Roe v. Wade will be overturned by the Supreme Court, but we believe its days are numbered. Then the REAL WORK begins. After Roe, each of our 50 states will have the power to choose where they stand on abortion. We choose LIFE! Will you stand with us?
Multilingual: global reach
Our data collection ingests ads from a variety of languages. Meta alone supports over 100 languages. The 2022 version of Ad Observatory is available in both English and Spanish. We plan for future versions of Ad Observatory to handle these languages and retain the possibility of expanding to other languages depending on partnership development.
Precise: getting it right
Two important metrics when evaluating the quality of a predictive system are precision and recall. Precision refers to how often a model is correct when it makes a prediction. For instance, out of all the times the topic model says an ad contains the topic economy, what proportion of the time was it correct. Recall measures how often the model finds the relevant information. In other words, of all the ads where the topic was “economy,” how many did the topic model find?
However, there is a tradeoff when optimizing these metrics; higher recall results in lower precision and vice versa. We determined that it is more important to ensure that the topics we label are reliable rather than comprehensive.
Efficient: processing speed
As the topics of interest change, we need the ability to regularly and retroactively process millions of ads through the topic model. While the system does not need to be fast enough to provide real time analytics, it needs to be able to identify topics on a daily basis.
Nixed approaches for improving automatic topic detection
With all problems in the data science realm, there are multiple potential solutions each with their own pros and cons.
- Classifier: We ruled out creating either a singular multi-label classifier or a collection of binary classifiers because of the high volume of labeled training data needed. We do not have the resources to hire a team of experts to commit the time to do this. Further, this approach is slow to respond to the emergence of new topics as it would require labeling new data and retraining.
- Latent Dirichlet Allocation: Latent Dirichlet Allocation (LDA) is a popular topic modeling algorithm that works by calculating the probability a word belongs to a topic and, from that, the topics represented in a document. It needs fewer human resources because it can run unsupervised. However, it often produces topics that are ill-defined to human reviewers and requires the number of topics be set a priori. Unsupervised alternatives like BertTopic perform better as they consider sentence structure rather than words in isolation. In this approach, topics are described by the bundle of words that appear most frequently in the cluster of similar documents relative to the other clusters (i.e. class based ti-df). While it performed better in some ways, it still did not generate topics that were coherent enough to be used.
- Few-shot generative models: The explosion in training data size and language model parameterization has produced models with impressive performance on tasks with little to no training data (think GPT-3). Traditionally, text classifiers learn to map text inputs to numerical classes (e.g. 0, 1, 2) which correspond to categories. In this training paradigm, the classes the model tries to predict carry no meaning independent of the inputs. This same task can be described in natural language terms by giving the model a prompt and having it generate a response. A prompt might be something like “Identify the topics (e.g. economy, immigration, and health care) in the following political ads” along with a few examples of inputs and expected outputs. The benefit of this approach is its high flexibility. Topics can be added and adjusted by changing the prompt. But, this comes at a cost. Mainly, these models are extremely computationally and financially expensive as their size is what makes them so good at understanding language. They require specialized hardware (i.e. GPUs and TPUs) for accelerated computing with large amounts of RAM. This might be an approach we revisit in the future. For now, we are looking for something more lightweight.
Under the hood: the approach we chose for surfacing topics
On to the fun part: how the topic modeling approach we implemented works:
Stage one: keyword queries
This is the most crucial step. Effective topic modeling of political ads requires a nuanced socio-political understanding of the election context. In this step, we have political scientists identify a list of topics of interest. You can find a list of our 2022 topics here. Each topic comprises a list of “include” and “exclude” keywords. These are run as queries where ads that contain the include keywords, but not the exclude keywords are labeled with a given topic.
Let’s illustrate this with an example ad and a topic of the economy , which we represent with a single keyword, economy.
If we were to add fuel as an exclude keyword , the following example would return no topics.
But say we created a new topic of employment represented by the keyword jobs. This would result in two topics found.
We process all the ads through the keywords queries to create an initial list of topic labels. However, keywords are rigid and often miss many of the ways ideas can be expressed. An ad which is semantically similar, but uses different words, passes by unlabeled. This is why we build the next stage.
Stage two: topic propagation
Here is where we get technical. The next step is to automatically propagate labels found in ads by the keywords to similar ads to increase recall. But how does a computer determine similarity? First, we need to translate human-readable text into model-readable embeddings — numerical arrays that represent a data input.
There is an old adage in linguistics uttered by J.R. Firth: “a word is characterized by the company it keeps.” Said differently, the definition of a word is based on its context. This is the central premise underlying word embeddings; they numerically represent how close word meanings are to each other in a given context. Through a variety of tasks like next sentence prediction and masked language modeling, machine learning models inductively learn numerical representations of word meanings without supervision. These embeddings can be aggregated to become document embeddings.
We use the sentence-transformers package and the paraphrase-multilingual-MiniLM-L12-v2 pre-trained model to generate document-level embeddings. These embeddings are multilingual, contextual, and faster to compute than many of the larger models in use.
Once we have embeddings for each ad, assessing similarity boils down to trigonometry. For simplicity, picture two vectors in a two dimensional space, shown below. Each vector represents an ad. By taking the cosine of the angle formed between the vectors, we get what is known as cosine similarity: the closer to 1, the more similar the two vectors are.
In practice, our data is not two dimensional. Instead, there are 384 dimensions from the length of the document embeddings. Sparing the details, linear algebra enables us to extrapolate from a two-dimensional case to a multidimensional one. While there are other distance measures, the benefit of cosine similarity is that it is agnostic to the magnitudes of the vectors which are sensitive to features like text length.
Calculating the cosine similarities from every keyword-labeled ad to every unlabeled ad is extremely memory intensive; the memory use grows quadratically as the number of data points increases. This is because it produces an intermediate matrix of size M x N where M is the number of the unlabeled ads and N is the number of keyword labeled ads. However, we are only interested in the K most similar keyword-labeled ads to a given unlabeled ad.
To prevent loading the entire intermediate matrix into memory, we calculate the distances in batches. This, instead, produces a shortened matrix of B x N. We then find the K closest neighbors of the shortened matrix, append the results to the final output, and disregard the rest. This reduces the similarity matrix to a size of M x K. Fortunately, there is a method in sentence-transformers that handles this called semantic_search which takes an argument called query_chunk_size which is the equivalent of B.
Not all ads are as close to their neighbors meaning that they do not always share the same topics. This is where a threshold can be helpful. We can set a threshold which requires a minimum level of similarity to propagate labels. Ads with neighbors more similar than the threshold receive their neighbor’s topic labels. This measure ensures precision. The threshold currently in use was manually selected by reviewing ads and their closest neighbors at various similarity levels.
Back to our example. Say our previous ad was unlabeled and we wanted to propagate labels to it. We calculate the cosine similarities between our ad and three other known ads resulting in the following:
As we can see, the ads with more in common have a higher similarity score. If we set the threshold to 0.8 our example ad received the Economy topic label. If this threshold were lowered to 0.6, it would receive the topic labels Economy and Inflation. Choosing the appropriate threshold ensures the correct labels get propagated.
Stage three: topic discovery
Even after all that, ads with undetermined topics remain. Many are miscellaneous and irrelevant. Often, there are ads not represented by our topic list, yet are still pertinent. To facilitate manual topic discovery, we cluster the remaining ads by cosine similarity above a certain threshold and filter the clusters by a certain size to ensure the topic occurs frequently enough to be of interest. These are then manually reviewed by a political scientist to identify new trends. Again, the sentence-transformers library has a function called community_detection that handles this quickly.
The scatterplot below illustrates how clusters can be derived from an ad’s document embedding. In this example, the algorithm discovered three clusters: in blue, red, and green.
While the clustering algorithm can identify ads which are similar, human review is needed to determine why they are similar. Political scientists can recognize common themes in the clusters and the words and phrases most indicative of these themes. The table below demonstrates how this might look.
We can then compose three new topics — abortion, environment, and immigration — seeded with the identified keywords and kick off a new cycle of classification and propagation to label more ads.
Conclusion: better topics!
Initial evaluations suggest we can identify topics in 50 percent of the previous uncategorized ads while maintaining a precision of over 70 percent. This occurs at processing speeds of 500+ embedded ads per second on CPU. As we prepare for future elections, this topic model continually adapts to the political conversations at the time. Political scientists reviewing the data can adapt seed keywords or topics, facilitated by automated topic discovery, and can adjust the propagation similarity threshold to strike the balance between precision and recall. We are looking forward to implementing this in future versions of Ad Observatory and other tools.
About NYU Cybersecurity for Democracy
Cybersecurity for Democracy is a research-based, nonpartisan, and independent effort to expose online threats to our social fabric — and recommend how to counter them. It is a part of the Center for Cybersecurity at the NYU Tandon School of Engineering.
Would you like more information on our work? Visit Cybersecurity for Democracy online and see how tools, data, investigations, and analysis are fueling efforts toward platform accountability.