Lab Notebook: New Features

Austin Botelho
Cybersecurity for Democracy
9 min readJul 5, 2023

C4D is building better ways to understand how advertisers try to influence the public

Ahead of the 2020 U.S. elections, and again in 2022, NYU Cybersecurity for Democracy created an online, free dashboard, Ad Observatory, designed to provide the public with a way to gain insight into the millions of political ads on Facebook and Instagram. While Ad Observatory is hibernating until the 2024 US General election cycle, we wanted to highlight a few features we‘re working on that you can look forward to next year.

Improved Language Identification

Much of our data handling is dependent upon knowing the language of the ad text. Since this is not provided by the Meta Ad Library, we need to make this determination ourselves accurately, quickly, and cheaply.

There are several free, out-of-the-box methods that meet the third criterion of cost: LanguageDetector, fastText, LangDetect, LangID, and GCLD3. To measure how they fared on the other two, we merged two open-source, labeled datasets (one and two), composed of data scraped from Wikipedia. Between the two, there are 32,337 data points across 28 languages.¹

We ran data from all of the languages through the models to evaluate their generalizability. Because English and Spanish are the most important languages for political communication in the United States, we additionally calculated a weighted a performance measure with English and Spanish overweighted by a factor of three. In both measures, accuracy was the micro-averaged F1 score. The speed reported is the total runtime of the evaluation dataset in seconds and should be interpreted as slowest (top) to fastest (bottom).

The fastText model achieved the highest raw and weighted accuracy. That, combined with it being the second fastest model, notably faster than all others except the least accurate LanguageDetector model, made it the best candidate to include in our language detection pipelines.

Here is a link to the evaluation code, including a Colab notebook.

Improved ad type classification

For each search, AdObervatory breaks down the ads by type. These types represent the purpose of the ad (which is separate from its topic, although they are sometimes related.) We currently categorize ads into 5 types:

  • Connect: These ads seek to get the viewer to share their contact information so they can be contacted later. These often take oblique forms, such as asking the viewer to sign a petition or birthday card.
  • Show Up: These ads ask the viewer to take an action in the physical world, such as attending a rally or voting.
  • Donate: These ads aim to get the audience to donate money immediately and offer no goods or services in exchange.
  • Buy: These ads sell goods or services in exchange for money, although this exchange can be couched as a donation or ‘shipping fee’.
  • Persuade: These ads do not seek to get the user to take any immediate action. We therefore conclude that their only purpose is to persuade the audience to hold some belief. This is, to a certain extent, a criterion of exclusion.

This breakdown provides further information about a campaign’s digital strategy. In an earlier blog, we highlighted differences between the Trump and Biden presidential campaigns based on the ad types they employed.

Screenshot of AdObservatory tool from 202 midterm election

Similar to topic modeling, ad type classification is a multi-lingual text classification task. By contrast, type classification does not need to be extensible meaning that new types are unlikely to emerge. This makes it a multi-class rather than multi-label problem. Therefore, three out of the five specifications remain: precise, efficient, and multilingual.

Table of ad types and descriptions

To train a classification model, we need labeled training data. Facing similar constraints around human review resources, we labeled data using a semi-supervised heuristic. This heuristic takes advantage of links and buttons which frequently appear in Meta ads. The contents of these are strong indicators of the purpose of the ad. Since many ads share the same link and button type, labeling the most common can cut down on manual review time. We can then use the human-labeled link/button-type mappings to label the text of all ads where those links and buttons appear. With this labeled data, we can train a machine learning classifier to generalize these labels.

From an initial labeled dataset of 50,000 texts with an 80–10–10 train-test-validation split, we trained two potential replacement models for the current Naive Bayes classifier with tf-idf features in production. The first is a bidirectional LSTM with ReLu activation. The second is a DistilBERT model with distilbert-base-multilingual-cased pre-trained sentence embeddings trained using HuggingFace’s Trainer API. Detailed training parameters are listed below.

Training hyperparameters

The next table compares the performance of the two new models and the trade-off between them; the transformer model is ~17% more accurate, but nearly 10x slower.

Model performance

Spending (and Spending Changes) Over Time

While the search figure allows users to drill into specific snapshots of the data, there is also value in getting a bird’s eye view of the emergent trends. We wrote a script that produces visualizations of the top ads, sponsors, topics, and types by spending and impressions as well as their week-over-week changes. The overview here can be a good jumping-off point for understanding who is currently most active in the Meta political ad ecosystem, and understanding what changes are underway.

To show you what kind of insights can be gleaned, we’ll use English-language Meta ad data from the week of February 1, 2023 to February 8, 2023 as an example. During this week, America’s Plastic Makers was the top sponsor across ad number, amount spent, and second in impressions. America’s Plastic Makers is a Facebook page run by the American Chemical Council, a trade association for plastics manufacturers. In second place is Hulu, not a typical political ad spender. However, during this period they were promoting the release of their documentary series, “The 1619 Project”, which had political implications. Concerningly we also see three media organizations that have been rated as “low credibility” by Media Bias Fact Check: Newsmax, China Economic Daily, and PragerU in the top ad sponsors by impression during this time window.

Looking at the change in spending compared to the prior week reveals a slightly different picture. America’s Plastic Makers ramped up spending by around 23% from the week before, nearly doubling impressions. However, the biggest change in ad spending (of sponsors that spent at least $500 the week before) was Julie Hartman’s page at more than 25%. Julie Hartman is a young liberal-turned-conservative whose podcast started airing under the Salem Media Group in late November. During the period of data collection, she released four episodes covering topics like Hunter Biden’s laptop, Chinese balloons, and the relocation of undocumented immigrants to NYC. In second place was UNITED24.Media, a fundraising entity for the war in Ukraine. Virginia Unified tripled impressions from the week before, their first week in existence, with ads opposing a Virginia state bill that would reduce the minimum wage for youth.

Although we don’t typically look at individual ads, it can be illuminating to review the single ads with the highest spend. Among the top ten costliest ads for the week was this ad below by America’s Plastic Makers trying to rebrand plastic as sustainable. Over the month of February, they pumped between $50k and $60k in this ad alone and it is one of dozens they aired as part of larger Greenwashing campaign.

Another ad in the top 10 most expensive sponsored by the conservative America Strong and Free PAC, featured this anti-China message. The PAC was founded by former Republican Arkansas Governor Asa Hutchinson who has since announced a run for the presidency in 2024.

Likely due to America’s Plastic Makers, Environmental Protection was the top topic by impressions and second highest for spending.

Environmental protection also had one of the largest changes in impressions, surpassed only by National Security. Spending on National Security doubled to more than triple impressions, whereas Environmental Protection had nearly the same impression growth at a fifth of the spending increase.

Campaign detection

We have observed interesting trends with the big-picture view, but we can also dig deeper. Deep dive investigations are time-consuming and can often lead to dead ends. To aid in determining fruitful lines of inquiry, we created tools to surface keywords and named entities from ads in the Meta AdLibrary. The frequency of use and cost per use for these, are good indications of important ad campaigns.

The keywords are determined using KeyBert. First, the text is embedded both at an n-gram and document level, (i.e, converted to a numerical representation) using the pre-trained paraphrase-multilingual-MiniLM-L12-v2 embeddings. N-grams are groups of consecutive words of a specified length. For this analysis, we consider bi- and tri-grams, sequences of two and three words. Stop words, words containing minimal significance to the meaning of the text, and words that appear in fewer than five documents are removed from consideration. The keywords are selected using Maximal Marginal Relevance which takes into account cosine similarity between the n-gram and document and a desired level of diversity in the selected terms.

From this, we generate a table for the spending, ad, and sponsor totals for each set of keywords. The top one is “black history month” which started at the beginning of data collection. Sixty-five different advertisers spent more than $21,000 on 381 unique ads containing these keywords. Further down on the list is a collection of keywords related to plastics from the American Chemistry Council, the organization behind the America’s Plastic Makers page mentioned earlier.

Named entity recognition aims to extract proper nouns from a text. We used spaCy’s en_core_web_trf language model pipeline disabling the other models to improve speed and filtering by the labels EVENT, LAW, LOC (location), NORP (Nationalities or religious or political groups), ORG (organization), PERSON, and PRODUCT. Many expected named entities were common in the data like Congress, House, Democrats, Trump, and Biden.

Ilhan (Omar) was among the most common ones mentioned 207 times, the most of any legislator. As a progressive Black Muslim woman, she is often targeted like in the example ad below.

Conclusion

These changes will lead to better insights into political ad trends through more accurate language classification, topic identification, weekly trend summaries, and keyword and named entity extraction. They are a few of potentially multiple additional features that may get included in the 2024 version of AdObservatory should there be renewed funding.

About NYU Cybersecurity for Democracy

Cybersecurity for Democracy is a research-based, nonpartisan, and independent effort to expose online threats to our social fabric — and recommend how to counter them. It is a part of the Center for Cybersecurity at the NYU Tandon School of Engineering.

Would you like more information on our work? Visit Cybersecurity for Democracy online and see how tools, data, investigations, and analysis are fueling efforts toward platform accountability.

Footnotes

  1. For those interested in a much larger dataset with millions of entries, there is this one

--

--