Political Ads Investigation

Example of Ad Information Pulled and Published by ProPublica

2018 has been a big year for political ads. As companies get better and better at targeting their ads to audiences of their choosing, the public has grown increasingly skeptical of these techniques. This year alone, investigations into Cambridge Analytica have shown how personal data could be used to profile and target voters, the Justice Department indicted Russia’s Internet Research Agency for, among many other accusations, using Facebook to spread divisive fake news. And, over the last few days, Facebook’s COO Sheryl Sandberg has been testifying in front of Congress on how they’re addressing these issues. So, how is Facebook addressing these issues? Well, one way they seek to fight outside influence and disinformation in political content is through setting up a system of political ad labeling. This system is meant to improve transparency into the political ads by making sure they are labeled as such, including a “paid for by” label. All political ads will also be kept in a public archive, which will tell users basic demographic information on who saw the ad and how much money went into it.

However, if you wanted to know more specifically who these ads are targeting, you won’t get that information from Facebook. If specific users that may be more susceptible to false information are being targeted with malicious information, there may be no one seeing those ads that would be inclined to report or challenge the misinformation. This is one of the reasons why ProPublica launched a tool that scrapes ads and targeting information directly from consenting users’ profiles in order to shed more light on how targeted ads are used (you can find more about how the tool works and examples of ads collected here).

Analyzing Ad Data for Insights into Messaging and Targeting

Using ProPublica’s published data on political ads from Facebook, I sought out to investigate the following questions I had:

  • Are there different categories that the posts fit into based on their messages?
  • What sorts of issues tend to come up?
  • Can we find significant differences in the language used between generalist ads and targeted ads? How about between different target audiences?

To start, I wanted to take a look at the different targets that were being used. Were there certain characteristics that were being used to target users over and over again? What types of audiences are these political ads trying to target? In general, it seemed that the majority of ads only used Age and Location to target their audiences (see Figure 1). However, this doesn’t take into account the impressions a given ad had, which dictates how many people actually saw it.

What does become apparent is that this data, which ProPublica readers opt-into providing, skews pretty liberal in the users that are being targeted (Figure 3).

From left to right: Figure 1) The percentage of ads that used each targeting category; Figure 2) The top 20 advertisers in the political ad dataset; Figure 3) The top 20 Interests targeted by ads in the dataset
Figure 4) Time series of number of impressions per day in the dataset

Another important thing to note is the window of observations (Figure 4). The dataset’s first observations begin to occur in October 2017 and run through July 2018. The metric here, impressions, is important because it gives a sense of scale to the reach of the ads, but can also be misleading because several users of the plugin may have stopped providing data after the initial release period. We would actually expect to see the opposite pattern as races ramp up for primaries in July/August/September, and into the general election in November 2018.

In order to understand the messaging landscape of the ads, you’d have to look through all the messages and try to group messages into categories, or topics. I wanted to know which topics were represented so that I could then measure if some topics were being pushed or served more to specific audiences. In Natural Language Processing (NLP), there are a few different ways to do this type of analysis. One of the popular methods to use is Latent Dirichlet Allocation (LDA). LDA is a statistical model commonly used for topic modeling that attempts to identify latent groups in a collection of documents. It is an unsupervised machine learning approach that is common in NLP when you are not sure what topics are present in the text you are analyzing.

One of the challenges of LDA as an approach is that you have to specify the number of topics ahead of time. To test out building an topic model on this dataset, I set the number of topics to 5 and observed the results (Figure 5).

Figure 5) Initial topic modeling results show pretty broad topics but decent separation between them. The highlighted topic (topic 4) seems to be broadly about gun violence in communities and schools.

While those results were promising, how are we supposed to know that 5 topics is even the right number? Or if our model is accurate or not? I’d like to quantitatively attempt to identify the correct number of topics. To do this, we can evaluate the LDA model using the model coherence score:

  • Coherence Score: In basic terms, this score evaluates the top N words in each topic in the list of topics identified. This measure scores the interpretability of a given topic, and a higher score is better (see detailed explanation here)
Figure 6) Iterating over topic models to find the optimal number of topics

Using the coherence score, we will test out models at iteratively at increasing numbers of topics, and use the topic number that performs best. In doing this, we find 26 topics to be the optimal number, and will use that topic model moving forward.

Using the optimal model, we can look across all the ads and assign a dominant topic to each ad. An ad may touch on many topics, or have some overlap between topics, so assigning a dominant topic helps identify the most likely topic present in each ad. Using this approach, we can see the breakdown of how frequently each topic comes up in the whole dataset (Figure 7).

Figure 7) Plot of number of ads that each topic is the dominant topic for. Topics are labeled by their top 10 keywords
Figure 8) Plot of the average targetedness of ads per topic. Targetedness is a metric defined by ProPublica

Looking at the plot of average targetedness of each topic, it appears that the topic with keywords “county, state, city, etc.” is the most targeted. This would make sense as local issues should be targeted to local populations. Interestingly, the two least targeted topics on average both have the words “grassroots” in them.

Figure 9) Plotting the difference in topic distribution between ads targeting Democrats and ads targeting Republicans

In looking to observe the difference in topics targeted to Democrats vs. Republicans (Figure 9), a few things are important to note:

  • Of all ads observed in this dataset, only 20% target based on Interest, and a much smaller number target these specific interests. Large campaigns often use lists purchased or generated through email campaigns as opposed to Facebook interest targeting.
  • Because the sample dataset skews mostly democratic in users, there are many more observations of Demoratic Party targeting than there are of Republican Party targeting (85% vs. 15%). This will lower the significance of differences.

Looking at the results, it seems that a few topics have a significant difference in messaging, such as the topics on “grassroots, campaign, fundraising”, “attorney, general, president, accountable”, and “family, child, immigrant, policy”. On the other hand, some topics have virtually no difference, using messages that speak to broad audiences and perhaps on shared values.

Conclusions:

Although the sample size of the dataset is relatively small, we were able to observe some pretty interesting differences in messaging across topics, amount of targeting, and partisan lines. The more we dug into topics that were more coherent, we found that certain topics were targeted to a more granular audience. We also saw some differences in how democrats and republicans are targeted, with an emphasis on campaign fundraising, President Trump, and immigration policy. Lastly, we saw that many topics/issues are talked about across party lines, and in future analyses it would be great to look into how the language on those topics may differ.

To see all code used to generate this post, you can check out this notebook. A special thank you to Jeremy B. Merrill at ProPublica for making this tool and data available to the public!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade