Public Good at Bloomberg D4GX 2017

Michael S. Manley
Public Good Blog Archive
9 min readJan 25, 2018

I had the pleasure of presenting Public Good’s experience paper, “Automating, Operationalizing and Productizing Journalistic Article Analysis” in a lightning talk session at the Bloomberg Data For Good Exchange 2017. The conference featured dozens of presentations and discussions about the application of data science, machine learning, artificial intelligence and other “big data” techniques to social problems in government, healthcare, environment and nonprofit arenas. We felt honored to be included among so many interesting problem solvers.

The abstract of our paper hints at a few things we’ll soon talk about in more detail here on the blog in 2018:

Public Good Software’s products match journalistic articles and other narrative content to relevant charitable causes and nonprofit organizations so that readers can take action on the issues raised by the articles’ publishers. Previously an expensive and labor-intensive process, application of machine learning and other automated textual analyses now allow us to scale this matching process to the volume of content produced daily by multiple large national media outlets. This paper describes the development of a layered system of tactics working across a general news model that minimizes the need for human curation while maintaining the particular focus of concern for each individual publication. We present a number of general strategies for categorizing heterogenous texts, and suggest editorial and operational tactics for publishers to make their publications and individual content items more efficiently analyzed by automated systems.

If you’d like to read the entire paper, Bloomberg has made it public on arXiv.org, and you can see the lightning talk as I delivered it here. I spoke in the fourth spot of the session, surrounded by people creating some amazing tools and resources using machine learning. I found the session on machine learning applied to catching wildlife poachers particularly fascinating.

The lightning talk format involves rather abstract slides and requires considerable compression of information, so I’d like to share my prepared notes.

Slide 001: Intro

… or, as we’ve been accused of attempting: Saving Journalism.

Slide 002: Who We Are

PGS was founded in 2013 by several alumni of the Obama for America 2012 technology team.

We wanted to apply to nonprofit efforts the campaign’s lessons regarding fundraising as well as recruiting, engaging and mobilizing volunteers.

The team currently includes people from OFA, SitterCity, Leapfrog Online, Orbitz, Kickstarter and GrubHub: We have a lot of marketplace experience, so we built a marketplace bringing together the United States’ 1.5 million 501(c)(3) nonprofit organizations, sponsors (corporations and foundations), media outlets and individuals. In 2015, we started working on driving people to that marketplace through journalism.

Slide 003: What We Do

We make it possible for people to take action on the causes they care about at the time they are most motivated to do so. Our job is to surface actions that a person can take in the context of what they are doing and where they are doing it. Such actions include:

  • Donate (money, material)
  • Volunteer (donating time, effort)
  • Advocate (donating attention)
  • Learn (read more)
  • Share Experience (tell your story)

We surface these actions through iterative contact with individuals who indicate interest in a cause via email, social media, or organizational outreach. We initiate contact through interactive experiences attached to journalistic content.

In its simplest form, we deliver the Take Action Button, which leads people to a cause, campaign or organizational profile on the publicgood.com marketplace site.

Rolling out soon: Take Action Cards, which deliver an interactive UI for individual actions, organizations, campaigns, or causes.

Finally, we deliver a Conversational Interface (currently embedded on the media partner’s web page, soon to be in multiple messaging platforms), currently available in English and Spanish and tailored to specific campaigns and actions.

Slide 004: Take Action Publishers

We work with quite a few media outlets, adding more every month. In 2016, we surfaced 19 million Take Action experiences.

In the summer of 2017, we helped raise $300k for Life After Hate, through Samantha Bee’s promotion of that group’s profile on her show after the organization lost a $400k federal grant.

CNN raised over $1 million for victims of Hurricane Harvey and other disasters through other outlets and campaigns.

With newsrooms that publish this volume and breadth of content, we have to surface actions at scale, accurately and quickly.

Slide 005: First Question

We have two major questions to answer in surfacing the right action.

First, how do we automatically match an article to its underlying cause.

Here, we have an article from PRI about the Barcelona attack. The underlying cause is the fight against terrorism — intervention, deradicalization, cross-cultural outreach.

We have to know what the article is about and match it as best we can to actions people can take with those relevant organizations in our marketplace working on these issues.

Slide 006: Second Question

This raises the second question: How do we match causes to the relevant organizations and the relevant actions with those organizations?

Here’s an example of the kind of cause page the Take Action Button might direct an individual to on PublicGood.com.

(With the Action Card and conversational interfaces, the user doesn’t need to leave the media partner’s property unless the action is with a third party who require it.)

We need to know what the organization works on and where it does its work (which is different from where it has its offices).

Slide 007: The Constraint

To make this scale, we:

  • Have to do it fast
  • Have to do it as infrequently as possible

Humans are expensive and slow, but they’re pretty good at this categorization problem, so only use them when it makes the most sense:

  • Up-front analysis of a publication
  • Verification of the machine’s work
Slide 008: What Didn’t Work

Our first foray into automating this analysis involved some naive attempts at textual analysis in addition to a lot of human curation and correction.

What didn’t really work: Causes based on IRS–990 categorization and Entity, geographic term and sentiment extraction using the Alchemy API services, combined with naive term matching and ElasticSearch percolator queries to attempt to match on proper names of organizations, locations and our own tagging of marketplace entities. While we still use similar text analysis tools as part of the generation of input features for our current system, naive application of these tools led to:

  • Matching frequently being accurate less than random chance would have been
  • Very high levels of human intervention
  • Very high incidence of the interactive widgets being hidden on a page due to no relevant matches

And embarrassing results like:

  • All articles with “United States” matching to “Golf” (yes, there are golfing nonprofits)
  • All gun violence articles in Chicago matching to Northwestern University (because victims are often sent to Northwestern Memorial Hospital)
  • An article about “Deep Throat” leading to a “Ear, Nose and Throat Disease Research” cause

So what strategies did we put in place?

Slide 009: Content Fingerprinting

First, we have to start eliminating duplicate content so that we are not analyzing and categorizing the same story multiple times.

  • Stories get edited for typos, headline changes.
  • Stories get syndicated within and between publications
  • Stories do not get canonical URLs attached

We use the Python library appropriately called “Newspaper” to isolate article content from advertising and navigation chrome.

PGS adopted an implementation of Moses Charikar’s “SimHash” technique to identify duplicate content. If the article’s hash value is within a specified distance of a previously stored article, we can assume the articles are the same and we assign a previously calculated set of matching advice.

This helps us slim down the corpus of articles we have to consider.

Slide 010: Business Rules Engine

This is one place where we make the most effective use of human analysis. While machine learning will always involve a margin of error, articles classified deterministically yield nearly 100% accuracy. But those articles need structured clues in their text, their URLs, their metadata, for people to recognize and formulate rules over. We look at publication structure, content tagging, entity extraction terms, and publisher preferences. We look at whatever history we may have with the reader at any given time. We take all of these facts and run them against multiple sets of production rules that can match articles to causes and causes to organizations.

Here, we can do some naive matching against very event-specific terms, such as “Hurricane Harvey” or proper names, such as authors like Dan Savage. We can share rules between publishers.

A simple example: Univision has very limited but well-structured tagging on their stories and a small number of causes to which they want to direct traffic. It is very easy for us to assemble a small set of production rules that match their stories to the proper causes.

Slide 011: Taxonomy & Metadata

Turns out, modeling the world is hard.

Terminology is messy, so we can’t rely on nonprofit taxonomies created by the IRS at the beginning of last century. Nor can we rely on taxonomies from people who think about nonprofits all day. Or the news. Or the UN’s development programs. Or Dublin Core metadata.

And we can’t rely on our own categorization biases.

We have to treat each of these taxonomies as individual aspects of the same content or organization or whatever other entity. We have to determine which aspect is most relevant in any given matching situation. And we need a lot more metadata attached to both journalistic articles and our own marketplace entities.

This is another instance of where human effort is better applied. We can even crowdsource this kind of “tagging.”

Slide 012: Machine Learning Models

Subtracting out any items that have been hand-curated by the media outlet, half of the articles encountered in the current PGM corpus do not have enough information for the Business Rules Engine to deterministically match content to a cause. This number is shrinking, but still a lot. This is where we start bringing in the machine learning algorithms.

Starting with a naive Bayes classifier and a limited number of similar categories, we could match with 80% accuracy. Applying the same technique to a larger number of broader categories resulted in unacceptable drops in accuracy. Eventually, we landed on a 37-category taxonomy that we could match at better than 50% accuracy after training the model with 6000 human-categorized articles. Good, but not great.

Testing with other algorithms, and using training data sets for both individual media outlets and applied to multiple media outlets, we can routinely get another 10 to 15% accuracy. With continuous training on larger data sets and tweaking of our taxonomies, we think there is a ceiling of about 80–85% accuracy that we can reach with current models.

Thumbnail sketch: for 100 articles, we can hope to match 50 deterministically and 40 with machine learning, leaving only 10 that have to be manually categorized by a human. And of that 10 we think we can get our audience to categorize 8 or 9 more for us. Considering where we started, which was far less than 50% matching, this appears to be a workable solution.

Slide 013: What We Learned

Lesson Learned: Machine Learning is not magic pixie dust.

There will always be a desire for human curation.

There will always be a need for some human analysis of publications.

The best strategy is in using the human analysis where it does the most good in the least time.

Publishers should invest in the “semantic web” if they want automated discovery of their content.

The nonprofit world needs more useful categorization schemes that reflect real-world concerns and causes.

We’re planning on presenting further refinements and updates to this experience paper at future conferences. In coming weeks, we also plan to expand on a few of the final lessons learned here on the PGS blog.

I’d like to thank the Bloomberg Data for Good Exchange organizers for including Public Good’s experience paper among the many great presentations and talks. We came away from the conference impressed by the depth and breadth of the work being done by data scientists and other folks on many hard social problems, and we collected quite a few ideas we plan to mix into our own work. I’d also like to thank the paper’s other authors, Dan Ratner and Eric Kingery, and Q. McCallum for his shepherding of the company through its first conference presentation.

--

--