Text processing highlights from NICAR 2018

Published in

Text Data Stories

6 min readApr 24, 2018

Takeaways of interest for journalists working with textual sources

Of the many possible options to navigate NICAR— the annual conference of the National Institute of Computer-Assisted Reporting, held in Chicago last month— I chose the “text path”: a personal itinerary through panels and hands-on sessions focusing on stories, issues and tools involving some kind of text manipulation.

This topic is the focus of my John S. Knight fellowship project in Stanford University, so, naturally, I was interested in seeing how much of it would be covered at one of the most important journalism events in the world.

What follows is simply a summary of the talks I was able to attend, not a comprehensive review of all the relevant sessions, but I hope these notes can be of help to people with similar interests:

The eternal PDF dilemma

Accessing information trapped in PDF files is a common problem in journalism, so it was no surprise to find more than one session offering solutions.

Unleash the data: Tools and tricks for taming PDFs

The Associated Press’ Chad Day taught a hands-on session (scheduled twice in the program) titled “Unleash the data: Tools and tricks for taming PDFs” where he introduced participants to tools that are commonly used in CAR to convert PDF files to machine-readable formats:

Cometdocs (free with the IRE membership)
Tabula
Pdftk

He also covered a series of options for dealing with PDF files based on images, which require the use of Optical Character Recognition (OCR) tools:

Abbyy Finereader
Adobe Acrobat Pro
Document Cloud
Cometdocs
PypdfOCR

Day also recommended pdfinfo as a useful tool to access the metadata linked to the file in question.

Using OCR to extract data from PDFs

A similar problem was covered in “Using OCR to extract data from PDFs,” a session taught by Miguel Barbosa from CitizenAudit.

Barbosa’s presentation does a great job of helping journalists decide how to better proceed about their file processing, taking into consideration file formats, volume, expected accuracy, skills and budget available. Make sure you check out his “PDF processing checklist” and “PDF tools landscape”.

Fuzzy matching

Are “Theresa May” and “Rt Hon Theresa May MP” the same person? Can we join two different data sets based on that information?

CSV Match is a tool that helps solve the second question: although establishing a person’s identity is a much more complex process — even in cases when their names may be exactly the same — being able to match names registered using different formats, spelling, word order, etc., can be a useful in certain situations.

This command-line tool allows comparing columns in CSV files and determining if the strings contained in their respective cells are similar.

According to its creator, FT’s Max Harlow, it works better with medium size data, which makes it useful to work with names of people and companies mainly.

IRSx: Python library for accessing non-profit tax returns

jsfenfen/990-xml-reader

990-xml-reader - IRSx: Turn the IRS' versioned XML 990 nonprofit annual tax returns into standardized python objects…

github.com

Originally built for ProPublica’s NonProfit Explorer by former JSK fellow Jacob Fenton, IRSx is a Python library and command line tool that allows for easier access to nonprofit tax returns released by the IRS in XML format, dating from 2013.

Following the same format as the “paper” 990 forms, the library can render standardized Python objects, json, or human readable text with original line number and description.

The Github repository for the tool includes extensive documentation, including a useful cookbook.

Story examples

Some speakers shared examples of how processing large amounts of documents/text had played varying levels of importance in their reporting: from helping discover the trends or outliers that became the centerpiece of the story, to providing contextual or initial information that was necessary to produce the story later on.

Janet Roberts explained how in 2014 Reuters analyzed 14,400 U.S. Supreme Court petitions using a series of computational tools, including machine learning (LDA) to identify the topics in the petitions, and One Calais, a Thomson Reuters-owned document-analysis software program, to identify companies that petitioned the court. Their work revealed how a cadre of well-connected attorneys had honed the art of getting the Supreme Court to take up cases.

Mike Tigas shared a couple of examples from ProPublica, in which text data helped the team get halfway in their search for the story:

How U.S. Commanders Spent $2 Billion of Petty Cash in Afghanistan used data assembled from several different Department of Defense databases by the Special Inspector General for Afghanistan Reconstruction and provided to ProPublica under a FOIA request.
Dollars for Docs was created processing payment reports released by the Centers for Medicare and Medicaid Services, that ProPublica compiled into a single, comprehensive database.

Steven Rich from The Washington Post talked about the fatal police shootings database that his organization has been keeping since 2015, and how scraping and analyzing news reports has helped to find information, that is then fact-checked by reporters.

Fatal police shootings database kept by the Washington Post. Screen capture of one of the visualizations included in the 2018 update

Anthony Pesce from the L.A. Times spoke about the production of a story based on the automated re-evaluation of LAPD crime reports, which led to the discovery of a long history of serious assault underreporting in that city.

Jeremy Merrill explained how ProPublica had used hundreds of thousands of press releases from 2015 to the present to train a computer model to identify the topics that are distinctive to each member of Congress. The tool is used to contribute content to Represent, this organization’s news app tracking different aspects of the work of the U.S. Congress.

Tips

…as summarized by these Twitter users:

Talks, slides, data, speakers

title:Unleash the data: Tools and tricks for taming PDFs
author:Chad Day
materials: slides

title:Using OCR to extract data from PDFs
speaker:Miguel Barbosa
materials: slides

title:Finding needles in haystacks with fuzzy matching
speaker:Max Harlow
materials: slides

title:Turning your documents into data
speakers: Steven Rich, Janet Roberts, Mike Tigas
materials: N/A

title:How to find reporting leads and publishable facts in the text data you already have
speakers:Jeremy Merrill, Jeff Ernsthausen, Youyou Zhou
materials: slides

title:Getting started with machine learning for reporting
speakers:Peter Aldhous, Chase Davis, Rachel Shorey, Anthony Pesce
materials: slides

Tools

Cometdocs
Tabula
Pdftk
Abbyy Finereader
Adobe Acrobat Pro or DC
DocumentCloud
Cometdocs
PypdfOCR
pdfinfo
Match CSV
Open Calais
IRSx

Stories

Chamber of Secrets: Teaching a Machine What Congress Cares About (ProPublica)

Analysis of 141 hours of cable news reveals how mass killers are really portrayed
(Quartz)

LAPD underreported serious assaults, skewing crime stats for 8 years
(L.A. Times)

At America’s court of last resort, a handful of lawyers now dominates the docket
(Reuters)

How U.S. Commanders Spent $2 Billion of Petty Cash in Afghanistan (ProPublica)

Doctors and Sex Abuse
(The Atlanta Journal-Constitution)

Acknowledgements: I want to thank OpenNews for giving me one of their scholarships to cover part of my expenses in this conference, as well as to the OpenNews members at the conference for making me feel welcome in my first NICAR.