Text processing highlights from NICAR 2018
Takeaways of interest for journalists working with textual sources
Of the many possible options to navigate NICAR— the annual conference of the National Institute of Computer-Assisted Reporting, held in Chicago last month— I chose the “text path”: a personal itinerary through panels and hands-on sessions focusing on stories, issues and tools involving some kind of text manipulation.
This topic is the focus of my John S. Knight fellowship project in Stanford University, so, naturally, I was interested in seeing how much of it would be covered at one of the most important journalism events in the world.
What follows is simply a summary of the talks I was able to attend, not a comprehensive review of all the relevant sessions, but I hope these notes can be of help to people with similar interests:
The eternal PDF dilemma
Accessing information trapped in PDF files is a common problem in journalism, so it was no surprise to find more than one session offering solutions.
Unleash the data: Tools and tricks for taming PDFs
The Associated Press’ Chad Day taught a hands-on session (scheduled twice in the program) titled “Unleash the data: Tools and tricks for taming PDFs” where he introduced participants to tools that are commonly used in CAR to convert PDF files to machine-readable formats:
- Cometdocs (free with the IRE membership)
He also covered a series of options for dealing with PDF files based on images, which require the use of Optical Character Recognition (OCR) tools:
- Abbyy Finereader
- Adobe Acrobat Pro
- Document Cloud
Day also recommended pdfinfo as a useful tool to access the metadata linked to the file in question.
Using OCR to extract data from PDFs
Barbosa’s presentation does a great job of helping journalists decide how to better proceed about their file processing, taking into consideration file formats, volume, expected accuracy, skills and budget available. Make sure you check out his “PDF processing checklist” and “PDF tools landscape”.
Are “Theresa May” and “Rt Hon Theresa May MP” the same person? Can we join two different data sets based on that information?
CSV Match is a tool that helps solve the second question: although establishing a person’s identity is a much more complex process — even in cases when their names may be exactly the same — being able to match names registered using different formats, spelling, word order, etc., can be a useful in certain situations.
This command-line tool allows comparing columns in CSV files and determining if the strings contained in their respective cells are similar.
According to its creator, FT’s Max Harlow, it works better with medium size data, which makes it useful to work with names of people and companies mainly.
IRSx: Python library for accessing non-profit tax returns
990-xml-reader - IRSx: Turn the IRS' versioned XML 990 nonprofit annual tax returns into standardized python objects…github.com
Originally built for ProPublica’s NonProfit Explorer by former JSK fellow Jacob Fenton, IRSx is a Python library and command line tool that allows for easier access to nonprofit tax returns released by the IRS in XML format, dating from 2013.
Following the same format as the “paper” 990 forms, the library can render standardized Python objects, json, or human readable text with original line number and description.
Some speakers shared examples of how processing large amounts of documents/text had played varying levels of importance in their reporting: from helping discover the trends or outliers that became the centerpiece of the story, to providing contextual or initial information that was necessary to produce the story later on.
Janet Roberts explained how in 2014 Reuters analyzed 14,400 U.S. Supreme Court petitions using a series of computational tools, including machine learning (LDA) to identify the topics in the petitions, and One Calais, a Thomson Reuters-owned document-analysis software program, to identify companies that petitioned the court. Their work revealed how a cadre of well-connected attorneys had honed the art of getting the Supreme Court to take up cases.
Mike Tigas shared a couple of examples from ProPublica, in which text data helped the team get halfway in their search for the story:
- How U.S. Commanders Spent $2 Billion of Petty Cash in Afghanistan used data assembled from several different Department of Defense databases by the Special Inspector General for Afghanistan Reconstruction and provided to ProPublica under a FOIA request.
- Dollars for Docs was created processing payment reports released by the Centers for Medicare and Medicaid Services, that ProPublica compiled into a single, comprehensive database.
Steven Rich from The Washington Post talked about the fatal police shootings database that his organization has been keeping since 2015, and how scraping and analyzing news reports has helped to find information, that is then fact-checked by reporters.
Anthony Pesce from the L.A. Times spoke about the production of a story based on the automated re-evaluation of LAPD crime reports, which led to the discovery of a long history of serious assault underreporting in that city.
Jeremy Merrill explained how ProPublica had used hundreds of thousands of press releases from 2015 to the present to train a computer model to identify the topics that are distinctive to each member of Congress. The tool is used to contribute content to Represent, this organization’s news app tracking different aspects of the work of the U.S. Congress.
…as summarized by these Twitter users:
Talks, slides, data, speakers
Adobe Acrobat Pro or DC
Doctors and Sex Abuse
(The Atlanta Journal-Constitution)