Technology for (investigative) journalism

Krzysztof Madejski
TransparenCEE network
9 min readSep 21, 2016

Journalism is based on gathering evidence. Evidence is data. No matter if it is public data, official statements, off the record talks or plain rumors, this data has to be stored, linked, analyzed, filtered and eventually visualized to create the groundwork for the stories. This is even more so the case in investigative journalism, where the stories have a bigger scale, they involve lots of people, organizations, documents and relationship among them. As stories are, and always will be the key to journalism, technology can help to build their base or augment them with data storytelling.

I will present some of the tools that journalist can use for such work. The first section will be devoted to building the evidence base: how to organize information, what sources to start with, how to collaborate with others. The second section will present tools that can be used to augment stories with data visualisations. This information is intended for both professional journalists as well as amateurs. Finally, I will mention sites dedicated to communities of amateur journalists that help form the growing trend of “citizen journalism.” And for dessert — I invite you to join imagining the future.

For all the tools and initiatives connected with investigative journalism in the CEE region please visit transparencee.org.

http://panamapapers.org/

How to build the evidence database?

The recent #PanamaPapers leak was one of the biggest in history, both in terms of the document count (11.5 million files) and file size (2.6 TB in total). After a year-long analysis in addition to publishing the stories behind the documents, the International Consortium of Investigative Journalists (ICIJ) has shared some insights on the technologies they’ve used to process such a big amount of data.

The first step was to extract data from a dozen of document file types. That’s an obligatory step in every process of opening up the data and a subject for a separate article. Let’s just mention that they’ve been using Apache Tika to extract content and metadata from all of the documents and Tesseract for OCRing (optical character recognition, image-to-text) the images.

More important are the tools that were used to make sense of the data: database of public databases, tools to “google” into them and tools for visualizing relationships. It seems that such a common set of tools for journalists has just emerged — two big investigative journalists’ networks, the ICIJ and the OCCRP (Organized Crime and Corruption Reporting Project) have been using tools with the same functionalities from different vendors. ICIJ has been using Project Blacklight (open source) for searching through documents and Linkurious (commercial license) to visualize connections. OCCRP, on the other hand, has developed an independent solution called Investigative Dashboard (the search functionality is open sourced) incorporating both features. Let’s split and discuss them separately.

Searching through documents

There are two common types of searches supported by popular platforms:

Faceted search — you can narrow (facet) search results by choosing a specific value from a category associated with the items, ie. documents mentioning Putin in a given language or to give a commercial example: look for a laptop by specific screen size. These kind of searches are only possible if we have metadata (such as language or date of publication) associated with the items we are searching through.

Full-text search (FTS) is another approach: it searches through the documents’ content, and not their metadata. Properly configured, FTS solutions take into account stemming and declension of the words, which is of a huge help for searching through many languages. For example, In Polish one noun can have 48 declensions instead of two in modern English — imagine searching for all 48 using exact matching.

There are also some more advanced types of searches:

Regular expressions — allow users to search for certain patterns of letters and numbers, for example, passport or telephone numbers.

Search using word proximity allows you to do searches like “I want John Doe proximity 2” and that would give you “John Doe,” “John whatever Doe,” “Doe, John,” (example source).

Search using the so-called Levenshtein distance between words allows you to find words regardless of the spelling variants that may occur due to the OCR process, human mistake or purposeful actions intended to make automatic analysis harder. When used, such a technique will find “Vladmir”, “Vladymir”, “Vlaadimir” when searching for “Vladimir”.

Lucene, Elastic Search and Solr logos

There are only a few open source platforms that support advanced search. For a long time, the Apache Lucene project consisting of Lucene Core, an indexing and search technology, and SOLR, a search server based on Lucene Core, was one of the only options. A few years ago, Elastic Search arrived on the scene. It’s a SOLR competitor that is also based on the Lucene Core.

The tools used by the ICIJ and OCCRP are also based on the above-mentioned search engines. Project Blacklight is a user interface for SOLR, whereas Aleph-based Investigative Dashboard uses Elastic Search on the backend.

Navigating through relationships

The complexity of connections between people through companies, organizations, proxies and informal networks is vast. We need tools to easily navigate through these networks to find the relevant data. Complex networks of companies can be legitimately created to act on local markets, but can also be used to dodge taxes (legally or not) or launder money.

Tools to visualize connections are used to both create stories and to present them.

The Proxy Platform by OCCRP presents stories about “billions of Euros [that have] circulated in the region in an illegal, parallel system that enriched organized crime figures and corrupt politicians.The system is built on hundreds, maybe thousands, of ever-morphing phantom companies.They exist on paper only and appear to be run by scores of common people, who are, in fact, simply proxies.” Viewers can navigate the networks and jump into stories in which particular subjects are involved.

MojePanstwo open data portal by ePanstwo Foundation presents connections between companies registered in Poland and its beneficial owners, boards of directors, etc.

ICIJ has used the Linkourious platform to analyze the data gathered from leaks and official documents.

Admin panel of Linkourious — https://source.opennews.org/en-US/articles/people-and-tech-behind-panama-papers/

Others create their own tools, like Open Corporates did to visualize corporate networks.

Map of the global Goldman Sachs corporate network assembled and visualized by Open Corporates — https://opencorporates.com/viz/financial/index.html

OCCRP has created their own tool, Vis — Visual Investigative Scenarios, to analyze the data as well as share parts of it with their audience.

Publishing mass content

When your news site is viewed by few people it can be hosted on nearly anything. When it is visited by millions — you need a devops/server admin department.

slide from the presentation of OCCRP representatives at Point 5.0 — “Technology, Journalism, and some Papers from Panama — Stories Published, Lessons Learned

I’ve recently attended Smári McCarthy and Michał Woźniak’s (both from OCCRP) presentation on POINT 5.0 where they gave a glimpse into that process. They have illustrated the need to have static pages instead of WordPress and a proper infrastructure on the backend, in their preparations to publish Panama Papers stories. They expected a lot more traffic, a lot more than even OCCRP’s recent big hit “Pussy for Putin”, but what came next went well beyond their estimates. I’m hoping that they will soon publish a technical guide on how to set up infrastructure to handle such loads.

slide from the presentation of OCCRP representatives at Point 5.0 — “Technology, Journalism, and some Papers from Panama — Stories Published, Lessons Learned

How to visualize data?

There is a number of tools that can be used by journalists to visualize data and embed it into the articles they create. There are catalogs of such tools so I’m providing links to them instead of listing the tools here:

Some of the examples fall more into rich media content rather than data visualization, Story Builder by Georgia’s JumpStart is a good example of such a tool.

The future

How will investigative journalism and fact checking develop in the future? It’s hard to tell, but some people are giving their best shot at predicting it. For example, the science-fiction writer David Brin in his novel “The smartest mob”. Give yourself a break and read an excerpt below. Full version is available on CC BY-NC-ND 4.0 license as a part of Pwning Tomorrow compilation by the Electronic Frontier Foundation.

Disinformation, a curse with ancient roots, had been updated with ultramodern ways of lying. Machoists and other bastards might plant sleeper AIs in a million virtual locales, programmed to pop out at a pre-set time and spam every network with autogenerated “plausibles”… randomly generated combinations of word and tone that were drawn from recent news, each variant sure to rouse the paranoic fears of someone.

Mutate this ten million times (easy enough to do in virtual space) and you’ll find a nerve to tweak in anyone.

Citizens could fight back, combatting lies with light. Sophisticated programs compared eyewitness accounts from many sources, weighted by credibility, offering average folk tools to re-forge Consensus Reality, while discarding the dross. Only that took time. And during an emergency, time was the scarcest commodity of all.

This is Tor Pleiades, investigative reporter for MediaCorp — credibility rating seven-hundred and fifty-two — aboard the passenger zep Spirit of Chula Vista. We are approaching the DC Beltway defense zone. That may put me at a right place-time to examine one of the reffer rumors.

I request a smart mob coalescence. Feedme!

Calling up a smart mob was tricky. People might already be too scattered and distracted by the rumor storm. The number to respond might not reach critical mass — in which case all you’d get is a smattering of critics, kibbitzers and loudmouths, doing more harm than good. A negative-sum rabble — or bloggle — its collective IQ dropping, rather than climbing, with every new volunteer to join. Above all, you needed to attract a core group — the seed cell — of online know-it-alls, constructive cranks and correlation junkies, armed with the latest coalescence software, who were smart and savvy enough to serve as prefrontals… coordinating a smart mob without dominating. Providing focus without quashing the creativity of a group mind.

[…]

“I’m here,” she murmured, breathlessly, toward any fellow citizen whose correlation-attention AIs would listen.

We recognize you, Tor Pleiades, intoned a low voice [..]. We have lit a wiki. Can you help us check out one of these rumors? One that might possibly be a whistle-blow?

The conjoined mob-voice sounded strong, authoritative. Tor’s personal interface found good credibility scores as it coalesced. An index-marker in her left peripheral showed two-hundred and thirty members and climbing — generally sufficient to wash out individual ego.

Sources

[EDITED] Other leads/initiatives to check out

Originally published at transparencee.org.

--

--

Krzysztof Madejski
TransparenCEE network

Postgrowth, civic engagement, transparency, tech. Working at @epforgpl as @codeforall Coordinator; @Transparen_CEE. Po polsku na https://bit.ly/2EGWxSF