Credit: Hans-Peter Gauster on Unsplash.

We cracked the Panama Papers with 400 human brains. Can AI help us next time?

A new partnership between journalists and Stanford machine learning scientists aims to enhance the investigative reporting process. Here’s what we learned so far.

As we approach the third anniversary of Panama Papers, the gigantic financial leak that brought down two governments and drilled the biggest hole yet to tax haven secrecy, I often wonder what stories we missed.

Panama Papers provided an inspiring example of media collaboration across borders and using open-source technology at the service of reporting. As one of my colleagues put it: “You basically had a gargantuan and messy amount of data in your hands and you used technology to distribute your problem — to make it everybody’s problem.” He was referring to the 400 journalists, including himself, who for more than a year worked together in a virtual newsroom to unravel the mysteries hidden in the trove of documents from the Panamanian law firm Mossack Fonseca.

Those reporters used open-source data mining technology and graph databases to wrestle 11.5 million documents in dozens of different formats to the ground. Still, the ones doing the great majority of the thinking in that equation were the journalists. Technology helped us organize, index, filter and make the data searchable. Everything else came down to what those 400 brains collectively knew and understood about the characters and the schemes, the straw men, the front companies and the banks that were involved in the secret offshore world.

If you think about it, it was still a highly manual and time-consuming process. Reporters had to type their searches one by one in a Google-like platform based on what they knew.

What about what they didn’t know?

Panama Papers reporters work together at the German newspaper Süddeutsche Zeitung, which first obtained the data.

Fast-forward three years to the booming world of machine learning algorithms that are changing the way humans work, from farming to medicine to the business of war. Computers learn what we know and then help us find unforeseen patterns and anticipate events in ways that would be impossible for us to do on our own.

What would our research look like if we were to deploy machine learning algorithms on the Panama Papers? Can we teach computers to recognize money laundering? Can an algorithm differentiate a legitimate loan from a fake one designed to shuffle money among entities? Could we use facial recognition to more easily pinpoint which of the thousands of passport copies in the trove belong to elected politicians or known criminals?

The answer to all of that is yes. The bigger question is how might we democratize those AI technologies, today largely controlled by Google, Facebook, IBM and and handful of other large companies and governments, and fully integrate them into the investigative reporting process in newsrooms of all sizes?

One way is through partnerships with universities. I came to Stanford last fall on a John S. Knight Journalism Fellowship to study how artificial intelligence can enhance investigative reporting so we can uncover wrongdoing and corruption more efficiently.

Democratizing AI

My research led me to Stanford’s Artificial Intelligence Laboratory and more specifically to the lab of Prof. Chris Ré, a MacArthur genius grant recipient whose team has been producing cutting-edge research on a subset of machine learning techniques called “weak supervision.” The lab’s goal is to “make it faster and easier to inject what a human knows about the world into a machine learning model,” explains Alex Ratner, a PhD student who leads the lab’s open source weak supervision project, called Snorkel.

The predominant machine learning approach today is supervised learning, in which humans spend months or years hand-labeling millions of data points individually so computers can learn to predict events. For example, to train a machine learning model to predict whether a chest X-ray is abnormal or not, a radiologist might hand-label tens of thousands of radiographs as “normal” or “abnormal.”

The goal of Snorkel, and weak supervision techniques more broadly, is to let ‘domain experts’ (in our case, journalists) train machine learning models using functions or rules that automatically label data instead of the tedious and costly process of labeling by hand. Something along the lines of: “If you encounter problem x, tackle it this way.” (Here’s a technical description of Snorkel).

“We aim to democratize and speed up machine learning,” Ratner said when we first met last fall, which immediately got me thinking about the possible applications to investigative reporting. If Snorkel can help doctors quickly extract knowledge from troves of x-rays and CT scans to triage patients in a way that makes sense — rather than patients languishing in queue — it can likely also help journalists find leads and prioritize stories in Panama Papers-like situations.

Ratner also told me that he wasn’t interested in “needlessly fancy” solutions. He aims for the fastest and simplest way to solve each problem.

In early January, my newsroom, the International Consortium of Investigative Journalists, and Re’s Stanford lab launched a collaboration that seeks to enhance the investigative reporting process. To honor the “nothing needlessly fancy” principle, we call it Machine Learning for Investigations.

ICIJ reporters and Stanford research scientists use machine learning in a journalistic investigation.

For journalists, the appeal of collaborating with academics is twofold: access to tools and techniques that can aid our reporting, and the absence of commercial purpose in the university setting. For academics, the appeal is the “real world” problems and datasets journalists bring to the table and, potentially, new technical challenges.

Here are lessons we learned so far in our partnership:

So yes, you will be hearing from us either way!

There’s a ton of serendipity that can happen when two different worlds come together to tackle a problem. ICIJ’s data team has now started to collaborate with another part of Ré’s lab that specializes in extracting meaning and relationships from text that is “trapped” in tables and other strange formats (think SEC documents or head-spinning charts from ICIJ’s Luxembourg Leaks project).

The lab is also working on other more futuristic applications, such as capturing natural language explanations from domain experts that can be used to train AI models (It’s appropriately called Babble Labble) or tracing radiologists’ eyes when they read a study to see if those signals can also help train algorithms.

Perhaps one day, not too far in the future, my ICIJ colleague Will Fitzgibbon will use Babble Labble to talk the computer’s ear off about his knowledge of money laundering. And we will trace my colleague Simon Bowers’ eyes when he interprets those impossible, multi-step charts that, when unlocked, reveal the schemes multinational companies use to avoid paying taxes.

In the meantime, we stay real. Nothing needlessly fancy.

Are you a journalist with a story idea or data that would benefit from machine learning? Are you a machine learning expert interested in collaborating with journalists? Reach out to me and let’s chat about ways to collaborate: @MarinaWalkerG

The following people are part of the Machine Learning for Investigations partnership between ICIJ and Stanford: Alex Ratner, Jason Fries, Jared Dunnmon, Mandy Lu, Alison Callahan, Emilia Diaz-Struck, Rigoberto Carvajal, Sen Wu.

JSK Class of 2019

Insights and updates from members of the John S. Knight Journalism Fellowships Class of 2019 at Stanford University

Marina Walker Guevara

Written by

JSK Fellow ’19 @ Stanford. International Consortium of Investigative Journalists. Artificial intelligence for investigations. #PanamaPapers #ParadisePapers

JSK Class of 2019

Insights and updates from members of the John S. Knight Journalism Fellowships Class of 2019 at Stanford University

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade