Coordinating leak ops for Panama Papers
Edit: wrote this last night before this wired article was published, but it seems like we won’t have access to the full dataset unless it’s leaked to the public by one of the parties who has received it. HT https://twitter.com/kissane for correction.
It seems like what we’re going to get is a big graph, which may only total a few hundred megs or less.
http://www.wired.com/2016/04/reporters-pulled-off-panama-papers-biggest-leak-whistleblower-history


** OLD CONTENT BEGINS HERE
The Panama Papers full dataset will be released in early May, according to ICIJ. [edit: the specific language is: “ICIJ will release the full list of companies and people linked to them in early May”, which may or may not be the entire dataset].
Because of its size, 2.6TB of data, the PP leak provides an interesting opportunity to consider how the news / OSS / civic tech / hacktivist communities might coordinate efforts.
Those who have the expertise in distributed computing and working with datasets as large as this one can get to work building data pipelines for personal use. But is there a way make hacking on the dataset more accessible to the broader developer community? If there is, it seems like a great test case:
- Previous ICIJ datasets led to real-world consequences / arrests / policy changes, so there’s a high potential for impact / strong public interest component
- Scanned documents already OCR’d
- ‘Many eyes’ helpful as there are millions of documents to sift through
- The data is richly time series, graph, object/entity, which is great for visualization and should attract developers and students for a long time to come
- While it’s a big dataset by OSS community standards, it’s not big at all by industry standards
A fun idea that came to mind was a Seti@Home style setup with https://github.com/syzer/JS-Spark (they market this package as a kind of ‘distributed lodash’, though I haven’t played with it), where people could contribute browser tabs for computation to be involved in the effort.
How might we help each other out? What role might news organizations play, individually or collectively, in ops for leaks? What infrastructure do news organizations already have after a year of exploring? Will they OSS it along with the data (please)? What arrangement of storage, processing and hosting would be most beneficial for encouraging play in the broader community? Can infrastructure and processes be preserved for future leaks of this magnitude? What further questions need to be addressed?