Using Aleph: The Times data journalist on why he prefers the platform
George Greenwood, a data journalist with the UK-based newspaper The Times, is one of many journalists outside the OCCRP network who uses Aleph in his investigations.
Major investigations are almost always based on documents, lots of documents, that reach journalists as a hodgepodge of Xerox-stained PDFs, images, and other formats that aren’t even digital. To make sense of messy data sets, journalists can upload them onto Aleph, OCCRP’s investigative data platform, which supports optical character recognition (OCR).
George Greenwood, a data journalist with the The Times, used Aleph when he joined our #29Leaks investigative series, which was based on leaked records obtained from the transparency collective Distributed Denial of Secrets in 2019.
In this Q&A, Greenwood explains why he continues to upload documents to Aleph, and how he thinks the platform could help other journalists. This interview has been edited for clarity and brevity.
OCCRP: How would you describe Aleph to other journalists?
George Greenwood: I’d describe Aleph as another library of content that journalists should check when doing background research on a person or a company. I have a list of 10 data sources I run down when I’m backgrounding a target, and Aleph is top of the list, followed by ICIJ’s Offshore Leaks, LexisNexis, Pacer, and a few others.
That’s what I’d say to a non-tech person. For my colleagues in data journalism, I’d focus on its strengths as an archival tool to host leaks.
When I was first introduced to Aleph, I immediately noticed the platform’s OCR (Optical Character Recognition) and natural language processing that makes it easy to find keywords and entities from large data sets.
After reading the documentation behind it — all open sourced — I figured I might as well use it for other investigations, especially when I have big sets of PDF flat files.
In the past, I used Google Drive’s R interface to upload datasets. The problem was it only works for smaller files. Anything over a couple of pages wouldn’t OCR at all in Drive. With Aleph, I can type one line of code and it will OCR an entire folder. I can then share it where I need to.
I prefer Aleph to Document Cloud mainly because of the interface, but also the ability to safely search through databases that aren’t open to the public for confidentiality or legal reasons.
Have you used Aleph for other reporting projects besides our #29Leaks investigation?
Yes. A good example has been with Prevention of Future Deaths Reports. In the UK, when a coroner investigates death under suspicious circumstances, and they find failings by a public sector body where a death could have been prevented, the coroner can issue one of these reports. But they’re horribly formatted.
You’ll find them on the government website usually as picture PDFs. Some are missing headlines and other descriptive copy. Aleph is great at collating these poorly formatted records. This story was a direct result of my work with these documents.
The platform has also been useful for going through government contracts as well — something I’m still working on. In the UK, we have a system called “Contracts Finder,” but it’s a nightmare. Paper contracts are sometimes attached as PDFs, sometimes as pictures, sometimes not, in a variety of formats.
It makes it impossible to search unless you know exactly what you’re looking for. Now, I make sure to strip all the contracts I may need using a scraper, then upload them to Aleph so I can search through them quickly.
I’m also using Aleph for our work on Aquind, an energy project co-run by former Russian arms company executive Alexander Temerko, but some of this is still yet to go out. I put all the project planning documents into Aleph, which have helped us find concerns filed by local residents about the project.
What do you think about the concept of public-facing databases? Should there be limits to what financial documents the average person can have access to?
I think if there’s no good reason to not release this stuff, we should. If we’re telling the public what’s true and what’s not, they will want to know why.
I think journalists often withhold documents out of fear they got things wrong. So I’m very in favor of publishing anything where there are no legal concerns. Obviously, you have to be careful when documents come from confidential sources, but loads of material doesn’t have that issue.
I also just think it’s fascinating to look at the original documents, and I believe there’s a real interest from the public for this type of access.
Do you see data tools like Aleph as reducing the need for researchers and archivists in newsrooms?
Not really. I think these platforms free up journalists to be more investigative. Now, reporters don’t need to spend a week at a local library looking at clippings, instead they can make calls and form a story rather than worry about the administrative prep work.
I think there’s a sea change in journalism right now. I didn’t start as a data journalist, I started out as a specialist in FOIA (Freedom of Information Act). But knowing these tools were out there drove me to learn to code, to learn how to use these resources, and to become the best investigative journalist I could be.
I hope newsrooms push young journalists to do the same, because there are free stories out there just waiting to be told.