How to eat an elephant

Published in

OCCRP: Unreported

10 min readSep 30, 2020

Data journalism meets digital forensics

The brutal assassination of Martina Kušnírová and Ján Kuciak in 2018 shook Slovak politics. Kuciak, a 27-year-old investigative reporter, had uncovered multiple cases of corruption and fraud. A Slovak court has since found two men guilty of committing the crime, even as businessman Marian Kočner, accused of masterminding the attack, was found not guilty.

In late 2019, reporters at the Ján Kuciak Investigative Center in Bratislava and the Czech investigative reporting platform Investigace.cz, both OCCRP member centers, were given access to the complete police case file on the two murders. Slovak police and Europol had collected a total of 53 TB of material. The full archive included digital copies of seized computers and phones, footage from security cameras and other evidence. This is when OCCRP’s data team got involved, helping to transfer and analyze the information.

The volume and nature of this data forced us to study and apply multiple digital forensics techniques in order to make the material accessible to reporters. It has also given us a more systematic understanding of how experts in related fields — especially law enforcement — handle evidence. We’re writing some of these lessons up here, in the hope that other projects may benefit from them.

1. Data has gravity

The first thing we realized was that even in 2020, large volumes of data move slowly. Copying or verifying 53 TB of data takes days and weeks. Even a plain listing of filenames and checksums would come out at 5 GB — a DVD’s worth of data.

After multiple rounds of traveling from Bratislava and Prague to Berlin with backpacks full of data, we decided to rent the largest server we could find and use it as a central staging ground. (Special thanks to CZ.NIC for letting us upload the source drives directly from their internet backbone!).

Instead of thinking of data as immaterial and global, the logistics of this project forced us to accept that data has gravity. In fact, sometimes it’s easier to move a few people than a lot of files.

2. Getting an overview

We received data on eight source drives, to which we assigned radio codes: Alpha, Bravo, Charlie, and so on. To get an overview of what was on them, we ran `hashdeep`, a command-line tool that will traverse a directory of files and generate file sizes and checksums for each of them.

The resulting list of files showed that the two largest chunks of data were security camera footage and imaged copies of computers and phones (and xboxes, drones or night vision cameras) seized by the police. A smaller part were data collected by external parties such as bank account details or phone logs.

3. Working structure

Before proceeding to extract this data, we decided to put an overall strategy in place. After some discussion, we discarded the option of loading the whole dataset into Aleph. We also rejected a plan to build a custom web application to let reporters search the file metadata. This was mainly in the interest of time: A simpler solution was needed.

Instead, we ended up creating a folder structure with three main sections: `original/` would contain a read-only copy of the files as received, while `work/` would contain a full copy of the data that we incrementally unpacked and cleaned.

A third folder, `desktop/`, would provide a landing page for reporters accessing the data. Here, our data reporter Ada Homolova would generate file system symlinks with speaking names (for example, `work/bravo/011/048/IMG.E01.unpacked` might be linked to `desktop/Suspect 1’s laptop`).

4. Where’s the meat?

The reporters in the project let us know that the contents of phones and computers were crucial to their investigation. So we set out to unpack the images so they could be browsed without additional tools.

The images were in two main formats: Cellebrite UDFR and EnCase Expert Witness Files (E01).

The images from Cellebrite, a tool used for data extraction from cell phones, proved a simple target: they were renamed .zip files. In each UDFR file, there would also be an XML manifest and a SQLite database that contained contacts, messages and other metadata from the phone. We couldn’t resist making a conversion tool that turns Cellebrite data into Aleph’s FollowTheMoney format.

But a larger part of the evidence consisted of about a hundred binary disk images in EnCase Expert Witness format. We ended up with a manual process for pulling the files out of each partition in the files, using the command-line `ewf-tools`

mkdir extracted mounted partitionsewfmount *.E01 partitions/partx -av partitions/ewf1mount /dev/loop{N}p{N} mountedrsync -r mounted/ extracted

We also found a significant number of .ISO files in the evidence, representing CD or DVD media. This script extracted all of them at once:

#!/bin/bashBASEPATH=$PWDTMPDIR=”$BASEPATH/__uniso”;find . -type f -name “*.iso” -print0 |while IFS= read -r -d ‘’ FILE; dorm -rf “$TMPDIR”;mkdir -p “$TMPDIR”;DESTFILE=”${FILE%.*}”;7z x -o”$TMPDIR” “$FILE” && rm “$FILE” && mv “$TMPDIR” “$DESTFILE”;done

We repeated this process for UDFR and ZIP files.

5. Cleaning up house

At this point we had extracted most of the package formats in the investigation, but we were running out of disk space.

We decided to conduct another run of `hashdeep` to identify files that could be removed. This included all files with no content, very small image files (i.e. icons), and a large collection of pirated ’90s Hungarian made-for-TV movies (impressive mustaches).

We also decided to manually delete some specific files by name and extension:

find . -name desktop.ini -exec rm {} \;find . -name *.dll -exec rm {} \;

Taking this one step further, we decided to use the NIST Reference Data Set (RDS) hash list: a public list of every file known to be included in commercial software and operating systems that is published by American authorities.

While there is software available to apply such hash lists, we decided to use a PostgreSQL database to keep things simple:

# Download the hashes and load them all into a single table:pip install pgcsvpgcsv — encoding latin-1 — db postgresql://localhost/project hashes android/NSRLFile.txtpgcsv — encoding latin-1 — db postgresql://localhost/project hashes ios/NSRLFile.txtpgcsv — encoding latin-1 — db postgresql://localhost/project hashes rds/NSRLFile.txt# Load the hashdeep output after manually removing the hashdeep header and putting in CSV column headers:pgcsv — encoding latin-1 — delimiter “|” — db postgresql://localhost/project files hashdeep.out.psv

Some data prep allowed us to find all the deletable files using a simple JOIN statement:

— — De-dupe hashes:ALTER TABLE hashes ADD COLUMN id SERIAL PRIMARY KEY;DELETE FROM hashes a USING hashes b WHERE a.id < b.id AND a.sha_1 = b.sha_1;— — Convert hashes in hashdeep to uppercase:ALTER TABLE files ADD column sha1up TEXT;UPDATE files SET sha1up = UPPER(“sha1”);— — And then find the overlap:SELECT f.filepath FROM files f JOIN hashes h ON h.sha_1 = f.sha1up;

That list could finally be turned into deletion commands like this:

while IFS= read -r file ; do rm — “$file” ; done < delete.list

We also considered using `fdupes`, a command-line tool that deletes duplicate files from two folders. Eventually we decided against using `fdupes` on the police file because deduplication would have made information harder to find by navigating the folders. The following command has served us well with other document sets, though:

fdupes -r -m -S SOME/ OTHER/

Finally, the deletion process had left us with a number of empty directories that we could delete by running the following command a few times:

find . -type d -empty -delete

6. Mapping the structure

The data was organized in a folder structure that didn’t reveal anything about the content. So as we were cleaning and unpacking, we were also going through the data, folder by folder, to identify the content. Many of the folders contained data from one of the suspects’ devices, so we noted their names and device type in a spreadsheet. Using the sheet, we finally created a new folder structure that identified the owner of each device. This allowed journalists to dig deeper into the aspects of the evidence they found most interesting.

7. Guessing a lot

Another line of work was a set of password-protected files in the evidence. Details of how we reverse-engineered some of these passwords exceed the scope of this post, but two things must be said: It’s nice to have GPUs; and the US prosecutors’ strategy of charging Julian Assange for attempting to crack a password will have chilling effects for journalists around the world.

8. Access room

While the work of cleaning the data and creating a meaningful index of the material was ongoing, we also needed to address another question: How are reporters going to access this material?

This discussion was dominated by concern that uncontrolled release of the material could infringe the privacy of those involved in the ongoing court case, and also threaten the case itself.

Eventually, we settled on the idea of a data access room (later dubbed “Kocner’s Library”). We discussed operating the room in Vienna, Prague, or on diplomatic grounds, but finally found a secure location in Bratislava.

Inside the room, we set up a group of journalist viewing stations based on Silverblue, an immutable Fedora-based Linux distribution. The stations are laptops that are physically stripped of wifi, bluetooth, and gsm cards, and have their USB ports disabled. A router connects them to a data server via a `wireguard`-based VPN, but does not allow access to the public internet. The viewing stations mount the evidence into their local file system using `sshfs`, which required a fair bit of tuning:

sshfs -o follow_symlinks -o idmap=user -o cache=yes,cache_timeout=3600 -o kernel_cache -o Compression=no -o ServerAliveCountMax=3 -o ServerAliveInterval=15 -o reconnect -f user@server:/ ~/Desktop/evidence

9. Guidance and take-out

Access to this room was offered to a group of select media organizations from Slovakia. Their reporters were welcomed by a “librarian,” a member of the OCCRP or ICJK teams who provided guidance on how to use the setup and allowed the journalists to take data out of the system for offline study.

Reporters could select a set of files into a “shopping cart” and then check them out via a special laptop held by the attendant that would copy them onto an encrypted USB drive.

This proved to be one of the most challenging aspects of the project to implement in practice, not least because of bugs in the VeraCrypt disk encryption utility. VeraCrypt also proved to be a bad home for large volumes of data in other parts of the project. Future OCCRP projects are going to favor operating system mechanisms like Linux’s LUKS or Apple’s APFS for larger datasets.

10. Collaboration

Alongside this access room, we also operated a secure Wiki that let people share their findings from the evidence, and we uploaded some curated subsets of the material into Aleph to make them searchable from any location.

A lot of the interesting data was in the form of instant messages from many sources: WhatsApp, Viber, Threema, SMS, and iMessage. In order to emulate the most natural way of reading a chat conversation, we exported these chats from the phones, and our member center investigace.cz created an offline chat viewer to read them in.

Smartphones created a lot of location data, but there wasn’t a way to view it without uploading it into a cloud. To prevent data breach, we built a simple, (nearly) offline map viewer. Data is stored locally, but the underlying map tiles are from an online source. Source on github.

People take pictures with their phones constantly. On some of the devices in the leaked data, we found folders with as many as 50,000 pictures. To filter out the screenshots and photos of documents we used this neural network. It does a great job in filtering out pictures of documents and screenshots from a folder of photos.

Conclusions

Pulitzer Prize winner Stephen Doig once said, “Data journalism is social science on a deadline.” Many of the decisions we made in this project were owed to a need to make the most of a massive dataset in a short amount of time (ca. 6–8 weeks).

What allowed us to make good progress in that time frame was a commitment by OCCRP leadership to provide the necessary resources (a single set of the source drives required 4,500 euros worth of hardware). I’m not aware of any broadly available emergency funds that would provide support for projects like this.

In retrospect, we should have invested more time in identifying the most relevant sections of the dataset at the very beginning. The vastness of the material demanded a depth-first approach over a breadth-first processing of the whole archive.

However, this wasn’t easy: We didn’t start off with an index of what was in each device, and we had no idea what organizing principle police had applied to the drives.

Of course, the description of the project in this post is extremely limited: it doesn’t speak to any of the findings from the data, and it ignores the incredible ingenuity that the reporters on the project have shown in tackling parts of the data on their own. In an effort like this, everybody becomes a data wrangler.

What it makes clear, however, is that journalists — especially those involved in large-scale leaks — can learn a lot from the digital forensics community in law enforcement and the private sector about how to tap into corpora of valuable but chaotic evidence.