How to verify leak data?

As a journalist or researcher, you may be handed leak data. But how to be sure it's the real deal? This is a leak data verification checklist for investigators.

Techjournalist
19 min readNov 20, 2023
Image AI Art

I come into the office every morning. I use the espresso machine. I make myself a double espresso with a bit of milk. The coffee comes from a local coffee roastery. The milk from the fridge. When I check the fridge, I often find a plain milk carton. It could be from anyone. People use my coffee. I use their milk. That’s how it works for me.

A few times now, I chanced it, instead of milk, all my espresso received were chunky bits. A smell of old eggs. It happened again. The milk was bad and I poured it. It ruins my morning. How to tell whether you got bad milk in the fridge? You can smell, taste check, read the best-before-date, shake it for chunky bits. It is similar to whether your leak data is trustworthy. At SZ we get a lot of leaked data. Here are my thoughts on how to spot whether the leak data is “off”.

Journalists are handed leak data all the time. They made some of the most consequential news stories in the past decade. For newspapers like the Süddeutsche Zeitung, it is certainly a golden era. But bad actors, especially since the Russian attack on Ukraine, increased the risks of faulty leak data. Wrong information, misleading intelligence or simply malware-infected data, can ruin the integrity.

Files from Russia (and other leak data projects)

On Vulkan Files: Special thanks to Paper Trail Media Team who performed a huge part of the, including Hannes Munzinger and all the wonderful colleagues there et al.

One of the recent projects entailed an anonymous source. He sent thousands of leaked documents from a company with alleged ties to the delicate Russian security and military sector. The documents, all in Russian language, came in form of emails (txt files), text processing documents (word files), PDFs, tabular data (xls files), folder structure etc.

We will cite some of those instances without going into more details. More important is to get across the message what check we applied and what methodologies we used to be sure upfront we are dealing with the real deal.

Nothing is more harming to an investigative journalism career these days than to publish a story on fake or tempered data leaks.

Always good for a checklist

If you followed some of my other guides I post, you know I am a sucker for a good checklist. It prevents forgetting anything. You can always jump items. And it’s simple and straightforward.

  1. Metadata: Date it!
  2. Check content: Graphs/Graphics/photos
  3. Sense-check any names and organizational links
  4. Examine emails in the data — leak data with Breachdata
  5. Physical Addresses and other companies
  6. Social media to confirm leak content
  7. Sense of purpose: Why do we get documents
  8. “Selection Bias”: What is missing in the data?
  9. Final Verdict: Are the data real and risk-free?

1. Metadata: Date it!

To test roughly when the documents are created or updated, we try to examine the files' metadata. The caveat is: Metadata can be manipulated. I can right-click on a PDF and change the attributes to my liking. But to do that for thousands and thousands of documents is cumbersome. I guess it leads to the first big methodological question, to test leak data. How likely is it that someone does that. It would require heaps of people, to manually alter metadata of documents that they remain an undetected fraud.

Now you might respond to that and say, perhaps it's AI generated. Some sort of algorithm that randomly assigned documents slightly different metadata creation dates. That is possible. So sense-check that.

The creation date contains a timestamp. Check whether it falls into the normal working schedule/hours of the timezone a leaks came from,- perhaps it was created at night or on the weekend. All these details can be faked, of course. But it's laborious work to organize it. Keep in mind how much work it would be and the pay-off for “the other side” to feed you false data.

Consider the following tools for checking metadata.

Tools of choice (some cheekily sourced from the wonderful OSINTessentials resource page):

Extract metadata from multiple images and store it in a list

Pdfminer

metadata2go.com — Online tool for viewing metadata, including image Exif data (metadata embedded within images), such as camera setting used when taking a photographs, date and location information, and thumbnails. Also can be used for video, and tends to do better on this than other tools.

Jimpl​​ — Exif viewer — can also be used to remove metadata

VerEXIF​ — Exif viewer — can also be used to remove metadata

Metadata Interrogator​ — Desktop metadata viewer that works offline and extracts metadata from multiple filetypes

If there is a desire to automate the collection of metadata, an extension of a Python 3 script such as below can collect creation dates and store it for analysis in chronological order.

Virus scans are always useful when working with leaked data, especially when journalists are being sent data with the promise of a big scoop. The best advice here: set up a separate machine, one that perhaps connects via the Tor Project and runs two different virus scan software packages. If the data runs on an external hard drive, check virus/malware on that dump of data as a whole.

If you find lots of malware, consider it as a dealbreaker. Investigative journalists, let alone anyone with a keen interest in uncovering leaks, like malware. It is a red flag. If you are good, you might be able to analyze the malware itself and perhaps it will lead you to an adversary hacking group from Russia. Anyhow, we would at this point seriously consider how to proceed from here.

“This_document_is_a_real_russian_leak.pdf”

Perhaps last on this part is to be considered the filenames. It is cumbersome to change sensible filenames. If you receive a big stag of documents with a numbering ID system, it might be worth reconsidering that these files weren't named intuitively. Someone spent time to do it for you. That might be another red flag.

The name should describe what it is about. Think about how you name your files. Sense-check that.

How I do it. I write a script that extracts all the filenames of a leak corpus by type, making analysis more handy.

2. Check content: Graphs/Graphics/photos

#OSINT isnt #OSINT without reverse image search. In the cited project on leaked Russian company files, there weren't really great images involved. Instead there were a bunch of illustrations — technical nature, mainly of system architecture and modules etc. Reverse images in combination Google Dorking (image search + a search term) for the module names could turn out to be prosperous.

Though Yandex reverse image search often struggles with mind-maps or graphical representation, color palette and structures may make it stand out enough to get some hits on the open Russian web.

Sometimes, similar illustrations are present on open-source presentation platform. However, a clever Google Dork search, may offer better results than a reverse image search (you might be surprised how much Russian intelligence leak onto platform Slideshare).

Reverse image search that (source)

Reverse image searches of the company logo or those of partner organizations helped us early on to sense-check the data. It gave us the confidence to proceed.

Sometimes you might be lucky, and a PDF may actually present you with maps. Coordinates and places can be fact-checked. Be very alert there! Any kind of map could help understand context.

In the case of the quoted example of Russian files, the location (the geo-location of an atomic power plant in Switzerland) we assume, was a mere mockup, an example referenced one location. We are not sure. I remembered it made cross because we didn't have the context what this meant. For the value of a leak, this kind of cross-checking, could be a telling factor.

A few years ago, I sense-checked data from a leaked document by a government agency of a country in Europe. The photos of the document — we did not have the original - showed satellite images. On the satellite images were drawn illustrations. There were no location names or descriptions on where this air base was. They showed a number of potential war targets, where bombs need to be dropped. It was a document for a planning excerise. However, as we figured out, it was for a mere fictitious defense scenario. The satellite image showed an airport runway, as shown in the satellite pic below.

I told myself, If I could understand where in the world this runway was operating from — the exact geo-coordinates — , it would have some meaning for the investigation on the leak documents. In the end, the runway was in Asia, far off any meaning war targets by that country (thanks @obretix at this point who advised in this case — https://cryptome.org/eyeball/ as a great source for geo-OSINT.

I used some sort of reference database of international airport runways. I compared and found the right one. Perhaps you can find it, too. (I won't tell, as it would spoil the fun for you to find it. Try it yourself and let me know in the comments).

A satellite shot leaked in one of the graphics in a secret leaked defense document

The point, however, was that from the secret military planning document, it became clear that this satellite shot was “taken” from Google Earth or Google Maps at the time when the document was produced. It offered a chance to roughly “chronolocate” creation date.

The satellite image was easily matchable to those on Google Earth pro (click the historical image button and compare). There aren't that many images available, but color composition and appearance make it usually pretty easy to nail down the right image taken.

For the leak data investigation, it meant that the satellite graphic could not be produced before the image appeared on Google Earth. It is a smoking gun argument if someone wants to make you believe that the documents were produced long before the satellite image was taken.

Background on this case: An attack on EFTA nation that is apparently imminent, at least this is what the leaked classified defence document suggests. To prevent the attack, heavily armed fighter jets of that country rise into the air and are to bomb opposing targets abroad: for example a bridge and an airfield. These fictitious war mission scenarios are described in internal documents from the countries armaments authority. They are tasks that the manufacturers of several fighter jets had to present to the government, involved in the bidding/tendering process for the defence procurement. According to the documents, objective was to test the gun systems and check on how fit the jets were for its missions.

3. Sense-check any names and organizational links

To verify whether a leak features any truth and validity, the simplest thing is to cross-reference names.

In the leaked files on Russian cyber start up, there were names of the members of the management and members of government organizations. There was a hunch that these were, in fact, Russian security agencies and military units that acted as clientele of that cybersec company.

The first sense check involved checking samples of names, and involved copying names manually out the PDFs or Excel sheets. We then threw those names with search operators into Yandex.

Important for Russia is the confirmation of a person’s ИНН (tax) number. It makes it a traceable ID and also links to entities.

Once you got names confirmed, Russian open-source platforms allow in-platform searches to cross-check company and relevant tax information:

List-org.com, Companies.rbc.ru, sbis.ru… among some (there are more)

The investigation found out that military units 33949 (SVR unit) and 64829 (FSB Information Security Center) were major clients of the company in question.

By knowing some names of clients, addresses of research institutes or regions where people are registered, we could sense-check whether the names mentioned in the leaked documents carry any weight.

Remember that faking names is an option. But also ask yourself, how much effort someone that intents to trick you, wants to go through the trouble to “make up” fake identities.

On a sidenote: It is worth considering writing a script that pulls automatically all the Russian names out of a .PDF, .Xls and .txt documents. I approached the subject like a busy journalist with little time, and tasked ChatGPT 3 to make the start of a simplified python3 script that runs this search on a number of PDFs. This way, Russian names mentioned in the PDF documents can then be listed and used on Google/Yandex Dorking scripts.

It ignores, however, Russian or English language. This is one way to get a sizeable collection of names out of the documents to cross-check them.

Pulling out Russian names from a PDF file, an example on how to collect evidence on whether a leak dataset is trustworthy or not — worth checking out what Python library PyMuPDF can do else for us, to help analyze leak data

Generally, open-source information on Russian military units remains scarce. Russia does not want it to leak. However, there is some stuff out there. Take gostevushka.ru .

Cross-checking some leading figures and organizational structure in Russian military and security entities, might help. There are a number of reports that present names and organizational links —notably the RAND report on the Russian general staff (2023).

Ask yourself if that, what you find in the leak, matches roughly the organizational structure. If not, add it to the red flags!

Linking names that might appear in leak documents with organizational structures of the Russian military complex

Signatures

Another attempt to verify the identities of who signed secret documents is to analyze their signatures. Signatures are also on passport and ID documents. If you have access to Russian Telegram bots that allow accessing such data, cross-checking such documents may add weight to the validity.

One example in which a signature in a document, in this case an academic thesis, could be successfully forensically matched with one on a Russian passport

Like with facial comparisons, there are tools to compare signatures, such as on fileproinfo — alternatively, some systems can do this already in real time (such as signotec “Biometrics API”).

Word-Doc analysis

We have done some PDF metadata analytics. But what many people often forget is that especially Word documents can be a rich source of intelligence. For dating (finding out when has been worked on the document) and for checking who (author) contributed to the changes or made comments — Word documents can be of great value in leak data analytics.

I am by far no expert on #OSINT with Word Docs. However, it helps to use some guidance by OSINT fellows, such as people like Neil Smith, an investigative researcher and trainer OSINT (his guide on the hidden forensic secrets of documents). Here is what Smith proposes what Microsoft files reveal…

Neil Smith on the value of Microsoft Word OSINT

In fact, Microsoft is so painfully aware of how much data in their documents lingers, that it warns their users preventatively on their website.

For evaluating the authenticity of leak data, Word Docs therefore are useful. In the case of one investigation, one user left a name and comments in the documents. It was a staff member. He could be found on social media and later tracked down. He could be consulted whether the rest of the data was genuine.

Tools to remember: FOCA

FOCA is a tool used mainly to find metadata and hidden information in the documents it scans. These documents may be on web pages, and can be downloaded and analysed with FOCA. It is capable of analysing a wide variety of documents, with the most common being Microsoft Office, Open Office, or PDF files, although it also analyses Adobe InDesign or SVG files, for instance. (Github)

4. Examine emails in the data — leak data with Breachdata

Similar to the example before, we can write python regex code to pull out email addresses from data such as PDFs. We can then cross-check these emails whether they have been “pawned”. So they show up in data breach.

If they do, we can check the authenticity of these actors and perhaps get more context where they work and how they might fit into the picture of the leak. There might also be phone numbers and social media profiles that help us to contact them as journalists, later. An invaluable source for journalists, if you ask me.

Let us consider a simple example of a company email domain. The support email of a company in the focus, the email of its founder and CEO, appear present public data breaches.

We use the Hunter tool of the company Constella Intelligence, an external threat intelligence service.

A view on the Constella Hunter platform — the company domain and email addresses show up in several other breaches

A check on the now sanctioned company NTC Vulkan shows there are a number of members of staff listed in breach data. A number of details were disclosed.

For the initial check, details found in other breach data (on names, numbers, IP addresses, social media usernames, or domains) can help to validate the content in the leak data.

If what you find here, on Hunter, amidst 15 billion breach data points that Constella hosts, then add that to the red flags.

5. Physical Addresses and other companies

Any mentioning of addresses in a leak data document corpus can add some context and perhaps more certainty towards authenticity.

While we might not be able to reference and fact-check every name, Russian military unit or even every company, a street address is in the fewest instances fakeable.

Yes, they can also just be fake and completely unrelated to the target. But we can check that. A Russian address has a certain structure. Governments and businesses usually obey this rules in official documents.

If we search for that on Yandex Maps, we will also get a listing for all the other entities based on this address. Let’s try this for the following address: Ibragimova Street, 31, Moscow, 105318. We find different businesses in the building and even a photo someone took of the entrance signs, showing various companies at this address.

Now, if we manage to extract all the relevant-seeming addresses from the documents, we could map them. Perhaps they cluster on proximity or overlap with business addresses that we can check on.

Perhaps it's a partner firm or a business branch. In the case of Vulkan Files, one address of a business the founded formed showed up, not far from the address the firm is located today.

We got a feel for where the company operated, where it maintained relationships too and where employees travelled to.

Example how to map out various addresses of leak documents

Let us take it one step forward, and try to automate.

The Russian Federation postal service recognizes the following structure for a postal address, as this guide explains. We can write a script that obeys these rules and recovers addresses from all the Russian documents.

To illustrate this point, we fed Hunter, Ddosecrets platform, the address of the FSB (Lubyanskaya Square , 2).

We then recovered one PDF in Russian language. It happens to be quite an interesting one on “Report on the holding of Navalny on April 21, 2021”.

Left, the address of the Russian FSB, right where it was mentioned in an example of a PDF with cyrillic content: Protests around Navalny’s incarceration

The python 3 script is there to help pulling out all Russian street addresses, squares and regions mentioned in the PDF documents.

It spits out a list. If we run the script, we receive 7. Lubyanskaya Square (at the) FSB buildings (which we know was in there), but also Pushkinskaya Square, Arbatskaya station, Manezhnaya Square, the Siberian Federal District, the Volga Federal District, Palace Square and so on.

That can all be helpful in verifying whether the material is real. (subsequently, it helped to at least to verify this material to be real).

Script that pulls out of PDFs Russian place names and addresses (It is merely a starting point and can be further optimized)

We have already spoken about how addresses of companies mentioned in leak data can link to other businesses via Yandex Maps. There are more links to draw on. One are business partnerships, sponsorships and client-business relationships — all that can be assessed with open-source techniques.

If those partnerships are NOT verifiable on the websites and press releases of other entities, add it to the red flags!

In the case of that Russian company we looked at, there were a good dozen connections with other entities, charities, and firms that are either on red-list or sanctioned by western nations. (back then, when we drew up the map, the company was not yet sanctioned, but many of the entities it linked to).

Various sanctioned entities connected to a Russian company that leaked data. In June, 2023, the company itself got sanctioned by the EU, a huge success for the journalist involved in the investigation

In Russia, there are more ways to draw links between businesses and other entities, including the government. Most prominent, court cases. These are usually openly sourcable.

A record of a 2018 court case was very helpful. A Russian database showed that the founder of the dubious Russian firm maintained business ties to a Russian entity (a research institute with links to the armed forces), the Russian military complex.

It was a previous client. The companies took legal action against it, to get the rest of payments, it alleged it was owned. Such information is freely available on the internet and can further substantiate the validity of names and entities in Russian leaks. Other useful Russian platform are:

casebook.ru, in combination with list-org.com and websites such as Gostevushka.ru, to tell a bit more about the nature of the entity (if it has military links).

It did help to identify and confirm the nature of an entity called 18 TSNII (Central Research Facility), or military unit 11135, “which has been involved in SIGINT/ELINT research, including “developing the equipment for conducting and coding satellite reconnaissance activities,” — and is now involved in information security under the auspices of the Main [Intelligence] Directorate, as Andrei Soldatov and Irina Borogan wrote for CEPA (Center for European Policy Analysis).

“НТЦ Вулкан” site:https://casebook.ru — search for “FSUE “18 Central Research Institute” of the RF Ministry of Defense” — on https://casebook.ru/

6. Social media to confirm leak content

No doubt, even in the most secretive circles in the world, is social media intelligence analysis no hopeless tool anymore for verification. Any reference to social media profiles, accounts, hashtags, companies, might be helpful to endow the leak with integrity.

In the case of that Russian leak and technical documentation that let us assume it is a systems to spread disinformation, we weren't sure whether the systems described, were indeed applicable and already in use.

But with time, the team of investigators came across a hashtag in a documents, that contained hints that disinformation spreading campaigns were, at least in parts, tested.

A Twitter X hashtag (#htagreact, story here) in a screenshot led to the unearthing of hundreds of Twitter accounts, with pretty odd — potentially automated — behavior.

7. Sense of purpose: Why do we receive documents?

Question the personality of the data source: What’s his motive? How does the source behave?

We usually look for a motive, that fits the bill. It can be easily faked, especially if the source doesn’t want to unveil his identity. But they often do offer something, perhaps a username, or an acronym, that might lead to a username on social media (an extensive guide on usernames here).

For a background search on the source, there are then ways to proceed. Important is that through the search, the source remains protected. In the end, it's important that those who work on the leak can trust the source. And that the source can trust the journalists involved.

If anything breaks that trust, add it to the red flags

8. “Selection Bias”: What is missing in the data?

Something so trivial as the structure of the folders that contain the documents, may, in some instances, help verify the leak’s authenticity.

Consider this. If someone like a company or an individual collates data on their computers, it is seldomly in a single folder, and seldomly cleanly organized, with a consecutive file structure. There are in directories, and subdirectories. It is usually complicated. Messy.

If it is a personal computer the data comes from, there is a “desktop folder”, perhaps a “documents folder” and there is data on programs and so on. Suffice to say, it’s chaotic. That messiness is difficult to fake. Add it to the red flags if it appears too clean!

And what if there is anything missing? Is there an empty folder, for instance? Is there a void, where there should be something? Chart the folder structure and check.

Examples of typical folder file structure of a personal computer. Within businesses, it may look different, also in government.

Regarding missing data, we arrive back at metadata analysis. There, we checked the creation-date of the leak data. We would then usually build some sort of timeline with the data on the creation dates from the files. If there are huge “jumps” or gaps in creation dates, it is worth reconsidering if these gabs are intentional. Add it to the red flags.

I couldn't resist the temptation to write a python script. It takes all the files, runs a metadata-check on it, stores the timestamp on the creation dates, and files it in a data frame, sorts them by date and issues.

This way, it's pretty easy to spot any oddities, gaps or duplicates. Once you know when data was created, OSINT research on what happened in the world parallel to it, can help in understanding the context the data might have been created in. (we built here a timeline of when the documents were created).

Example of a timeline

9. Final Verdict: Are the data real and risk-free?

There is no such thing as risk-free. All we can do is to follow the list stringently and keep alert of oddities. But a bit of structure, in finalizing the analysis on the authenticity of leak data, can help. Here there is a rating system. From minus 1 — failed the test or shows red flags, to 0 (not application or not relevant) to 1 (passing the test), a final score brings some method to the madness. In this test case, we could probably proceed with a more profound investigation.

An example of how to score various aspects of a data leak

On Russian Translation of PDFs:

While this is not directly related to verifying the authenticity of leaks, it is worth considering how valuable PDF translation is on the fly to leak data analysis.

I recommend the translation to run locally — on your local machine. You might purchase a local DeepL account. Or consider running the plugin system below.

For images, if they are of sensitive nature, there are python libraries that run locally. For not so sensitive material, there is Google Images OCR and translation.

One way to run a translation service on really, really long PDFs

--

--

Techjournalist

Investigative journalist with a technical edge, interested in open source investigations, satellite imgs, R, python, AI, data journalism and injustice