Using AI and machine learning to find clues for journalists

Published in

The SVT Tech Blog

5 min readSep 29, 2020

Journalists working with the leaked documents realized, with the recently published FinCEN Files, that they need help in finding and structuring information. The international consortium of investigative journalists (ICIJ), turned to Sweden’s Television (SVT) and their data journalism team for help.

Background

The leak contains suspicious activity reports, SAR:s, filed by banks to the US finance police agency, FinCEN. The documents were obtained by BuzzFeed News and reviewed by the ICIJ. The reports follow a pattern, but different bank officials write their reports in different ways.

It was of great interest to the members of the project to create a database of all the transactions that were mentioned in the about 2 100 reports that was included in the leak. Our mission became to extract data from free text, identifying the different entities like companies, banks, account numbers, money and dates. Beside this, it was also interesting to connect the information to people, organizations and countries mentioned in the reports.

The database could then be used to find a country, a person or a company — and get links to all the documents that would be interesting for the journalist to examine further. Our intention was never to create a perfect result, but instead we were aiming for a simple way into the leak.

There was also a time aspect — we needed to solve this within a couple of weeks — since the journalists needed as much time as possible to search the database and find interesting clues.

Documents

The reports are written in different ways depending on the author. This variation makes it difficult to do an automated analysis. Let’s look at a fictive example:

Imagine we have a company called X Ltd that make one or more suspicious transactions to company Y Ltd through bank Z Banking. This can be written in the following way:

“At 12/18/2012 X Ltd sent three (3) transactions, totaling $700,000, through Z Banking to a single beneficiary Y Ltd.”

“X Ltd sent three (3) transactions, totaling $700,000, through Z Banking to a single beneficiary Y Ltd between 12/18/2012 and 12/19/2012.“

“From 12/18/2012 to 12/19/2012 X Ltd sent three (3) transactions, totaling $700,000, through Z Banking to a single beneficiary Y Ltd.”

“At 18/12/12 Y Ltd recieved three (3) transactions, totaling $700,000, through Z Banking from a single originator X Ltd.”

“At 12/18/2012 X Ltd transfered $700,000 in three separate transactions, in range from $5,000 to $570,000, with account #ABC123 at Z Banking via Q Banking with the final beneficiary X Ltd.”

A common way to handle this kind of problem would be using a number of regular expressions to find a transaction and its parts. But since there are an unlimited number of ways to describe a transaction, we couldn’t go down that route. Instead we chose to use Machine Learning and a Named Entity Recognition (NER) model. The model is taught from examples of what a bank, a date of a sum is.

Working with real data

Anyone who has worked with real data knows that no data is perfect, but these reports were something out of the ordinary. Beside being written in several different ways — the descriptions of the transactions were also found within lots of text. The structure and content of the text varied extensively between different reports. Sometimes text was written in only upper case and without punctuation. All this made it very hard to identify where a sentence, or a paragraph, starts or ends. The following example shows that it can be difficult even for a human to identify what is what:

“Z BANKING CREATE A REPORT AT 12/18/2017 INCLUDED FOUR HUNDRED (400) TRANSACTIONS OF A TOTAL OF $13 020,021 OF THEM X LTD TRANSFERED $700,000 IN THREE SEPARATE TRANSACTIONS IN RANGE FROM $5,000 TO $570,000, WITH ACCOUNT #ABC123 AT Z BANKING VIA Q BANKING WITH THE FINAL BENEFICIARY Y LTD TWO (2) TRANSACTIONS WERE SENT FROM COMPANY AS AT 2/20/2012 WITH A TOTAL OF $35.000 THROUGH ACCOUNT #CBA321 AT Z BANKING WITH A SINGLE BENEFICIARY Y LTD”

We had to identify a number of key words or phrases to be able to break the text into parts or sentences that were manageable. For example, phrases like ”also please note”, ”however,” or ”this search identified” were used to split sentences apart. Our goal was not to find a correct sentence, instead we wanted to break the text into pieces, each piece containing information about just one transaction. We found new phrases and new ways to divide information into entities by processing several documents with huge amounts of text.

How did we process the documents?

The leak contained a number of pdf-files that each had one or more reports. Each report is converted to text. Some of the contents were formatted as a table and we could extract some information from that table.

All text found in the report was split into smaller parts and analyzed with the NER model. We classified a transaction as a text containing at least one date, one sum and one organization or person. We also looked at other parts of the text, as an example the words “send” and “receive” determined which way the money went.

Training the NER model

To handle training the model we had to develop a tool to process phrases and tagging them correctly.

The tagged sentences could then be used to train our NER model using SpaCy. Our first model made a lot of errors, but by examining the errors we could add examples. This gave the model a better chance to learn and improve.

Normalizing names

Another challenge was the different ways a name can be written — both in different reports, but sometimes within the same report. For example. “Company Name Limited” could be written as ”Company Name Ltd” or ”Company Name”.

We kept lists of company names to keep track of the entities we’d found and tried to connect the names describing the same company. We used the same process for names of people.

Our database was created as a tool for journalists, so it was crucial for us to include the source for each transaction. That made it easy to look up the original source and judge whether the extraction was correct or not. Not to destroy but just improve data was important to us.

In the end we found 13 500 transactions and we were able to identify 19 000 banks, companies and organizations in the reports.

Once we were done, we gave the result to all the journalists working on the project. They used the database to find news and stories in the leak. We don’t believe that machine learning can replace the work a journalist does, but we wanted to show that machine learning could be a helpful tool to create order in a chaotic world of a large pile of data.

At SVT this resulted in “SEB and the Fincen Files” (swedish).

Rickard Andersson
Helena Bengtsson
Fredrik Stålnacke