Text Mining and eDiscovery for Big Data Audits

European Court of Auditors
#ECAjournal
Published in
14 min readMar 6, 2020

With quantitative data becoming more widely available, the step towards data mining appears to be a small one. However, for their performance audit work auditors from public audit institutions are spending more and more time assessing policy efficiency and impact, reviewing policy documents, policy programmes — often formulated in qualitative terms in text. With the quantity of text and diversity of sources available, more and more techniques have been developed for text mining, which can be used to assist the auditor. Professor Jan Scholtes holds the Extra-ordinary Chair in Text Mining from the Department of Data Science at the Department of Science and Engineering of the University of Maastricht. He is also Chairman and Chief Strategy Officer of ZyLAB, a company developing software for eDiscovery and information risk management. He has been involved in deploying in-house e-Discovery software with organisations such as the UN War-Crimes Tribunals, the FBI-ENRON investigations, the EOP (White House) and the fraud investigators of EC-OLAF. Below he explores the different techniques of text mining and eDiscovery, and how they can help to find complex relations in electronic data sets.

By Professor Jan Scholtes, University of Maastricht

Source: Shutterstock.com/ By Jirsak

Recent developments in deep learning

To auditors, the field of data mining is undoubtedly better known than that of text mining. A good example of data mining is the analysis of financial transactions. A wealth of algorithms and analytical methods are available to find patterns of interest or fraudulent behaviour in such data sets.

However, 90% of all information is unstructured information in the form of text documents, e-mails, social media or in multimedia files (speech, video and photos). Searching within or analysis of this information, using database or data mining techniques, is not possible, as data mining tools only work on structured data in the form of columns and rows, as used in databases.

In addition, fraudsters are more and more knowledgeable about the inner workings of audit and compliance algorithms, so they tend to make sure that the transactional aspects of their actions do not appear as anomalies to such algorithms. The details of what is really going on, can often only be found in auxiliary information such as email, text-messages, WhatsApp, agreements, side letters, voice mails or discussions in a forum, such as a chat box (1) or the dark web.

Meta-data extraction (normal and forensic), machine translation, optical character recognition (OCR), audio transcription, image and video tagging have reached highly reliable levels of quality due to the recent developments in deep learning. Therefore, text can be used as a good common denominator describing the content of all electronic data, regardless of the format. The next step is to use techniques from the world of text-mining and eDiscovery to assist today’s auditors. This is what we will discuss in more detail in this article.

ECA Journal Short Read

Text mining — The process of deriving normalized, structured data from large quantities of text using deep learning such as AI and algorithms, with the aim of using the data for analysis and identifying patterns.

Text mining and audit — Besides quantitative data mining, text mining is increasingly being viewed as a useful asset in the audit process. Ninety percent of all information takes the form of textual documents, and new tools such as meta-data extraction, machine translation, audio transcription, image and video tagging are proving to be increasingly more reliable in terms of providing information in a structured manner, also to the auditor.

eDiscovery — Stands for legal fact finding missions dealing with large amounts of electronically stored information, most often dynamic and unstructured. eDiscovery is proving to be a standard approach when dealing with cases pertaining to regulatory requests, the General Data Protection Regulation, Freedom of Information Act requests, compliance investigations and the preparation of mergers and acquisitions.

eDiscovery and text mining — This combination is the perfect tool for early case assessment, offering many methods to understand the structure and content of large data sets in an early phase to reply to basic questions like who, where, when, why, what, how and how much. Different text mining techniques can help here, such as Topic Modelling, Community Detection, Topic Rivers, Emotion Mining and Event Detection.

New techniques, also for audit — Combining the real-world data of eDiscovery technology with the isolation of patterns of interest in text mining will aid our understanding of complex relationships between variables in big data. These tools will help auditors and investigators to get to the essence of a case quickly and efficiently.

Text mining

The study of text mining is concerned with the development of various mathematical, statistical, linguistic and deep-learning techniques which allow automatic analysis of unstructured information as well as the extraction of high quality and relevant data, and making the complete text more searchable. High quality refers here, in particular, to the combination of the relevance (i.e. finding a needle in a haystack) and the acquiring of new and interesting insights.

A textual document contains characters that together form words, which can be combined to form phrases. These are all syntactic properties that together represent defined categories, concepts, senses or meanings. Text mining must recognize, extract and use all this information. Using text mining, instead of searching for words, we do in fact search for syntactic, semantic and higher-level linguistic word patterns.

‘Text Mining: The next step in Search Technology — Finding without knowing exactly what you’re looking for and Finding what apparently isn’t there ’ was the title of my inaugural acceptance speech for the Extra-ordinary Chair for Text Mining at the University of Maastricht in 2009. There I explained the difference between consumer search engines such as Google, which are highly optimized for precision, and other search engines that are more optimized for recall. The latter are obviously more suited for investigators. High precision search engines only find the best solutions, but not all of the relevant ones. High recall engines do find all of the potentially relevant ones, whilst preserving a reasonable level of precision. Precision — also called positive predictive value — is the fraction of relevant instances among the retrieved instances, while recall (also known as sensitivity) is the fraction of the total amount of relevant instances that were actually retrieved (2).

Internet search engines such as Google, but also Lucene, are fine tuned to give the best answer or the most popular answer. Fraud investigators or lawyers do not only want the best or most popular answers, they want all possible relevant documents. Furthermore, in an internet search engine everyone does his or her best to get to the top of the results list. Criminals and fraudsters do not want to be at the top of the results list in a search engine; they actively try to hide what they do. With text mining algorithms, we aim to find someone or something that does not want to be found.

Fraud investigators have another common problem: at the beginning of the investigation they do not know exactly what they must search for. As using encryption for such communication would have the effect of a red flag to an auditor, such communication is often in plain open text, using code words. Investigators do not know such specific code names, or they do not know exactly which companies, persons, account numbers or amounts they must search for. Using text mining, it is possible to identify all these types of entities or properties from their linguistic role, and then to classify them in a structured manner to present them to the auditor.

For instance: one can look for patterns such as: ‘who paid who,’ ‘who talked to whom,’ or ‘who travelled where’ by using searches for linguistic matches. Subsequently, the actual sentences and words matching such patterns can then be extracted from the text of the auxiliary documentation and presented to the investigator. By using frequency analysis or simple anomaly detection methods (3), one can then quickly separate legitimate transactions from the suspicious ones, or identify code words.

eDiscovery

Another set of interesting developments for auditors can be found in the field of eDiscovery. When discussing big data in relation to law, we may confidently state that legal fact-finding missions, also known as eDiscovery, deal with the biggest legal data collections of all. Today, an average eDiscovery easily involves several tera-bytes of electronic data, holding hundreds of millions of documents with highly dynamic and completely unstructured information. These data sets consist of a variety of languages and distributed sources in many different electronic formats and shapes (including legacy and corrupted files); to put it more bluntly: eDiscovery data is truly tough big data to deal with.

eDiscovery (also called electronic discovery) originated in the United States. It refers to the process of discovering facts in legal proceedings. Discoveries are a part of a pre-trial procedure under which one party can request evidence from the opposing party. Discovery is all about fact-finding, and ultimately, truth-finding. Over the years, eDiscovery (US style) has become the standard approach in Europe for use in cases, such as arbitration, answering regulatory requests, (internal-, government- , and criminal) investigations, freedom of information act (FOIA) requests, public records requests, compliance investigations, preparation of mergers and acquisitions (M&A) and recently also Right to be Forgotten Requests under the General Data Protection Regulation (GDPR) (4).

eDiscovery and text mining for Early Case Assessment

The combination of eDiscovery and text mining is the perfect tool for the more strategic application of what is called early case assessment. When an organization is confronted with litigation, a regulatory request, or an (internal) investigation, the initial eDiscovery can generate terabytes of electronic data. It is not easy to start comprehending what a case is about, let alone making well-informed strategic decisions. This is where early case assessment can help. Early case assessment is an umbrella term for many different methods to understand the structure and content of large unstructured data sets in order to make better decisions in an early phase of eDiscovery without the need to have to review all documents in great detail in advance.

Depending on the type of eDiscovery case, there are different dimensions that may be interesting for an early case assessment: custodians, data volumes, location, time series, events, modus operandi, motivations, etc. As described by Attfield and Blandford in 2010 (5), traditional investigation methods can provide guidance for the relevant dimensions of such assessments: Who, Where, When, Why, What, How and How Much are the basic elements for analysis.

Who, Where and When can be determined by Named Entity Recognition (NER) methods. Why is harder, but personal experience of the first author in law enforcement investigations shows that data locations with high emotion and sentiment values also provide a good indication of the motivation or insights into the modus operandi.

Below I provide a few examples of such analysis derived by using methods from the field of text mining.

Answer the What question with Topic Modeling

An example of the visualization of the What question can be found in Figure 1. The Non-negative Matrix Factorisation (NMF) topic modeling is combined with clustering and basic visualization. This visualization allows users to dynamically browse eDiscovery document sets based on the automatically derived topic hierarchy.

Figure 1 — Visualizing the What question in eDiscovery

Source : ZyLAB Technologies BV, Amsterdam, the Netherlands.

Figure 1 represents two adjacent visualizations of the What question: on the left side a traditional text hierarchical tree view and on the right side a so-called Word-Wheel representation. Both can be navigated interactively. For the text this is done by clicking on a line and for the graph by clicking on an area in the graph, one can either enlarge it, make it smaller, or navigate to the documents containing a specific topic that is most dominant. An example of text clicking is as follows. Clicking on the red entry on the left side ‘golf hole woods holes round’ will show the documents describing Tiger Woods successes in the 1996 World Golf tournaments. All these topics and corresponding labels have been recognized by the topic modeling algorithm using unsupervised machine learning, the algorithm does not need any labeled or other initial information to build such topic models. This method is also language and domain independent.

Answering the Who question with Community Detection

Once the Who’s are identified by using Named-Entity Recognition, methods from social network analysis can be used to identify relevant groups and communities, allowing the reviewers to prioritize the review better by focusing on the data of individuals that are communities in close vicinity of the main suspects.

An example of such Community Detection in correspondence of the Museum of Modern Art in Amsterdam will lead us to the automatic derivation of communities.6 See Figure 2.

Figure 2 — Community Detection in Communication from the Stedelijk Museum of Modern Art Amsterdam, the Netherlands

Source : ZyLAB Technologies BV, Amsterdam, the Netherlands.

Answering the What-When questions with Topic Rivers

Another dimension of early case assessment can be obtained by combining more complex overviews such as What-When, a form of dynamic topic modeling, also referred to as Topic Rivers.7 Figure 3 displays the visualization of so-called Topic Rivers in 8 months of Reuters news from 2014. For each week, the system determines (in this case) the 20 most dominant topics. Next, for each period, the number of new, growing or declining topics is determined and connected to corresponding topics in the previous and next period. In the resulting graph, the invasion in Ukraine can clearly be observed in March 2014, pushing aside all other news. Other topics, such as the Israel-Palestine conflict can be seen to be present in the news for the entire year. Another anomaly is the blue one on the right-side bottom of the graph, representing the news when Malaysia Airlines Flight MH17 was shot down over Ukraine.

Figure 3 — Answering the What-When question: Topic Rivers in 8 months of Reuters News from 2014

Source : ZyLAB Technologies BV, Amsterdam, the Netherlands

Answering the Why questions with Emotion Mining

Emotions and sentiments often lead to interesting emails in eDiscovery. Both can be measured. Combining emotions with custodians (persons) can lead to the discovery of relevant issues or insights into the modus operandi and sometimes even code words, that could be the starting point of further investigations.

For this reason, part of the Why question can often be identified by looking at the communication with the highest levels of emotions of negative sentiments. By identifying these and linking them to the persons expressing them, one can obtain an answer to the Why-Who question.

An example of how well one can determine emotions from text is provided in Figure 4, where we can observe the analysis of the text from 220 000 pop song lyrics for the basic emotions: Trust, Anticipation, Joy, Anger, Fear and Sadness. The name of the pop artist is displayed on the lines connecting the most dominant emotions in their songs. As we can observe, rappers are in the left bottom corner around Anger and Fear. Elvis, the Beatles, and David Bowie are more in the top right corner around Joy, Trust and Anticipation. Similar analysis has been made of movies, books and other content, leading to similarly satisfying results. These techniques have also been used for court cases.

Figure 4 — Answering the Who-Why question: Emotion mining in 220.000 song lyrics and the corresponding artists

Source : ZyLAB Technologies BV, Amsterdam, the Netherlands

Many other combinations, analysis, clustering and visualization methods can be created, and we will most likely see more of these in future research.

Using Text-Mining for Event Detection

In a number of recent research projects, the above mentioned ideas have been taken one step further: by using open information extraction , text collections can be converted into graphs of linguistic patterns. This can be done by first extracting patterns of objects, predicates and subjects. Next, as similar words are used both as objects and subjects, these patterns can be represented as graphical structures. Extending this idea with a temporal component, it is then possible to create dynamic graphs that change over time, where changes to the objects, predicates and subjects result in a time-lapsed representation. It can be observed that major changes in such networks, appear to relate to major real-world events. Early experiments on news messages and on emails from the ENRON data-set, have shown interesting early results which have resulted in a number of on-going current research projects.8

Instead of investigating changes in graphs derived from objects, predicates and subjects, one can also just investigate graphs representing communication patterns between individuals (without taking the actual content of message into consideration). Using nodes as individuals and the number of emails (or text messages) as the edge between the nodes we can create a graphical representation as well. Such a graph can be presented in time, also resulting in a time-lapsed representation of all communication.

Major changes in such a network can then be linked to major events during an investigation. The challenge is to determine which changes are relevant. To do this, a variety of algorithms can be used. The most common ones are NetSimile (first graph), DeltaCon (second graph), or one could even just measure the edge difference (third graph). The fourth graph represents the ground-truth of the interesting events in the Enron investigation. One can observe that DeltaCon finds almost 65% of all events with a relatively low number of false positives. NetSimile finds almost all events, at the price of a larger number of false positives.9 This field, too, is subject to additional research, which is currently underway at ZyLAB in collaboration with a number of universities in the Netherlands.

Figure 5 — Events in Enron between June 1999 and February 2002

Source : ZyLAB Technologies BV, Amsterdam, the Netherlands

In this graph, three different methods have been compared to detect important events in the Enron investigation. Note that the lowest graph in the figure illustrates the ground-truth events shown with green lines. For the other three measures, the green line is when one of the ground truth events is correctly flagged as an anomaly, while the red line is an event that is incorrectly flagged as an anomaly (aka a false positive). 10

New techniques… for the same purpose

eDiscovery technology taught us how to deal with real-world big data. Text mining taught us how to find specific patterns of interest in textual data. The combination of eDiscovery and text mining will teach us how to find even more complex (temporal) relations in big data and ultimately train our algorithms to provide better decision support and assist investigators in detecting anomalies and moments of incidents in our ever growing electronic data sets.

This is a rapidly evolving field, where new methods to understand the structure, meaning and complexity of natural language, are being introduced at an ever accelerating speed. These developments will result in tools that will be essential for auditors and internal investigators to keep up with the ever-growing electronic data sets and get to the essence of a case as quickly and efficiently as possible.

(1) Monitoring or analysis of Bloomberg chat-boxes is standard procedure in the USA as part of investigations involving financial traders.

(2) See https://en.wikipedia.org/wiki/Precision_and_recall for a detailed explanation of the terms precision and recall.

(3) For instance, by comparing the extracted distributions of the components of such patterns to those of a verified language model of normal language.

(4) See https://www.zylab.com/en/corporations/resources/big-data-analytics-for-legal-fact-finding for a detailed overview and the history of eDiscovery.

--

--

European Court of Auditors
#ECAjournal

Articles from the European Court of Auditors, #EU's external auditor & independent guardian of the EU's finances.