Leveraging LLMs with Kineviz SightXR for visual graph exploration of the Enron Corpus

Diénert Vieira
Kineviz

--

After two decades, the Enron bankruptcy case still has lessons for how mismanagement and wrongdoing can destroy a large and seemingly robust corporation. After the resolution of the litigation, the company’s enormous email dataset was made public, and it offered some transparency into the company practices leading to failure, which people can study.

Although many researchers have analyzed the network topology available in these emails, the dataset’s size, density, and complexity often result in “hairball” graphs with so many nodes and relationships that visual evaluation becomes very difficult.

In this case study, we aimed to sift through the mass of entities and relationships to focus on possible early-stage “red flag” patterns that could help signal such a massive catastrophe.

We can easily visualize such patterns with Kineviz GraphXR. But to build the graphs, we must identify and map entities of interest in the unstructured emails. We can use the new Kineviz product, SightXR, an extension to GraphXR that lets us process unstructured emails and other unstructured files like PDFs, docx, and text.

Initially, we extracted some nodes from the Enron files (Figure 1).

Figure 1 — Initial nodes from the graph.

Source nodes are associated with each file with a unique URI, e.g. “file:///enron/lay-k/notes_inbox/140.”. Email nodes have the subject, date, from, to, cc, bcc, and content. We combined the date and from properties to identify emails uniquely. This avoids duplication and makes it possible to notice that multiple source files can refer to the same email.

We started our analysis by searching for “off-balance sheet”, an expression that might signal one of the activities that led to financial collapse. Then, we presented the related emails and the people who sent these emails in GraphXR’s ring layout (Figure 2). In the center, we see the person Kay Mann, whose emails contain these words the most frequently.

Figure 2 — Stand-out result of searching for “off-balance sheet”.

We can also visually organize the emails in a timeline format (Figure 3).

Figure 3 — “Off-balance sheet” timeline.

In addition, we searched for “special purpose” entities, another expression associated with the frauds being perpetrated (Figure 4).

Figure 4 — Stand-out results for mentions of “special purpose” entities

I am not a litigation expert, so deciding on the entities to pull out of the emails might seem daunting. Fortunately, to help with that, SightXR lets me set the schema to define the entities we are looking for ( Figure 5).

Figure 5 — Setting the entities of interest

By default, the well-known POLE model (Person, Organization, Location, and Event) is used to extract entities, but we can add as many as we want. With the schema in place, our Large Language Models (LLMs) are deployed to build a specific Knowledge Map that embodies the specific entities of interest and their relationships (Figure 6).

Figure 6 — Knowledge mapping tasks available in SightXR.

Knowledge mapping is a task that runs in parallel processes, but running some routines in such a vast dataset can take a while. That’s why we want to narrow down our search. After building the Knowledge Map, we can show the results of detecting our entities in email texts (Figure 7).

Figure 7 — Stand-out emails from a search for “special purpose entities” with more suspicious fraud terminology in the Graph

We can also highlight all detected entities in the body of each email for rapid evaluation (Figure 8).

Figure 8 — Suspicious fraud terminology is highlighted in the email text.

In this initial engagement with the Enron data, we’ve found that the synergy of GraphXR’s advanced visualization platform, the comprehensive data processing power, and the LLM-enabled knowledge mapping of SightXR offers a potent investigative framework for fast human-centered visualization and analysis. With the Enron corpus as a test bed, we can show that this approach enables identifying and contextualizing potential red flags. This has great potential to fortify future risk assessment and regulatory scrutiny in corporate environments.

--

--