Investigating Cybersecurity in Public Companies

Extract and visualize incidents reported in SEC 8K and 6K filings

Weidong Yang
Kineviz
Published in
6 min readMar 15, 2024

--

Introduction

The financial statements and background information that publicly traded companies must provide in Securities and Exchange Commission (SEC) filings are an important source of information about the operational stability of a company and its dealings with owners, partners, customers, and regulatory bodies. In 2023, new rules introduced mandatory reporting of cybersecurity-related incidents for all companies listed in the US. Domestic issuers must disclose material cybersecurity incidents in a Form 8-K filing, and private foreign issuers must disclose such incidents in a Form 6-K filing. These SEC filings can be used to construct a history of incidents and to evaluate a company’s cybersecurity vulnerabilities.

SEC filings are often lengthy. Asking a human investigator to read through many filings can be an overwhelming prospect. Furthermore, cybersecurity incidents are just one of the many kinds of material events that trigger a requirement to file. Manually extracting connected information on that single topic of interest is far too slow and unreliable.

With SightXR we have a platform that delivers rapid access to the information in unstructured text documents. We can use it to:

  • select and ingest relevant filing documents,
  • extract and connect specific entities and relationships related to cybersecurity incidents using AI-powered language models,
  • store and display the results as a connected knowledge map,
  • expand, explore, organize, and simplify the knowledge map, and
  • answer further questions about the reported incidents using Q&A chat.

In this article we’ll show how to use SightXR for a fast, basic investigation into reported cybersecurity incidents.

Ingest filing documents

SEC filings are provided online through searching its Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system. We created a Streamlit utility to search and ingest selected filings into SightXR. It shows the main properties of the filings and lets us select documents by topic and range of dates. Ingesting a filing retrieves the complete document and submits it to SightXR, together with all the properties available in the table resulting from the search.

For an initial demonstration, a search for filings with cybersecurity incidents over the last three months returned 15 filings that we then ingested into SightXR.

Extract entities and build the knowledge map

In SightXR, we choose the provider of the large language model (LLM) that extracts the entities and relationships used to build the resulting knowledge map. Also, the LLM’s Q&A-style chat adds a new dimension to iterative exploration of the filing documents. For this example, we’re using GPT 3.5 turbo.

Extraction is based on a data model that specifies the entities of interest. We use the default POLE model (Persons, Organizations, Locations, and Events) popularized by Neo4j, since it’s a standard approach in investigative and security use cases.

Store and visualize extracted results as a connected knowledge map

Extraction detects entities (nodes and relationships) according to the POLE model. Results are stored in a graph database, in this case Neo4j. Entities are also connected to the source documents through Chunk, Observation, and Source categories. Collectively it forms a Knowledge Map, a loose form of Knowledge Graph.

That done, we go to the Visualize step and click Everything to see the POLE entities that have been created and their relationships to Observation nodes. In the workspace, entities are displayed as nodes and connected edges which are assigned category and relationship labels as specified in the model.

With all the nodes and edges present, we can get a visual sense of the data we have to work with, and evaluate the results of our extraction process.

Re-organize, and simplify the knowledge map

A useful first step is to see that categories have been assigned as we expect.

For example, when we click Organization and search for nodes with a label that includes SEC, we see a few nodes that should be selected and removed, as they do not actually refer to an organization.

We also see that due to spelling variations or errors in the filing documents, a few of the categories should be merged into one, and we can do this immediately.

Explore and organize starting with source documents

We’ve seen how to view all the extracted entities. Now we’ll start an investigation by working with the source documents. First we clear the workspace by selecting all the entities and deleting them. They still exist in the graph database, as we’ll soon see. To display the source documents within the workspace, we click on the Source category and Pull All button. A table is displayed, and a separate node for each source document appears in the workspace.

In the graph workspace we can also work with tables for any node or edge label, perform data cleaning and transformation, run graph analytics such as path finding and centrality, and apply a variety of layout and caption options to help focus an investigation.

For example, Source nodes have a display_name property that’s a bit complicated. We can use an f(x) transform to create a readable company_name property and apply it as the caption

To connect all the filings from a given company, we use an Extract transform to create a new Company category based on a companyName property.

What organizations are involved in the filings?

We can pull Organization nodes from the database to see the organizations that are connected to the companies and their incidents.

The SEC organization node can be removed, since it’s obviously connected to all the source documents (again, it remains in the Neo4j database).

Expand with Observations

We can Expand along relationships to show an organization’s connected observations, and then arrange the observations over time to see how a cybersecurity incident was handled as it progressed from discovery to investigation to resolution.

Answer questions using Q&A chat

Now that we have the basic story for the company and incident, we can continue to investigate by entering questions in the Q&A chat. Answers are returned with an explanation and a summary. Such conversational response is likely to spark further questions and insights which can in turn be quickly explored.

Conclusion

The SightXR platform provides seamless and rapid access to unstructured information in ways that have recently become possible with the advent of large language models (LLMs) and GenAI. The key role of the LLM here is to prepare the documents for investigation by humans. You can start from an entity, immediately pull all the observations associated with it, and gain a comprehensive understanding of that entity in the context of provided documents.

Conceptually, we’re building on retrieval augmented generation (RAG), an architecture that improves the performance of LLM applications by leveraging custom data. By combining observation and knowledge maps, we allow a language model to transform documents into a more open-ended investigative environment that retains essential context and nuance. The resulting knowledge maps provide visual cues to help humans gain quick insights, and at the same time improves the quality of RAG. We find that this approach makes it dramatically faster and easier to obtain insights from masses of unstructured text.

--

--

Weidong Yang
Kineviz
Editor for

Weidong is an entrepreneur, scientist, programer and artist. He founded Kineviz and Kinetech Arts.