Bootstrapping Datasets with Machine Aggregation and Human Annotation: Cyber Risk Case Study
Why Build a New Cyber Risk Dataset
After we introduced a new and more complete cyber risk dataset for cyber risk quantification and underwriting, a number of folks in the community asked for more details on how we’ve built the annotation process and tooling for the data we’re collecting.
As mentioned in the previous post, cyber risk quantification currently lacks the necessary volume of data and rich set of features that are useful for cyber risk modeling. Most datasets have only basic information about the attack and don’t expand on the state of the organization at the time of the incident. Often we have information about when the incident occurred and/or when the incident was made public, but we don’t have information about the financial state of the company, the state of its security, or how the attacker exploited it. There is occasionally some information about the attack, but the labels are not well defined and don’t follow an established standard framework.
Another challenge occurs when cross-referencing the organizations with other data sources because company names can be written in several different ways.
In order to have a more complete view over security incidents, we built a dataset where we could see what security controls were affected in the attack, which losses did the company incur as a result of the incident, what was the financial state of the company and finally which record types were affected.
Scalable Aggregation with an Automated Pipeline
We take security incident information from 16 different sources — 12 state attorney general listings and 4 public aggregators. As for financial and organizational information on the companies, we use information from public data sources such as Yahoo Finance, but also 14 additional proprietary sources.
We uniquely identify an organization using Named Entity Recognition (NER). This allows us to identify the company’s name by recognizing the identifiable portion of the organization’s name — e.g. ‘Apple’ in ‘Apple Computers’ or ‘Apple, Inc.’.
Given the records from different sources, we then deduplicate them and merge records connected to the same event using a model over announcement, incident date, and other features — different events are unlikely to have the same times and other properties co-occur without being about the same event.
Quality Control with Manually Deduplicated Records
The aggregation engine still generates some errors and incomplete data. Often there are different events occurring in short intervals or dates from the sources are mixed up, creating records that are erroneously merged. We needed to improve this if we want to have a robust enough dataset for modeling and taking principal risk. We also wanted additional features providing more information about the incidents in order to test hypotheses from the team and security advisory board, so we decided to start manually annotating.
We could not find any tools or services that made this task efficient enough, so we built a custom toolchain and methodology to handle the annotation, and we built an annotator workforce and training program.
Annotators are given rows of information from different sources about one company. Each row has the original source of information, plus often any external source that was attached to the original source. The annotators read through this data, identifying incidents that are duplicates. The UI allows for drag and dropping into buckets representing the same incident.
For each of the deduplicated events, the annotator initially identifies the essential information of the event — the date when it was published, when the incident occurred, the total number of records and their type. Occasionally this information is present in the original data source, so the job of the annotator is to find inconsistencies in the data, and to add missing information.
Affected Security Controls Annotation using the CIS Framework
In our modeling, we use the CIS framework to establish the security profile of a company. Even though the detailed security state of the company at the time of the incident isn’t mandatory to disclose publicly, it is often possible to infer which CIS controls were affected in the incident, since some details about the attack are provided to the public. We, therefore, ask the annotators to identify the CIS controls that could have prevented the incident or significantly mitigated its impact. This gives a proxy to which controls were missing or not working properly at the target organization.
We’ll publish the detailed methodology for annotating each of the CIS controls soon. In this post we give an example using our scenario-based methodology so that the annotator can identify the affected controls by understanding the scenario of the attack.
Let’s say a phishing attack from a human error resulted in a theft of credentials — a user imputed his credentials to a phishing site. The credentials were then used to log in to a remote machine and later the attacker found out that POSs (pay terminals) were located on the same network. The attacker then used this access to steal credit card information from the terminals.
In this scenario, we can reason that the employees of this company were not trained sufficiently (CIS 17: Implement a Security Awareness and Training Program). The ability of the attacker to log in remotely with stolen credentials implies that multi-factor authentication was missing (CIS 16: Account Monitoring and Control), and finally, the missing separation of sensitive applications on the local network implies CIS 14: Controlled Access Based on the Need to Know was either missing or not implemented correctly.
For researching the losses incurred due to the incident, we read the original sources available about the attack and researched the costs using specific detailed queries that we’ve trained the analysts on. We also review financial statements for publicly listed companies to look for concrete evidence of these costs. We annotate only concrete numbers we can find mentioned, we do not allow annotators to project estimates — only directly citable figures.
During our research, we defined a Loss Graph, which contains generalized events such as Customer Credit Monitoring and Notification, and Third Party Litigation. Annotators look for evidence of a total cost and also look to link losses associated with each specific category in the Loss Graph.
Financial and organizational company profile
The dataset has several financial indicators available at the time of the publicly disclosed incident. These values are taken directly from the automated pipeline aggregating structured information, so they are not annotated manually.
These values include (all at the nearest time of the announcement) the price of stock, revenue, net income, number of employees, ICB industrial sector, operating expenses, among many others. In the initial batch of events we’ve annotated, these values are available for approximately 60% of the events.
If you’re interested in more details about the data, review the original intro post on the cyber risk dataset. We welcome any experts who can be helpful in scrutinizing the data, suggesting new sources to aggregate, improving its quality, and recommending better methods for aggregation and annotation. We have a limited release private beta, so please include a bit about your background and use case to: email@example.com