The World’s Most Complete Breach Dataset for Cybersecurity Risk Models

4 min readJun 19, 2019

The pain of historical security incident data

When we conducted our technical due diligence on available historical security incident data sets, we found them to be incomplete. They are missing many incidents, and also missing a lot of interesting data about each incident.

Breach Level Index has records from 2013, and it has redacted organization names. Identity Theft Resource Center has a truncated set of incidents in 2012 and 2013. The public Verizon dataset (VERIS) has been reducing operations since 2013. None of these projects are staffed or supported adequately from the community to be comprehensive enough.

All US states now have privacy laws that force companies to disclose breaches of consumer data to the respective attorney general where each consumer resides. Many of those attorney general offices publish those notifications, therefore giving some visibility to the security incidents happening in the US.

There are many other interesting sources of public data one would also like to aggregate and merge with the incident data, so we built a machine aggregation system for security incidents, and designed a human annotation layer on top for a final pass of incident metadata enrichment.

Show me the data

Number of incidents throughout the years in our automated pipeline (‘Cyber Risk Dataset’) compared against public aggregators

We currently have around 37,500 records automatically identified as unique incidents. From those, we have manually annotated approximately 2,750 records, which include all security incidents related to publicly traded companies, and all companies which have suffered a data breach of over 1 million records.

We are releasing the manually annotated data in a small private beta with top academic and industry security risk researchers, to understand how useful this can be to the community, and how we can improve it further.

The annotated data includes basic information about the incident (organization, publication and incident dates, and the number of records breached), and whether different record types (PII, PHI, PCI, PFI, SSNs, passwords) have been affected in the incident.

Number of records breached per record type (vertical axis is on logarithmic scale)

Since our models are aligned with the CIS framework and want to estimate losses, we are also annotating which CIS controls have been affected in the attack, and the losses in the aftermath of the incident. We have also tried to divide the losses into different loss categories in order to obtain a more meaningful categorization.

The aggregation process and architecture

We aggregate data from attorney general websites from 14 different states, 4 different data aggregators, and our team of manual annotators.

Although the incident data itself small, we built our pipeline to measure cyber risk globally, so we have aggregated data on over 1.4 million companies, so the underlying pipeline operates at a larger scale than the private beta incident dataset might imply.

This pipeline was built in Scala using Spark and Python for modeling. To find duplicates in the data (deduplication), we use Named Entity Recognition (NER) to identify the same company entity when it’s referred differently by different sources — e.g. the company Apple can be referred to as ‘Apple’, ‘Apple, Inc.’, ‘Apple Computers’, etc.

The annotation process and toolchain

Annotating incidents by deduplicating records, together with the enrichment of each attack is quite complex. Annotators need to first select which rows of information refer to the same event, and then add information to all those rows. Since this type of annotation is quite specific and annotating raw data on spreadsheets is borderline impossible to track, we developed an annotation toolchain to speed up the process. Here’s a screenshot:

Crowdsourcing the annotations didn’t prove successful, which is often the case for such complex deep domain-specific tasks. After iterating the annotating guidelines a few times, we weren’t able to get the accuracy above 25%.

So we built our own team of annotators which got us to the level of quality we were hoping for. Being able to directly instruct handpicked annotators, combined with developing an intuitive and efficient annotating tool, was the sweet spot that allowed us to more effectively annotate the data. Since we have our own team we were able to personally train them to annotate data that needs expertise to understand, such as the affected CIS controls in each incident, and categorized incurred losses.

Private beta results so far

We carefully selected a set of the world’s best security researchers to participate in the private beta of our data set, and give us feedback on the quality and completeness of the data, as well as our strategy for engaging the community.

The responses so far have been positive. Many researchers noted the diminishing quality of existing breach datasets and a clear need for a more live-maintained dataset. We also received requests for adding additional metadata such as attack types, actor types, and such — we’re keen to incorporate this into a revision of the annotation methodology.

We welcome any experts who can be helpful in scrutinizing the data, and improving quality, as well as pointing out better methods for the aggregation or annotation process described above.

We will have a limited release with a private beta, so you can email your background and use case to: miguel.pinheiro@towerstreet.co

The World’s Most Complete Breach Dataset for Cybersecurity Risk Models

Written by Tower Street