Challenges and Strategies for data leak investigations (Part 2/2)

Hari
Borneo
Published in
5 min readFeb 8, 2022
Magnifying glass on data
Data Investigation picture from https://aceds.org

As we discussed in Part 1 of our story Why is Data Investigation Critcally Important In The Age Of Data Breaches?, Data investigations are performed by the security team to understand sensitive data stored in their infra and the potential risk associated with that being exposed internally or externally, ensure compliance with the law of land, and safeguard customer data.

Data investigations are usually carried out post a data breach or data leak to understand the extent or severity of the incident. But performing structured and continuous Data investigations preemptively helps organisations to eliminate or minimize the risk with associated data leaks, data breaches, and customer data privacy.

With ever-increasing data, more and more countries are putting in place Data Protection and Privacy Legislation to ensure the protection of data and privacy, it is going to be important for every company to understand the level of compliance and risk associated.

However, Data investigation within any company is not a simple affair. It costs a lot of time, resources, and money if one doesn’t have the right tool in place. That being said, let us look at the challenges involved in carrying out a Data investigation in a traditional way.

Unstructured Data

Approaches involved in looking at unstructured data completely vary from that of structured data, both have their challenges of looking into and identifying sensitive data.

Usually, we consider storage services like AWS S3, Azure Blob Storage, Google Cloud Storage, Google Drive, Dropbox, etc as sources for unstructured data, but many other services like Email, Slack, Jira, confluence contribute a major share. Connecting to each of these services is a challenge by itself as it involves getting familiar with their SDK/APIs, keeping track of their updates, and processing the data fetched from them.

Investigating unstructured data involves context from source, neighborhood text. For instance:

In the above log, #1 is a random 16 digit number, whereas #2 is a valid credit card number. If we don't rely on the neighborhood context both #1 & #2 would be flagged as the credit card numbers.

Structured data

By notion, structured data is something stored in Databases or structured files like CSV's, excels, XML, JSON. Though these can be easily understood by machines and can be queried directly in some sources like DB's, it's not going to be easy as understanding the data involves consuming a lot of meta-information here.

For instance, consider two CSV files named “Data.csv” and has a column `ssn_id` and another filename “SSN.csv” and has a column “id”.

In the file “Data.csv”, it is slightly easier to know that the column `ssn_id `contains SSN number.

In the case of “SSN.csv”, you wouldn't know this file contains an SSN number without considering the file name. Even though we predict the file contains SSN numbers based on file name we wouldn't know with certainty which column contains SSN unless we look at every token in the file or we derive a good machine learning model.

Challenges of looking into unstructured data within structured and vice versa

Another interesting dimension to the unstructured and structured data is one containing the other.

Eg: JSON/XML in word doc or a chat conversation stored in a blob column of a Database.

Word File with structured data
DB with Blob results

Things get much trickier here, as you cannot isolate and deal with one type of data anymore.

File Types

The problem of understanding the sensitive data inside various file types like zip, excel, pdf, images, Parquet, ORC, etc is a huge challenge.

Identifying sensitive data in each of these involves understanding their metadata, extracting content, and keeping track of the location of tokens.

Data Hierarchy

While processing the data, we would encounter different hierarchies of data in different sources in the case of excel we will have sheets & columns, in case of Database we would have a database, schema, tables, columns, whereas JSON/XML has a dynamic hierarchy. As we have seen above in the case of fileNames example, this hierarchy metadata plays a critical role in identifying & classifying sensitive data.

Also while processing some of these compressed formats might generate huge volumes on extraction, nontext file types cannot be streamed for processing.

Normal Regular Expressions wouldn’t work

An initial attempt to identify sensitive data from any data source would be to use regex to identify the tokens. But this approach would miss the context from the neighborhood or metadata which plays a critical role in improving accuracy e.g. bank account numbers may not have a specific pattern, we would need to rely on the context. Not just that we would need to do multiple validations for each of the tokens identified like most credit cards would have to be validated with the Luhn algorithm.

Identifying the Data Source

Despite finding the sensitive data, remediations would always need an approach to identify the source of data. For example, when we notice credit card numbers in log files, identifying them to the code/service that is writing this log is important. Another example would be if SSN numbers are identified in Slack chat, we would like to know the source of it and whether it maps to any of the customer data.

The challenges identified above are complex enough problems by themselves to solve. Attempting to solve these problems in a traditional approach would need a huge engineering effort leading to shifting focus from business to Data investigation.

At Borneo, we are on a mission to help companies solve their problem of customer data privacy and data law compliance. Our Data Investigation Platform has helped customers to investigate their data by providing initial insights in less than a day, minimizing their data privacy risk, and expediting their data compliance process by 12x.

Want to try our Data Investigation Platform? Request a demo with us!

--

--