S3Insights: Derive insights about your S3 environment at scale

6 min readSep 8, 2020

Introduction

Amazon Simple Storage Service (S3) is a cloud storage service that offers a durable, highly available, and scalable data storage infrastructure at a very low cost.

For big enterprise S3 customers, as the service usage grows within their organization in a decentralized manner, it becomes increasingly difficult for security and privacy teams to keep track of the types of data in S3 and have a good understanding of their cloud storage environment. It’s not uncommon for enterprise S3 environments to contain a large volume of heterogeneous data. A while ago, I started looking into building practical solutions to analyze S3 data at scale in my free time. I created a platform named S3Insights to capture and analyze S3 metadata to derive security insights. This article covers the background and high-level details about the platform.

Why not simply analyze all the S3 content?

This is difficult to do, and is not economical or sustainable at scale. There are several commercial solutions that help discover specific data types and provide visibility into S3 by leveraging content analysis. Most of these solutions download S3 objects to analyze them locally using various techniques. Unfortunately, this methodology requires significant network, compute, and memory resources and cannot scale effectively. For S3 environments with a large number of objects or high data volume, these solutions become impractical. Some of these solutions support sampling, but random sampling does not effectively prioritize riskiest objects for more in-depth analysis. For these reasons, most cloud users end up building custom solutions using vendor technologies to analyze a subset of their data.

Can we start with metadata analysis instead?

Instead of using content analysis as the go-to method, it might be more effective to first examine object metadata. For example, if we come across a 50 GB object in S3 with the object key suffix ‘.sql’, we can make an educated guess that it’s probably a SQL database backup without looking at the object content. User-provided object tags on top of such system attributes may provide additional details such as the specific enterprise service that generated the database backup.

To use available resources effectively, I came up with a hierarchical approach consisting of three layers. The steps become more expensive as we go up the hierarchy. These steps can provide insights on their own as well as prioritize S3 objects for the next step above it.

S3 Inventory Analysis

At this step, we analyze S3 inventory information for all objects. This step could be performed at scale because S3 provides a built-in feature to generate and store inventory details in an easily queryable format.

User-Defined Metadata Analysis

The objects that get prioritized by the previous step would be analyzed for user-provided metadata. This step is more expensive than the last step, as it entails one S3 API call per object. The API response, however, would be small as it would only contain object metadata.

Content Analysis

The objects that get filtered through the previous two steps would go through content analysis, which is the most expensive step in the pipeline.

In the above workflow, the User-Defined Metadata analysis step is optional. For example, an organization that doesn’t use user-defined metadata for objects can go directly from Inventory Analysis to Content Analysis.

S3Insights

When I started looking for practical solutions, S3 inventory appeared very promising for analyzing system metadata at scale. Unfortunately, I couldn’t find much guidance on how to leverage this feature for deriving security insights. I ended up building a solution called S3Insights to perform the first step of this approach. I have some ideas about the next two steps, but I don’t have any concrete implementations yet. Today, I am open sourcing S3Insights so that anyone can use the platform in their AWS environment and contribute to the project. S3Insights is a platform for efficiently deriving security insights about S3 data through system metadata analysis. Rather than analyzing the content of individual objects, S3Insights harvests S3 inventory data from multiple buckets in a multi-account environment to help discover and manage sensitive data.

The goals of S3Insights are as follows:

Provide an easy-to-use, modular, and extensible platform to harvest S3 inventory from multiple AWS accounts and build a queryable database on top of the consolidated inventory. This would allow users to execute analysis queries in an automated/manual manner.
No interference with existing AWS workloads. Cloud users should be able to deploy the platform in their environment without requiring any changes in their AWS workload.
Build a knowledge base of analysis queries for interesting S3 scenarios. This is one of the motivations behind open-sourcing this project.
Help users derive insights about S3 data at scale in an efficient way.
Help users prioritize S3 objects for content/user tag analysis related initiatives. For example, the platform should help with performing malware scans in an enterprise S3 environment efficiently.

Bridging AWS feature gaps

S3Insights uses an existing S3 feature called S3 Inventory along with several serverless AWS technologies to provide a scalable platform. The platform overcomes several limitations of built-in AWS features:

There is no central way to manage S3 inventory at the AWS organization or account level. One can only manage S3 inventory configurations for each bucket individually.
It’s not possible to centralize S3 inventory collection. S3 inventory details can only be stored in an S3 bucket in the same region as the source bucket.
There is no easy way to generate a one-time S3 inventory. Once the inventory feature is enabled, it would keep producing inventory details at the scheduled cadence.
There is no built-in way to track inventory completion. The inventory generation pipeline can take up to 24 hours, and the built-in feature does not provide a good way to track progress. This limitation becomes more severe in a multi-account environment.

AWS Wishlist

If S3 inventory included user-defined metadata, then it would have allowed us to repurpose a platform like S3Insights to analyze it. This feature request would shorten the hierarchical methodology to two steps and enable us to analyze user-provided metadata at scale without requiring explicit S3 API calls.

Call for action

To get started, follow the docs in the GitHub repo to install S3Insights in your AWS environment. The repository also contains more technical details about the platform. A practical approach would be to focus on inventory insight scenarios in the short term, such as discovering database dumps, validating assumptions, etc. This way, you can deliver security value just by using the platform itself. In the long term, you should consider leveraging S3Insights to drive other S3 related initiatives such as water-marking sensitive S3 objects.

Future Directions

Machine Learning

My current focus has been on building Presto SQL analysis queries. I have not looked at leveraging Machine Learning for the consolidated inventory data. S3 Inventory supports AWS EMR, Apache Spark, and GCP Big Query. There might be opportunities to build interesting machine learning models on top of this data.

Visualization

I have not explored this area in detail other than building a couple of basic AWS QuickSight dashboards using AWS Athena as described in the GitHub documentation. We might be able to create other interesting dashboards using AWS QuickSight, GCP Data Studio, and other visualization solutions.