Adventures in S3 Scraping — Part 1

8 min readNov 29, 2022

You wouldn’t believe the weird files people share in S3 buckets. Things they probably shouldn’t. Things they definitely shouldn’t.

Just this month I’ve seen files such as:

License keys to a company’s software.
Excel doc from a media-related conference containing names, emails, and phone numbers of local media employees in 9 different states.
Powerpoint doc for a Latin American division of a major tech company marked “Proprietary and Confidential” (my Spanish is a little rusty) that I believe listed costs/fees/margins for proposed services.
45,000–50,000 PDF documents containing customer PII (full names, email addresses, phone numbers, and locations) for a company that went out of business in the last 2 years. (I’ve reported the bucket to AWS Support and will discuss the issue in more details IF they take action and protect the data in the near future.)
High-resolution scan of a someone’s driver’s license, insurance card, and vehicle registration from 2015–2016. (In an attempt at decency I tried contacting this person by looking him up on LinkedIn and Google only to find someone with the same name, date of birth, and address was convicted in 2021 of a fairly serious crime and is now serving time, thus I won’t lose sleep over his 7-year-old documents being out there on the Internet).

In this series of articles I will cover the steps I took to identify, validate, and enumerate publicly viewable content in S3 buckets. I performed this same activity more than 10 years ago. Much to my surprise I found an astonishing amount of content that should not have been public. It felt like an odd kind of voyeurism: seeing things I shouldn’t. Watching out the window as a neighbor struggles to walk down an icy driveway without falling over for the 4th time. ‘People watching’ at the mall. Watching people on security cameras when they don’t know they are being surveilled. In this case seeing files meant to be shared with a small group of people and not the entire world.

I presumed over the last 10 years that people would be better equipped and have better tooling to lock down their content and only share what should be shared. I was wrong. So why bother 10 years later to repeat this activity? For fun and awareness. It’s also been a few years since I’ve written any substantial amount of code so I wanted to revisit and sharpen some skills.

How Can you View a Bucket’s Files?

First, I wanted to identify existing S3 buckets that may be out there in a specific AWS region (us-east, us-west, etc.) If I had a valid name for a bucket I could then try it in a web browser and see if it listed the contents of the bucket. That doesn’t mean the files themselves are able to be viewed, but at least let me see the file paths for future investigation of those files.

For example, if the bucket name “MyObviouslyFakeBucket” were publicly viewable and located in the AWS “US East 1” region you could potentially see the contents in your web browser by visiting https://myobviouslyfakebucket.s3.us-east-1.amazonaws.com/

This would return a listing similar to the following partially redacted image.

Files contained in the myobviouslyfakebucket S3 bucket.

In the XML document that is displayed as a result you can see file entries under each “Contents” tag. For each “Contents” node there is a
“Key” node that displays the file path and name of each file. Thus for the file “interesting-text-file.txt” you could potentially test access to the file by appending the path onto the end of the bucket URL such as the following:

https://myobviouslyfakebucket.s3.us-east-1.amazonaws.com/interesting-text-file.txt

If the file was able to be viewed it would open in the browser or trigger an automatic download (depending on the file type and your browser). If you didn’t have access you would see an XML result essentially showing an “Access Denied” message.

Despite the best efforts of AWS there are still people that set content to be publicly viewable when they shouldn’t. I haven’t worked with S3 in a few years so I may be missing something, but there seems to be MULTIPLE steps involved before your bucket can be easily enumerated and viewed in a web browser.

When creating a new bucket the “Block all public access” is checked by default. You have to un-check it and then select the “I acknowledge” checkbox further down. See below.

Even with my bucket configured to not ‘block all public access’ when I added two sample text files I still could not list the bucket contents in my web browser until I added a JSON bucket policy explicitly granting public access.

{
    "Version": "2012-10-17",
    "Id": "Policy1669653712601",
    "Statement": [
        {
            "Sid": "Stmt1669653708988",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::myobviouslyfakebucket"
        }
    ]
}

After applying the bucket policy I was able to view the bucket and it’s contents. How other people go through these steps and STILL end up sharing content they should not is beyond me.

Overall Approach

So how do you list bucket contents without guessing randomly? My main two options for achieving this beyond a single bucket name of my own invention were to use a dictionary file paired with either the AWS SDK or simple HTTP GET’s .

I chose to write Java code using simple HTTP GET’s for a first version. This avoided the need to learn the updated AWS Java SDK v2. The last time I used the AWS Java SDK it was in a v1 and enough of it changed that I didn’t want a learning curve to delay my progress. I also could avoid the need to set up AWS credentials to use the SDK and any SDK-specific errors or oddities that came up during testing. Keep it simple.

I started with an existing dictionary file I had from a previous personal project. This was a flat text file with one word per line. At one point I broke it up into 8 or 10 separate files with each file containing the entries for 1–3 letters depending on the number of entries. This allowed me to process a smaller number of entries at a time more easily. You can search online for a dictionary file as there are plenty available.

I sketched out the steps I needed to program on 2 post-it notes. They were as follows:

Parse the dictionary file to retrieve each word entry.
For each word in the list construct the URL to check using the word as a bucket name and the AWS region (hard-coded for now to “US East 1”).
Try connecting to the URL to perform a GET operation.
Retrieve the response code sent from the server.
If the response code indicated success (the bucket exists) add the word to a data structure.
Save the successful words in the data structure to a flat text file for later investigation.

There are additional steps I would need to perform in future iterations, but these were my initial goals just to find a list of existing buckets. Simple and straightforward.

Parse the dictionary file

private void populateList(List<String> words, String dictionaryFile) {

   BufferedReader br = null;
   try {
      br = new BufferedReader(new FileReader(new File(dictionaryFile)));
      String line;
      while ((line = br.readLine()) != null) {
         words.add(line);
      }

   } catch (Exception e) {
      e.printStackTrace();
   } finally {
      try {
         if (br != null) {
            br.close();
         }
      } catch (Exception e) { }
   }
}

In the code above I called the populateList method with an empty ArrayList and a String containing the full path to my first dictionary file such as “c:\\projects\\s3fun\\dictionary_abc.txt” The code parsed the file and put the entry on each line into the ArrayList.

Construct the URL

String currentRegion = "us-east-1";
int wordSize = words.size();

for (int i = 0; i < wordSize; i++) {

   String bucketName = words.get(i);
   
   String sUrl = "https://" + bucketName + ".s3." + currentRegion + ".amazonaws.com";
   URL url = new URL(sUrl);

   // do something with the URL
}

In this step I looped through the ArrayList of words and created the URL using the word as a bucket name and a hard-coded region.

Perform a GET operation

String sUrl = "https://" + bucketName + ".s3." + currentRegion + ".amazonaws.com/";
URL url = new URL(sUrl);

HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.connect();

This step is pretty easy. I used the URL to open a connection, set the request method to “GET” and connect. That is roughly equivalent to pasting a URL in your web browser address bar and pressing Enter.

Retrieve the response code and store it

int respCode = connection.getResponseCode();

if (respCode == 200) {
   code200s.add(bucketName + "," + currentRegion);
}

I retrieved the response code from the server and checked to see if it was an HTTP 200 indicating the request succeeded. You can review HTTP status codes here if interested.

Store the valid bucket names

private void writeCode200s(List<String> validBuckets, String parentDirectory) {

   if(validBuckets == null || validBuckets.isEmpty()) {
      return;
   }
        
   BufferedWriter bw = null;
   
   try {
      File parentDirectory = new File(parentDirectory);
      if (!parentDirectory.exists()) {
         parentDirectory.mkdirs();
      }

      FileWriter writer = new FileWriter(new File(parentDirectory, "valid_buckets_" + System.currentTimeMillis()+ ".txt"));
      bw = new BufferedWriter(writer);
      
      for (int i = 0; i < validBuckets.size(); i++) {
         String bucketName = validBuckets.get(i);

         bw.write(bucketName);
         bw.newLine();
      }

   } catch (Exception e) {
      e.printStackTrace();
   } finally {
      try {
         if (bw != null) {
            bw.close();
         }
      } catch (Exception e) { }
   }
}

In the code above I made sure a valid parent directory exists and then write out a file into it using a timestamp to ensure uniqueness so files don’t get overwritten. In the file I write out each entry one per line.

That’s it. Simple and straightforward. It might not be the most elegant solution, but it worked and was a base for allowing me to expand and improve it. I identified thousands of valid S3 buckets just from scraping several letters of the alphabet. From just my first few attempts I found catalogues of MP3 files, millions of images, countless log files, and more.

I have since deleted my test bucket “myobviouslyfakebucket” so feel free to claim the name if you want it. In the next few parts of this article series I will highlight additional steps to enhance this solution such as:

Handling and storing response codes for buckets besides 200(OK) and what they mean and what you can do with that info.
Using the list of valid bucket names to see if you can enumerate the list of files in that bucket.
Parsing the list of bucket files to capture the path and name of individual files.
Filter files by file extension to ignore unwanted noise.
Paginating file results for S3 buckets with more than 1000 files.
Checking lists of files to see if individual files can be downloaded.

And of course I will also highlight any unusual or interesting files or situations I find.

Part 2 : https://medium.com/@adamlmyatt/adventures-in-s3-scraping-part-2-da48106c041a