Adventures in S3 Scraping — Part 2

13 min readDec 11, 2022

Among the countless files on S3 buckets live scores of product catalog images, reams of documentation, endless computer backups, legal adult content, and a myriad of personal files people have shared without realizing it is generally public.

In Part 1 of this article series I explored the basics of S3 buckets, how to find them in a web browser, and how to programatically search for and store them. In this article I will expand on this concept, share some additional odd content, and explore next steps for processing and searching through hundreds of millions of files.

The most common types of files I found thus far in S3 buckets are log files. Endless log files. There are multiple use cases for generating log files into S3 buckets for deeper analysis by a variety of tools for data security, analytics, usage tracking, etc. People have definitely been generating log files over a decade.

In one bucket I found a history of generating a specific named log file every 5 seconds starting in early 2017. I scrolled through pages and pages of file names with timestamps confirming that this bucket contained a 414K log file for every 5 seconds since January 1 of 2017 through today and still going. Let’s do some math (hopefully I get this right) :

1 Log file every 5 seconds

12 per minute x 60 minutes x 24 hours = 17,280 log files per day

2,158 days since January 1, 2017 (as of the day I wrote this).

17,280 x 2158 = 37,290,240 log files.

Each log file is ~ 414KB x 37,290,240 = 15,438,159,360,000 bytes or roughly 15 TB. The current monthly S3 storage cost alone is $0.023 per GB in that region for “S3 Standard” storage . Thus this person or company is spending about $355 USD per month to store their log files. That might not sound like a lot to some people, but that bill keeps getting higher every 5 seconds.

Let me dig into how I was able to find that information.

First, let’s briefly revisit a snippet of code from Part 1 of this article series.

int respCode = connection.getResponseCode();

if (respCode == 200) {
   code200s.add(bucketName + "," + currentRegion);
}

In the snippet of Java code above I retrieved an HTTP status code after having opened a connection to a URL and performing a simple GET request. I checked to see if the response code was 200 (OK) and adding a comma-delimited string of the bucket name and the region as an entry to be written out to a flat file. When I tested out hundreds of S3 bucket name URL’s I quickly found that not all of them were able to be “viewed” or accessed. A large number returned several response codes beyond a 200.

I discovered the most common HTTP response codes returned when enumerating bucket names included:

200 (OK) — The bucket name existed in that region and the URL should let me see a list of bucket entries (files). However, each file may or may not be secured individually.
301 (Moved Permanently) — I found that in the world of S3 the 301 response code indicated that the bucket name DID exist, but not in the region I was attempting. More of this later.
400 (Bad Request) — Represents an invalid bucket name. The “rules” of S3 bucket naming were violated.
403 (Forbidden) — For S3 bucket URLs this essentially means the bucket exists , but access is denied.
404 (Not Found) — The S3 bucket URL doesn’t exist. Nothing to see here.

Once I figured out the additional response codes I decided for now I wanted to just capture them and write them to files. I could obviously write code to perform operations on them in line right then and there, but I wanted to keep my methods simple and clean. Additionally I wanted processing to be quick so I could see results as fast as possible knowing I could iterate later on.

In addition to the ArrayList for storing 200 response codes I added specific lists for response codes 301, 403, and any others I called “unknowns”.

List<String> code200s = new ArrayList<>(500);
List<String> code301s = new ArrayList<>(500);
List<String> code403s = new ArrayList<>(500);
List<String> codeUnknowns = new ArrayList<>(500);

In my looping code where I checked the status code I modified the check for 200's to include the following.

int respCode = connection.getResponseCode();

if (respCode == 200) {
   code200s.add(bucketName + "," + currentRegion);

} else if (respCode == 301) {
   // moved permanently
   String newRegion = connection.getHeaderField("x-amz-bucket-region");
   code301s.add(bucketName + "," + newRegion);

} else if (respCode == 403) {
   code403s.add(bucketName + "," + currentRegion);

} else if (respCode == 404) {
   // ignore it, it doesn't exist

} else {
   codeUnknowns.add(respCode + "," + bucketName + "," + currentRegion);
}

For response codes of 301 I noticed in the response of the object (thank you Google Chrome developer tools) that requests for buckets that existed (but were in a difference region) returned an HTTP header named “x-amz-bucket-region” that identified the correct region for the bucket. This value could be captured and stored with the bucket name for later processing. I could have re-attempted to connect to the bucket in the correct region, but wanted to keep the code minimalist.

Note that I intentionally ignore 404 response codes and do not store them anywhere. These buckets do not exist and I do not want to keep a record of them as many of the names that were tested did not exist. I also realize my if-elseif-else statements above would be much more elegant using a switch statement, but let’s ignore that for now.

After the loop finished processing all the possible bucket names I wrote out the records in each data structure into a separate file: one for the 200’s, one for the 301’s, one for the 403’s, and one for all other response codes (excluding 404's). I could now choose to process these different files separately. However, now that I had a list of hundreds of valid bucket names I wanted to explore a bit and see what might be out there and viewable simply by trying the bucket name in my web browser.

The majority of what I found was fairly mundane and boring:

PDF mockups for packaging changes coming soon to a brand of adult washclothes.
Rehearsal videos of 2 salesmen practicing together on how to sell high-end cars in China and how to respect common Chinese business customs/manners.
Audio files for a sales pitch enticing people to buy the “secret to a better relationship” for only $499 USD.
Video replay files from Fortnite games

I also found things that people probably didn’t intend to be fully public:

Private videos of family events, vacation photos, etc.
Whole libraries of MP3 backups.
Excel docs creating price estimates from everything ranging from custom web development to landscaping.
Full-copy PDF’s of tech books on Javascript, Node, and other software topics.
Some small Internet shop “Shopify XML feed files” generated for some kind of API integration with each file containing a product ordered and the name, email, and shipping address of the customer.
Thousands of “Residential property reports” from 2019–2022 of properties in Brockton , Mass and the surrounding towns. Most of the data was similar to what you’d find for a property on Zillow, but some of the notes and data fields seemed possibly private in nature to the company that generated the reports.

Iterating the bucket contents

Trying every single valid bucket name in the web browser would take forever. Thus I wanted to program a way to iterate each bucket and write out a list of files that I could scroll in a text file.

private void LoadValidBuckets(List<S3Bucket> validBuckets, String validParentFolder) {

BufferedReader br = null;
try {
   File parentDirectory = new File(validParentFolder);

   File[] validFiles = parentDirectory.listFiles();

   for (int i = 0; i < validFiles.length; i++) {
      
      File validFile = validFiles[i];

      if (!validFile.getName().startsWith("valid_buckets_")) {
         continue; // skip to next file
      }

      br = new BufferedReader(new FileReader(validFile));
      String line;
      while ((line = br.readLine()) != null) {

         S3Bucket buck = new S3Bucket();

         String[] lineParts = line.split(",");
         buck.setBucketName(lineParts[0].toLowerCase());
         buck.setRegion(lineParts[1]);

         validBuckets.add(buck);
      }
   }

} catch (Exception e) {
   e.printStackTrace();
} finally {
   try {
      if (br != null) {
         br.close();
      }
   } catch (Exception e) {  }
}
}

As referenced in the code above the class S3Bucket was created as a quick time-saving utility object to hold a few values:

public class S3Bucket {
 
    private String bucketName;
    private String region;

    public String getBucketName() {
        return bucketName;
    }

    public void setBucketName(String bucketName) {
        this.bucketName = bucketName;
    }

    public String getRegion() {
        return region;
    }

    public void setRegion(String region) {
        this.region = region;
    }
}

The “valid” buckets are stored in timestamped files such as “valid_buckets_#####.txt” . Each line in the file has a bucket name followed by a comma followed by the region name. Each line is split using the comma and the subsequent values stored in the S3Bucket object.

I could then loop through the ArrayList of S3Bucket objects, construct URLs, connect, and if successful attempt to download and parse the list of file contents from the XML.

List<String> interestUrlsToCheck = new ArrayList(500000);

I created a new ArrayList to store the URLs I retrieved from each valid bucket file for later processing.

for (int i = 0; i < validBuckets.size(); i++) {

   S3Bucket bucket = validBuckets.get(i);

   String sUrl = "https://" + bucket.getBucketName() + ".s3." + bucket.getRegion() + ".amazonaws.com";

   // do something fun with the URL and its content


} // end-loop

The snippet above shows the overview of what is done. I looped through the list of “validBuckets”. For each bucket the name and region were used to created a URL. The line with “do something fun” represents the slightly harder part.

In part 1 of this article series I showed an example XML response of what bucket entries looked like.

In the XML file above the top-level node is named “ListBucketResult”. Each file in the bucket is contained inside a node named “Contents” with the file name specified by the “Key” node.

To connect to the bucket:

URL url = new URL(sUrl);

HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.connect();

//make sure it is still valid
int respCode = connection.getResponseCode();

I haven’t written any XML parsing code in forever so I deferred to a quick online search and found the following code below. There are multiple ways to approach this, but I wanted to parse the code to get the URL’s, not figure out “the best way” to parse XML for now.

The libraries used include:

import javax.xml.XMLConstants;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

I first checked to make sure the connection resulted in a 200 (OK) response code.

if (respCode == 200) {

   DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

   dbf.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);

   DocumentBuilder db = dbf.newDocumentBuilder();

   Document doc = db.parse(connection.getInputStream());

   doc.getDocumentElement().normalize();

The useful part of an HttpURLConnection is that it can return an InputStream which most Java developers should be familiar with if they’ve done any file or network I/O operations. In this case the InputStream from the connection can be passed to the DocumentBuilder.parse method. We can then retrieve the list of “Contents” nodes.

NodeList list = doc.getElementsByTagName("Contents");

for (int temp = 0; temp < list.getLength(); temp++) {

   Node node = list.item(temp);

   if (node.getNodeType() == Node.ELEMENT_NODE) {

      Element element = (Element) node;
      String bucketKey = element.getElementsByTagName("Key").item(0).getTextContent();
      interestUrlsToCheck.add(sUrl+ "/" + bucketKey);

The code above iterates through a list of nodes and retrieves the “Key” node inside it containing the value of the file name represented by the string variable “bucketKey”. This string can be added to a list and stored in a file for further review or processing. Note that I needed to append the file name to the end of the original bucket URL for future retrieval either in my web browser manually or later through HTTP code.

I then write out the full URLs to each file into a text file for visual analysis.

FileWriter writer = 
   new FileWriter(new File(parentDirectory, 
   "interesting_urls_" + System.currentTimeMillis() + ".txt"));

bw = new BufferedWriter(writer);

for (int i = 0; i < interestUrlsToCheck.size(); i++) {

   String urlToWrite = interestUrlsToCheck.get(i);

   bw.write(urlToWrite);
   bw.newLine();
}

I ran this overnight to scan a few hundred “valid” buckets that had returned an HTTP 200. To my surprise it generated several billion entries in a massively large text file I couldn’t even open. I then reverted to start over and only run a few bucket names at a time and ended up with 4–5 million rows in a text file. After opening the file and scrolling for a while I realized there are a lot of files I simply didn’t care about. For example I came across web content such as CSS, Javascript, HTML, ASP, PHP, etc. I also didn’t care about seeing images PNG, GIF, JPG, JEPG, SVG, etc.

It became obvious I needed to add a check against the bucket key names to exclude files I found boring vs those interesting.

I wasn’t looking for an elegant solution at this point, just one that worked.

private boolean isInterestingFileExtension(String fileName) {

   Map<String, String> boringExtensions = new HashMap<>();

   boringExtensions.put("jpg", null);
   boringExtensions.put("jpe", null);
   boringExtensions.put("jpeg", null);
   boringExtensions.put("gif", null);
   // .other extensions here

   String fileExt = "";
   
   if (fileName.contains(".")) {
      
      fileExt = fileName.substring(fileName.lastIndexOf(".") + 1, fileName.length());

      if (boringExtensions.containsKey(fileExt.toLowerCase())) {
         return false;
      } else {
         return true;
      }
   } else {
      return false;
   }
}

The isInterestingFileExtension method above is quick and dirty. It let me copy and paste lines to add to it quickly. It assumes that a file name has at least one period with some kind of extension after it. Files may also have multiple periods so it looks to grab the end of the file name starting at the last instance of a period. I found there are filenames stored in S3 buckets without a proper file extension of “.abc” thus the ‘else’ clause in the code above returning false if a period wasn’t in the file name. This way it would flag and store unusual files without an extension as well. I modified the key iteration as follows:

String bucketKey = element.getElementsByTagName("Key").item(0).getTextContent();

if (isInterestingFileExtension(bucketKey)) {
   interestUrlsToCheck.add(sUrl+ "/" + bucketKey);
}

If other people utilize this method you can easily add or remove a variety of file extensions you want to exclude or make sure you don’t exclude so you can find them. I excluded movie files such as MOV, QT, AVI, WMV, MPG, MPG4. I excluded various log files such LOG, TMP, TEMP. I also added executable and packaged/zipped files such as EXE, BIN, PS, CMD, GZIP, ZIP, GZ, TAR, RAR, YAR, SH, BAT. You can customize as you see fit.

For example if you want to scan all buckets for only one file type it is easy to modify the entire isInterestingFileExtension method down to just one line:

return fileName!=null && fileName.toLowerCase().endsWith(".pdf");

I was mostly interested in finding documents such as PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, TXT, and others. Somewhat surprisingly, PDF documents seemed to contain some of the most interesting information. Any time I scrolled through countless URLs and found something that probably shouldn’t be shared it seemed to be in a PDF document.

For example I happened upon a bucket that had approximately 2500 scanned documents from a notary or bank in South America. It contained high-resolution scans of personal identification cards, personal bank statements, property tax records, and various other financial and government documents that people use throughout the course of day to day life. In the example below there was a scan of the front and back of a Colombia citizen’s ID. I blacked out anything that seemed personally identifying such as the photo, signature, numerical data, fingerprint, and barcode. I would think customers would be pretty upset if this financial institution couldn’t properly secure their personal data.

Looking for PDF documents in S3 buckets also resulted in sensitive (or interesting) finds such as:

A PDF document of a U.S. court filing of a defamation lawsuit against several individuals. Each individual’s name and address are listed, but an attempt to redact each parties’ home address with a black marker was made. However, when the physical document was scanned to an image, the marker used to redact the image was still wet or not dark enough. The original text can be seen behind the black marker clearly showing the addresses of the individuals named in the lawsuit. In the lower left image below I performed actual redaction on the name, street, city, count, state, and zip, but left the poor attempt at the street number visible partially circled in red to highlight it. In the lower right image through the black marker you can clearly see “at 773”.

An application to Internet Banking Services with Barclays from a South American company including signatures, account numbers, and security questions of the primary account holder.
File dump of meeting minutes and committee reports from the United Nations from the 1960s and 1970s.
Someone’s laptop backup with a variety of files including a dozen private SSH keys.
A 1300+ page PDF of “The Complete Calvin and Hobbes” cartoons.

Countless resumes with work history, addresses, email addresses, and phone numbers.
Several sets of personal and company tax returns, W9’s, 1099’s, and misc other tax docs. An example I have redacted is below:

2022 Budget forecasts for a small-to-mid-size company.
A 2014 Cornell University “Disorientation Guide” for new students that appeared to be a newsletter highly critical of Cornell policies, history, and administration.

An entertaining letter from a political group lobbying the FBI and Department of Justice to drop charges against people arrested during the January 6th insurrection at the U.S. Capitol.

In the next installment of this article series I will cover how to process S3 buckets that returned a status of 301, how to paginate buckets and retrieve more than a single list of the first 1000 files, and how to download all the files in the bucket instead of manually opening them in your web browser. I will also discuss some more interesting content found in S3 buckets that should NOT be shared publicly and some that is just odd or entertaining.

Adventures in S3 Scraping — Part 2

Iterating the bucket contents

Written by Adam Myatt