A Better Browsing experience for S3

Andrew Gross
YipitData Engineering
2 min readJul 24, 2020

At YipitData our Analysts often need to view S3 buckets to check 3rd party data deliveries, as well as our own data exports. This can be pretty painful! Our Analysts work inside a Databricks Notebook which gives a decently high bar for visualization expectations.

Your options are not great:

  1. AWS Console S3 Browser
  2. aws-cli using %sh in Notebooks
  3. boto3 in Python

The AWS Console has a decent interface, but presents a few problems. Primarily that you now need to give console logins to everyone who needs to use it. This falls outside of our desired security model and created a lot of friction with Analysts who needed to check data deliveries.

The aws-cli is a decent interface for small buckets. However, it can be a bit painful to constantly navigate deeper into nested directories and interpret the command line output when working with large numbers of files. Your eyes will glaze over pretty quick.

boto3 is the wildcard, it is extremely flexible since you can write Python. However, it puts the burden of the interface on the user, which isn’t a great outcome for Analysts who are focused on data, not writing Python wrappers.

What is the better option? We wrote browse_s3 to give the dataframe interface to browsing S3. This eliminates the need for AWS Console logins, while providing a clean and familiar interface.

Browsing Folders

The goal is to provide a consistent interface for browsing. We don’t have full interactivity, but we can get close enough. While S3 doesn’t actually have folders, it is helpful to present information this way when browsing. Helpfully, if you stick to some basic conventions boto3 easily satisfy this experience between Contents and CommonPrefixes when using list_objects_v2.

Browsing Files

While not on display here, this interface seamlessly displays folders and files together, keeping folders at the top per convention. In addition to the normal S3 metadata (etag, LastModified and size) we also include extra columns with the full prefix path and S3 URL. Finally, with this information stuffed in a dataframe you can easily use the sorting UI elements and join against it in SQL and PySpark.

Check out the code gist and feel free to leave comments and suggestions!

--

--