There’s a cool Python module called s3fs which can “mount” S3, so you can use POSIX operations to files.
Why would you care about POSIX operations at all? Because python also implements them. So if you happen to currently run a python app an write things to a local file via:
with open(path, “w”) as f:
you can write this to S3 simply by replacing it by:
with s3.open(bucket + path, “w”) as f:
. Of course S3 has good python integration with boto3, so why care to wrap a POSIX like module around it?
Why would you want to do that?
For me, the question is what kind of model you want to have. s3fs supports a simpler implementation, as you usually move from storing local, to storing in the cloud. This change doesn’t require you to change how you think about your app and your data.
If you happen to hit the limits of s3fs it of course makes sense to either sync data in batch to S3 or use a different form of loading your persistent data.
My prime use case for this is machine learning. Loading and saving models in evaluation stages and loading and saving data in various stages of processing are typically done first on a single machine, and then in the cloud, possibly on more than one machine with the need for some form of persistent storage.
The pros of the two concepts. Pro s3fs:
- less space needed on device, instance, host, container
- simpler mental model
Pro load via boto3:
- speed in execution. Data is stored locally, no bandwidth on load to memory
- possibly more immutable, data only changes on sync, not on every read/write.
Installing and Starting Out
s3fs is pip-installable, so just run
pip install s3fs
, import s3fs into your script and you’re ready to go. All actions require you to “mount” the S3 filesystem, which you can do via
fs = s3fs.S3FileSystem(anon=False) # accessing all buckets you have access to with your credentials
fs = s3fs.S3FileSystem(anon=True) # accessing all public buckets.
you can test things with a simple list. Remember s3fs uses the POSIX standard, so all usual Unix commands like ls, cat, touch should work:
fs.ls(“…”) #displays contents of a bucket
will work as does
fs.touch(“…”/”test.txt”) # should put a 0 byte file into your bucket.
. Now let’s do something useful with this.
Example 1: A CLI to Upload a Local Folder
This CLI uses fire, a super slim CLI generator, and s3fs. It syncs all data recursively in some tree to a bucket.
In the console you can now run
python filename.py to_s3 local_folder s3://bucket
to start the CLI. Note this assumes you have your credentials stored somewhere. Somewhere means somewhere where boto3 looks for it. Boto resolves credentials in order:
1. things passed to S3FileSystem via access_token, which is then passed to boto.client(),
2. Environment variables, which I usually use,
3./4. Shared credentials, config files, etc.
(see the boto3 documentation for more information)
Example 2: Writing, Loading Machine Learning Models
Another good use case is to save andload machine learning models. They usually are too big to store in subversion, but you do want to save them regularly. Because, well sometimes things crash. One thing you could do for instance is to save:
1. the model object, pickle
2. parameters in some dict used to create it, describe it, whatever you want to remember. Possibly in JSON so you can read it in plain text.
3. results, scores etc.
Here’s a small example.
Example 3: Writing a Pandas DataFrame to S3
Another common use case it to write data after preprocessing to S3. Suppose we just did a bunch of word magic on a dataframe with texts, like converting it to bag-of-words, tf-idf etc. We then would want to save this DataFrame, and possibly the Tokenizer, to S3 with the following code:
That’s it! Start playing around with this module. You can find the module and the three gists here: