AWS S3: All you need to know about S3 as data scientists (Part 1)

Sareh Fotuhi
Nov 6 · 2 min read

Simple Storage Service (S3) is one of the first AWS services provided to general public. It is an object storage service which means it manages data as objects. Leveraging this architecture, AWS S3 provides greater analytic capability, quick retrievals of data along with high reliability and scalability as you would expect from cloud.

S3 enables running big data analytics using AWS query in places services such as Athena, Redshift Spectrum which I will cover in another post.

In S3, files/data are saved in buckets and every bucket is associated with a unique url that can be accessed on web. Hence, the bucket name has to be unique across all existing bucket names in Amazon S3 to enable the creation of unique urls.

Currently aws provides both virtual-hosted–style and path-style URLs to access a bucket:

virtual-host-style: http://bucket-aws-region.amazonaws.com

path-style: http://s3.aws-region.amazonaws.com/bucket

By default due to security the public access to S3 is deactivated when you first create the bucket. There is always a limit on AWS services and for S3 it is 1,000 buckets per account by default. It is a good practice to delete the buckets you are not using anymore so that the name becomes available. The file limit for each S3 object is 5 Tb and for bigger files it is recommended to use multi-part file upload functionality.

S3 also enables versioning and multi region replication which I will explain in the next story.

Thanks for taking the time reading the post :)

Sareh Fotuhi

Written by

Data Scientist

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade