AWS S3: All you need to know about S3 as data scientists (Part 1)
Simple Storage Service (S3) is one of the first AWS services provided to general public. It is an object storage service which means it manages data as objects. Leveraging this architecture, AWS S3 provides greater analytic capability, quick retrievals of data along with high reliability and scalability as you would expect from cloud.
S3 enables running big data analytics using AWS query in places services such as Athena, Redshift Spectrum which I will cover in another post.
In S3, files/data are saved in buckets and every bucket is associated with a unique url that can be accessed on web. Hence, the bucket name has to be unique across all existing bucket names in Amazon S3 to enable the creation of unique urls.
Currently aws provides both virtual-hosted–style and path-style URLs to access a bucket:
virtual-host-style: http://bucket-aws-region.amazonaws.com
path-style: http://s3.aws-region.amazonaws.com/bucket
By default due to security the public access to S3 is deactivated when you first create the bucket. There is always a limit on AWS services and for S3 it is 1,000 buckets per account by default. It is a good practice to delete the buckets you are not using anymore so that the name becomes available. The file limit for each S3 object is 5 Tb and for bigger files it is recommended to use multi-part file upload functionality.
S3 also enables versioning and multi region replication which I will explain in the next story.
Thanks for taking the time reading the post :)