Using S3 is becoming popular for big data applications, in particular Apache Spark. In this post, I’ve collected a few recommendations from several years of our experience with it.
Not a filesystem, but consistent at last
Obviously, S3 is not a filesystem. It operates only on complete objects, the rename operation actually copies data, it has high latency and charges for every operation. On the other hand, since the end of 2020, it offers strong consistency — a completed operation is immediately visible for other clients. Quite a lot of documentation and blog posts still say otherwise, but you no longer need S3Guard, EMRFS consistent mode, or anything like that.
Use terraform from the start
It’s best to use terraform to manage your S3 buckets right away. It has quite good documentation, and it will take no more than 30 minutes to create the first bucket. In return, you will be able to consistently reproduce your setup, and not rely on careful clicking in the AWS console. Of course, you can use any other IaaC tool if you prefer.
Get your permissions and owners in order
At the minimum, use a separate bucket for analytics data and a separate IAM role that can only access that bucket. You certainly don’t want analytics to have unchecked access to production data. A better approach is having a separate AWS account for analytics. For a perfect solution, create separate buckets per purpose (e.g bronze/landing and gold/mart tables) and per team.
Sadly, permissions in AWS and S3 are extremely confusing and would take a lot of reading, but here is a couple of recommendation
- For access within the account, create IAM roles and assign policies to them. This way, permissions of a role are in one place and easy to see.
- Avoid access across accounts.
- If access across accounts is necessary, that is, IAM role from account A writes to a bucket in account B, policies for the IAM role don’t apply. You then have two options — either assume a role in account B or use bucket policies to give write permission. Sadly, when using Spark you can’t dynamically assume roles based on write destination, so bucket policy is the only option.
If implemented naively, bucket policies that allow account A to write in a bucket from account B will keep objects owned by account A. This results in complete confusion — code in account B cannot read objects in its own bucket, and you’ll try to use bucket policies, which don’t work in this case. Therefore, please force ownership change.
If object versioning is enabled, S3 keeps all data that was ever stored under a particular key. If you replace an object, a new version is created, while the previous version becomes “noncurrent” but retains all the data. Deleting an object creates a “delete marker”, while the previous data will also be kept in a “noncurrent” version”. There is an API to access the content of a non-current version, and you can write it back to revert to an earlier state.
This functionality is a lifesaver. We used it several times, from restoring a single data science notebook to restoring a dozen of large tables. Sadly, S3 does not provide an easy tool for such point-in-time recovery. In some cases, you can use the open-source s3-pit-restore tool. In other cases, you might have to use S3 API directly. Still, it is a very good idea to have this option at hand.
Setup logs and inventory
For all buckets, you should configure server-side logging. It will log every request, include object key, data size, and timing. Then, also enable bucket inventory — an automatic process that creates a daily list of all objects in your bucket. Together, these instruments allow you to drill down into usage and costs, debug performance and permission issues, and are helpful with recovery should you delete something.
It is best to set up a couple of Spark queries that read this data, to be ready when you need them.
Learn about storage classes
When you add an object to S3, it uses a so-called standard storage class — you can read it whenever you want, no questions. For historic data that you rarely access, there are several cost-saving options
- Intelligent-Tiering keeps track of accesses to an object, and after 30 days with no accesses reduces the cost by half. If you need to access an object, it moves back to the full price tier. S3 charges a per-object fee for this monitoring, and you are advised to do the math yourself, but in many cases, this fee is insignificant compared to savings. For example, we use intelligent tiering to store events. Should an event fall out of use, the cost of storing it reduces by half without any work on our side.
- Glacier storage class is suitable for archival data. It is several times cheaper than standard storage, but the objects are not directly accessible. If you need to access data, you need to temporarily pull it from the glacier, read them, and then it freezes again. We use it for raw data — something that is processed once and no longer needed — except if you find a bug and need to reprocess the past one more time.
- Glacier Deep Archive is even cheaper than Glacier, but pulling data from it takes considerably longer. You might consider using for archival data you basically never use but need to keep for legal reasons.
Configure lifecycle rules
The storage classes above can reduce the costs, but you need to transition objects to these classes. And of course, the cheapest storage option is always “delete the data”. S3 can automate these options using automatic lifecycle rules.
The most important recommendation here is to agree on policies up-front. Otherwise, one might find that important business processes depend on raw JSON data because it happened to be available. Also, prefer to set lifecycle rules in broad strokes — ideally on the level of buckets, or first-level prefixes. While lifecycle rules can be set on any prefix, the behavior is pretty confusing, and in particular, you can make subprefix keep data longer than the parent prefix.
For example, you can have a separate bucket for raw (landing, bronze) data. After a week, move data to Glacier. After further 90 days, remove the data.
For another example, the final mart tables can be stored in another bucket. The current versions are kept forever, while non-current versions are removed after 14 days. This gives enough time to recover from bugs in jobs while keeping usage under control.
Avoid small objects
Most S3 operations are charged per object. That includes PUT and GET, but it also includes lifecycle transitions we discussed earlier and intelligent-tiering management fees. Therefore, the larger your objects, the better. In the context of big data using Parquet file format, the perfect size is around 1TB. It is comfortably processed by Spark, and sufficiently big for S3 operation costs to not matter. For real-time ingestion pipelines, 50–100MB are ok. Anything less than 1MB is a cause for worry.
You can use S3 Storage Lens to get an overview of object sizes, or you can use inventory (see above) and write queries on top of it to drill down into problems.
Beware of the latency
The final recommendation is to be aware and keep track of the S3 latency. Suppose you’re reading a lot of objects from a medium-size instance such as r5d.2xlarge. You can expect to reach 200MB per second. However, the latency — the time from request start to the first byte of data — can reach 100ms. In other words, if you are reading 20MB of data, latency can be 50% of your total request time.
There are many cases where latency can bite you
- Obviously, if you table a table that has a thousand files, each of them 20MB in size. That is the classic case of table fragmentation and you should compact such a table.
- It is possible to have internal fragmentation. Parquet files are composed of row groups, and these row groups are read independently. By default, the writing process starts a new group after collecting 128MB of data in memory, so the size on disk, after compression, can be 10–20MB, and we can have a 1TB file with small row groups inside, no better than before. I’d recommend setting the `parquet.block.size` Spark option to at least 512MB.
- You can have too many columns. Not only row groups in a Parquet file are read independently, but each column of data can be read independently. Worse, the current Spark code reads each column serially. Here, I don’t have any good suggestions other than increasing row group size, don’t read columns you don’t use, and waiting for Spark to improve
We went through a few recommendations for the production use of S3 for big data. I hope they were useful, but if I missed anything, please let me know in the comments.