Reflections on S3's architectural flaws
On 28th February 2017, Amazon Web Service’s S3 storage service experienced a period of unavailability. While we are awaiting a postmortem and explanation of root-cause, I thought I would share my thoughts on the storage guarantees provided by S3, the implications of S3’s closed model for metadata, and a prediction of the re-emergence of cloud-scale hierarchical filesystems as a replacement for object stores (for some workloads).
It appears that the S3 outage originated in US-East, but affected any S3 clients around the world that did not use region/AZ-specific hostnames [s3outage]. At one stage even the AWS status page was down, presumably because it had a dependency on S3. Even Heroku app developers learnt the hard way the true definition of a distributed system: it is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.
The Metadata Problem of S3
Aside from the outage, there are many limitations of working with S3 that make it a less than ideal long term storage technology, and most of its problems relate to S3 object replication and metadata. S3 is a an eventually consistent key-value store for objects. However, eventual consistency tells us nothing about what guarantees S3 provides. AWS will not give you answers to these reasonable questions:
- how long will it take until the object is replicated ?
- what is the current status of object replication (how much longer should i wait until trying to read the latest version of an object) ?
For instance, if you list objects in an Amazon S3 bucket after putting new objects, Amazon S3 does not provide a guarantee to return the current objects in that bucket. The same is true for even for single objects that have been updated or deleted (see the figure below) — a subsequent read may not return the latest version (the object may still exist even if you deleted it!).
Netflix does not trust the metadata provided by S3. They have replaced it with their own metadata service, s3mper, which is essentially an eventually consistent key-value store that stores a copy of the metadata in S3 [s3mper]. Netflix rewrote their applications to account for s3mper. In the diagram below, you can see that application programming becomes more complex. Creating an object in S3 becomes a write to DynamoDB and a create operation in S3. This is not done transactionally. All S3 read/list operations need to be re-written to query DynamoDB and S3 and compare the results. All mutation of S3 objects involve updates to both DynamoDB and S3. The result is that DynamoDB stores an (eventually consistent) replica of the metadata from S3 (which itself is eventually consistent). So your applications rely on an eventually consistent replica of eventually consistent metadata. Whoa! Yes that’s as crazy as it sounds. But, it works so well for Netflix that Amazon have now released a port of s3mper called the Elastic MapReduce Filesystem. The problem is not specific to S3 — Spotify have had to develop a similar metadata keystore for Google’s GCS object store.
The s3mper type of solution is now the lamentable state-of-the-art for object storage on public clouds, but there is hope for better systems for the future.
Objects Stores going the way of NoSQL Databases?
Object stores have much in common with NoSQL databases. Both arose from scalability limitations in hierarchical filesystems and relational databases, respectively. However, many NoSQL databases are now being replaced by NewSQL systems that overcome the scalability limitations of single-node relational databases, but maintain the strong consistency guarantees of relational databases [newsql]. Sometimes, you can have your cake and eat it.
Given this, I make the following prediction:
Object stores systems grew from the need to scale-out filesystems.
A new generation of hierarchical filesystem will appear that bring both
scalability and hierarchy, and many companies will move back to
scalable hierarchical filesystems for their stronger guarantees.
Object stores were designed to scale up a storage service to tens of thousands of servers, providing effectively unlimited storage capacity, but at the cost of weakening the guarantees provided to application developers, in the form of eventually consistent semantics. Eventual consistency implies that data read by applications may be stale (not the latest version written) and, as such, it introduces problems for application developers, as it is challenging to write correct applications due to the weak guarantees they provide which version of an object or file will be read.
Amazon has not published any details on how metadata works in S3 or what guarantees it provides (other than don’t expect to read what you wrote). Companies have reacted by building their own metadata stores for the metadata in S3 — they developed key-value stores that are eventually consistent replicas of the eventually consistent metadata in S3. Even Amazon doesn’t trust S3’s metadata and recommends users to use EmrFS that stores replicas of S3 metadata in DynamoDB.
On the same day as the S3 crash, at USENIX FAST 2017, our research group presented a next-generation distribution of Apache Hadoop File System, called HopsFS (in joint work with Spotify and Oracle). In the paper, HopsFS is shown to scale to 1.2 million ops/sec on Spotify’s Hadoop workload (16X the throughput of HDFS). For more write-intensive workloads, HopsFS throughput is 37X the times of HDFS. HopsFS also supports at least 35X the metadata capacity of HDFS (HopsFS can have up to at least 12TB of metadata vs ~300GB limit for HDFS).
We are currently working on making HopsFS highly available in a multi-data-center environment, such as over two AWS availability zones. We believe we can build such a cloud-scale hierarchical filesystem with performance similar to HopsFS. This would enable companies to replace S3 with a high-performance distributed hierarchical filesystem for many workloads. Not only should we have much higher write/read throughput at lower latency than S3, but we should also provide stronger consistency semantics than S3. We want you to have your cake and eat it :)
Our approach is not without hurdles, though. One of the challenges for distributed hierarchical filesystems in the cloud is the economics of local storage versus remote storage. AWS and many other public clouds provide compute nodes with small amounts of local storage and larger amounts of remote storage (in Elastic Block Storage). There is no economic reason that public clouds could not provide more storage dense compute nodes at reasonable cost. Most on-premise Hadoop installations have storage-dense compute nodes with typically 8–24 hard disks (32–172TB of local storage). Criteo (who manage Europe’s largest Hadoop cluster) recently estimated it would cost them 8X their current on-prem system to move to AWS.
- [s3mper] http://techblog.netflix.com/2014/01/s3mper-consistency-in-cloud.html
- [Spanner] http://www.cubrid.org/blog/dev-platform/spanner-globally-distributed-database-by-google/
- [emrfs] http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html
- [hopsfs] https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi
- [newsql] http://db.cs.cmu.edu/papers/2016/pavlo-newsql-sigmodrec2016.pdf
- [s3outage] https://news.ycombinator.com/item?id=13755673