Adobe I/O Events: Building a Distributed Linked List on S3 (Part II)
In our previous post, we introduced the concept of a Distributed Linked List. We also talked about how Adobe I/O Events implemented them and why those implementations failed.
After pushing out our second implementation we were keen on rearchitecting our Journaling subsystem to be more performant, cost-effective, and most importantly scalable.
From the shortcomings of the first two implementations, we had already learned that utilizing an object store (such as AWS S3) to store the event payloads was the right idea — as long as — we could batch our events together in a way that facilitated event reads.
We also learned that performing a DB query in the critical path of reading events would continue to prove to be the bottleneck. And, hence, to achieve a truly scalable system we would need to eliminate all database interactions in reading events.
Putting both those learnings together and accounting for the fact that our API, to a great extent, resembled traversing a linked list — we started to explore the possibility of building a linked list of S3 objects. Where each S3 object would not only contain a batch of events (the data), but also the location of the next S3 object in that list.
And, if we could build such a list, reading events would be a simple list traversal that eliminates any kind of database lookups. Not only that, but the event reads would have the potential to scale as much as AWS S3 itself. And if we played our cards right, the cost of reading events could be as low as S3 GET requests ($4 per 10 million requests). Ka-ching!
The hard problem
Reading the events from the list was never a hard problem, the hard problem was to write those events into the list in the first place — in a manner that is scalable and that works in near real-time.
For true scalability and availability, we were sure that we would ultimately need multiple compute instances (containers) to write to a list concurrently because a single container could be scaled-up only so much.
And, thus, what we really needed to solve was — writing concurrently to a shared resource in a distributed environment. We also needed to ensure that our solution was “mathematically” correct and was guaranteed to never fail — else we may end up losing or corrupting customer’s data.
At first glance, the idea of writing concurrently could be easily solved through a lock that could guarantee mutual exclusion. But once we went further down this path, first with a Redis based Distributed Lock (RedLock) and then with ZooKeeper, we learned —
- Even though correctly locking between two threads in a JVM or between two processes on the same machine is hard, yet, it is a solved problem.
- However, achieving the same mutual exclusion guarantee in a distributed environment is not solved. We could either guarantee mutual exclusion or liveness, and we needed both. Martin Kleppmann has written an enlightening article about this analyzing Redlock.
Luckily, we also discussed our approach with experienced folks internally —
The difference between ‘a distributed lock’ and ‘a distributed lock that works in practice and is on the critical path for high throughput writes and magically is not the bottleneck’ is ~50 person years. [sic] — Michael Marth (Director, Engineering)
Lastly, there was yet another consideration to take into account — we could not simply devise an approach that would be operationally heavy or one that would need much human intervention.
No major cloud provider provides a managed ZooKeeper service. If you want us to run ZooKeeper ourselves, we will need another full time DevOps engineer — Sathyajith Bhat (Senior DevOps Engineer & AWS Community Hero)
All in all, the best thing we did with distributed locking was to eliminate it as an approach early-on.
A Lock-free solution
After weeks of whiteboarding, we finally had a breakthrough. Re-evaluating the problem at hand, we realized that instead of acquiring a lock and writing to the linked list inside the critical section, we could break the whole process down into two steps —
- Order the writes by simply assigning each “write-task” a monotonically increasing number. And then,
- Perform the actual writes into the list in the determined order.
Both of the above steps could be run concurrently on any number of containers without affecting the correctness of the algorithm and without corrupting any data. The only thing we had to make sure was to make the second step of writing the events into the list idempotent.
Implementation: Technology choices
Ideas are a dime a dozen and the true test is always to implement and execute the vision. The new journaling system took two months to go from conceptualization to production — this was, quite frankly, impossible without the technologies available to us out-of-the-box.
This section calls out some of the technology choices that we made and how those technology choices affected the design of our system, the API, and the algorithm itself.
AWS RDS MySQL
MySQL was our technology of choice. Not only did we heavily depend on the auto-increment id feature to order the event writes, but we were also only able to guarantee the correctness of our algorithm because we used database transactions in MySQL.
Secondly, AWS RDS is a managed service and there is no denying the fact that we love managed services! RDS provided us with high durability, lossless replication, and abundant uptime right out of the box — thus, reducing our operational load tremendously.
Lastly, to write events at scale to the list, we knew that the database again could become the bottleneck. Hence, we introduced a time-based batching of event writes which made sure that no matter the ingestion load on our system, there was a practical limit to the number of database interactions we would make to process that load. This critical piece of the puzzle made sure that our database could always handle the incoming load.
The only thing we were sure of when we started on this re-architecture journey was that we needed to use an object store such as S3. To simply say that S3 influenced us would be a gross understatement, the object store, quite literally, was and is the central piece of the journaling subsystem. Here are some ways it influenced the design —
- In a traditional linked list construction — a new node is appended by updating the next pointer on the tail node. However, because an object in S3 cannot be partially updated or appended to, the distributed linked list could not be constructed in a traditional sense. Instead, each S3 object in our list dictates where the next S3 object has to be.
- In the previous implementation, we used to fetch multiple S3 objects to serve a single read request. Not only did this consume more resources, but more importantly it was very costly. This time we made a subtle change to our API semantics, which made sure that we never have to fetch more than a single S3 object for a single read request. Thus, the 20M read requests that we serve in production every day only cost us $8. I like the fact that our production system spends lesser money serving requests daily than I spent on coffee while I was whiteboarding it.
- To maximize S3’s performance we followed its recommendation on object key prefixes and used completely random strings as object keys, even though it meant that we could not list S3 objects meaningfully. Furthermore, we even implemented the capability to add more S3 buckets on the fly, enabling us to add more capacity to our system on demand.
- Lastly, to talk about S3 and to not talk about eventual consistency would be almost cheating. Anti-climatically, however, we vigorously exploited S3’s read after write consistency to transfer and process large amounts of event data. The only place where eventual consistency played its role was in the API that pulled the events. Here, the eventual consistency time was easily absorbed by our near real-time guarantee.
What did we learn?
- Always understand the problem you’re trying to solve. First, find out how the system is intended to be used and only then design it. Also, find out the aspects in which the system is allowed to be less than perfect — for us, it was being near real-time. (Batching, eventual consistency)
- Do not design a system first and then try to fit various technologies in it, rather design the architecture in the stride of the underlying components you use. (S3, MySQL)
3. Listen to experienced folks to cut losses early. (Distributed Locks, ZooKeeper)