Post-mortem on 3Box service degradation

Post-mortem on 3Box service degradation
Joel Thorstensson
·Follow
Published in3Box Labs
·
6 min read·
Jun 30, 2020
--
During the past two weeks, 3Box core services suffered varying levels of service degradation and disruption as traffic and usage continually and sharply increased over the previous month. Issues were fully resolved on June 27
The goal of 3Box has always been to provide a fully decentralized user data system which is easy and reliable for developers, while at the same time respecting users’ sovereignty. These past two weeks we did not live up to these goals. The risk of developers having to rely on 3Box, and our centralized infrastructure, as the primary data backup service is one of the main drivers for us to build out and migrate 3Box to the Ceramic Network. This will give developers and users larger choice over which service providers they trust, be they decentralized or centralized.
In the meantime, our 3Box infrastructure needs to be reliable and resilient. Our architecture is designed to provide more resiliency than traditional client/server models. The 3Box client library is a full p2p node that caches user’s data locally. This means that even when our infrastructure is degraded and data not backed up, users can write and access their data. When they are able to connect to our infrastructure again their data will be backed up retroactively.
During this recent stretch, we had an unacceptably long period where our infra was unavailable, causing disruption to several application experiences that rely on 3Box. We strive to be a reliable piece of infrastructure for Web3 apps and services, and we let down some of our partners. We are taking steps to ensure this does not happen in the future.
What happenedRoot causeThe root cause of the service degradation was a combination of our use of S3 as a shared data layer, an update to IPFS that dramatically increased traffic to S3, and insufficient tooling that made it take too long to pinpoint what exactly caused the issues.
3Box runs a cluster of IPFS and OrbitDb nodes which pin user’s data in the network and stores pointers to user’s database logs. This cluster looks like a single IPFS node to clients in the network, but allows us to spin up instances on demand. These nodes operate on a few shared data layers, one of them being a shared Amazon S3 bucket which stores IPFS blocks and other data for other IPFS services. Concurrently sharing the “blockstore” is of little concern as all blocks are content-addressed and keyed as such in S3. On the other hand, the IPFS “datastore” includes shared objects, like the pinset, which be may be read and written to by multiple instances, as well as provider records for the DHT.
When we first implemented S3 as a shared data layer, there was little shared data besides the blocks. Long term we only planned to store data like blocks in S3, as other shared data layers would be better suited to higher throughput or concurrent read and writes. With a recent IPFS update from 0.41.0 to 0.44.0, IPFS started to enable new pinset data layers and new services like the DHT (in JS). These changes started to perform many data operations in the “datastore” in S3, and increased service traffic overall further increased operations here. Multiple nodes in our cluster were making similar requests to S3, resulting at times thousands of request to S3 per second.
Unknown to us at the time, we began to hit S3 limits and our requests where throttled from 10s of seconds to minutes, until multiple retries and ultimately node failure. This was not obvious from just looking at the logs from the pinning nodes, and unclear that updating js-ipfs would such change in behaviour. To ultimately resolve the issue we moved the “datastore” out of S3 and further fully disabled the DHT (while the DHT is disabled in js-ipfs on the network level by default, it was still making a large number of operations on the data layer for provider records). These changes massively decreased the request made to S3 which restored normal operational capacity of our infrastructure.
Timeline and ImpactStarting June 15th, we began to receive infrastructure alerts that some our IPFS/OrbitDB pinning nodes where periodically in unhealthy states during traffic spikes. We immediately began to address why nodes where consuming excessive amounts of cpu and memory. To start we made more resources available for each node and scaled out horizontally.
For over a year now we have been running a cluster of IPFS and OrbitDB nodes. This cluster of nodes appears as a single peer to clients in the 3Box network and runs over a shared data layer. This configuration allowed us to the keep the service up even in the presence of repeated node failures. During this time users would have seen degraded service in the form of slow initial connections, slow syncing times, and dropped connections.
Through that week and weekend we worked to collect more data and insight into what was happening and made a number of improvements, all of which helped but did not fix the root issue:
Upgraded js-ipfs to 0.46.0
Backported a recent fix in libp2p-pubsub to work with this version of IPFS
Aligned versions of dependencies in multiple sub dependencies (including in orbit-db-access-controllers)
Made multiple configuration changes to our nodes, our cluster and how we manage connections to internal AWS services
Made a fix to our internal orbit-db caching layer that was making excessive calls to an internal redis instance. Many of these fixes did improve our services, but traffic continued to grow and we saw continued service degradation.
Many of these fixes did improve our services, but traffic continued to grow and we saw continued service degradation.
The peak of the service degradation occurred between June 23 and June 24, where during periods of increased traffic, clients were not able to connect to the pinning node to sync, write and load data. This coincided with a further increase in usage. During this time we temporarily suspended some of our internal supporting services to make sure they were not contributing to the issue and core services could remain up in some capacity. In parallel we worked to simulate the same type of load and traffic patterns in a test environment so we could test some hypothesis about what was causing the excessive resource usage and failures.
Finally on June 26 we made the fix that allowed us to return our service to normal. Over June 27 and 28 we restored internal supporting services, reaching full operational capacity.
How we are improvingWe have a number of learnings we are taking away from this event to ensure more stable performance going forward.
Immediately we are sharding the S3 key space for IPFS blocks as a suggested best practice by AWS. This will prevent us from hitting any limits and may speed up many normal operations.
We will build out a much more robust system for load testing our infrastructure in a development environment. This will allow us to catch these and other types of issues before they occur by simulating expected traffic patterns.
We’re adding more granular logging to enable greater visibility into our internal services (e.g. S3 and Redis), which will enable more rapid discovery of underlying limitations.
We are optimizing our auto scaling configuration. Even though we had auto scaling enabled, it did not perform as well as it could have.
We are adding a backend engineer to drive infrastructure improvements (JD here)
The biggest improvement to our dependability will come as we migrate the core of 3Box to the Ceramic Network. This will enable fully decentralized and verifiable identity and data access in 3Box, with a network of service providers and no specific dependence upon 3Box services. We are working towards this as fast as possible, with an anticipated fall release.
At the same time, we are fully committed to supporting a high performance and reliable experience for users and developers building on 3Box. We are operating in a fast-changing environment and stack, and it’s our responsibility to ensure technologies we are using and our implementation of them are delivered in a stable, reliable manner to our customers. Enabling developers to easily build applications that gives users control over their data is core to our mission. Nobody should have to trust us at 3Box to build in this model but it is our responsibility to create trust through our products and infrastructure as we build towards a fully decentralized network.
Want to join us in creating the foundation for Decentralized Identity?
Come chat with us in Discord
Have a look at our docs
Read more about Ceramic
Post-mortem on 3Box service degradation

What happened

Root cause

Timeline and Impact

How we are improving

Written by Joel Thorstensson