Are you a software engineer who misses math? Good news! You’re going to see some fractions and even characters like λ and μ. We’re going to learn how to estimate the durability guarantees of a storage system. You’ll be able to say things like “we have 9 nines of durability” with a somewhat-straight face.
This post applies equally well to object storage systems or databases or any situation where you store data in multiple places for redundancy.
There’s going to be a plot twist at the end. Fun times.
This post is based on a chapter I wrote for the book Seeking SRE.
Back in 2013 I was helping design a block storage system to store a few exabytes of data and was trying to figure out how many disks to replicate data on. This may seem like a pretty obvious question but in the real world most people just make an educated guess and move on. I have certainly been guilty when working on databases of just assuming that three replicas sounds like enough and leaving it at that.
This time we’re going to use math.
For the sake of this exercise pretend we have a storage system that stores a copy of each file on a bunch of disks. In practice at Dropbox we use various fancy erasure coding schemes for storage efficiency, but the logic is the same.
Assume that each disk has a given mean time to failure (MTTF) and that we build some automated systems to to replace and re-replicate a failed disk within a given mean time to recovery (MTTR), measured in hours. We’ll represent these as failure and recovery rates of λ=(1/MTTF) and μ=(1/MTTR) respectively. This means that each disk fails at an average rate of λ failures per hour and each individual disk failure is replaced at an average rate of μ recoveries per hour.
Let’s say we store each file on n disks and that we lose data if we lose more than m disks, e.g., for synchronous 3-way replication we have n=3 and m=2 because we can survive losing 2 disks. If we were using Reed-Solomon erasure coding to convert 6 input blocks to 9 output blocks (RS(9,6)) then we have n=9 and m=3 because we can reconstruct all input blocks from any six output blocks.
You can think of disks failing within a replication group as a state machine, where each state represents the number of disks we have left in the group and the transitions between states represents a disk failing or a disk being recovered. The standard technique for estimating the probabilities of moving through this state machine is a Markov chain:
In this model the flow from the first to second state is equal to nλ because there are n disks that each fail at a rate of λ failures/hour. The flow from the second state back to the first is equal to 1μ because we have only one failed disk, which is recovered at a rate of μ recoveries/hour. The other transitions in the model follow the same convention.
In this model we lose a file if we move from the second-last state to the data loss state. The data loss rate in “replication groups per hour” is equivalent to the rate of flow along this transition, which can be computed as:
loss rate = nλ × (n-1)λ/μ × … × (n-m)λ/mμ = n!/m!(n-m-1!) × λᵐ⁺¹/μᵐ
Damn, that’s kinda hard to read. Here it is as an image:
Bam! Looks like we have a formula for durability.
Ok time to put some numbers in this durability formula.
Let’s say our disks have an Annualized Failure Rate (AFR) of 3%. We can compute MTTF from AFR using MTTF=-8766/ln(1-AFR), which turns out to approximately equal to 8766/AFR. In this case MTTF=287,795 hours.
Let’s also say we have some reasonably good operational tooling that can replace and re-replicate a disk within 24 hours after failure, so MTTR=24.
If we adopt 3-way data replication we have n=3, m=2, λ=1/287,795 and μ=1/24.
Plugging these into our new favorite equation we get a data loss rate of 7.25×10⁻¹⁴ incidents per hour or 6.35×10⁻¹⁰ incidents per year. This means that a given replication group is safe in a given year with probability 0.999,999,999,4. That’s your 9 nines right there!
The plot twist
Whoops, I almost forgot the plot twist… this is the important part.
9 nines sounds a lot right? Does that mean you did good? Well, maybe.
9 nines isn’t that high when you’re at scale. That’s the durability of a given individual replication group, but if you have a huge number of replication groups (i.e., a lot of data) then data loss becomes more likely. Each individual group has 9 nines of durability but there are more of them to fail.
If you’re storing exabytes you’ll likely want to do better than this. Increasing replication factor or buying better disks or reducing recovery delay can yield much higher durability numbers. While Dropbox publicly claims durability of 12 nines, the numbers we compute via that durability formula are well over 24 nines! This is thanks to replicating data on much higher numbers of disks, spread across multiple geographical regions.
Ok cool, 24 nines sounds like a really big number. Does that mean we get to pat ourselves on the back and go home?
Your durability estimate is only an upper bound.
It’s fairly easy to design a system with astronomically high durability numbers. 24 nines is a mean time to failure of 1,000,000,000,000,000,000,000,000 years. When your MTTF dwarfs the age of the universe then it might be time to reevaluate your priorities.
Should we trust these numbers though? Of course not, because the secret truth is that adherence to theoretical durability estimates is missing the point. They tell you how likely you are to lose data due to routine disk failure, but routine disk failure is easy to model for and protect against. If you lose data due to routine disk failure you’re probably doing something wrong.
What sounds more likely? Losing a specific set of half a dozen different disks across multiple geographical regions within a narrow window of time? Or maybe an operator accidentally running a script that deletes all your files, or a firmware bug that causes half your storage nodes to fail simultaneously, or a software bug that silently corrupts 1% of your files? The challenge in designing a storage system is making these other risks so unlikely that you have a chance to get in the vicinity of your theoretical estimates.
Now what do I do?
Alright alright, how do you actually avoid losing data due to catastrophic unforeseen events? I’ll probably write a post about it sometime. In the meantime you can check out the full chapter in Seeking SRE or learn about verification systems on the Dropbox Tech Blog.