Is your database too big to fail?

3 min readAug 18, 2018

DevOps is all about automation, and CodeOps is taking that automation to it’s extreme. The hard part that makes extreme automation all work is … Fault tolerance:

Netflix has made fault-tolerance popular again with it’s Simian Army tools. In a similar vein, I practice the Sith Lord school of system administration, and if an automated playbook can’t resolve an alert without human help, then I terminate the server. This has serious architecture ramifications in that all services must be able to lose a server or two without visible impact to customers. I have limited time to troubleshoot every glitch in the cloud, so I need to architect the platform to let me pick and chose my battles. I run a report at the end of the week to tell me just how many times the play books had to run, and how often they failed so I can chose to investigate possible chronic issues.

Your data stores do not get a free pass on the above requirements.

I allow for chaos engineering to destroy databases in production on a whim. If a database fails at 3am, traffic is automatically shifted to the backup, the failed DB is destroyed, and a new backup is provisioned. This is a huge architecture requirement that has far reaching implications. Your databases must be able to be replaced routinely via automation without causing impact to your service. Your business may have a forgiving RTO bar, but automation needs to work with replacing servers on a daily basis. This means three very important things:

If your data must be accurate (like payments), you have to enforce data consistency at write time and not rely on eventual consistency configurations. This means your data writes will be slower as each write has to wait for quorum confirmation.
You cannot have huge data stores, the data must be sharded so any database server can be replaced in a timely fashion. My rule of thumb is if you cannot copy the data between two disks in under 30 minutes, your database is too big to fail.
Sharded databases can’t easily do things like table joins. Few relational databases are fully featured when used in a sharded configuration. You have to either write a middleman service that will do things like table joins for the app, or the app has to do them itself.

Setting up CodeOps is a big upfront investment in time, and places significant requirements on the way you operate your infrastructure. Most organizations historically had to rush to get application and database servers up so the product could be developed, and so they were developing/designing with short-term speed as the most important factor in their platform.

When you are practicing CodeOps, automation is everything. Automation is not free, it costs in terms of upfront development, redundant resources, and workflows that aren’t always optimized for speed. However, because of the automation, your engineers can focus on new features instead of keeping the lights on, they can make large changes with more confidence, your customers have a better UX due to better stability, and your disaster recovery plan is not vaporware. The long-term agility benefits are significant, you have to weigh that against the financial cost of redundant servers. Remember that time is money.

If you found any of this interesting, please like the article. You have the opportunity to make my day 😍

Is your database too big to fail?

Written by Anthony Hobbs