Engineering Operations — Our First Bad Surprise
Recently, Kiwi Dials hit one of those unpleasant milestones that every tech company has to hit sometime in their growth. We had an extended service outage. For a period of about 5 hours, nobody was able to use Kiwi Dials to vote. When they tried, the app displayed a message saying “Something went wrong, try again later Oops! Sometimes things don’t go according to plan, unfortunately this is one of those times. We’re sorry to keep you away from Kiwi Dials, but we will let you in as soon as possible. You could try exiting and restarting the app. If that doesn’t work please send us an email at email@example.com to let us know what’s going on,” which meant that the back end service was offline.
This happens to everybody once in a while, usually for very short periods at a time, from a few seconds to as much as a minute or two, often due to factors outside our control, such as Internet congestion or service transitions. Most of the time, the issue resolves itself, because the systems are designed to be resilient in the event of trouble. This time it was different.
Kiwi Dials is a modern, cloud-based app, hosted on Microsoft’s Azure infrastructure. It’s designed to be resilient and expandable. When demand spikes, the system scales easily and automatically. If there is a problem with any of the components (like the database, servers, messaging transports, etc.) built-in monitoring alerts us. Those monitors are distributed geographically around the country so that we can detect performance problems for anyone, no matter where they live and work, and respond immediately.
We’re proud of the engineering effort that went into delivering the Kiwi Dials service, so when one of us pulled out a phone for a demo, we were surprised to see an error, but figured it was just a problem on that one phone. So, we tried other phones, only to see the same error on each of them. We concluded that it was a problem with the service, but figured it would be back up in a minute. After all, we hadn’t received any service alerts, so, maybe the problem was still contained. A few minutes later, Kiwi Dials wasis still not working and we still had no notifications. WNow we knew something was up.
Turns out that our Azure usage hit our billing cap, which disabled the service. Once an Azure service is disabled for more than a few minutes, key parts of the service get deleted, which meant that just turning the service on again isn’t enough to get things running again for us. We had to rebuild the service and redeploy it, which took far longer than we expected.
The experience left us feeling pretty sheepish. The engineering team has decades of experience building, deploying, and managing software for some of the largest companies in the world, so it felt like a bucket of cold water to think that the lights went out under our nose for so long.
Once everything was back to normal, we put some thought into how things ran off the rails in such a big way, and how we could avoid a similar event in the future. Here are a few lessons we took from the event:
- Use billing caps on Azure carefully. If the cost of a subscription is a key concern, then a better option might be to establish alerts when the cost exceeds a specified threshold.
- Establish monitoring functions under an independent subscription.
- Don’t break the build. Make sure the build always works and is automated all the way into production.
- Test the infrastructure. Keep a list of everything that could go wrong, and add to it over time. Plan how to respond to each situation, and test those responses to make sure they actually work.
Rationally, we know we’re better for the experience, and that we won’t make those same mistakes again. But the truth is, it’s a little embarrassing. Thanks for sticking it out with us.
Originally published at kiwidials.com.