Day Two Problems

Excellent retro-clipart from geralt.

The idea of “day two problems” refers to all of the critical, but often forgotten problems an organization faces once an application is in production. The primacy of developers in our industry today focuses much of our attention on everything that happens before production: coming up with feature ideas, coding, testing, and provisioning “servers” in seconds instead of weeks.

While these are all valuable “first-day” problems, once the application is deployed, a new kind of excitement ensues on the second day. “Day two” issues are not new. What has changed though, are patterns of behavior and the expectations of developers and operators in an increasingly software-defined, cloud native world.

Beyond uptime, the goals in production are to anticipate failure before users notice and minimize losses due to failure. This implies monitoring and remediation tools that will detect, help you diagnose, and then fix the problem. This may mean rolling back changes, redirecting traffic, shutting down offending features with feature flags, or scaling up resources for performance problems.

While these may seem like operational concerns, developers must be involved. At the very least developers need to ensure that their applications and backing services support cloud native operations tasks. Developers often need to consult on diagnosing and fixing application problems. This illustrates one of the many reasons that “DevOps” — combing development and operations into a single team — is a best practice for cloud native organizations.

Another “day two” problem is upgrading smoothly. In my opinion, the most valuable benefit of a cloud native approach is the ability to constantly improve your software by frequently deploying new code to production. You then observe how those changes improve or hinder your business goals, and make new changes accordingly. Of course, the ability to smoothly upgrade the cloud platform itself, often with zero or near-zero downtime, is another “day two” task that’s often overlooked.

Keep “day two” in mind the next time discussion is focused on just getting up and running quickly. The real cloud native challenge is what happens starting on day two, and how you respond to failure — hopefully not at 1AM.

This article is part of Built to Adapt, a publication by Pivotal that shares stories and insights on how software is changing the way businesses are built.