Scheduled Maintenance is not a Thing

Joe Crobak
8 min readMar 26, 2018

--

Inspired by Jeff Dean’s Numbers Every Engineer Should Know, in 2015 a crew at the United States Digital Service (USDS) set out to build a similar list of “IT numbers everyone should know” aimed at a federal government audience. It included that, conservatively, hosting a static website costs ~$500/month and storing 5TB of data costs ~$150/month. At the top of the sublist of government-specific numbers: acceptable hours of schedule maintenance = 0.

If you’ve ever interacted with a government system, whether it’s the IRS, your state’s DMV, healthcare.gov, or the FAA, you’ve probably experienced a notification of “system unavailable” or “down for scheduled maintenance.” This, unfortunately is a well-accepted “feature” of federal IT systems.

Occasionally, there are legitimate technical reasons for scheduled maintenance. For example, many systems in government are built on legacy mainframes and haven’t received the money necessary to upgrade. Time on the mainframe is limited, and thus some systems require downtime over the weekend (or nightly) to perform batch processing.

Incentives and Acceptance

The legitimate technical reasons are the exception, though. Rather, scheduled maintenance is the result of public services embracing risky, big bang software deployments. The culture doesn’t value, and many ways makes impossible, small, low-risk deployments that most private internet services have adopted.

Both vendors and the government bore responsibility for this. In my two years working with federal government software systems, I heard over and over that government software systems are more complex than those in the private industry. At one point, I overhead a member of the healthcare.gov team compare their website with Amazon.com—they said something along the lines of “healthcare.gov is a much more complex website than Amazon.com.”

The myth that federal government systems are more complex with more challenging requirements than private sector software systems leads to lots of compromise and sacrifice. For example, a business owner justifies a 12-hour deployment because US citizens are only in three time zones (5 if you also include Hawaii) making it OK to shutdown the website from 8pm-8am over the weekend. The system architect designs the system to require scheduled maintenance, because it’s easier to assume that a system is up than to design your product to continue to work in the face of downtime in an integrated legacy system. When compliance adds complexity to a software architecture, rather than questioning the red tape, the business will deem acceptable a ten second response time for a HTTP endpoint.

Further, vendors aren’t incentivized to bring new software practices to a project. Industry best practices like continuous deployment and even configuration automation (how will you keep segregation of duties?!) are seen as risky since they don’t fit perfectly into the strictly defined software development lifecycle. While modern practices are discouraged, contracting structures can actively encourage bad practices. For example, a vendor can in some cases charge the government overtime for the manual and time consuming deployments done outside of business hours.

Zero-downtime deploys in the federal government

Fixing this problem is possible, though. Jon Booth’s Web and New Media Group at the Center for Medicare and Medicaid Services (CMS), with the help from startup-inspired vendors and other agency leaders, have been using modern practices since the early days of healthcare.gov.

So when USDS joined the Quality Payment Program (QPP) at CMS, we sought to build on their work in several ways, including to have zero downtime deploys during regular business hours. These teams (there are a handful of vendors working on QPP) deploy across several codebases and software architectures including some that integrate with legacy systems at CMS.

Below are the key challenges and strategies we used to overcome them. While none of these concepts are novel, hopefully other folks working in government and other large companies find them useful.

Automation

First, and most important, is automation. Strive to automate everything from deployments to unit tests to integration tests. If a bug-fix-test-deploy cycle takes 24-hours, then you can’t deploy early and often with high confidence. Early on in QPP, one of the development teams demonstrated a deploy that took 10 minutes and the ability to completely rebuild an environment in under 4 hours. They also had selenium tests that breezed through certain user flows to validate functionality in a matter of seconds.

The IT staff at CMS were amazed (standing up an environment was previously a multi-month saga), and new parts of the project were held to this higher standard.

Cloud

Most of the government is stuck in the world of legacy data centers. While these are often contracts based on the federal government’s cloud-first strategy from 2010, the beltway bandits win these bids and not the likes of Amazon Web Services, Microsoft Azure, or Google Cloud Platform. The resulting 9+ figure contracts “require” as much spend on labor as on hardware (that is, if there’s $1M in hardware, there’s likely at least $1M/yr in labor). Hourly work is reimbursable, which means there’s no incentive to automate (and hey, if a config is fat-fingered, the fix is another opportunity to bill the government!).

In addition to common data center tasks (racking servers and switches, swapping hard drives, patching switches), the data center vendor is usually responsible for manually provisioning environments (including the base OS install), installing/configuring applications, and patching the operating system when vulnerabilities are disclosed. Since AWS/Google/Microsoft don’t typically bid on government contracts, when a project has access to these cloud services they are purchased through an intermediary. Arrangements vary, but it’s common that this intermediary operates AWS like a traditional data center too—manually configuring Virtual Private Clouds, building Amazon Machine Images, and more.

One painful experience from the QPP project demonstrates the drawbacks of these contracts. A process that should have taken a matter of minutes with proper automation—establishing connectivity between two services—involved several vendors, phone calls, and an escalation to IT leadership. Because one of the services ran in Amazon Web Services (AWS) and the other was in a CMS data center, there were over five vendors involved in setting up new routes, firewall rules, and database-level credentials. The requests for configuration changs were shuffled around in Word documents, and every step required a separate approval (sometimes even to only answer a simple question like “what is the IP address of your server?”).

Compare this to a modern cloud—everything can be automated, from network and firewall rules to instance provisioning/scaling to application deployment. Approvals are done via code review/pull requests. We automated as much as we could for QPP, and we were able to attain reliable and fast deployments. Rather paying for expensive hardware, database licenses and manual backups, we made use of PaaS systems like Amazon RDS, DynamoDB, and Amazon S3.

With that said, the situation wasn’t perfect—some tasks still required working with a “cloud vendor” that wasn’t AWS, and we were limited in our ability to use the AWS services that we would have liked to (more on this in the section on governance and compliance below).

Governance and Compliance

Each Federal Government Department has defined a Software Development Lifecyle, which includes a number of steps that are required to build and deploy a software project. At Health and Human Services, there are over 60 different artifacts that need to be produced during the various phases of a project’s lifetime. This process is designed for waterfall software development in which releases occur infrequently and these artifacts need only be prepared every 12 months or so.

CMS recognized this fact and created the Expedited Life Cycle (or XLC). The XLC, though, still includes 13 reviews with the CMS Technical Review Board (and other oversight bodies) and 65 artifacts. Fortunately, we were able to piggy back on work that Jon Booth’s team had done and slim down this list by delegating several of the reviews to internal teams and automatically generating most of the reports (such as the defect log and product requirements document) using JIRA and Confluence. In most cases, our CMS colleagues who have worked in IT for years were able to help us navigate the bureaucracy to adapt the governance and compliance to fit our desired development cycle (this was often just as much about using the right words as it was doing the right things).

We also leveraged a strong relationship with the Chief Information Security Officer’s office to bring in new SaaS products such as Okta and Amazon KMS. This, though, was a often a multi-month (if not year-long) process due to the stringent FedRAMP certification process and the weight given to compliance documentation. I’m not a fan of FedRAMP—in my experience it proves little about the actual information security of a system and is mostly a costly paperwork exercise. And further, the multi-million dollar cost of achieving FedRAMP is likely to either keep vendors from selling to government or to be passed on to the government as a rate hike.

Legacy integrations

Most of QPP’s integration with legacy systems is for offline, batch processing. But in cases where there was an online dependency (by this I mean the website reached out to the service as part of an HTTP request), we made heavy use of caching. This way, an end-user could still interact with our website even if the legacy system was down for scheduled or unscheduled maintenance.

Unfortunately, the integration isn’t always straightforward. You’re lucky to find a SOAP web service for a government system. If you do, it’s likely broken (because you’re the first to use it), slow/unreliable (as mentioned above, 10 or 20 second response times are fairly common),and a months-long journey to hop through the right firewall requests to actually hit the endpoint. Far more common are old-school enterprise service buses (think SOAP over MQ without delivery guarantees) or “enterprise” FTP.

Contracts

Many IT contracts in the federal government are structured with the expectation of scheduled maintenance. It’s not uncommon for an application development contract to say that deployments will be performed between 2am and 8am on Sunday mornings. That outage information also goes into the contractual calculation of the Service-Level Agreement (SLA). That is, a vendor can claim 100% uptime, even if they’re down for 6 hours a week as long as that downtime is scheduled. Since the SLA excludes scheduled maintenance, there’s a loophole in which a vendor can notice a “service degredation” and “schedule” a maintenance to start in 5 minutes. If they didn’t eat up all of there 6 hours/week of scheduled downtime doing a deployment, they can get off scot-free.

Often times, the entire data center and all applications in it will shutdown simultaneously during that outage window. This leads to simplifying assumptions in software architecture (particularly when it comes to dependencies — it’s much easier to assume that all dependent systems will be upgraded simultaneously) that often make the system brittle in the face of failure.

QPP took the approach of requiring automation and incentivizing zero-downtime deployments. Some vendors like these practices, and others don’t. Since sometimes a carrot isn’t enough, another option is to have one of the more advanced vendors deploy a PaaS solution that’s as easy to use as Heroku for everyone else.

Conclusions

The Digital Services playbook outlines these and other best practices for building modern services. As I’ve described above, without these strategies a project is likely to fall into the old habbit of big bang deploys requiring scheduled maintenance. Projects like those from the Web and New Media Group and Quality Payment Program have show that the federal government can do better. It’s time to expect more from our publicly-funded software systems and to hold the government and its vendors accountable.

If working on these types of technical (and bureacratic) challenges seems interesting to you, then there are lots of opportunities to get involved and help. The United States Digital Service is hiring—you can learn more about what it’s like to work there in my post reflecting on my two years there: The Best and Hardest Job You’ll Ever Have.

--

--

Joe Crobak

Distributed and complex systems, healthcare and gov tech. Prev @USDS @Foursquare & some defunct startups. I run dataengweekly.com