High Availability

“The System Will Be Available 99.9% of the Time,” Said No One Ever.

Malina Tran
Tech and the City
2 min readDec 23, 2016

--

A service-level agreement (SLA) is a contract between a provider or contractor and a client.

As Michael T. Nygard states: “SLA definitions are like the details in your medical insurance plan. Nobody reads them too closely until something awful happens.” Nygard advises specificity over ambiguity. As in, don’t promise availability 99.9% of the time. And don’t refer to it as “the system.” The system, after all, will more than likely include various calls to other systems beyond the enterprise application. You most likely do not have control over those, and should only commit to define the availability of a specific feature or function. And not all systems require so much availability, with the cost increasing radically at each level.

Since managing expectations is a key aspect to working with clients (or really, anyone or anything in life), I’ll outline key aspects of availability requirements.

  1. Isolate each feature or business process. Nygard’s example is a hotel chain that may have several key functions to its website, including online reservations and event bookings. The features that will generate revenue will sensibly have the highest SLA because, well, duh, we all hate to wait.
  2. Define “availability” and the required level of availability, as well as how the feature is being checked (preferably through an automated system that executes synthetic transactions against it). Also:
    - Define which device or devices will be monitoring the availability and the frequency of their execution as well as their location(s).
    - Determine how it will be reporting problems.
    - Determine how percentage of availability is computed, whether it is based on time or sample size.
  3. Define what a good response looks like. You don’t want a transaction to quickly respond, yet returns errors. Think about the following:
    - Maximum acceptable response time for each step of the transaction
    - Response codes and text patterns to indicate a successful transaction
    - Response codes and text patterns to indicate a failed transaction
  4. Two prerequisites for high availability are load balancing and clustering. There are various techniques at a wide range of costs and making early decisions about high-availability architecture can make deployment and identifying solutions much easier.

While I may not have to worry about SLAs (or ever worry about them, for that matter), I think it’s still worthwhile to learn. It also seems like a useful exercise to think about and break down a company’s software system into core features and functions. Doing so can help envision the various moving parts of its services, and determine which is essential and revenue-generating. Such is the software engineering field: breaking down larger, complex applications or systems into smaller components. It is often times how I start looming projects or TDD’ing. Being able to do so from a technical standpoint can clearly have implications from a business perspective.

--

--