Why Service Acceptance is Important to Hippo

Published in

Hippo Engineering Blog

7 min readJul 7, 2022

Many engineers like to write software but when it comes time to operate that software, it’s often pushed aside or not thought about until the last minute.

At Hippo Insurance, our engineering team has decided to move to a service oriented architecture. Some will call them “microservices” but we aim to not write too many services in order to keep our platform manageable. We do build other services for other purposes but the majority of our business logic resides in a small number of core services.

When we build and deploy a service at Hippo, we want to be assured that we can successfully operate that service. We want to know how it performs, be able to monitor it, and be able to fix issues quickly. We’d like to deploy the highest quality services that take the least amount of effort to maintain. This article describes how we established service acceptance criteria and how we think about operating the service from inception to shutdown. It describes different tools, techniques, and principles we use to successfully run our distributed system.

I like to use the term MOPS for service acceptance. A service should be Measurable, Operable, Predictable, and Scalable to be able to be accepted into our production environment. Here’s a quick explanation as to what each of these mean:

Measurable

We should be able to measure low-level health (e.g. CPU utilization, heartbeat), high-level health (does listing an entity returned by the service return a non-empty set?, what is the success rate of HTTP requests to that service?), and application metrics (e.g. how many times a new implementation of a feature is executed). Examples of metrics that could be included in a “measurable” requirement:

Inbound HTTP request rate
Inbound HTTP request error rate
Inbound HTTP 50p/95p latency
CPU and memory utilization average for each replica of a service
Database CPU utilization, memory, available space for a service
Queue size and dead letter queue size for a system that relies on a queue to process requests

I like to setup a dashboard for each service in our dashboard system of choice, Grafana, with a tile for each of these items, drawn from metrics from our real time monitoring system (we use Prometheus).

Operable

We should be able to operate a service or system successfully. Operation requires an operator’s manual or run-book. For example, when we expose an HTTP interface to kick off a bulk process, the run-book should include an example request, link to an OpenAPI spec, and description of parameters and behavior. It’s also a really good idea to have a Postman or equivalent setup for all of the different endpoints that the service can handle, to be able to test instantly with the click of a button.

Predictable

A service/system should be predictable and reliable. We shouldn’t see large spikes in HTTP requests, queues, or CPU that are unexpected. When measured for reliability, our goal for service level objective (SLO) on services is 99.8%. Most tech companies who operate back end infrastructure shoot for reliability in the range of 99% to 99.99%, depending on their business and need to invest in a certain level of reliability.

A batch or bulk system should have a predictable ramp up, maximum rate of processing data, and predictable ramp down.

An example graph of traffic on a predictable system

Here is an example of a system that is less predictable. It has a huge spike of requests initially, then a long tail of unpredictable activity. It’s difficult to determine when the process will be finished and why our system hasn’t been processing the maximum throughput to make it finish more quickly:

An unpredictable service, number of items processed

And here’s an example of a system that has a more predictable behavior pattern:

A predictable service, number of items processed

Reliable

A sub-category of “Predictable”, reliable means that we can define a specific rate of unexpected or system errors that we expect from our system. The simplest way to get an error rate is to load test and simulate the level of requests that we expect in production. Sometimes test versions of our systems can’t replicate the behavior of our production systems. It’s also possible to measure reliability of a system by running a canary in production before release.

Scalable

A service/system should be able to scale appropriately for business needs. If we expect a bulk process to finish in 10 minutes but the maximum scaling will only meet a 2 hour time frame then the scaling is not correct for the business requirements.

Below is an example of a service that has predictable scaling, but doesn’t scale quickly enough to meet the needs of the business. Let’s assume we have a Node JS service running in production that has some basic requirements:

It relays data in real time from our system to third party analytics solution
It will have traffic in bursts and will need to scale quickly

We have a minimum of 2 replicas configured for this service in Kubernetes:

replicas: 2

But the auto-scaling settings use CPU and memory-based scaling and is limited to the 2 replicas. Since this service is handling simple HTTP requests and is single-threaded Node JS, it is unlikely to ever reach these CPU or memory limits to scale up:

resources:
  limits:
    cpu: "2"
    memory: 2Gi
  requests:
    cpu: "1"
    memory: 1000Mi

In this case, the service will only support the throughput handled by 2 parallel instances. We know the downstream system where we send analytics data can handle much higher throughput. The service will run much more slowly than we would need because its scaling settings aren’t set to fit the situation. This is an example where scaling for the sake of scale isn’t necessary — it’s scaling in the right way to meet the needs of the system that’s important.

Acceptance Checklist

To accept a service into our production environment, we have a simple checklist to cover. In addition to the MOPS guidelines we also want to make sure we agree that the service should exist, that it’s not duplicating another service, and that it doesn’t introduce new infrastructure that we’re not capable of operating. With those guidelines our checklist is:

For new services, it must have an approval from our architecture review process
If new infrastructure is required for a service, it must have a design document and that document previously been reviewed by our Cloud Platform team. We don’t want to introduce MongoDB into our system if we’re already using a NoSQL database or don’t have a need for one.
If a service is adding a new external API dependency or significantly changing an existing one, it must also have an approval through our architecture review process.
Services and external (third-party) API dependencies must have metrics/monitoring in place in accordance with the measurable section of this doc.
Each new service or subsystem should have an operator’s manual/playbook upon release. That operator’s manual should describe how to investigate issues, what to do to fix problems, and who to call if a problem can’t be fixed.
Each service and subsystem should have appropriate inbound and outbound (dependency) request timeouts set. Those should be addressed in a design document and review. This helps make a service predictable.
All services, workers, and subsystems must be capable of being deployed to all environments managed by Hippo.
Each new service will have a go-live review before the staging environment release. This will be a 30 minute live meeting for our Ops/Infra teams to review this list.

Wrap it With a Bow

Even if your organization doesn’t want to be so strict as to have a service acceptance process, writing down what you think make the right acceptance criteria can help guide an engineering team to successfully operate services. A list of requirements influences the culture of an organization to reach that goal. We spent a long time at Hippo with this list as just a suggestion, but teams started taking it seriously and were already doing these things before we decided to actually accept services to run in production.

I believe an organization can define its own service acceptance requirements, which will benefit the development of those services. There are few worse problems for an engineering org than building a service, then finding out that you can’t successfully operate it in production. Or it doesn’t meet the reliability that your organization needs. Using some of these principles can help you decide what your services need to do in production.

To try out our reliable production systems, go to hippo.com to get a quote for insurance.