The “What” and “Why” Behind Performance Metrics

Agustin Sacco
SSENSE-TECH
Published in
5 min readSep 25, 2020

Part II: Commitments to Performance & Reliability

This is Part II of a three part series focusing on the what and why behind performance metric. Access Part I of the series here.

At SSENSE, we have adopted a microservice architecture that allows for teams to design and implement a multitude of services that solve different types of problems. For example, a REST API that serves tax information to customer-facing channels will require a very different set of indicators (metrics) to track performance and reliability than let’s say, a cron-job that is responsible for capturing payments on orders created every 10 minutes.

Providing tooling for teams to track the performance and reliability of such a large variety of services can become a challenge. In order to overcome this challenge, our Architecture team had to provide a standard set of SLI’s — or Service Level Indicators — for each tier of our stack.

Each system tier is made up of services that expose different capabilities. For example, the UI Tier consists of user interfaces that allow tracking page load time (PLT) or time to interactive (TTI) metrics. The API Tier does not expose that capability, so a service in this tier cannot logically use PLT or TTI.

In many cases however, these capabilities can overlap and this allows us to use similar SLI’s across tiers. Each different capability requires an SLI to ensure performance and reliability are tracked respectively.

System Tiers

Having covered how to split our stack and generate SLI’s that match the capabilities of each service, we can now focus on a couple of system tiers. This will illustrate what kind of standard SLI’s we can use to track performance and reliability of services within these tiers.

Below are examples of two services, a REST API and a data ingestion worker who are part of the API tier and Data ingest tier respectively.

Let’s imagine that the service on the left is a tax service which exposes an endpoint that accepts an array of order line items and returns tax information. The traffic which this endpoint receives is logged to our logging platform, and we use the responses of this endpoint to calculate our first SLI; yield. Yield provides a request-based availability metric which is calculated by taking all successful requests (200–499 response codes) and dividing it by unsuccessful requests (500–599 response codes). This SLI illustrates the health of the service by providing a high availability metric where 100% is total availability and 0% is no availability at all.

The Yield indicator works very nicely for services that expose REST endpoints, but is not very applicable to a worker service which does not expose that kind of functionality. Instead, for a data ingestion worker that pulls messages from a queue, we came up with another SLI called age of the oldest message. This indicator allows you to track the age of the oldest message in your queue in order to ensure that your ingestion rate is within reasonable performance. If your queue consistently can ingest 100,000 messages within an hour — at most — this indicator will not trigger any thresholds. If for example, your queue receives a million messages, the oldest message in the queue will eventually go over the time threshold of an hour which will trigger an alert.

Objectives

At this point, we have covered why SLI’s apply to a system tier and not another, as well as how these SLI’s can help a team track performance and reliability. We can now dive into the realm of creating and meeting objectives, or SLO’s — Service Level Objectives — for the tracked performance and reliability.

At SSENSE, system reliability objectives are typically linked upwards towards top company objectives. This means that both product managers and developers have a stake in ensuring the services they build are robust and reliable.

Other objectives based on performance or data ingestion, as we covered above, might be trickled down at the engineering level and serve to help teams make informed decisions and increase accountability. Availability objectives for the UI and API tiers are under a strict 99.9% (three nines) availability objective and 99.95% during sale periods. Other SLI’s that exist for these tiers like path error rate or path response time are there simply to provide visibility to the team and ensure code or infrastructure changes do not negatively impact the performance of these endpoint(s).

Continuous Monitoring

At this point we have covered the theoretical portions of measuring and creating objectives, but have not touched much on how technically we are creating and enforcing these objectives at the code level.

At SSENSE, we strive to provide developers with all the tooling necessary to build, test, deploy and monitor applications. Unlike the first three concepts, the latter — monitoring, was traditionally done manually and heavily governed to ensure any issues are quickly surfaced and alert thresholds are not altered by just anyone. Although this pattern protects from unapproved changes or singular decisions, it also reduces the monitoring visibility developers have on the services they manage.

Continuous monitoring, similarly to Continuous Integration and Deployment, introduces the ability for developers to create monitors through code. This monitoring is continuously applied by the Integration layer when merging pull requests to the trunk branch and provides visibility and history of changes in version control. Since code changes are governed by code review, and potentially code owner approval, unapproved changes cannot happen by a single developer.

With the adoption of monitoring as code, developers now benefit from pushing features to their application in one atomic commit. This includes application implementation, automated testing, infrastructure alterations, and monitoring, all within the same pull request.

As continuous monitoring adoption increases throughout our organisation, we are seeing increased levels of visibility in places that were previously lacking. This visibility has had a direct influence on how teams plan and roll out new features, and is helping to shape how developers approach feature development.

Part III of this SSENSE-TECH series will elaborate on how these performance metrics affect technical accountability, as well as how they align with company objectives. Stay tuned!

Editorial reviews by Deanna Chow, Liela Touré, & Mario Bittencourt.

Want to work with us? Click here to see all open positions at SSENSE!

--

--