SRE: Debugging: Heuristics: Drivers of Increased Latency

dm03514
Dm03514 Tech Blog
Published in
5 min readMay 11, 2019

Increased latency is a common symptom of system and resource contention. There are a few common sources of Increased Latency which I have found to help shortcut debugging and arrive at the source of contention: increase in work being applied, a change in the type of work being done, or a change in the amount of work being performed. This post will walk through each of these scenarios in order to illustrate what each scenario looks like. This post should help you quickly answer: What has changed to drive an increase in latency? This methodology should help to quickly arrive at a source of load in order to help inform mitigation (ie scale out service or restrict traffic) and to give a concrete place to start investigating.

An alert fires indicating that a services latency has increased. How do you start to understand the problem and get to the source of the latency? Latency is a way for us to describe work. It is an observation about how long (duration) an operation takes. In order to observe the latency of an operation there has to be an operation, making the operation the foundational in understanding time.

An increase in latency indicates that there is a constraint on the system. System performance is a well understood and studied field and can usually be modeled using Queuing Theory. Since traffic is what drives the load on a system it’s an intuitive place to start looking. This methodology uses traffic to understand the driver of increased latency in a system:

Increase in the Amount of Work Being Done

The first cause most likely cause of latency is an increase in the amount of work being applied to the system. When the rate of work increasing it puts additionally load on the system. When the system is at or nears capacity work begins to queue resulting in longer latencies. If the resource constrained is scalable, than one solution for this is to scale out.

This is a classic up-and-to-the-right-growth and indicates more work is be applied to the system. The graph above shows a 50% increase in requests in a couple seconds time frame. If there is a correlated increase in latency it suggests that the system is constrained somewhere.

Increased in the Type of Work Being Done

The next most common type of load that corresponds to an increase in latency is the type of work being requested of the system has shifted. There may or may not have a corresponding overall rate of work in the system. The first graph shows the “baseline” system taking requests @ 100 req/s:

The next graph shows the type of work shifting:

In the graph above the overall rate of requests does not increase, but the distribution of the type of requests has. The system is being requested to perform 30% more slow operations (compared to the first image).

Change in the amount of work being done in each transaction

If the rate of work hasn’t increased and the type of work hasn’t changed it’s likely that the amount of work being done has increased. This often manifests as larger payloads (which need parsing) or payloads size increased (resulting in longer loops, or more queries being performed) or more data is being scanned in the database, etc.

This is the most ambiguous case because many systems have poor coverage of payload size, database operations, or other metrics that help to characterize the amount of work a transaction is performing. This is the most insidious because often the stats aren’t measured which fly under radar.

As mentioned above, common causes for this are increase in payloads, ie imagine parsing larger payloads, or and increase in the work being requested, ie # of database queries, or an index changed, or more results are being pulled back.

The graph below shows an example of this in practice:

The rate and type of work hasn’t changed but operational latency has. At this point it’s helpful to check the latency between the service and each one of its dependencies in order to pinpoint which (if any) are the root of the latency. If none of the dependencies has a corresponding increase in latency, the debug space has been partitioned and it’s safe to assume that the cause of latency is originating somewhere in the service itself (probably for one of the reasons listed above).

Conclusion

Anecdotally, whenever I respond to latency issues these are the first things I check (in this order) and almost always uncovers the driver of the latency:

  • Is there an increase in the amount of work?
  • Is there an increase in a type of work being done?
  • Is the amount of work being done per transaction changed?

References

--

--