Deep dive API: a statistical approach of API performance

Bad times when your APIs turns to scale. How to assess your weaknesses?

This article is all about scaling and here is the story of your API over time. First you have the Good times, as people are starting to use your API — and yeah… that’s awesome! It generates new yet sticky customers (with no efforts) and your revenues will start climbing as your adoption rate raises. As more people are using your endpoints, you come to a point where you don’t really know anymore how your clients have implemented them. And then people starts telling you that your API is sometimes not as performant as it should be — and sometimes it even crashes your back-end. This is the Bad times.

When Bad times hit, your APIs deserve a performance audit.

“People are doing stuff with our API we never imagined.”

Data.

Is my API healthy? How far is it performing well and how to identify my weaknesses? Am I compliant with my company’s SLA? We all know that the only true way to improve towards a goal is by carefully picking key metrics and iteratively measuring and tweaking your system until the stated goals are met. In this article, we will adopt a statistical approach of API performance resulting of the analysis of a massive amount of data, considering all requests made on your endpoints over time. As a condition, you must be able to easily retrieve the following data for each call:

  • http_request — will give you information on http method, Uri and path used for each request. It will also allow you to correlate weak performance with bad usage of your endpoints.
  • http_status — will give you appropriate information on Time-Outs (TO). When your endpoints takes too long to answer, then you may return an Http_status 504 —Gateway time-out. By analysing the TOs behaviour you might be able to tweak and improve your system.
  • response_time (in milliseconds) — will give you the main information used for our statistical analysis. You must gather response time for each request and for a duration of at least one business week. As it may represent quite a bunch of data, make sure you use the right tools to retrieve and analyse this information from your system.

The Statistical approach — Analyzing chaos

When you really starts getting into scale, you get passed by how your API is getting used as you handle millions of requests at the same time and it drives a lot of complexity and the kind of noise and chaos that rely in the scaling system.

Systems never behave precisely. There is not a person in the world who has ever built a system that constantly responds at 100 milliseconds. It is always gonna be a range and what’s nice is that you can found in statistics all the tools allowing to assess this chaotic behaviour.

Normal distribution.

Normal distribution curve is one of them. In a normal curve, the tails at either end bend down toward zero in a very predictable fashion and so as you move a standard deviation away from the middle, the likelihood you are going to get one of those events becomes logarithmically less likely.

Applied to the response time, that’s fantastic because, when you have a hundred millisecond baseline, you might get some responses that are faster and some that are slower. Normal distribution will allow you - taken as a benchmark — to be used at your advantage to trigger yourself to understand am I getting into trouble and what changes I need to make.

Credits @AlexVoyer_fisheye

Scuba-diver KPI 1: Your Average Response Time (ART)

The Average Response Time (ART) is visually represented by the peak of the normal distribution curve of your API. *insert schema here*

So ART is the first indicator you might want to retrieve as it gives information on the general health of your endpoints. Basically, your ART should not be higher than 2 seconds. If it is the case, then either your endpoints should be designed ina new way (maybe using an asynchronous process) or you have a serious performance issue.

Eventually you can use ART as a Key Performance Indicator (KPI) and try to tweak your system in order to improve it continuously. The http_request may give you information on filtering and sorting options that might have a big impact in the response time of your API — do not neglect granularity on your analysis.

Scuba-diver KPI 2: Your Time-Out (TO) rate

At the very end of your normal distribution curve, you might find a peak which represents your Time-Out (TO) probability. If your API is returning a TO error after 10 seconds, then your normal distribution curve will show a peak of probability at t=10.000 ms because all event with a probability to occur after 10 seconds will happen at this very precise moment. *insert schema here*

Time-out may be considered also as a good KPI on your system health although it might happen under only rare and exceptional conditions. If it is not the case, and you see a TO pattern based on the http_request analysis of your calls, then either try to contact the clients misusing your APIs or limit the usage of it by design.

Analysing TO rate (i.e. number of TO vs total of number of hits) will allow you to assess that you don’t breach your Service Level Agreement (SLA). For example, if you take a typical SLA of 99,99%, then your TO rate should be lower than 1 TO out of 10.000 Hits. (99,90% => 1 TO / 1000 hits).

Scuba-diver KPI 3: Analysing in details the right tail of your normal curve.

Now you know that for some endpoints your ART or you TO rate are too high, you might have already identified patterns to ban and misconception in your API design. But as you dive deeper in your performance audit, you might want to dig a little bit more in the granularity and therefore, you would like know why 50% of the calls are responding slower than your ART while not reaching TO. Actually, you might want to know what happen between ART and TO. That’s where a comparison between the probability of a response time to occur and Normal distribution curve come into force.

So, we are analysing the right tail of your Normal distribution curve, from ART point to TO point. As explained earlier, probability of having calls slower than expected should bend down toward zero in a very predictable fashion and so as you move standard deviation away from the ART, the likelihood you are going to get one of those slower responses becomes logarithmically less likely.

From the graph in section 2 of this post, you can easily identify points where your Response Time distribution is underperforming a Normal distribution (your benchmark) — points where red curve is above the blue one. Visually, this tool is really useful to detect choke points where behaviour of your system is slower than expected. Consequently, by analysing http_request made on those response time frequencies you might be able to identify behaviour where your system is not performing well. Ultimately, you will be able to tweak your system and improve it under this very specific conditions.

Normal distribution can be calculated based on mean and standard deviation. Once you have calculated those values, it comes easy to build the Normal distribution curve as a benchmark vs your real data.

Conclusion.

At large scale, statistics happen to be your best mate when dealing with a performance audit on your APIs. But statistics can also be really effective on other circumstances; in the process of development for example or when you are trying to perform performance test on your API such as stress or load test.