Back-end Performance, Those Metrics We Should Care About
This article is translated from tmq.qq.com (Tencent Mobile Quality Center)
One dark night, an O2O taxi service provider, ‘X’, put out their new discount deals online. The campaign was a bit too successful for the back-end servers and the requests deluge crashed the query suggestion service (QSS). After the incident, the dev and op teams had a joint meeting and decided to thoroughly investigate the capacity of the service ASAP!
The challenge we have to undertake the task is to come up with the answers to the following questions, 1) for this particular type of service (QSS), what are the vital performance metrics? 2) what are their implications? and 3) what is the performance test pass criteria?
In this case study, we firstly discuss the performance metrics we care about in general basis. Then we benchmark QSS for real, and complete the circle by providing the in practice solution based on the analysis.
The Hamlets appears differently in the views of two different groups of people, the API users (i.e., ‘X’, the O2O taxi service platform) and the resource owner (i.e., us). ‘X’ can only see external metrics such as throughput, latency and error rate. While as resource owner, we also have to take account of internal metrics and factors including but is not limited to CPU and memory usage, I/O and network activities, system load, server hardware/software configuration, server architecture and service deployments strategy etc.
In practice, the latency requirement should correspond to the specific service type. In our case, it is <100ms since QSS users expect real-time feedback after every single character input. On the other hand, users of a map & navigation service would feel 2~5 seconds for a route suggestion acceptable.
We consider that mean value is not sufficient to reflect the performance of an application, thus, we normally collect the mean value, .90, .99 as well as data distribution (a sample is given in the following figures).