We are going to explore a variety of aspects of the performance of software systems. This article is a checklist from which you can mix and match what you find useful in your reality — kind of like a book of ingredients and recipes.
We are not going to discuss performance from any single point of view such as the web or microservices or scaling. Google web vitals, optimizing Big O, bundle sizes, and messaging protocols are out of the scope of this article.
The focus is on foundational knowledge encompassing the entire system end-to-end. Examples will be given here and there, but keep in mind that each point could be a series of articles on its own.
At the end of the series, you will find a list of tools that could be used as starting points when evaluating performance.
Here we are going to establish what performance is and the main terms in which it is being discussed in the scope of computer/software systems.
What is performance?
The textbook word definition is as follows:
Performance — accomplishment of a given task measured against preset known standards of accuracy, quality, completeness, cost, and speed.
Please note that speed is only one of the aspects used to measure performance. Performance improvement could be defined as:
- Measuring the output of a particular process or procedure
- then, modifying it to:
- improve accuracy
- improve quality
- increase the output
- increase efficiency
- reduce resource allocation
- reduce operational cost
Computer performance is the amount of useful work accomplished by a computer system.
Outside of specific contexts, computer performance is estimated in terms of:
- Speed of executing computer program instructions
When it comes to high computer performance, one or more of the following factors might be involved:
- Short response time for a given piece of work
- High throughput (rate of processing work)
- Low utilization of computing resource(s)
- High availability of the computing system or application
- Fast (or highly compact) data compression and decompression
- Efficient bandwidth usage
- Short data transmission time
To give a simple example with a web server, fast speed of execution + short response time = short time to first byte.
Performance is an aspect of software quality.
What is measured?
- Response time
- Channel capacity
What is availability?
Availability means the probability that a system is operational at a given time, i.e., the amount of time a service is operating versus the percentage of total time it should be operating. Another way to measure availability could be per amount of successful attempts vs failures for a period of time.
High-availability systems may report availability in terms of minutes or hours of downtime per year. You mostly see availability in percentage form such as: 99.9999%.
Availability should be measured based on business valuable scenarios (the About us page is probably less critical compared to the Checkout page):
- Store data
Reduced availability could be a problem with any shared resource such as storage, or service. Your queries could be waiting in a queue, or if the queue has overflowed, they could have been dropped.
Any component on the use-case path can cause a problem with the availability of the whole system. The most efficient way to deal with these is to identify the lowest possible amount of dependencies between the components during the design of the system. Architecting the system to keep the coupling and dependencies to the necessary minimum reduces the risk of a domino effect where one component going bad brings down the entire system.
Solutions to availability include but are not limited to:
- Adding a load balancer and more application servers
- Adding caching between server and services and/or storages
- Adding micro caching for specific requests on the LB level
- Heavy inserts with additional service and fast write DB
- Splitting the writing from reading flows
- Adding more of a scarce resource
This is the amount of time it takes to generate a response to a request. It could be an Input/Output operation such as reading cached assets from a disk or hitting a web service that hits 10s of services sitting behind it. Usually we measure two main types of response times: average response time and peak response time.
Average response time — Average response time is a good metric to use to start gathering data on your performance tests. Tracking this allows you to see how your app fluctuates with more or less load. It gives you an idea of the average user experience and provides insight into regressions if something changes. Note that on a high throughput system spiking requests are going to be statistically hidden in the data. Measuring only averages is a hidden pitfall of monitoring.
Peak response time — Peak response time allows you to see the performance of the slowest requests, generally by taking the 90th percentile of all response times. This creates a different view than an overall average. With peak response times, you can find more specific queries that may be problematic and know the worst response times users are experiencing.
Response time is an aggregate
It has three main components:
- service time
- wait time
- transmission time
This is how long it takes to do the work requested. When our application server gets a request, this is the time it needs to understand the request, prepare the response, and start sending it to the requestee. Let’s say 20 seconds of your transaction is used by taking dynamic content out of a big data set because a developer used some default search method that is not optimized for the use-case. It takes too much time. Once you identify this as the source of the slowness, you should put a sprinkle of computer science on the problem and solve the service’s own time issues.
This is how long a request must sit in the queue waiting for requests ahead of it to process before it can run. What is the duration if we have 100 or 1000 or 1 mil requests ahead of ours?
Even if processing 1 request takes only 1 millisecond, when you have 20,000 requests ahead of you, this means yours is going to wait for 20 seconds just to start processing.
This is the total of how long it takes to move the request to the computer doing the work and to return a response back to the requestor not including service time. Transmission time is measured to the last downloaded byte of the response.
Note that response time in complex applications could include the aggregation of 100s of services. Keeping it low in a large system is generally a feat of engineering. Solutions to transmission time are varied based on the types of problem causing the delays.
Some of the transmission time could also be due to an issue with channel capacity.
Channel capacity is the maximum rate at which information can be transmitted reliably over a communication channel. In areas like video streaming, depending on the available channel capacity, the streaming service could switch between a variety of pre-defined image quality settings and continue the playback with lower resolution in a way that doesn’t hinder user’s experience.
Your channel may not be the Internet. An easy-to-understand problem with channel capacity is the HDMI video standard.
The HDMI 2.1 standard defines channel capacity as 48 gigabits/sec. This could be:
- 1 screen of 8K 60 fps
- 1 screen of 4K 120 fps
- 1 screen of 4K HDR and 60+ fps
- 2 screens of 4K 60 fps
Like many video, audio, and data cables, HDMI cords can suffer from signal degradation at longer lengths — 15 meters is generally considered the maximum reliable length. It’s rare to see an HDMI cable longer than 5 meters in a store.
When you are designing software, you should consider the channel capacities that could be limiting. As an example, when designing software for mobile devices you should consider the channel capacity of the various mobile networks in the target market and understand the lowest band that you should support.
Channel capacity could also be a source of availability issues.
We could have issues with channel capacity in a variety of places such as:
- A third-party connection could be a bottleneck
- I/O is also a channel, the ability of our DB to write to drive/RAM is also limited by the technology (move from SSD to PCIex… NVMe… etc)
- The maximum amount of requests in queue
- TCP/IP limitations and congestion control over the network, CWND and RWND
- Capacity of the ISPs between app server and client
- The data center’s internet connection could be a bottleneck; highly unlikely but in my career, “I’ve seen things you people wouldn’t believe!”
- Network topology
- Screen refresh rate
- Peak power consumption
When we are discussing the performance of an application, we need to be aware of a few varieties of latency.
Latency is the time between cause and effect, and is everywhere. Even this article is viewed with some latency right now, each time you scroll:
- The OS polls the mouse to understand if something changed every XXX hz
- The scroll is being processed
- The event is sent to the active window (web browser in this case)
- The browser calculates the pixels to scroll and the direction
- Without going to into browser implementation specifics, content is recalculated, rasterized and painted at the refresh rate of the screen for the duration of the scroll
- Light emitted from the screen hits your eyes
The definition of latency depends on the system in question.
In a web-based service, network latency is a component of the response time related to the transmission time.
Depending on the type of latency, you can approach the problem the same way as availability or implement a better UX through research, finding where users suffer from it. This could take many forms, but one simple example is a loading indicator whenever the user does something that requires us to request data over the wire.
Components of Latency
There are many components that affect network latency.
- Transmission medium — the physical path between the start point and the end point. Copper cable, fiber optics, WiFi, etc.
- Propagation — the further apart, the more latency, even light travels at a constant speed in a vacuum
- Hops on the network — any active or passive equipment on the network increases latency. Most of the internet infrastructure has been there since the beginning, and many routers and hops you are going to encounter are 10+ years old. This brings many challenges to the advancement of the network protocols we are using
- Storage delays — accessing stored data can increase latency as the storage network may take time to process and return information
- Last mile delays — when talking about internet, there is an ISP and their network between you and your client, which is unpredictable, especially when it looks like this!
- UX — in terms of UX, the perceived latency from the user’s interactions is one of the most critical things to optimize to improve your customers’ satisfaction. In the case where you click the submit button of the login form to get logged in, the immediate visual response of the user interface could mitigate the perception of slowness
- Network latency is the time it takes for a data packet to travel from the sender to the receiver and back to the sender
- High latency can bottleneck a network, reducing its performance
- You can make your internet applications less latent by using a content delivery network (CDN) and a private network backbone to transfer data
In our example system, we have 2 big potentials for latency that we can’t control; the third-party used by one of our services, and the client’s internet connection. All else we can measure and improve as it is owned by us.
Some quick ways to offset the effect latency has on our end users are:
- On the client’s behalf, we can improve latency by employing UX patterns for cause and effect such as immediately reacting on clicks to call to actions, rather than waiting for response from our back-end. We could diminish the sensation of latency by streaming the data as soon as it’s ready rather than waiting for all of it, however, that comes with its own complex set of caveats.
- CDNs and PoPs* could be utilized to bring content closer to the clients to reduce the time it takes to request resources from their origin
- An edge-computing approach could be a solution to make your client more in charge and reduce some of the need for communication with a back-end through its distributed nature. This reduces latency by housing applications, data, and compute resources at locations geographically closer to end users (even on the clients themselves).
*PoP is a demarcation point, access point, or physical location at which two or more networks or communication devices share a connection.
Throughput is the rate of production or the rate at which something is processed.
You may think of throughput as digital bandwidth consumption: you may have 1 teraflop of computing power on a cloud CPU, whether that’s 10, 10,000 or 1,000,000 requests depends on your application and how heavy it is on the CPU.
If you need extra bandwidth, you pay extra.
For example, let’s compare two systems with the same throughput:
- System A is sequential and can process 1000 requests, one after the other, for 1 second (1 request per millisecond)
- System B takes 1 second to process 1 of the same requests as System A however, it is able to process 1000 requests in parallel (1 request per second).
While they both accomplish the same amount of work for the same amount of time, you can hardly call System B a performant system. System A is going to utilize a total of 1 second of CPU time, while the aggregation of the parallel execution of System B would result in 1000 seconds of CPU.
Throughput is something that should be defined by the business and as such, it can be measured in a variety of ways. You may see that Cloud providers generally measure throughput relatively to network bandwidth.
What could we establish as throughput for a system?
- For a web application or service, that could be the rate of requests/responses, or network traffic consumed, or CPU time that’s being used
- Error rates
- CPU/memory utilization
- Successful requests/responses
- Number of business transactions per unit of time
Controlling throughput is possible via scale out where we increase the instances of the scarce resource.
Latency vs Bandwidth vs Throughput
For an application that works over the internet or any network utilizing TCP/IP, there are a few important things to know about how latency, bandwidth, and throughput differ.
Latency is usually measured as a round-trip delay. A round trip is the time a packet (car) takes to get from point A to point B and back:
Bandwidth is like a road with a strictly enforced speed limit. All the cars must travel at the same speed. Every participant in the network gets the same slice and priority to move about.
The only way to get more cars on the road is to make the road wider:
Data throughput is a practical measure of actual packet delivery while bandwidth is a theoretical measure of packet delivery. The bandwidth can only be increased by a finite amount as latency will eventually create a bottleneck and limit the amount of data that can be transferred over time. Throughput is often a more important indicator of network performance than bandwidth because it will tell you if your network is literally slow or just hypothetically slow. An efficient network connection is comprised of low latency and high bandwidth. This allows for maximum throughput. Packet loss, latency, and jitter are all related to slow throughput speed.
Significant benefit can be gained in terms of throughput with proper use of compression and projections of the data you need to send. In case of transmitting data in JSON, a simple Gzip compression could yield 75% savings, i.e. 1MB shrinks to about 250KB. With more aggressive compression settings and proper structure of the data, these savings could be even more. Note that compression comes at addition CPU cost on both sides → compression and decompression. While your servers could be powerful enough and scaling with needs, the client devices might not have the necessary horse power to deal with every kind of compression. Different protocols compress differently. Also, data serialization and deserialization should be considered when choosing one for your application.
The pillars of observability
“Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs”. Wiki
In software systems, observability is the ability to collect telemetry data about program execution, internal states, and communication between components. If your service is not observable, it’s hardly possible to measure or troubleshoot its performance. Telemetry data is usually divided into three categories metrics, traces, and logs.
Do I have a problem?
What is the difference between logs and metrics except where you send them?
While logs provide you with troubleshooting information for when you know you have a problem, metrics tell you if you actually do have a problem:
- How many requests does my service process per minute? Is it unusually low?
- How many of them are failing? Is the percentage of failures too high?
- How fast my service responds? Is the processing duration high enough to cause a poor user experience?
“There’s one wolf in Alaska. How do you find it? First build a fence down the middle of the state, wait for the wolf to howl, and determine which side of the fence it is on. Repeat the process on that side only, until you get to the point where you can see the wolf.” source.
Services must report metrics not just about their own vitals but also for every external dependency they communicate with.
This way you can quickly find out which parts of the system misbehave, narrowing down the search a multitude of times.
Metrics and detection
The most value from metrics comes when you configure automatic alerts based on them.
An alert is a pre-configured condition that is automatically triggered when the metric value exceeds certain thresholds.
Examples of good alerts:
- High request processing duration
- High % of failed requests
- High CPU usage
- An unusually low number of requests
Metrics and aggregation
Another difference between log and metric is that metric is not a specific value but a trend representing general system behavior.
Therefore, you work with aggregated information. This is an additional layer of complexity coming with its own gotchas.
In essence, a metric is just a number you measure over time. It has to represent information about thousands of events. So what should this number or rather numbers be?
Aggregation to measure throughput
On the following graph, each individual request is represented by a green dot on the timeline. The yellow line represents the count of requests per minute. This is an example of the simplest aggregation.
Devil is in the details or why a simple average is not enough
Your service may process thousands of requests and some of them may be much slower than the rest. How can you describe their generalized speed? You could just use a simple average (mean) but it’s not enough. Two very different situations will produce the same average: for example, if all requests have a similar speed or if half of the requests are very slow, and the other half of the requests are very fast.
Percentiles come to the rescue. You can monitor the maximum latency for a certain percentage of the requests. For example, you may monitor the 99th percentile to see if it takes more than one second to trigger an alert. Such an alert still tolerates 1% of the requests to be slower, so you don’t get spammed with alerts caused by the nature of distributed networks. Depending on the requirements, you may raise the bar as high as the 99.99th percentile or as low as the 50th percentile.
In the example below, the service was processing requests at a normal speed. At the 5'th minute, around 1–10% of the requests began to slow down.
What we can see is that:
- A simple average (mean), median, and even 75th percentile cannot detect the problem.
- 99th percentile detects the problem as soon as it starts occurring.
- 95th percentile detects the problem as well, but a minute later, only after more requests start to be affected.
If you had only monitored averages, your users would tell you about the problem rather than your monitoring. This is why it’s vital to monitor both: averages and percentiles.
Often, raw data is aggregated multiple times across various dimensions. It can lead to unexpected results, so you should be aware of the common gotchas.
For example, your service tracks request latencies. Every 5 seconds, it aggregates and flushes them to a monitoring system. At the same time, visualization and alerting systems could aggregate this already pre-aggregated data again, this time using a different time interval.
After a couple of aggregations, mean is 205 milliseconds, while in reality, it’s 140 milliseconds.
After an aggregation on top of another aggregation, median is 780 milliseconds, while in reality, it’s 10 milliseconds. This is a huge difference.
In reality, median for the uneven number of elements is an average between the middle pair. For simplicity, we just take the bigger value in this example.
Distributed systems mean distributed metrics
In distributed systems, your service may have a lot of instances that are running on different machines. This means you will have to aggregate metrics not just by the time interval but also across different instances.
Let’s take a look at the following example containing two instances:
- the first one processed 1000 requests with a mean latency of 100 milliseconds
- the second one processed only 10 requests but with a mean latency of 10 milliseconds
Suppose you want to visualize a mean request latency of all the requests no matter which instance processed them. We have a couple of options on how to calculate it:
- as an average of the averages: (100 + 10) / 2 = 55 milliseconds
- as a weighted average: (1000 * 100 + 10 * 10) / (1000 + 10) =~ 99 milliseconds
The second approach gives us the latency, which is much closer to reality because it considers the significance of each instance.
Don’t miss spikes
Metrics are aggregated over regular time intervals. If the aggregation time interval is too big, you can miss valuable details.
For example, you may decide to monitor the number of processed requests per minute. One day your service handled 1000 requests per minute without any issues, while the same amount of requests overloaded it the next day. Why could this have happened? One of the reasons could be a sudden traffic spike. Handling 16 requests per second uniformly, or handling 1000 requests in the first 10 seconds and then almost idling for the next 50 seconds, still mean 1000 requests per minute. However, the second case would be a heavier blow to the system.
Choosing an optimal aggregation time interval is a trade-off. Smaller time intervals may provide more insights but, at the same time, may require more computing and storage resources.
Don’t be afraid to dig deeper
Don’t be caught off-guard when the numbers on your graph seem odd. In order to trust your metrics, you need to understand the whole aggregation pipeline, starting from the service runtime to appearing on your screen.
Where is the problem?
Distributed tracing gives us the ability to understand how a request went through the system and how long it took. You can quickly detect what parts of the flow are the slowest.
Based on collected traces, you can also see a real-time dependency graph. It is an invaluable asset in complex systems consisting of hundreds of services.
Out of the box, tracing frameworks transparently track network calls, database queries, and other types of interactions. The amount of such interactions is huge. If every such interaction were to be recorded, it would make tracing not viable resource-wise.
This is why traces are sampled. For example, if you record just 1% of the traces, you transfer and store a hundred times less data. At the same time, 1% of the traces is enough to build an accurate dependency graph or to identify the slowest components in the chain. The downside of sampling is that you cannot rely on tracing for troubleshooting particular occurrences of the problem, as they may not be recorded. For this purpose, you should use logging.
A log is an immutable record of discrete events that happened at a specific point in time.
Logs could be in a variety of formats:
- Plain text → common logging format, usually human-readable.
- Structured → JSON
- Binary → DB, binlogs, systemd journal, etc.
Logs can be used for debugging and informational purposes.
What is causing the problem
Once metrics signal to you that there is a problem, logs will help you pinpoint the exact root cause.
In the following screenshot, we can see that each request is logged, along with its status code, and other important details. If some of the requests fail, there would be stack traces and other helpful debugging information.
Observability and performance impact
Excessive logging, tracing, and nonoptimal metrics can have negative consequences.
Telemetry data must be serialized, and eventually sent over the network. It’s not free and may consume even more resources than the actual business logic.
- Logging — each log statement consumes resources. Reducing the log level to a minimum partially solves the problem. Yet your service could still be very busy executing each statement, and serializing the data, even if this data is discarded and not sent over the network. Look for techniques to reduce this overhead (.NET example — High-performance logging).
- Tracing — the performance impact pretty much depends on the sampling rate. If the sampling rate is high, the overhead is similar to logging every service network call, database query, etc.
- Metrics — their performance impact is different from logging and tracing. The number of times the metric statement is executed does not matter much. Under the hood, it’s a counter or a histogram that is regularly flushed in the background. What matters though, is how frequently you flush it, and the number of unique tags, also known as cardinality (DataDog example — Counting custom metrics). The more unique tags you have, the bigger the performance impact so try to keep it low.
If telemetry data are sent to the Cloud, you may spend all the budget. As for the on-prem infrastructure, a lot of data could simply overwhelm it. Once the infrastructure can no longer handle the incoming data, you start losing observability not just for your service but for the entire system. Hence, it’s important to have the ability to rate-limit or cut off telemetry for certain services to save the entire system.
How to measure
Performance of a computer system is not constant. It depends on many variables and once measured, it is accurate only in the context of the variables. Once the parameters change, we need to test again. This is a key thing to keep in mind. Increasing hardware blindly could point your intuition towards better performance, but it is just going to expose the next bottleneck.
Performance is about balance — your system is as slow as its slowest component.
Establish a baseline
If you can’t measure it, you can’t improve it… — P. Druker
Establish a baseline. This baseline is going to change over time as your knowledge of the system and what/how to measure increases. The most important property of the results of any performance testing is repeatability! If you can’t reproduce the results they cannot be considered a baseline.
Start by asking the right questions! At a minimum we should establish:
- What is the test scenario, is it representative of an entire business flow, one or a few components?
- What is the performance test scope?
- Which systems and subsystems, interfaces, components, etc. are in and out of scope for this test?
- Whenever UI is involved, how many concurrent users are expected: peak vs. nominal?
- What does the target system hardware, and configuration look like? Servers, routers, load balancers, etc.
- What is the Application Workload Mix of each system component? (for example: 20% log-in, 40% search, 30% item select, 10% checkout).
- What is the System Workload Mix? Expect 20% in the morning, 75% in the afternoon, 10% in the evening.
- What are the time requirements for any/all back-end batch processes, again peak vs. nominal? Any Service Level Agreement (SLA)?
Define your methodology:
- Identify the Test Environment
- Identify Performance Acceptance Criteria
- Plan and Design Tests
- Implement the Test Design
- Execute the Test
- Analyze Results
- Isolation testing
- Load testing
- Stress testing
- Soak testing
- Spike testing
- Breakpoint testing
- Configuration testing
- Internet testing
In theory they sound similar, but that is because for the most part the methodology of execution is similar. However, the devil is in the details. Let’s check each one of them, how they perform and what their is.
Isolation testing is not unique to performance testing but involves repeating a test execution that resulted in a system problem. Such testing can often isolate and confirm the fault domain.
It is important for efficient troubleshooting. You need the ability to replicate a faulty behavior in isolation so that you don’t need to spin up hundreds of components to verify the results of fixes to the defective one.
In these kinds of tests, we can make sure we have improvement, or at least we don’t have degradation of the component’s performance. Such tests should be done regularly and be a part of the CI/CD process.
Load testing is the simplest form of performance testing and is usually conducted to understand the behavior of the system under a specific expected load. This load can be the expected concurrent number of users on the application performing a specific number of transactions within the set duration. The test is going to give out the response times of all the important business critical transactions. The database, application server, etc. are also monitored during the test, to assist in identifying bottlenecks in the application software and the hardware that the software is installed on.
The system is as performant as the least performing link on a path.
Loading one service up to its capacity could overwhelm another dependency. The point of the test is not to kill the system, but to confirm it performs up to spec. Load testing could indicate where we lack resources and where we are using more that we need.
Stress testing is normally used to understand the upper capacity limits within the system. The green line on the figure above represents the expected normal system load.
This kind of test is done to determine the system’s robustness in terms of extreme load and helps application administrators to determine if the system will perform sufficiently when the current load goes well above the expected maximum. The results of such testing are key to understanding mitigation strategies in real life situations:
- strategy for cutting traffic or rate limiting
- policies for identification of malicious traffic
- deciding on which point and how many new instances to create based on their spin up time
Stress testing in isolation is also an option and could be used to verify variety of KPIs based on pre-defined conditions.
Soak testing, also known as endurance testing, is usually done to determine if the system can sustain the continuous expected load.
During soak tests, memory utilization is monitored to detect potential leaks. Queues are monitored to see if requests/messages get dropped.
Also important, but often overlooked is performance degradation, i.e., to ensure that the throughput and/or response times after some long period of sustained activity are as good as or better than at the beginning of the test.
It essentially involves applying a significant load to a system for an extended, period.
The goal is to discover how the system behaves under prolonged sustained use.
Spike testing is done by suddenly increasing or decreasing the load generated by a very large number of users and observing the behavior of the system.
The goal is to determine whether performance will suffer, the system will fail, or be able to handle dramatic changes in load. How resources are utilized, and then released.
Spike testing is interesting because it can be used to test multiple components of the infrastructure: load balancer, auto scaling, resource utilization bottlenecks (example: RAM is not enough for CPU to be fully loaded).
Breakpoint testing is similar to stress testing. An incremental load is applied over time while the system is monitored for predetermined failure conditions.
Breakpoint testing is sometimes referred to as Capacity Testing because it can be said to determine the maximum capacity below which the system will perform to its required specifications or Service Level Agreements.
The results of breakpoint analysis applied to a fixed environment can be used to determine the optimal scaling strategy in terms of required hardware or conditions that should trigger scaling-out events in a cloud environment.
Examples of breaking points are:
- Increased load on the application server with login requests could potentially overload a shared component like a DB, or third-party KYC service
- Increased load and querying of dynamic content could overload some queues
- Increased amount of messages from a kafka or a rabbit could exhaust the resources we have to process the messages
The idea behind breakpoint testing is to reduce the surprise element of system failure. Furthermore, it allows us to learn the symptoms and side-effects of a failure scenario:
- What other systems are going to be affected and how?
- How that failure cascades through the system. For example, users who are logged will be fine. Users who attempt to log in won’t be able to do so
- Prepare and test recovery plan for the system
There are various ways to do this kind of test, such as:
- Loading the system through its clients’ and business’ flows
- Chaos testing
- Introducing load on specific components while executing business flows
At some point our system is expected to break and then recover. If the system is well designed, recovery will be automatic and is yet another KPI to track.
A quick side by side of the different graphs to see how they compare visually:
Rather than testing for performance from a load perspective, tests are created to determine the effects of configuration changes to the system’s components on the system’s performance and behavior.
A common example would be experimenting with different methods of load-balancing:
- Round Robin
- Weighted Round Robin
- Least connection
- Weighted least connection
- Resource based (Adaptive/SDN adaptive)
- Fixed weighting
- Response time
- Source IP hash
- URL Hash
If demand for a service is becoming too high an example system could be re-configured like this…
…with additional application servers behind a load balancer.
Configuration testing can help us find a good balance between the variety of services we have. This balance is sometimes hard to find. In the example above we are going to have multiple services talking to the same DB which could lead to a variety of problems.
This form of performance testing is when global applications such as Facebook, Google and Wikipedia, are performance tested from load generators that are placed on the actual target continent whether physical machines or cloud VMs. These tests usually require an immense amount of preparation and monitoring to be executed successfully.
Alternatively, the so-called RUM monitoring (Real User Monitoring) could provide a lot of meaningful albeit varied data on the subject.
Let’s check this situation. A system is hosted in the red dot. Our services are used in the countries all around.
Depending on where we want to grow our business, we need to identify how our SLAs hold in the markets we target.
Internet testing is expensive. If you try using some of the free tools, they will provide very vague and unreliable results for you.
Services like webpagetest and lighthouse are a good starting point, but their results should be taken with a grain of salt.
1st of all their servers are located somewhere, so there is a penalty to accessing your services or web pages.
2nd these tools cannot represent your actual users in terms of internet speed, hardware, frequency of access, etc.
Good internet testing means you are going to have access to a machine on location where you can perform the tests.
An example from the real world which I’ve personally dealt with was very interesting slowness between a specific ISP and GCP. The DNS resolution was taking about 2 seconds, which made our services appear very slow for our customers.
RUM monitoring to the rescue! Real user monitoring tools allow you to gather metrics on how your services load and are perceived by the customers. While RUM is not going to sample 100% of your customers, you could still get a pretty good understanding of how your services are consumed and if within parameters.
RUM testing is not perfect. There are a lot of knobs that need to be adjusted until you get trustworthy results and establish a baseline for the variety of your users’ locations.
Performance sometimes hides in unexpected places.
If you focus only on code execution you are going to miss the big improvements that could be made to a system. There are cases where all services reply within SLAs of 50ms, yet a business flow utilizing dozens of services still takes a few seconds to execute. There are the individual services SLAs and then there are the business SLAs.
Sometimes you need to step back and see the system in context. Try to see the system from above. Your investigation should move from top to bottom. You should let go any assumptions that:
- the network is reliable and secure
- latency is zero
- bandwidth is infinite
- topology doesn’t change
- transport cost is zero
- the network is homogeneous
But also try to be more open minded. If you look at the system as a programmer your vision will be limited to “for loops”, SQL queries and other stuff like that.
Try to change your viewpoint!
Then, once a new understanding of the system is acquired one can see the system in a context previously unknown but important:
- How are the customers using your software?
- How it is delivered to them?
When are the customers doing specific things and in what sequence?
- Depositing and then checking their balance?
- Finishing a checkout and then going to their checkout history?
- Logging in and then immediately going to check some transaction status?
Context in terms of environment?
- ISPs slowness
- Access to PoPs
- Client devices specifics
- User demographics and their preferences: maybe the system was designed for mobile phones, but your users prefer desktops or vice-a-versa.
- How it is configured?
- What is the optimal configurations?
- How often it is verified that current system requirements are met from the CDN?
Variety of cases
And many more system specific examples…
Only then, once you zoom back in, this knowledge could be reflected in the implementation details.
Testing the wrong things → testing for loops and branches in the code very rarely produces meaningful results (except when they do). These are needed but are often more on the side of micro-optimizations and tweaking rather than bringing massive improvements. Sometimes such optimizations (loop tilling for example) require intimate knowledge of the hardware which in a cloud environment is not guaranteed to be the same every time.
Expensive tests → while soak testing sounds great, how could you perform it? Does it mean your software is going to be inaccessible for the time being? Are you going to mirror your entire system to run a soak? Could you be soaking your entire system constantly? Tests should be designed to be cheap and fast to execute. While there is a place and time for more expensive and time consuming tests, the majority of your tests should be easy to execute → even on a developer’s machine as a pre-push requirement.
Time To Market (TTM) → sometimes performance troubleshooting takes time. Often the business would need a much faster reaction, thus one should know all the aspects of performance and variety of gains. Taking you back to the example with the slow ISP and DNS resolve times. It was expensive to setup the mirror and re-route the traffic, but it was cheap comparing to the business that could be lost to faster competitors. Another example of this is what type of testing should be in CI versus what could be safely and reliably tested on production. The best solution is not always the best solution! We should find the balance between cost, time, and gains.
Don’t forget about performance → once a good result is obtained it is very easy to stop worrying about performance. But it degrades over time, and it requires constant effort to keep up to spec. Maybe some hardware specific optimizations were lost when we moved to the cloud? Maybe over time the added features increased the load and the SLAs are no longer attainable?
Impossible standards → while your software grows its SLAs would have to adjust to meet the new requirements. If a transaction used to take 1ms and it triples in size and complexity this number could no longer be feasible or even achievable. Infinitely fast = infinitely expensive!
Performance hides in unexpected places → parsing speed vs downloading speed, searching vs indexing, data structures (array vs hash table vs map, etc.). Expand your field of view, it is very hard to improve performance if your understanding of the system is limited to only one aspect of it.
Expectation vs Reality → Expectations and assumptions are dangerous when testing a system. Expecting the user would do A flow and they end up never doing it but do the much more compute intensive B flow on a regular basis. One example of this is testing hardware versus end user hardware. While testing the software on the latest hardware is important to ensure it is future proof, it could turn out the majority of your demographic is using 5 year old mid-range phones to access your services.
Caching is not everything, but it helps, unless it doesn’t. Staleness of data could be hard to define, identify and sometimes even debug. Cache invalidations is one of the top computer science problems. Dynamic memory is still much more expensive than storage.
Packing and distributing the content → Packages with 14KB and 1KB sizes are transferred at the same speed. But one sprite with 7 images is going to be transmitted faster than 7 x 2KB images. Keep that in mind when bundling and splitting your content into packages.
Custom tools are of course the best*. You know your software the best and know how to test it best. While it is expensive to develop them in the first place, eventually the payoff is larger. People who are passionate about software should develop their own tools!
For quick reaction however using one or more of the listed below could prove helpful and give you insights onto what is going on:
- Webpage test
- New Relic
- Custom tools
* “Best” is always from a certain point of view. They could be best in terms of system visibility, but worst from the point of view of cost at the same time.
What to measure:
- Response time
- Channel capacity
How to measure:
Gather metrics, logs, traces
Establish baseline → current picture, and starting point.
Define the numbers → where you want to go.
- Identify the Test Environment.
- Identify Performance Acceptance Criteria.
- Plan and Design Tests.
- Implement the Test Design.
- Execute the Test.
- Analyze Results.
Ask the right questions:
- In detail, what is the performance test scope? What subsystems, interfaces, components, etc. are in and out of scope for this test?
- For the user interfaces (UIs) involved, how many concurrent users are expected for each (specify peak vs. nominal)?
- What does the target system (hardware) look like (specify all server and network appliance configurations)?
- What is the Application Workload Mix of each system component? (for example: 20% log-in, 40% search, 30% item select, 10% checkout).
- What is the System Workload Mix? [Multiple workloads may be simulated in a single performance test] (for example: 30% Workload A, 20% Workload B, 50% Workload C).
- What are the time requirements for any/all back-end batch processes (specify peak vs. nominal)?
Some curated resources touching variety of the topics we covered: