Performance testing in the cloud

Marc van Esseveld
NS-Techblog
Published in
5 min readNov 2, 2023

At NS travel information, we are transitioning from Server based to cloud based. We use AKS and CEAP, a public and a private cloud. Is performance testing still relevant here?

I am a devops engineer in Team Distribution. My experience is that chain performance testing in the cloud is very different in certain ways from performance testing against servers. This has led to improvements that I would apply in any situation from now on.

Starting with the why

But why is there be a difference between performance testing of a physical server vs a cloud infrastructure? Surely both are physical servers in the end and a microservice runs just like a service on a physical server connected to the Internet?

Wel, yes and no. The abstraction of the cloud and more connected systems gives us a whole new landscape. Parts of the chain may have a speed mismatch, causing the entire chain to have a lower throughput than expected. The chain can also break because a single component is overloaded. Thus, this blog is about putting load on a chain of applications and/or application (micro)services.

What else

In my experience, a microservice uses very little CPU and memory. Only when things go catastrophically wrong is there anything to see. Often this is no more than a CPU suddenly going to 30% and back to 5%, a rabbitmq filling up, or memory not changing. This is very different from a server that is overloaded in CPU or memory due to a large load. In the cloud, in a chain of microservices, component tuning plays a big role. Often it is not clear from metrics of the CPU, memory, disk I/O or bandwidth of the components which component or setting in the chain was the cause.

That’s why I started looking at the throughput time per message.

How

I generate load with my own performance microservice, based on java jmeterDSL, which I deploy to our cloud. This can be on the CEAP private cloud or AKS public cloud. I control it by calling an API endpoint and tell it which service I want to create load on:

POST <https://url_performance-api-testomgeving-rabbitmq >

{

“duration”: 1800,

“rps”: 10,

“dataStream”: “DAS”,

“exchange”: “travelinformation.rit.topic”,

“routingkey”: “travelinformationfactory.das”

}

At NS, the logging of travel information systems is in Elastic. I worked with developers from our team to log the time of receiving and the time of sending from a microservice. Here I have for example the following chain:

1 INCOMING Microsvc:

→ 2 message reference

→3 object storage with reference

← 4 OUTgoing Microsvc retrieving message based on reference and sending it out →

By assigning attributes to messages in the message metadata, the performance application can retrieve all related log messages from Elastic. The application then retrieves the timestamp per message and calculates the processing times per part of the chain. The result is an overview of the lead times per component:

In this example, the cause of the overload in the chain turned out to be the storage manager on the Object storage. It could handle large messages well, but not many small ones. Also, the rabbitmq configuration of the incoming service was set to autoacknowledge. This put all messages directly from the queue into memory and could not return them to the queue. Then when the service crashed or stopped, all messages were lost.

Results

After adjusting the rabbitmq settings and changing the storage manager to CEPH, this was the result:

From the graph, it is easy to see when the service performs correctly. The lead times are almost the same from start to finish under equal load (a horizontal line). At peaks, the service quickly recovers to the same run times. We now have a chain of microservices that we can reuse for new services because we know the strengths and limits of our setup.

Cons?

Are yet to be found. In practice, even when testing an existing chain of servers, throughput as a metric proved better than CPU, memory, data I/O and bandwidth. With the new approach, we found a bottleneck in the design of the existing “enterprise service bus and pull_application” chain that had not been found in previous years.

The performance API pod takes recources from the cloud of a shared platform. However, experience shows that this is never so much that it affects performance results. In addition, a limit is set for the performance pod in terms of cpu and memory, which provides an extra safety net should things unexpectedly go wrong — through misuse, for example.

Future

My original goal for performance testing in the cloud was to be able to completely integrate it into the development process. Then when changes are made to the pipeline (code or configuration), the performance test would be run as part of the deployment process. This is called Shift Left. However, this is not (yet) possible within the NS azure pipelines, because NS endpoints cannot be called from the pipeline. Should this become possible in the Test and Acceptance environment (production is not required), then we can integrate performance testing into our pipeline and work completely automated. The modus operandi now is that we call endpoints of the performance API with vpn via the IDE or a tool like Postman. The results are already tremendously time-saving and valuable to us as a team, but complete integration would save some more time for me and my team every sprint and would make use by all team members one step easier.

Contact

I like to share what I have with other teams within NS.

If you would like to see if the same approach and code can help your team, check out our repo and contact me!

Every Monday I sit with team Distribution on the 14th floor of the main building.

https://dev.azure.com/ns-topaas/EBT/_git/dri-performance-api

--

--