Load Testing and Optimizing a Business-Critical Application

Lessons learned from the StembureauApp

Published in

Elements blog

8 min readJun 7, 2019

In this article I will explain the challenges we faced and solutions we implemented during the development and production release of a project we have been working on for about one and a half years: the StembureauApp (“PollingStationApp”). This is a software tool to support the voting process in Dutch polling stations on election days.

The Elements StembureauApp team doing a final load test in Rotterdam

The app helps with scanning votes, counting, filling in validation checklists, managing the presence the staff members as well as enabling the control room to monitor the whole process so the city hall knows exactly in what step each polling station is at any given time. All these features basically makes things a lot easier and faster for the municipalities in terms of process, logistics and communication.

You can read about more functional details on the StembureauApp in Wouter’s blog post.

Currently, ten Dutch municipalities, including major cities such as Rotterdam, Eindhoven and Utrecht, are using the StembureauApp. Combined, the municipalities have a total number of 1.500 polling stations and serve about two million Dutch voters (about 15% of the nation-wide voters).

One really important thing to emphasize is that on election day the application really needs to perform well. We cannot afford any performance issues, as it could disrupt the democratic process.

To not make this article too extensive, I will focus mainly on performance testing and optimization techniques.

Let’s dive into the technical details. The back-end of StembureauApp is coded in Python/Django and the front-end in Angular. Docker was utilized to create microservices that were run in a Kubernetes cluster. The two most important Kubernetes advantages: resource allocation based on municipality size (number of polling stations on an election day) and namespace isolation, to make data breaches impossible across municipalities.

Testing the performance

We used a number of tools to get valuable insights in the performance of the performance of the infrastructure and our backend. I will get into more details of the most important ones:

Locust
Bombardier
Iperf3
Nginx
Kibana

Locust

We used load testing tool Locust to simulate the different steps a polling station goes through on election day, such as opening, voting, counting, etc. Different user behaviors were defined to match what we expect to happen during an election day, including staff members and back office users. We also simulated users logging in and out during the day and even generating heavy data exports in between.

Locust generates charts and statistics out of the box, although we also had everything logged in Kibana and we extended the nginx logs, so we could compare the results and detect if the issue was in the backend or in the network or client.

Example of Locust charts:

Bombardier

We used benchmarking tool Bombardier to measure the concurrency on several levels of the application. It gave us more specific end-to-end insights in the number of HTTPS requests per second, the latency of the connections and the throughput.

Iperf3

To analyze the network performance, rather than the back-end performance, Iperf3 proved to be a very useful tool to provide us with information on pure network throughput. We were able to run Iperf3 at several layers in the network to validate each individual level.

Nginx logging

We enabled extended logging in Nginx contained the following metrics:

log_format json_combined escape=json
'{'
  '"time_local":"$time_local",'
  '"remote_addr":"$remote_addr",'
  '"remote_user":"$remote_user",'
  '"request":"$request",'
  '"status": "$status",'
  '"body_bytes_sent":"$body_bytes_sent",'
  '"request_time":"$request_time",'
  '"upstream_addr":"$upstream_addr",'
  '"upstream_connect_time":"$upstream_connect_time",'
  '"upstream_response_time":"$upstream_response_time",'
  '"http_referrer":"$http_referer",'
  '"http_user_agent":"$http_user_agent"'
'}';

Specifically the request_time and upstream_response_time were the most useful to identify where a delay was coming from.

At this point you may consider why are we considering the client being a bottleneck? Quick explanation, the iPads running the application were connecting via a special sim card to a private network and this had to be tested as well.

Kibana

We were logging a lot of data into ElasticSearch, so Kibana proved itself an invaluable tool in visualizing our metrics.

Example dashboard during one of our load tests:

In Kibana also the “deltas”, time differences between the request time and upstream response time, were revealed:

Time differences between the *request time* and *upstream response time in Kibana*

Looking at the last screenshot, the first two rows have a quite a big time difference between the request time and the upstream. Since the export endpoint is expensive, you may think this difference is caused because the client had to download the response payload, but between Nginx and Gunicorn there was no network delay.

Also look at the two bottom rows. The back-end processes the POST request in about 0.3 seconds, but the user perceived the endpoint as slow, taking about 3 seconds. In this case, the payload of the response was not big at all, it was a very small JSON payload.

This pointed us directly to the network layer between the client and the Kubernetes cluster (bandwidth issue). When it reached its limit, such deltas were happening much more often and therefore general “slowness” was perceived.

With these specific metrics, we were able to detect whether the bottleneck originated in the network or in the backend itself. If the issue would have been in the backend, we could invest time optimizing the most often used endpoints, and test it over and over until the results were satisfying.

Optimization techniques

Now we had more metrics and insights in potential bottlenecks, we could start working on optimizing the code and configurations.

I will discuss the following techniques and tools we used to optimize and improve the application and make it more resilient:

Scaling
Django cache
Django Nginx cache
Endpoint isolation
Database replication
ORM profiling

Scaling

Auto scaling did not work as expected. In practice, this is a project where the load is predictable, and when scaling is required, it is already too late.

Considering that spinning a new pod takes about ten to twenty seconds, new requests are already queuing and causing more load. Since we had all the hardware allocated, we just calculated what we needed in the peaks and pre-scaled it beforehand.

Django cache

Caching is a common topic for most projects, and of course we did apply it here as well, although we needed some tweaks, since the basic Django cache_page turned out to be insufficient.

Endpoints that could benefit from a caching mechanism were wrapped with our own cache manager which has a semaphore (mutex) allowing only one request at a time to fill the cache. All other requests just serve previous data from Memcached. The reason is, we wanted to prevent multiple concurrent requests doing the same work. When doing the load tests this helped to get rid of the queuing of requests (response times getting larger over time).

Django Nginx cache

There was one specific GET endpoint which was so often called, that it ended up consuming too many Gunicorn workers. The first and simple solution was to increase the number of pods to have more threads available (one pod is four Gunicorn workers). However, there was a more efficient option: serve the endpoint from Nginx directly.

There was a cron pod which was updating the values in the background and storing them in Memcached. Then, Nginx obtained the response directly from Memcached instead of hitting the backend, which was much faster and reduced the load tremendously.

Of course there where some security considerations here, but we handled them gracefully.

StembureauApp in use in Groningen in November 2017

Endpoint isolation

Endpoint isolation is a really cool feature of Kubernetes. You can deploy the same pod but with a different label, for instance “backend-replica”. Then you can define a list of endpoints that hit this new service, while the rest go to the default backend pods.

The advantage of this is that if there is an endpoint that can eventually drain up too many resources, it can be isolated to not impact the important ones. Moreover, this feature can be combined perfectly with the next optimization (database replication).

This is how it will look like in a Nginx configuration:

location ~ (/stembureauapp/rest/export/) {
    proxy_pass http://stembureau-backend-service-replica.elements.svc.cluster.local;
}location /stembureauapp/rest/ {
    proxy_pass http://stembureau-backend-service.elements.svc.cluster.local;
}

Another advantage is that you can do this tweak without modifying any line of code in the back-end. Both services with pods are exactly equal, just that specific endpoints are relayed to one or the other.

Database replication

Multi-database can be used in almost any Django project, since you can just define a database router and write operations from the master and read from the replica.

However, this can be tricky sometimes, since the sync between master-replica may take some time. For some processes it may not be acceptable to have such delay or even worse, have race conditions.

With the previous improvement (endpoint isolation) you can just define only those endpoints to use the replica database. So you don’t only isolate the pods to specific endpoints, but also the database!

Generating the exports in the application is perhaps the best example for this optimization. Exports are read-only operations, with no functional importance, but may be very computationally expensive. You don’t want them to impact the core of the application, not the backend nor the database.

ORM profiling

ORM profiling is a generic performance task to do in any project, but good to mention it here anyway. Since the backend is an API created with the Django REST framework, we used Django Silk to inspect endpoints and determine how much they consume and how many database queries they fire, so we could see ways to improve by optimizing them.

The majority of optimizations have to do with prefetch_related or select_related in order to reduce heavy SQL queries drastically.

That is mainly it!

EP19

On the day of the European Parliament 2019 elections on 23 May 2019, we saw the application run flawlessly across all municipalities. There were no errors and all endpoints responded faster than predicted.

It was a major success and we were all very happy with the results!