Building High Performance Applications — Lessons Learnt, Part — 1

Published in

Webmagic

7 min readMay 12, 2018

Over the last couple of years of my journey in the technology arena building high scalability, high performance, fault tolerant systems, systems that are highly concurrent in nature and are designed to handle a very large number of network requests per second. I learnt some lessons, and I wanted to share them with my fellow community members.

Some people reading this may definitely disagree with some of the suggestions I am going to make and some of the readers will suggest that they have a better way, which is absolutely fine with me. I am definitely not trying to be the voice of every Solution Architect or a Application . The suggestions I am going to make definitely helped me not only in terms of performance, but also ensuring that the system was easy to debug and most importantly extendable to business requirements.

Let me take you a couple of years back in time when the word “Server” used to be the alpha and the omega of your business application. You would install an http server and code your business logic and be done with it. Scale was not an important question.

But today, applications are designed for a web scale, and the word “Server” sounds a very weak term. I am not going to talk about mildly parallel applications such as your web browser which you are using to multitask along with reading this post. This kind of parallelism does not introduce many interesting challenges. I am calling a challenge where the infrastructure itself is the limiting factor in handling several thousand requests per second. I am calling a challenge where how you design your application really matters, on the very edge of infrastructure capabilities.

Off course, I can not cover everything in one article, so you are going to have to be patient with me in case you decide to stay with me and read the entire series. Please do not ask me how many in the series am I going to write. The short answer is “I don’t know”. I can only tell you for now that I am going to cover the basics in this Part 1 and decompose some truly great architectures in the future parts including SEDA (Staged Event Driven Architecture) and Microservices, that are going to answer your questions if you have just started out or just about to begin your journey into developing a massively scalable application.

In order to understand how to obtain performance from your system, you must first need to understand what kills performance. If you have heard the phrase from someone that “the best solution to a problem is to pretend there is no problem”, than you are in some deep s**t. Well, pretend for now that you haven’t heard that phrase and believe me that there is a problem and we are going to fix it.

1. Do Not Keep Data Copies

I am sure that every one of us has heard of the term “zero copy”, either some where in your code comments or in the presentation slides of a marketing person. It’s a buzz word people, Come on!!

OK. So data copies are bad and you should avoid them. They will be their all over in your code, disguised & hidden. And guess what, by keeping data copies you are not only wasting your expensive memory, but making room for more cache misses degrading your system performance by requiring data to be fetched from the main memory instead of cache.

Solution is to start identifying in code where you are keeping data copies. I would advice you to only focus on large objects such as your data blocks or container objects. And then, a good technique is to implement reference counting. Trust me, it may seem difficult initially but will save you a lot of pain. This will ensure that instead of copying data, you are simply incrementing a reference count on the objects buffer descriptor.

There are other techniques of keeping buffer pointers along with length and offset however I wouldn’t advice them. It can quickly become buggy.

2. Minimize Context Switches

Now this is the real bad ass!! For those who do not understand what is a context switch, it is the very ability of the compute that enables multi tasking. It allows for multiple threads to share a given number of CPUs. Time to refer your engineering books again? Well, read on.

When the number of threads increase, the CPU starts to spend more time going from one thread to another than it actually spends within any thread doing useful work. So the equation comes down to — As the ratio of active threads to processors increases, the number of context switches also increases.

The only approach that will work here to create a high throughput system is to limit the number of threads to less than or equal to the number of processors. “And why can’t we use just one thread? that will completely avoid context switching.” Because we don’t also want to stay under utilized silly. You must also make yourself familiar with I/O multiplexing techniques, that will allow you to create one thread to handle multiple concurrent connections.

A good model that allows you to minimize context switches, emphasizes the use of non-blocking queues. The model divides the entire system into set of discrete stages. Each stage has its own thread and a queue, such that events in one queue are independent from another queue. A very popular adaptation of such a model is called SEDA, staged event driven architecture. I will cover this architecture in detail in the future parts of this series.

3. Avoid Calling System Memory Allocator Repeatedly

Making a call to the system memory allocator takes a full round trip into the OS. Allocating and freeing memory being the most frequently used operations makes it extremely inefficient since more time is spent in the OS. There are a couple of suggestions here that I would like to make.

Preallocation
When we can be sure that there can not be more than N items in use at once, or when we know that a request handler would need so much amount of memory per request. It makes sense to preallocate memory. After all one round trip into the OS will be better than several ones, even though if it wastes a small amount of memory.

Lookaside Lists
Certain objects in your program will be allocated and freed very frequently. The idea is to put recently freed objects onto the list instead of freeing them right away anticipating they will be required again. If they are requested again, just give them from the lookaside list instead of allocating them afresh.

Now you will not want your lookaside list to grow infinitely so you will need to add a garbage collector thread to free the unrequired objects from time to time. Also, now it will become inevitable for garbage collector thread to introduce locking and contend other threads. The best scheme would require 2 lists “new” and “old”.

Allocation is done preferentially from the new list, then from the old list, and from the system only as a last resort; objects are always freed onto the new list. The garbage collector works as follows :

Lock both the lists “new” and “old”
Save the head for the “old” list.
Make the (previously) “new” list into the “old” list by assigning list heads.
Unlock.
Free everything on the saved old “list”.

The benefit is that the garbage collector does not contend with other threads. Saving unwanted overhead that was expected with the additional thread. However the best approach would still be to have a separate lookaside list for each thread.

4. Minimize Lock Contention

For those who do not understand Thread contention, “It is simply when two threads try to access the same resource in such a way that at least one of the contending threads runs more slowly than it would if the other thread(s) were not running.”

The most obvious condition for thread contention is on a lock, which I have referred to as Lock Contention. This condition occurs because the contending threads are trying to access the same resource. If there was no lock and the threads will be modifying the shared resource simultaneously, than probably it will result in a horrific race condition and stack corruption eventually process getting terminated.

Here’s the ugly truth, there is no scheme that exists which makes things simple. The only way to minimize or even reduce lock contention to zero is by ensuring that there is never ever a situation where there exists a shared resource for which one or more threads are racing. First thing to be done to achieve zero lock contention is to ensure that you are not using locks too liberally. Check if the data and corresponding operation is guaranteed to be atomic on your target platform. Second, adopt a model, that I mentioned earlier as well, divide your problem into set of discrete stages such that each stage has its own thread and a queue. This way each thread will be working on its own data set without there being a need to contend for locks on shared resources.

Conclusion

There may be other issues that any specific requirement of your server architecture may be facing, however these are the most notorious performance bottlenecks that I covered.

Feel free to connect with me on linked in for any specific issues that you are deadlocked with designing your server architecture. I’ll be happy to help. Connect with me @ https://webmagic.co.in/

Building High Performance Applications — Lessons Learnt, Part — 1

Written by Vikas Sood