Threads vs lightweight threads

Hielke de Vries
6 min readSep 27, 2024

--

At any given time our computer runs many different programs at the same time. If we open our process monitoring tool, we’ll see 10s or 100s of programs running. In order for a program’s code to run, it needs to be executed on the CPU. However, code does not execute directly on the CPU, but needs a thread to do that.

A thread is the entity that carries the code that needs to be executed. To visualize, we can say that a program’s code runs on a thread, and a thread runs on a CPU:

Threads exist in any modern operating system like Linux and Windows. It’s an abstraction layer between the CPU and the program you want to run. Before a program’s code is executed, it is assigned a thread. The thread contains information about exactly what part of the program’s code is running. For instance, it contains the memory address of the instruction that is currently executing, registers and call stack.

Scheduling threads

A program can start many threads, executing different parts of the same program. The whole magic of threads is that it can be stopped and later resumed. When a thread is stopped, all state information such as registers and call stack of that thread is stored in memory. When it resumes, this state information is loaded from memory so it can continue executing.

Modern computers can store 100s or 1000s of threads in memory while only some of these threads are actually executed on the CPU. An operating system program called the scheduler makes sure that all threads get a fair amount of time to run on the CPU. A thread needs about 1MB of memory. Switching threads on and off the CPU is called context switching. Context switching costs are comprised of the actual switching, which amounts to some CPU time but is generally fast, and the cost of cache invalidation. Caches play a big part in the overall performance of running code. When a thread is running code, caches fill up to increase performance for that code. Switching context causes all caches to flush, and the new thread starts with a clean cache.

A scheduler decides which threads gets execution time on the CPU

Your computer has 100s or 1000s of threads currently running or in memory (not running) when you normally use it. With regular use of your computer, the overhead of context switching these threads is not that high. A not-entirely-correct calculation:

  • 1000 threads: 1MB * 1000 = 1GB memory
  • context switching and cache CPU time: 10µs * 1000 = 10ms -> 1%
  • cache invalidation costs: (very hard to measure but for sake of the argument) 10µs * 1000 = 10ms -> 1%

For an average laptop with enough RAM, a fast SSD and a fast multi-core CPU, context switching costs should not be a problem.

Lots of threads

Context switching costs do become a problem when we are dealing with a lot more threads. Software that is highly concurrent such as a web server for instance, might create a new thread for each new incoming request. When this web server has to handle, say, 10.000s requests per second, the amount of memory and CPU time needed can become so high that the server might become unresponsive or crash.

Lightweight threads

Using a different model of execution, Lightweight Threads (LT) can help overcome the problem of high concurrent execution. Lightweight threads take much less memory and have less context switching costs. We’ll have a look now why.

lightweight threads have many other names such as fibers, coroutines, greenthreads or goroutines

A LT is not an abstraction between the CPU and a thread, but is generally executed on a thread. We can visualize like this:

As we can see LTs are generally¹ constructs that are defined in the program’s code and compiled into the program. With LTs, instead of letting the operating system deal with concurrency, the program’s code itself handles concurrency. This means that we have to explicitly write LT compatible code in our program. Whereas threads are scheduled (preempted) by the operating system’s scheduler, LTs can have built-in suspension or yield points where they give up execution. When such a suspension point is reached, the next LT in the queue gets execution time. This means that LTs are themselves responsible for it’s scheduling. This game of suspending and giving up execution to the next LT is often referred to as cooperative execution. LTs are generally executed on a single thread, waiting in a queue to be picked up for execution.

Lightweight thread scheduling works by waiting in line until the LT before it suspends execution

Suspension points in a LT are only needed if it actually needs to suspend. This is often the case when code is waiting for IO. For instance, when we have code that executes an HTTP request, we will see a suspension point right after it. When the response has returned, the LT continues to execute the code right after this suspension point. This is how non-blocking IO works. Because LTs are responsible for their own scheduling, when a LT runs long-running code and never suspends, other LTs in the queue do not get any execution time. This is why LTs mostly offer performance increase for IO-bound computation and not for CPU-bound computation.

LTs are smaller because they don’t store registers, call stack and other state information in memory. However, a LT does need some way to keep track of which suspension point it was the last time it suspended. For instance, Kotlin Coroutines compile suspension point information inside the byte code. This is why LTs use less memory than the 1MB that threads use. Kotlin Coroutines take up only about 600 bytes.

LTs get their name from the fact that they need less memory and CPU time which results higher and mostly faster concurrent execution. But they have to manage their own execution scheduling. If you run a simple application that does not have to deal with high concurrency, letting threads handle concurrency is the easiest way. If you need high performance in highly concurrent systems that are heavily IO-bound, it’s likely beneficial to use LTs. Many web applications and frameworks already have support for LTs by default.

In summary

Generally LTs are more performant because they don’t suffer from expensive context-switching costs that regular threads have. Context switching costs consist of:

  • memory to store thread state information
  • CPU time for the actually switching threads
  • cache invalidation when the cache needs to be refilled for the new thread

If your application is highly concurrent and IO-heavy, it’s very likely you will see a performance boost when you use a LTs in your code. If your application is not highly concurrent it might not be worth it to add the extra code needed to enable support for LTs.

Footnotes

[1]: We say generally because this is not always the case and can depend on the operating system used. For instance, Java Virtual Threads are implemented in the JVM.

Thanks to Jeroen, Julien and JM for reviewing this text.

--

--