I developed a small game using WebSocket; omokgame.com. The WebSocket is a bidirectional communication protocol on the web. The bidirectional means it is stateful, not unlike HTTP, which is stateless. When I almost finished my project, I wondered how many WebSocket connects each server could handle. Then I found this youtube video.
The video introduces a way to optimize WebSocket connections on a single server instance. It demos it by connecting over 1m WebSocket connections to a server. I will summarize the video shortly.
- Update ulimit to handle more file descriptors by a server
- Use epoll to optimize I/O multiplexing
- Allocate buffers to file descriptors that are ready to I/O by epoll instance
The video reduces almost 90% of memory usage using the above methods. People clapped 👏 But I couldn’t because I couldn’t understand how they work, especially epoll. What is epoll anyway? I haven’t heard of it before. For me, epoll seems to be some magic.
When I searched what the epoll is, I found the keywords like select, poll, kqueue, and I/O multiplexing. Additionally, Async, Sync, Non-Blocking, Blocking I/O. I was overwhelmed, honestly.
But I believe we can at least understand the concept of the epoll and what problem it solves. So let’s talk about I/O multiplexing first.
What is I/O multiplexing? If you search multiplexing, you can easily see many images like the one below.
Multiplexing is asking how to deal with multiple things inside one. For example, the server instance usually connects numerous clients to serve its context. Imagine how many web requests one medium server handles.
The above is a visualization of how the client and server communicate. Usually, one server instance handles multiple clients request like the above. Why? Because it is way more efficient than having each server instance per client.
I talked that a server instance handles multiple requests. The how part is what we want to cover first, which is I/O multiplexing.
Server connection in Linux Operating System is a File. Linux manages lots of things as Files. If a process wants to manage files, it should get file descriptors. So handling multiple connections inside a server instance means it can take multiple file descriptors. If you are not familiar with file descriptor, I recommend watching this video.
The select, poll and epoll are tools that we can use to achieve I/O multiplexing. Below are the C function definitions of select and poll.
The select and poll looks different, but the concept is almost identical. A process passes a file descriptor list to the kernel by saying, “Hey kernel, I want to know the events to these file descriptors.” Then the kernel scans all the file descriptors to check whether an event happened to each descriptor. Finally, the kernel returns the counts and information about file descriptor changes to the process.
If you search the select system call, you might find the below image. It describes special system calls and reads I/O operations with returned information from the kernel.
If the kernel tells us what file descriptors are changed, all things process should do is wait for the kernel and process file descriptors! This is how I/O multiplexing works! Very simple. Maybe I oversimplified, but the concept is clear.
However, select and poll are not scalable. Managing 1000 file descriptors and 1M file descriptors costs differently on these models. Let’s think about what the kernel does when the process requests select and poll to the kernel. It should look at all the file descriptors requested by the process and check the status of each. This scanning should be done with each request from the process. Hence their time complexity is O(n).
How about epoll? Conceptually, it is the same as select and poll. First, the process asks the kernel, “Are there any file descriptors that I can work with now?” then, the kernel returns file descriptors of such interests. However, internally, it is quite different. And it solves the scalability issue!
First of all, epoll is an instance in kernel space. You will create an epoll instance when you call the system call epoll_create
or epoll_create1
function.
Then, the process should enroll file descriptors of interest. This can be done by epoll_ctl
with add flag.
After telling the epoll instance about file descriptors, the process wants to watch. The epoll instance watches file descriptors inside the kernel space. If any file descriptors change their status, It marks them like the below.
Since the epoll instance knows the changed file descriptors, it can return them immediately when the process asks the kernel by epoll_wait
function.
So Let’s summarize the difference between select, poll, and epoll for I/O multiplexing. The select and poll should pass all the descriptors to the kernel to know what file descriptors the process can handle. On the other hand, the process only asks for the epoll instance without full file descriptors once the process enrolled them before. So kernel doesn’t need to scan all the file descriptors whenever the request comes from the process with an epoll instance.
Honestly, I over-simplified I/O multiplexing, especially epoll. To use epoll, I think at least you need to understand the below. And this part was challenging for me to understand.
- Level triggered event (default)
- Edge triggered event
When the process tells the epoll instance about file descriptors it has interested in, it also describes what kind of event they want to handle. For example, the process might want to read data from file descriptors or write data. Additionally, the process can choose the level-triggered event or edge-triggered event type for each file descriptor. These terms, I think, come from flip-flops.
Level triggered means it is always true if the file descriptor is in a ready state. Let’s say the process adds file descriptor 5 to the epoll instance by saying, “I want to know this fd when it can be read.” If it is the level-triggered event, it will be returned by epoll_wait
if processes can read it.
For instance, the process calls epoll_wait
and gets the file descriptor 5, which means the process can read some data from this file descriptor. But unfortunately, the process’s buffer is only 1KB when the data can be read is 2KB. So it reads only 1KB and calls theepoll_wait
function again. When the level-triggered event epoll_wait
returns file descriptor 5 again.
On the other hand, the edge-triggered epoll instance does not return file descriptor 5 in the above situation of the second epoll_wait
function call. Why? Because the file descriptor 5’s state has not been changed so it can’t be edge-triggered.
Internally, when the process calls the epoll_wait
function, the epoll instance first scans level-triggered file descriptors to see whether any file descriptors can be returned. By this means, if all the file descriptors enrolled with the level-triggered event, it is just a little bit faster than poll
or select
. The edge-triggered events are returned if it is inside the ready list inside the epoll instance. So kernel does not need to scan at all. This is where O(n) lookup can be O(1) by epoll.
It would be best if you were careful to use the edge-triggered flag, though, because it might dead-lock your process. TLDR; the edge-triggered event should be used with non-blocking file descriptors. So let’s know discuss what the non-blocking file descriptor is.
There are two file descriptors kind blocking and non-blocking. The blocking descriptors block the process when it is not ready. On the other hand, the non-blocking file descriptor never blocks the process.
Let’s see why a blocking file descriptor with epoll edge-triggered mode is dangerous. Do you remember the file descriptor 5 example above?
When the process calls epoll_wait
, the epoll instance would return fd5 when it is ready to be read. But again, let’s say the fd5’s data is 2KB and the process’s buffer is 1KB. It can read only 1KB and should call epoll_wait
again when the fd5 is a blocking file descriptor.
Does fd5’s state been changed after the first epoll_wait
function? No. Because it still can be read by the process, the situation is like the one below.
The kernel does not scan edge-triggered events when the process calls epoll_wait
. Since the fd5 can’t be inside the ready list, it can’t be returned to the process on this occurrence.
On the other hand, if fd5 was the non-blocking file descriptor, the process can consume all the data even though it has a small amount of buffer. It just calls the read function until it returns EWOULDBLOCK. Then, the epoll instance can notice when the fd5 is ready to read again!
I recommend reading this post if you want a more detailed explanation of non-blocking I/O and epoll multiplexing.
I think I over-simplified I/O multiplexing. And honestly, I am still confused about this concept. So I recommend you read articles if you are like me.
A detailed explanation of Non-blocking I/O is here. It describes the file descriptors and how it works inside the kernel.
The man page of epoll is mandatory before using epoll!
Do you want to understand the implementation of epoll? These posts cover what you may want. (part 1, 2, 3, 4)
I think I should implement a simple server program using epoll to learn more. Thanks for reading my post!