How node actually works, part 1: the operating system

Igor Atakhanov
10 min readJan 13, 2020

--

When talking about how node works, people talk about a single threaded, non-blocking, asynchronous event loop. But every program has to talk to the operating system through system calls. Servers can be described as abstractions on top of operating system mechanisms because it’s the operating system that tells the hardware what to do. After reading this, you’ll be able walk into a Javascript interview with a bunch of new buzzwords that will make managers go wild. Buzzwords like epoll, event stream, and multiplexer. Remember, it’s more important to have large wooden platform energy than an actual large wooden platform. Even if you don’t completely get it, a manager might think you actually read source code, which he can’t test, because he sure as hell didn’t. So let’s add some inches to your wooden platform energy with an explanation of how node talks to the Linux operating system.

Let’s explain things one buzzword at a time.

Demultiplexer

The big picture of what’s happening with a server is that data is multiplexed, then demultiplexed. A multiplexer takes many input channels and creates a singular output channel. What is it multiplexed into? Let’s call it an event stream, since it’s a stream of data, and that data can be literally anything, so let’s just call each data point an event. In node, when a request from a client is received, and a handler is called on it such as

(req, res, next) => {
if(isValid(req.body.password, req.body.username)){
return res.json({authenticated: true})
}
return res.json({authenticated: false)}
}

an event is taken off the event stream, by deconstructing a singular stream into a a few queues, and taking events off the queue to send to handler functions for processing. By now, this should be your mental model:

mux and demux model

And the next question should be, why do we multiplex? Why not call file handlers directly on incoming network events? Simple, because they would block.

Non-blocking

Blocking is brought up frequently, and tasks are talked about as blocking or non-blocking. To explain blocking, imagine that you have to do three things. You’re going to cook, then laundry, then clean. When are you going to do them? When you are done with one, you start the other. In other words, cooking blocks on laundry until you’re done cooking, then cleaning is blocked until you are done with laundry. This is quite simple, because you did not have to worry about scheduling tasks, the timing was a matter of workloads being completed. However, what if you’re roasting chicken? You mean I have to wait an hour and a half?

This is where concurrency comes in. To roast a chicken and do laundry at the same time, you need to allocate resources to both, and make sure one doesn’t interfere with the other. In Linux, what divides resources among programs are processes. They not only separate programs from each other for safety, they create independent threads of execution that can run concurrently.

Doing laundry while the chicken’s roasting is an example of parallelism and concurrency, but we don’t actually need parallelism. Here’s how one single core of a processor leverages concurrent processes to help us not to block.

Back in the 1980s, when people first began ripping their jeans on purpose and orange juice was for rich people, things were a little simpler. If you wanted to write to the terminal, and be able to read the output, two things had to happen. You had to call write() and then read() on a file. Why a file? Because everything in Unix is a file. An internet connection, an actual file on disk, or stdout in this case. That means that read() and write() were how you sent and received data.

When you write 10kb of data to a file, what happens is that write will send 1kb of the data, and before it sends the rest, it waits for a signal that it was read. In the meantime, it blocks. So, how do we get to the read part? Read is blocked until write gets a signal it was read, but the read is blocked from happening.

The way that people figured out how to deal with this back in the day is much stupider than you think. Seriously, it is ignorant. You fork that process, which copies the the reference to the same file, and switch back and forth between them not even asking if one is done before switching to the other. One process calls write(), the other process calls read(), switching many times a second, making it look like it’s happening all at once.

Switching between processes is called a context switch. Most operating systems use preemptive multitasking, which means they don’t hardly prioritize between processes, but simply decide how much time a process gets for execution based on how many total processes there are. It does not even ask if a process is ‘done’ doing anything before it switches to the next. This is what our model looks like right now.

The first innovation on this model was to create non-blocking I/O. Let’s say that we have a 100 byte write buffer, and 200 bytes in total to write. We empty our buffer, but we cannot perform the next write because the first 100 bytes haven’t been read. With blocking I/O, we would block indefinitely. But what if we know that we will block, so instead of blocking, we would return an error that communicates to us another write would block? Let’s call that error EWOULDBLOCK. Our non-blocking code would look something like this:

const response = write()
if(response === EWOULDBLOCK) {
read()
}

This way we can have both a write and a read going in the same process, both waiting for the other to return this error in order to run again. Problem solved, right? If you need to read/write on potentially thousands of stream sockets, how would you orchestrate that in a single process? If we need to read on ten different things in the same process, we can loop through them constantly to see which one would block or not. If they have nothing to read from this would be a waste of system resources. We can put in a delay, such as loop through them every second. This means that a data read could be late by a second, increasing latency.

So, you are thinking, we are back to context switching? Servers have to take care of thousands of input/output operations, do we make that many processes? Context switching has a cost. Every time you switch to another process, the data in the current cache in the CPU is replaced with new data. Each process also takes up memory, and all those processes take up a lot of RAM.

With network data, we are always using something called a socket to talk to other computers. Sockets were introduced in 1983, and they are also represented as a file in the operating system. We talked about reading and writing to files, but we didn’t say how we do that. How we reference that file is through a file descriptor. This is an integer, and is used as a reference in a table of file description. A file description has two values: a flag which determines if this file descriptor is copied to a child process when this process is forked, and a reference to the file struct we talked about earlier.

When we call read() or write(), we are calling them on file descriptors (fds), which then use the file reference to figure out the actual implementation of those functions — writing to stream socket is obviously different than writing to the terminal, though Linux will call both ‘files’. You say that the fds are just integers, so how does our operating system know what they represent? We are always working in the same process, which has just one table of file descriptions, and the fd being an index, always maps exactly 1:1.

So, how do we stay in one process and handle thousands of fds? With a little friend, also born in 1983, select(). Let’s look at how the system call select() treats 50 read fds. Select() loops through them and asks if each one is ready to be read. Along with which operation (read), and which fds we want to monitor, we also passed select a timeout. This is the amount of time that select() will block for until it sends us the currently ready fds. That’s right, we are still blocking, but we block less.

So, instead of us waiting on individual file descriptors to stop blocking, we ask select() to take all those inputs, and output a stream of data, which is just the fds that are ready for us to talk to. Not only that, everything is happening in user space, keeping things memory safe. Oh, we’re doing spaces now.

User Space

When programs share the same resources, and they have direct access to memory, they could access one another’s data. That means if you have a bank app, and you download a Pokemon game, Pikachu can steal your money. This problem was solved by creating a layer of abstraction between programs and memory called virtual memory. Instead of interacting with memory directly, they interact with this layer, behind which lies the kernel, which decides what to do. The kernel is the most important part of an operating system, and has direct access to everything.

The file struct that represents network sockets is in kernel space, select() and node operate in user space. As you can imagine, there’s a latency associated with going from kernel to user land, but exploiting user land code will not give you access to nearly as much data. There are servers which stay in kernel space completely, called accelerators, but they are stupid.

So, we have a memory safe user land multiplexer, which sends events to node in a way that doesn’t block a whole lot. However, there’s a problem. Because select() has to loop through all of the fds we sent to it, it is running in O(n) time, or in other words, we can scale to a couple hundred connections at a time, but not ten thousand. We need a better multiplexer.

Epoll

Let’s go back to the cooking, laundry, cleaning example. Let’s say we need a particular chemical to speed up cleaning, and we sent your little brother to go buy it. How do we know when he has it? Ask him every 10 seconds, and he’ll let us know when. That is pretty annoying, and that is what select() is doing. Instead, we can relax and do our thing, and he taps us on the shoulder when he has it. That’s what epoll does.

Epoll, short for event poll, is a structure in the kernel which keeps track of currently ready file descriptors. We talk to it using three system calls: epoll_create(), which initializes the event_poll struct and returns a fd for it; epoll_ctl(), which adds fds to the interest list; and epoll_wait(), which blocks until fds become available. Notice we return a fd from epoll_create() — that’s right, we can call select() on epoll because epoll can be treated as a file.

Epoll works in one of two ways: level triggered, and edge triggered. Level triggered is similar to select(), where we give epoll_wait a timeout, and it shoots us the ready fds when it expires and goes back to blocking. Edge triggered sends us the ready fds every time there is a change to any of the fds that epoll is monitoring.

It is important to remember that epoll does not ask an fd if it is ready. Instead, I/O events making changes to files associated with fds in epoll’s interest list dynamically populate a ready list. When data from a connection to some remote computer comes down the pipe in your network card, the network module notifies your CPU using signal interrupt. Interrupt context is outside of process context, and can send an interrupt to any process that is not in uninterruptible state, which is used very momentarily in certain system calls like mkdir. Being able to interrupt processes to send data is obviously very useful, and is implemented on the CPU architecture level. How does your CPU implement this?

Don’t ever ask me that again.

Because network sockets are mostly blocked and not ready to read or write to, epoll is usually pretty fast with servers. We went from O(all fds we are interested in) to O(of the fds we are interested in, only the ready ones) time complexity, which is basically O(1).

What we have going right now is data coming down network inputs, being multiplexed so our server doesn’t have to worry about file descriptors not being ready to operate on, then demultiplexed so different parts of our server can process different types of data. However, the data we are getting is not the data sent to us from remote computers, but actually events that tell us a file descriptor is ready. Get ready for another buzzword.

Reactor

Because we are just getting a stream of ready fds, our node handlers have to make the system calls responsible for reading and writing themselves. This is called a reactor pattern server. We could have the operating system do the reading and the writing, and send the data to the handler functions. That is a proactor, and it requires specific operating system mechanisms that Linux lacks.

One thing to mention is how we define a fd as being ready or not. Data coming on a network socket is arriving at a different rate than our functions will process it. We use something called a buffer to help us out. One of the things abstracted away from us in express is user provided buffers. When we we call read(), it blocks until data comes down a socket. As it arrives, it fills up a buffer of predetermined size. When the buffer is full, it is “ready”. The data is processed by a handler function, the buffer is emptied, and we wait for it to be filled up again. A writer sends a full buffer, which is emptied as it is consumed on the other end, and when the buffer is empty, the write operation stops blocking.

To recap, we’ve defined our multiplexer, epoll. It sends us ready file descriptors. This is the event stream our demultiplexer deconstructs so that our handler functions can process each event. We can call this stream non-blocking because our server isn’t bothered by read or write operations waiting on data to be written or read on the other side, and we can do other tasks until that happens. Two buzzwords we did not touch are asynchronous and event loop. Those are covered in part two, where we go into detail about what it means to be asynchronous, and how node implements its demultiplexer in detail. Until then, keep your deck clean!

More articles.

--

--