Kernel Panic with the Anonymous Pipe in the Linked Library

A few months ago, while listening to an argument about the merits of using: cat filename | grep X versus grep filename X I began to think about how pipes actually work. Pipes (|) are some of the most ubiquitous and useful shell operators available from the command line and one most users employ on a daily basis. I began to think about what happens at the Kernel level, when a pipe is used from a shell to redirect output from one process to the input of another process. In other words, how is the Kernel involved when I actually execute the following:

ls -l | wc -l

After thinking about it for a little while I was unhappy with my lack of thorough understanding about the process ;). It seemed like such an obvious thing, and I probably used pipe redirection every single time I work with Linux, but I had never thought specifically about what was happening. I think where my understanding really began to get murky was when I began to think in terms of what was happening at the “Kernel level”.

Since then, I have spent some time working directly with the Linux Kernel a little so I wanted to try to answer that old pipe question in as much detail as I could.

Before digging into the Kernel code directly though, maybe we can gain a bit more insight into what might be happening at the user to Kernel interface. I have been a fan of strace since I read a number of nice blog posts on its usage. Strace is a remarkably powerful tool for checking into what system calls are being executed by a process, so we will use that as our primary tool to further our understanding of pipes. We can use strace to look at the system calls in use by the shell when we execute our two commands using a pipe. In order to make our tracing a bit more clear, we can run strace on a spawned shell that will: execute our two piped commands, wait for them to complete and then exit. The command looks as follows:

strace -f sh -c ls -l | wc -l

Intuitively, we can anticipate the usual fork/clone exec calls, followed by some manipulation of the file descriptors of the two processes. Since each cloned process will inherit the file descriptors of the parent process we can expect the cloned processes to be able to communicate quite easily using the inherited file descriptors.

In the following strace outputs I’ve removed some strace noise; the PIDs for the processes are as follows: the initial shell has PID 5464, ls has PID 5465 and wc has PID 5466.

execve(“/bin/sh”, [“sh”, “-c”, “ls | wc -l”], [/* 23 vars */]) = 0
stat(“/usr/local/bin/ls”, 0x7fffd8dd4140) = -1 ENOENT (No such file or directory)
stat(“/bin/ls”, {st_mode=S_IFREG|0755, st_size=110080, …}) = 0
pipe([3, 4]) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fdfffc60a10) = 5465
close(4) = 0

In the first line of the strace output we can see the expected call to execve to execute the subshell, the clone call has been clipped in the strace output. Proceeding from there we can see repeated calls to stat as the shell attempts to locate the ls program in the directories listed in $PATH. Once ls has been located, the shell then invokes the system call pipe. Pipe returns a pair of file descriptors to allow for read/write communication, we can think of it as a pair of file descriptors for one file (which doesn’t allow seeking); one descriptor for read and one for write. In the returned file descriptor array, entry zero is the read fd and entry one is the write fd, this is listed as entries three and four in the file descriptor table in our simple example (our initial shell process seems to be only using file descriptor table entries zero, one and two for stdin, stdout and stderr respectively). Once pipe returns successfully, our simple shell program then calls clone, which returns PID 5465 (the PID of our upcoming ls process) and calls close on fd four. This intuitively makes sense, once the child process (which will eventually become ls) has been successfully created, the parent process (/bin/sh PID 5464) no longer needs its copy of file descriptor four in it’s file descriptor table so it can safely call close. Now that the shell has completed the setup for the invocation of ls, the strace output continues:

stat(“/usr/local/bin/wc”, 0x7fffd8dd4140) = -1 ENOENT (No such file or directory)
stat(“/usr/bin/wc”, {st_mode=S_IFREG|0755, st_size=39648, …}) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fdfffc60a10) = 5466
close(3) = 0
wait4(-1, Process 5466 attached

We see the same pattern as before but this time for the wc process. We see multiple calls to stat along the $PATH entries to find wc, a call to clone and then a call to close but this time on fd three. Once our simple shell process is finished setting up wc we then see it call wait4 which will then wait…for the child processes to exit. The next block of strace output deals exclusively with process 5466 (wc -l):

[pid 5466] dup2(3, 0) = 0
[pid 5466] close(3) = 0
[pid 5466] execve(“/usr/bin/wc”, [“wc”, “-l”], [/* 23 vars */]) = 0
[pid 5466] fadvise64(0, 0, 0, POSIX_FADV_SEQUENTIAL) = -1 ESPIPE (Illegal seek)
[pid 5466] read(0, Process 5465 attached

We then see wc immediately duplicate fd three, inherited from the parent shell, to fd zero (stdin) and close fd three. Now that stdin has been correctly set to the read end of our anonymous pipe we see the invocation of execve to execute wc. There then follows a call to fadvise on our pipe fd which fails due to being unable to seek on the pipe and then a blocking call to read (as there’s nothing to read just yet). Strace then continues tracing on our other process 5465 (ls -l):

[pid 5465] close(3) = 0
[pid 5465] dup2(4, 1) = 1
[pid 5465] close(4) = 0
[pid 5465] execve(“/bin/ls”, [“ls”], [/* 23 vars */]) = 0
[pid 5465] getdents(3, /* 4 entries */, 32768) = 112
[pid 5465] getdents(3, /* 0 entries */, 32768) = 0
[pid 5465] close(3) = 0
[pid 5465] fstat(1, {st_mode=S_IFIFO|0600, st_size=0, …}) = 0
[pid 5465] write(1, “hello\nstrace\n”, 13) = 13

Our ls process does a lot similar to wc, with just a few minor differences. First we see a call to close on fd three, followed by our familiar dup2 call but this time for fd four to fd one (stdout) and, as expected, another call to close fd four. We then see the call to execve to actually execute ls -l. At this point both sides of our pipe have been “connected” using some minor fd table manipulation and we now have simple inter-process communication! What remains are the calls ls needs to read the current directory entries, reusing the previously closed fd three, and writing its output to fd one, the write end of our anonymous pipe. All that is left is for wc to resume its blocked read call, calculate the number of lines read, output the result and perform exit cleanup:

[pid 5466] <… read resumed> “hello\nstrace\n”, 16384) = 13
[pid 5466] read(0, <unfinished …>
[pid 5465] close(1) = 0
[pid 5466] <… read resumed> “”, 16384) = 0
[pid 5466] fstat(1, {st_mode=S_IFIFO|0600, st_size=0, …}) = 0
[pid 5466] write(1, “2\n”, 22) = 2
[pid 5466] close(0) = 0
[pid 5466] close(1) = 0
[pid 5466] close(2) = 0
[pid 5466] exit_group(0) = ?
[pid 5464] <… wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 5466

We also see our two child processes clean up either ends of the anonymous pipe and then begin exiting. Our parent shell process, still blocked on the call to wait4, then resumes and will ultimately exit too.

With some basic strace usage, and reading some man pages, we can get a much better understanding of what’s actually going on under the hood when we use the pipe operator from a shell. We get a pretty clear idea about what’s actually happening without digging into any Kernel, or shell, source. Strace output can be a little noisy (and daunting at first) but it allows us to come right up to the edge of Kernel space and peer inside; what a great program!

Like what you read? Give chuckleberryfinn a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.