Down the epoll rabbit hole

This is an in-depth dive into the actual observed behavior and interaction of epoll_*(), dup(), shutdown(), and close(). Because there are enough permutations of these already, I’m only covering one socket type (IPv4/AF_INET, TCP/SOCK_STREAM), one platform (x86_64), and one kernel version (4.2.0). YMMV.

I’ll be showing observed behavior through strace and tcpdump output.

Setup

Our test environment starts with two sockets connected to each other. There’s also a listening socket, only used to accept the initial connection, and an epoll fd. Both of the connected sockets are added to the epoll watch set, with most possible level-triggered flags enabled.

On the wire, this is just a typical 3-way handshake:

We now have two file descriptors, 5 and 6, that are opposite ends of the same TCP connection. They’re both in the epoll set of epoll file descriptor 3. They’re both signaling writability (EPOLLOUT), and nothing else. All is as expected.

shutdown(SHUT_RD)

Now let’s call shutdown(5, SHUT_RD).

After the call, fd 5 begins signaling EPOLLIN (incoming data, in this case EOF) and EPOLLRDHUP (peer shut down or stopped writing). read(5) returns 0, signaling EOF. All of those make sense.

However, this doesn’t do anything on the wire. That means that the other side of the socket has no way to know that anything has changed. epoll doesn’t show any change on fd 6, and write(6) works fine. In fact, despite having previously signaled EOF, read(5) still returns data.

Side note: notice that close(5) causes automatic removal of that socket from the epoll set. This is handy, but see dup() below.

shutdown(SHUT_WR)

Let’s rewind and test with SHUT_WR (write).

After shutdown(5, SHUT_WR), fd 5 returns EPIPE and generates SIGPIPE if you write to it. The shutdown message traverses the wire this time (via the FIN flag), and fd 6 on the other end of the connection signals EPOLLIN and EPOLLRDHUP. Data can still traverse fd 6 -> 5 normally.

The only oddity here is that calling close(5) doesn’t change any of the epoll status flags for fd 6. Once you attempt to write to fd 6, however, every flag on the planet starts firing, including EPOLLERR and EPOLLHUP.

dup()

Rewinding to our setup state again, let’s look at dup().

Nothing interesting on the wire this time; it’s all internal state.

dup(5) gives us the new fd 7. We add it to the set, and all looks normal. We write(6), and both 5 and 7 begin signaling EPOLLIN. read(5) also clears EPOLLIN from 7. Calling close(5) doesn’t cause 6 to signal EPOLLIN or EPOLLRDHUP, since all fd copies have to close to trigger EOF on the socket. All sane.

Here’s crazy town, though. close(5) doesn’t remove it from the epoll set. epoll is waiting for the underlying socket to close, and fd 7’s existence is keeping it alive. Trying to remove fd 5 from the epoll set also fails. The only way to get rid of it seems to be to close(7), which removes both from the set and causes fd 6 to signal EPOLLIN and EPOLLRDHUP.

shutdown(SHUT_RD) + dup()

Nothing extra traverses the wire.

The takeaway here is that shutdown() operates on the underlying socket endpoint, not the file descriptor. Calling shutdown(7, SHUT_RD) causes both fd 5 and 7 to signal EPOLLIN and EPOLLRDHUP.

shutdown(SHUT_WR) + dup()

As expected, shutdown(7, SHUT_WR) causes fd 6 to signal EPOLLIN and EPOLLRDHUP.

Conclusions

  • If you’re using dup() and epoll, you need to call epoll_ctl(EPOLL_CTL_DEL) before calling close(). It’s hard to imagine getting sane behavior any other way. If you never use dup(), you can just call close().
  • If you’re using dup() and epoll and want to signal all fds on a socket to close, you can call shutdown(SHUT_RD) before calling close(), causing EPOLLRDHUP to signal for other fds attached to the same socket.
  • Remember that read() == 0 and EPOLLRDHUP mean that you can’t read, but not that you can’t write. EPIPE, EPOLLERR, and EPOLLHUP mean that you can’t write.
  • shutdown(SHUT_WR) is a useful way to signal EOF on a receive-only socket and to protect yourself from accidental socket misuse (including two read-only sockets connected to each other, which would otherwise wait forever). It will require the other end of your connection to be doing everything right; it’s likely that many systems will treat EOF as a signal that they’re unable to write.
  • shutdown(SHUT_RD) is nice window dressing, but the only mistake it’s going to save you from is misuse of EPOLLIN/EPOLLRDHUP. The other side can still write() and you can still read().
  • An alternative to shutdown(SHUT_RD) is to request notification of EPOLLIN on write-only sockets. This allows you to detect two write-only sockets connected to each other.