The design of lock_sock() in Linux kernel

Among various kinds of locks in Linux kernel code base, lock_sock() is probably the weirdest one (if RCU is not even weirder).

As we all know, basically, there are two categories of locks in Linux kernel: blocking ones like a mutex or a semaphore; non-blocking ones like a spinlock, or a read-write lock. The pick of them largely depends on within which context you plan to use them. The weird part of this sock lock is actually it’s both blocking and non-blocking, depending on its context.

There are two contexts for the software part of the networking stack: Bottom-Half context, which is when a networking packet is received and transmitted, that is often called “data path” or the fast path; process context, which is where the “control path” happens, this is a slow path. Of course I simplify a lot here, for example, on the transmission side, we send packets in process context too until hitting the Qdisc layer or the driver layer.

For a socket, its “data path” is how packets destined to it are queued, this part is not directly influenced by user-space; its “control path” is how we configure a socket, like setting it via setsockopt(), and how we change the status of a socket, like via bind() and close(), which is completely and directly driven by user-space.

Image for post
Image for post
socket contexts

Generally speaking, the locking rule is clear: if we want to lock a shared data structure used in both contexts, we want to lock it in both contexts. This is why you see there are many X_lock_bh() variants of a given X_lock(). So for a socket, locking it in both contexts means a packet being queued in BH context won’t race with a user-space close() of a same socket.

Why lock_sock() is not just a regular spinlock at all? For performance!!!

If lock_sock() were a regular spinlock, then, when we lock it in user-space for setsockopt(), the packet receiving path in BH context had to busy-wait until setsockopt() finishes. This is very bad as packet receiving is the fast path we certainly don’t want to slow down.

This is why the sock lock is turned into two different locks for process context and BH context:

  1. For process context, it is perfectly fine to block, so lock_sock() does sleep when contention happens. Of course, two parallel lock_sock()’s are serialized as normal. In this aspect, we can just consider lock_sock() as a mutex. In fact, kernel already marks it as a mutex for lockdep.
  2. For BH context, it turns into a regular spinlock as callers are expect to call bh_lock_sock() instead of lock_sock(). Of course, we have to serialize BH context on different CPU’s too.

When process context begins to content with BH context, it becomes complicated:

  1. While the BH context holds this lock, process context can not acquire it, it has to busy-wait until this lock is released. This is okay, because BH context is not supposed to hold it for a long time, and process context is the slow path, busy-waiting won’t hurt much.
  2. While the process context holds this lock, BH context can still acquire it. This is the goal of this lock. So, is this safe?

Without additional logic, it is clearly not safe. To make it safe, lock_sock() enforces the following logic to callers:

  1. In BH context, right after we call bh_lock_sock(), we have to check if it is already owned by user-space or not. If it is, there is not much to do here, except just queuing the packet into this socket. If it is not, we are safe to do as much as we need.
  2. In process context, when we acquire this lock, it sets a bit saying we own this lock in user-space, and when we release it by release_sock(), it checks if there are any packet pending in the socket backlog while we are holding this lock. If there are, consume them in process context!!

Take a look at TCP receive path in BH context as an example:

tcp_segs_in(tcp_sk(sk), skb);
ret = 0;
if (!sock_owned_by_user(sk)) {
ret = tcp_v4_do_rcv(sk, skb);
} else if (tcp_add_backlog(sk, skb)) {
goto discard_and_relse;

See the difference between when the lock is owned by user-space and when it is not? Clearly, tcp_v4_do_rcv() is much more complicated than tcp_add_backlog(), what about the “missing” part when we just call tcp_add_backlog()? It is exactly what is moved into release_sock() after we release this lock:

void release_sock(struct sock *sk)
if (sk->sk_backlog.tail)

where __release_sock() will execute the callback sk->sk_backlog_rcv() to continue to process the packets queued in its backlog, and for TCP, this callback is exactly tcp_v4_do_rcv(). Bingo!

As you can see, the whole packet receiving process is not always finished in BH context. For TCP, tcp_v4_do_rcv() could be either called in BH context as usual, or in process context if locking contention happens on the sock lock.

But the rule is still simple: always call lock_sock() and release_sock() in process context, and always call bh_lock_sock() and bh_unlock_sock() in BH context, properly check sock_owned_by_user() after acquiring bh_lock_sock().

Hope this clarifies your confusions about this weird lock when you look into it.

Written by

Linux kernel and security stuffs

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store