On Linux Netlink

Published in

THG Tech Blog

10 min readDec 2, 2021

When you are writing a linux application that needs either kernel to userspace communications or userspace to kernel communications, the typical answer is to use ioctl and sockets.

This is a simple mechanism for sending information down from userspace into the kernel to make requests for info, or to direct the kernel to perform an operation on behalf of the userspace application.

A good example of this type of communications between a userspace application and the kernel can be found in the venerable ethtool config application. Here the tool itself is a userspace application that communicates via sockets to the kernel. The kernel contains the API that the application uses to perform the communications.

Example: Setting a NICs channels

Let’s look at an example usecase of ethtool with a modern multi-queue network interface (NIC). Modern NICs have the hardware and ability to use multiple channels for sending & receiving packets. These take advantage of multi-core CPUs to balance the load of transmitting (Tx) and receiving (Rx) traffic. Historically all the traffic (and associated interrupts) was handled by a single core, spreading the workload across multiple cores can significantly improve performance.

How would we set the combined channel number on a NIC that supports the feature using ioctl?

Here we can see the standard method of working with ioctl to communicate to the kernel and set the required channels on a NIC. The important parts of the code to take note of are:

include <linux/ethtool.h>
fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_IP);
ETHTOOL_SCHANNELS

It’s clear to see from these that, ethtool is using a standard socket to communicate with the kernel. Also that the exact command that is being sent from the userspace application to the kernel is defined in the kernel headers.

It’s a very simple programming model; open a socket, fill structure with appropriate info and command, send down socket. However this simplicity of programming model comes at a very high cost; the application is necessarily tightly coupled to the kernel (the exact command you want to send to the kernel must already be defined in the kernel headers).

So ioctl, and syscalls (the traditional methods of communication between userspace and kernel) are simple to use but have this major pain point of requiring kernel level changes to implement the protocol. Obviously this makes any request to add these to the kernel onerous to the kernel development community, and there is no guarantee that any such additions will be completed in a timely manner (timely in the sense of a userspace application being blocked on adding functionality until the kernel has been modified).

The proposed solution to this issue has it’s roots in the linux networking space: netlink.

Netlink

In contrast to the previous communications options between application and kernel, to add a new protocol with netlink requires a simple addition of a constant to netlink.h then the kernel and application can immediately communicate via a sockets-based API.

The original goal of netlink was to provide a better way of modifying network related settings and transferring network related information between userspace and kernel. Importantly, the communications between userspace and kernel is bi-directional or rather the netlink socket is a duplex socket.

With this new means of communication both to and from the kernel, there is now a great way of developing applications, that by design, need frequent update events directly from the kernel. What started as more effective means to relay and modify network related information has become a generic kernel and userspace communications fabric via NETLINK_GENERIC.

The downsides

All of the advantages of netlink over syscalls or ioctl sound fantastic, however there is a catch. The simplicity of sending & receiving a message using ioctl is gone, netlink itself is a more complex messaging system — particularly in terms of constructing the messages themselves.

Netlink message format

A netlink message is a byte stream consisting of one or more headers plus payloads of data. Each message is byte aligned and requires padding if the payload isn’t an exact fit. A message can contain multiple headers.

0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                          Length                             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |            Type              |           Flags              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                      Sequence Number                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                      Process ID (PID)                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The message header contains the message length, type, flags, sequence number and process id:

struct nlmsghdr {
 __u32  nlmsg_len; /* Length of message including header */
 __u16  nlmsg_type; /* Message content */
 __u16  nlmsg_flags; /* Additional flags */
 __u32  nlmsg_seq; /* Sequence number */
 __u32  nlmsg_pid; /* Sending process port ID */
};

Following this header is the payload (with or without padding to align to the fixed length of the message):

A message can contain a second header defining the type of netlink message; the most common of these are:

NETLINK_ROUTE for modifying routing tables, queuing, traffic classifiers etc.
NETLINK_NETFILTER for netfilter related information
NETLINK_KOBJECT_UEVENT for communications from kernel to userspace (for an application to subscribe to kernel events)
NETLINK_GENERIC for users to develop application specific messages

The payload of a message follows the header and consists of a data expressed in a TLV (Type, Length, Value) format (although in the case of Netlink, it’s actually Length, Type, Value):

|<-- 2 bytes -->|<-- 2 bytes -->|<-- variable -->|
--------------------------------------------------
|     length    |      type     |      value     |
--------------------------------------------------
|<--------- header ------------>|<-- payload --->|

A simple message will consist of a single header followed by one or more TLV formatted attributes. However many messages have an optional additional header.

All netlink messages are byte aligned to a 4 byte size using the macro NLMSG_ALIGNTO, additional unused space must be padded with 0 to complete the message.

Now we know the makeup of a netlink message, how do we put it all together and actually communicate with the kernel to achieve something useful?

Example: Setting NIC state (up or down)

A good place to start is to use netlink for it’s original intent: communicating with the kernel to modify the settings of a network interface.

At the most simple level, we can use NETLINK_ROUTE to switch a NIC from state UP to DOWN (or vice versa).

This follows a straight-forward pattern:

Open a socket with AF_NETLINK as the family
Set the address id to 0 (for kernel)
Create the standard message header
Attach the correct payload (including the ifinfomsgand the NIC name)
Call sendmsg

With this compiled to a binary ifd we can see that this will alter the NIC stat from UP to DOWN.

[admin@dev-002 ~]$ ip a show eth04: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000link/ether b8:59:9f:bf:55:7e brd ff:ff:ff:ff:ff:ffinet xxx.xxx.xxx.xxx/24 brd xxx.xxx.xxx.xxx scope global noprefixroute eth0valid_lft forever preferred_lft foreverinet6 fe80::a8c0:8adc:61a1:d4e7/64 scope link noprefixroutevalid_lft forever preferred_lft forever[admin@dev-002 ~]$ sudo ./ifdresult of send: 48[admin@dev-002 ~]$ ip a show eth04: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000link/ether b8:59:9f:bf:55:7e brd ff:ff:ff:ff:ff:ffinet xxx.xxx.xxx.xxx/24 brd xxx.xxx.xxx.xxx scope global noprefixroute eth0valid_lft forever preferred_lft forever[admin@dev-002 ~]$

Things to note about this example however, this is using NETLINK_ROUTE which is one of the initial implementations of netlink. Similar to ioctl this type of netlink socket is defined as part of the netlink headers. This makes communications with the kernel relatively simple.

However the message construction is essentially the same regardless of the type of netlink socket is in use. So the next step is moving from NETLINK_ROUTE type messages to the newer NETLINK_GENERIC type. This involves additional complexity of setting up the connection so that the kernel registers the connection and returns a ‘family id’, prior to sending a command.

Example: Setting channels with Netlink

Unlike setting an interface up (or indeed down), changing the number of Rx/Tx/Combined/Other channels on a NIC with netlink doesn’t use a specific type of netlink message.

For changing the state of an interface we could use NETLINK_ROUTE which is a defined type in the linux kernel, for setting channels we need to use NETLINK_GENERIC messages. This leads to a small complication that the previous example avoids: sending an initial message to register with the kernel and receiving the correct family id from the kernel before we can send our ‘update channel numbers’ message!

With our previous netlink example code, the application only had to send a single command to the kernel; now, however, the application needs to send an initial command and interpret the response from the kernel prior to sending a subsequent command to perform the desired action.

Sending the initial message is similar to the example above. As we will be setting many nlattr values in this code, and getting the padding of the attributes correct is error prone, we used the libmnl library as a source for some helper functions, including nl_attr_put:

A function to add an nlattr to a netlink message

The initial message is sent to the kernel nladdr.nl_pid = 0; with the command CTRL_CMD_GETFAMILY and a name “ethtool”. With this initial message sent, the kernel will respond with a netlink message of it’s own including the details of the family id it wants subsequent messages to contain.

There’s an awful lot to unpack in this code fragment. The starting point is read_socket — this listens to messages coming back from the kernel and then determines how to actually call recvmsg. This is important as we cannot be sure just how large the message will actually be, so there’s an initial static buffer that we hope can fit the message but to be sure we check the length (using MSG_PEEK) and dynamically allocate a new buffer if required.

Next we actually parse the message in netlink_parse_nlmsg. This function works through the (potentially multiple) nlmsghdr handling each type of message in turn. For our example of listening for a response from the kernel here, we only really care about NLMSG_MIN_TYPE which signals a NETLINK_GENERIC message.

The control passes through a series of helper functions to extract the information needed (again in our simplified usecase, we only care about the family id, however there is much more that can be passed back in this message).

Finally we store the id retrieved from parsing the message in our netlink_sock struct so we can access it later as we send our control message:

Like the first message, we create a standard header nlmsghdr and an additional genlmsghdr (as this is a NETLINK_GENERIC message). In nlmsghdr we actually set the family id we retrieved by parsing the kernel response:

/* where msg_type is the family id the kernel provided */ 
nlmsg->nlmsg_type = nls->fi->id;

In the genlmsghdr we set the actual command we need to execute (in our case ETHTOOL_MSG_CHANNELS_SET). To set the interface name "eth0" we need to create a ‘nested attribute’, lines 43–48 handle this nesting. Finally we set the number of channels we wish to set (8) and the type (combined) in a further nlattr then we construct the message, as before, and send it.

Conclusion

It’s fairly easy to see how much more complex the netlink version of the code is, in comparison to the legacy ioctl method presented at first. This stems from the fact that the NETLINK_GENERIC type is used for many different applications and as such there is an initial handshake with the kernel before you can send the control message. In isolation sending the control message to set the number of combined channels is not really that much more complex (although the nested attributes are easy to miss when first working this out).

As working with netlink directly is quite laborious, a couple of libraries have sprung up to make life easier for netlink application developers:

There are pros and cons to using either of these (or using neither library and working directly at a low-level). Libnl is the standard netlink application library and as such has good coverage in terms of examples and documentation, along with separate modules for generic, route and nf (netfilter). The big drawback of libnl is the size of the library which is something to be aware of.

Libmnl addresses this size issue at the cost of providing far less. Libmnl is “minimal” and focuses on things that are easy to get wrong: parsing messages, handling nlattr etc.

For our experiments, we decided against the fully-featured libnl as it’s another large dependency to keep up to date & audit with respect to CVEs etc. Libmnl is closer to what we need, however as we don’t need all of it, we simply took the parts we found most useful and included them directly in our sources.

The other source of code that was useful when learning how to use generic netlink was the ethtool source code itself. Interestingly, ethtool relies on libmnl to deal with attr handling and message construction, which provides a set of example code for how libmnl can be used to build applications.