A Systems Engineering Attempt to Analyze the Linux Block Layer with BPF Compiler Collection (BCC) Toolkit and Static Tracepoints

17 min readJan 17, 2023

Systems Engineering

NASA defines Systems Engineering as follows:

At NASA, “systems engineering” is defined as a methodical, multi-disciplinary approach for the design, realization, technical management, operations, and retirement of a system. A “system” is the combination of elements that function together to produce the capability required to meet a need. The elements include all hardware, software, equipment, facilities, personnel, processes, and procedures needed for this purpose; that is, all things required to produce system-level results. The results include system-level qualities, properties, characteristics, functions, behavior, and performance. The value added by the system as a whole, beyond that contributed independently by the parts, is primarily created by the relationship among the parts; that is, how they are interconnected. It is a way of looking at the “big picture” when making technical decisions. It is a way of achieving stakeholder functional, physical, and operational performance requirements in the intended use environment over the planned life of the system within cost, schedule, and other constraints. It is a methodology that supports the containment of the life cycle cost of a system. In other words, systems engineering is a logical way of thinking.

My key takeaways from the block of text above could be reduced to phrases of “combination of elements” and “interconnection of parts”. In fact, if I were asked to describe systems engineering, my short attempt would be:

understanding how components of a system interact with each other and affect the system as a whole.

Abstract definitions do not really impress me so I am going to continue with a real life example to set the tone of this article. In the 60s, during the second test flight of TSR-2, British Aircraft Corporation’s Cold War era interdictor aircraft, oscillations from an engine fuel pump traveled through fuselage and caused temporary loss of vision for the pilot due to the resonant frequency of the human eye ball. The pilot had to adjust the throttle to control these vibrations and maintain vision during flight.

As an individual with a mechanical engineering degree and a bias to all things mechanical, I find this historical anecdote as impressive as an example can get to demonstrate the inter-connectivity between seemingly distant parts of a system. Hoping that I was able to explain my mode of thinking influencing my technical decisions, I can now focus on the actual topic of this article.

Goal

As far as software systems are concerned, the Linux kernel is as complex and its components are as interconnected as it gets. In this huge ecosystem, the block layer, with its components stacked tightly on top of each other, is a prime target for systems engineering. To illustrate my point, here is a simple question: how does a stripe size configuration change in dm-stripe ripple up or down the block stack? To be able to answer a question like this, we need to start with a more fundamental question: can we track an IO request all the way through a stack across multiple component types and monitor and collect performance statistics along the way. In essence, can we peel each layer individually to have a peek inside the system?

The first obvious response that would follow is whether an approach like this is actually needed or not. One may argue that there is already a wide variety of tools that provide the same visibility into the kernel. The iostats tool makes the accounting data available for each device in the block stack. There is blktrace which is specifically intended for the block layer in terms of tracing. BPF/BCC tools provide visibility into disk operations, bio processing, and more. There are indeed many alternatives out there.

The counter argument is that while these tools are very effective, I think they represent approaches on opposing ends of the spectrum. Some generic performance tools, such as iostats, do not differentiate between io scheduler and driver (scsi, nvme, etc.) performance numbers for request processing (sar toolset does). On the other end of the spectrum, we also have access to tools such as SystemTap and BPF that allow us to delve deep into the kernel. However, tools under the latter category tend to get pretty focused targeting specific layers and functionality. The end result is that the deeper we go, the harder it gets to maintain a view of the overall picture.

The above arguments are not intended to portray these tools as limited and/or flawed in any way. Uses of these tools almost always align with the intended goal. On the other hand, there could be a need in having a solution that is generic enough to provide coverage across distinct components in a stack but also focused enough to provide detailed information on each distinct layer to perform proper systems engineering on that stack. As a systems engineer who has to evaluate and propose system designs, I think the ability to have a flexible method to observe the interactions of each layer and understand the behavior of a system as a whole would be useful. Consequently, I am convinced there is value in targeting somewhere middle in the spectrum in this specific case, and, therefore, my initial goal will be to implement a prototype as an attempt to demonstrate my approach and, also, to understand how effectively I can leverage BPF and static tracepoints against a complex problem like this.

The second part of the answer to the original question is whether this proposed approach is actually feasible or not. The tracing infrastructure built into the kernel is impressive and there is a variety of tools built on top of this ecosystem to extract the data required. Out of these tools, BPF Compiler Collection (BCC) with its Python front end offers all the features and flexibility required to implement this design. The BPF backend provides all the necessary functions to collect internal data, while the Python frontend provides the dynamic capabilities needed to create a generic tool. I can argue, then, that this design idea seems feasible, at least in theory at this point, albeit with certain limitations that I will elaborate on later.

Design

This section will explain the evaluation process behind the proposed design.

Input

Deciding on a suitable input mechanism to collect data is the first decision that has to be made. In this context, input mechanisms refer to the available list of different kernel tracing technologies available for use. There are multiple alternatives, such as kprobes, kfuncs, static tracing points, etc. that can be utilized for this purpose and the primary factor in selecting one to satisfy our goals depends on the desired level of exposure to internal kernel structures.

The Linux kernel is under constant development. Utilizing tracing technologies such as kprobes or kfuncs expose external clients to internal structures which are subject to change. This may not necessarily be an issue for clients that target specific components. Change can still be managed even if that means a somewhat tightly coupled implementation for the client. This reality clearly manifests itself in some BCC scripts where support for a specific feature is checked through a reflection mechanism (if I could use that terminology in this context) before enabling that respective feature. However, for a design that is intended to be generic across a number of components, such as the device mapper implementations that are currently available, maintenance is a major concern.

There is no escape from change. But, the intention here is to make it as manageable as possible. This is where static tracepoints shine. They are not immune to change, but by looking at their implementation, naming conventions, and so on, we can argue that they constitute a somewhat standardized instrumentation interface into the kernel across different components. This, by itself, makes static tracepoints an appropriate mechanism, at least in theory, to keep the tight coupling tendencies of our client code under control.

Exclusive utilization of static tracepoints do imply certain limitations for clients, however. First problem is the placement of these tracepoints in the kernel code. They are called static for a reason and there is a side effect due to that inherent characteristic: client code is bound by these placement decisions. Unlike kprobes or kfuncs, arbitrary function entry and return points cannot be targeted. To achieve the same effect, multiple static tracepoints must be utilized in combination. This means that their placement in the code are critical for accurate resolution and may not be always ideal. On the other hand, these decisions are made by kernel contributors who have the most knowledge about the component in question, so, while this is a limiting factor, it is not a major concern.

Second, the amount of information that we can extract is restricted to a predefined set of values (i.e. format) defined for a static tracepoint. This is a double-edged sword. It provides the necessary standardization required across different components and different implementations of those components to support a generic tool. This is an advantage. The disadvantage, however, is that the client is limited to the information provided; that decision is already made and cannot be changed. This can be a major issue requiring additional complexity to solve, if even possible. In fact, I will comment on this issue further later.

Third limitation is code maturity. Tracing output may not be accurate because static tracepoints may not be properly utilized across different components (see dm-crypt related changes committed earlier in 2022). The requirement to use static tracepoints in combination implies that they have dependencies on each other and events must happen in a sequential order. This is not guaranteed. In this case, there is not much one can do. Either a combination of static tracepoints and kprobes/kfuncs needs to be implemented, or a recent kernel version that addresses these issues is needed. It did not take long for me to understand that I had to pick my battles. As a result, a recent kernel version is used for prototyping with the exclusive utilization of static tracepoints.

Output

At this point, we have an idea about the type of inputs we are going to work with. Then, the question about outputs would naturally follow. As common in many block stack tools, throughput related statistics will constitute the integral part of the output. These will include average latency, average size, and throughput numbers, along with merge related data. Split statistics will also be provided to complement merge data. In addition, as it is common, each throughput and latency data will further be grouped under read, write, and discard sub-categories. Similar to fio output, submission and completion pipelines will be treated as different sub-categories.

Test System

In order to prototype this design quickly, I wanted to eliminate as many external constraints and variables as possible from the beginning. Code maturity issues constituted the easiest target so I booted up a Fedora 36 VM running a cutting edge 6.x kernel.

The VM is configured with a /test mount point that lives on a stack that consists of dm-crypt, dm-thin, dm-stripe, and two emulated SATA drives with a single partition on each. This test stack is illustrated in a diagram below (note that the latency axis is not to scale):

The VM’s block stack was deliberately designed to be kludgy to put the ideas presented in this document to the test across different components. For example, the single partition on each drive is clearly not needed, but was necessary to demonstrate some of the difficulties that were faced in dealing with unexpected tracing output.

Note that this diagram is not intended to be fully accurate. For example, the dm-stripe component maps IO requests to multiple devices, but the relationship is reduced from one-to-many to one-to-one for simplification purposes in the diagram. In essence, only a single branch of the block stack hierarchy is illustrated.

Code

The code is available under the sysbpf project on Github. Note that the provided code is a prototype implementation. Quality of the BPF code can definitely be improved. For other, non-superficial, issues and limitations related to the implementation, please refer to the subsequent sections.

Deliverable

The tool outputs columnar data periodically which is controlled by the command line interval parameter. Data is categorized into read, write, and discard groups. Under each group, performance stats, which include request count, average request size, average service time, and average bandwidth, are displayed. In addition, submission and completion pipelines are treated separately.

While columnar data is a good starting point, it is not really ideal to work with directly. The appropriate action is to plot that data for analysis.

sysbpf generated read statistics — Read Statistics

sysbpf generated write statistics — Write Statistics

Above are two images that group read and write related stats respectively. Discard stats can also be graphed but are not included for the sake of brevity. Each image includes two columns. Except the last row where only merge and split bandwidths are displayed, the left column contains bandwidth graphs and the right one contains latency graphs.

In terms of bandwidth stats, there is not a lot of interesting information except merge and split bandwidth graphs. The slowest layer dictates the bandwidth limit, so, as long as the numbers are consistent across different layers (and add up if there is a striped layer), there is not much to glean besides the maximum bandwidth limit. On the other hand, latency numbers, which are provided for both submission and completion pipelines, are interesting. In each pipeline, the component that contributes most to overall latency can individually be identified and targeted for optimization.

Limitations

Filesystem Coverage

Filesystems are one of the most complicated layers in the block stack. I remember listening to an engineer explaining to his peers during a presentation that it takes ten years to fully mature a filesystem. Therefore, if our goal is to perform proper systems engineering, a complex component like a filesystem sitting on top of the block stack should not be ignored. The question that follows then is about feasibility.

One major problem is how to connect distinct layers for instrumentation. How do you connect a filesystem that works with file maps with lower layers that perform bio processing? This may be possible as long as we manage to map read and write requests for a file to bios and back. In fact, a similar issue manifests itself when we have to instrument across layers that perform bio processing and request processing. Admirably, static tracepoints are implemented in a generic way across bio and request processing layers to handle this issue without too much hassle. On the other hand, filesystems, at least the common ones such as xfs, ext4, btrfs, etc., seem to have specific and namespaced static tracing infrastructures as opposed to generic ones. These filesystems come bundled with a large list of their own dedicated static tracepoints. This is expected given the their complexity and the work required to support them. However, it is a problem in the context of this design. Considering the number of filesystems supported in the linux kernel, dedicated implementations go contrary to the goals stated in this article as they require individual targeting of each implementation and that, by definition, comes with the cost of tight coupling and the elimination of the benefits provided by static tracepoints.

Next, a curios question follows. Device mapper layers, each with its own and unique internal implementation, provide standard external APIs to the kernel to function as part of the block stack. Block drivers follow the same design pattern; they function under the generic block device layer. Filesystems are not much different in this regard; they are exposed via VFS. Why is there no generic tracing mechanism to provide coverage for filesystems through the VFS layer then?

Targeting specific filesystems seemed to me as a deviation from the stated design goals. The lack of filesytem coverage is, therefore, a serious limitation of this design. This may change in the future, but, for now, it does not exist.

BPF

A complex approach, such as this one, built on a relatively limited framework, such as BPF, is a negative feedback mechanism on the size of a program. As a function starts getting larger and larger, droplets of cold sweat start forming on your forehead as you ineluctably hurtle towards that thick wall of stack size limit.

Complex BPF programs are not easy to debug. In this case, sequential dependencies between parts of the code have necessitated the use of a debugging mechanism to sort out bugs. The end result is that sad and utterly bad logging mechanism in the source code. It should serve as an adequate reminder of debugging problems faced during the implementation of this prototype and a warning to those who are planning to jump in a similar rabbit hole.

Lastly, as complexity increases and the limitations mentioned above become more obvious, questioning the current approach to the problem becomes part of the game. Is a monolithic approach a proper solution to the problem? Can the same goals be achieved with multiple and simpler programs? The cost benefit analysis of complex monolithic approaches should be part of the discussion. In this specific case, while there is complexity stemming from the nature of this approach, a considerable chunk of it is also contributed by underlying tracing related limitations that I will describe in the next section.

Tracing

This section will primarily focus on tracing event format and behavior and their impact on the design in terms of accuracy and complexity.

The first limitation is the lack of a unique identification mechanism for each request. In fact, each request can only be identified by its device, sector, and associated flags. This means that repetitive requests that get queued in submission and completion hash maps may overwrite each other. This has an adverse effect on the accuracy of stats.

Second, tracking requests crossing partition layers is involved. The tracing output seems to have a bug containing the partition owner as the target device for the remap operation.

dmcrypt_write/2 3748 [010] 68.318853: block:block_bio_remap: 8,16 WS 29824 + 128 <- (253,1) 55680
dmcrypt_write/2 3748 [010] 68.318854: block:block_bio_remap: 8,16 WS 33920 + 128 <- (8,17) 29824

As can be seen above, there is no subsequent bio complete event for the first remap event and it is actually followed directly by a new remap event. This divergent behavior for partitions requires additional logic to identify the correct target partition device for the initial remap event. Needless to say, significant complexity can be eliminated if block layer components behave as standard as possible, at least, from a tracing perspective.

Third, static tracepoint formats are not always adequate and require additional logic to work around their limitations. Backmerge events constitute the best example of this limitation.

dmcrypt_write/2 3748 [010] 68.318893: block:block_bio_backmerge: 8,16 WS 34048 + 128 [dmcrypt_write/2]

Merge target information is not provided in the backmerge event format. This requires an additional hash map that tracks previous requests in the pipeline for the sole purpose of identifying the target. Again, this complexity can be completely eliminated if event formats are improved in these cases. I believe a static tracepoint event format should, at minimum, provide a sufficient level of information to describe the respective event in its entirety, so that redundant external constructs to track other entities can be completely eliminated.

Fourth, the behavior of a request crossing up and down a stack of block layers is unfortunately not standard. The sequence of events may change due to a variety of factors. As usual, issues are exacerbated by introduced complexity on the client side required due to missing information. The end result is again accuracy issues that compound as requests make their way down and then up the block stack. One can argue that dependency on the order of operations is inherently a requirement dictated by the design itself. That is a correct statement. On the other hand, I believe deviation from standard behavior should at least merit an investigation into its underlying causes and a call for potential improvements if feasible.

Device mapper split events manifest one of the most problematic behaviors under this category.

# Linux test 6.0.16–200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Dec 31 16:47:52 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
dmcrypt_write/2 3748 [010] 68.318825: block:block_bio_remap: 253,4 WS 33024 + 512 <- (253,5) 256
dmcrypt_write/2 3748 [010] 68.318828: block:block_bio_queue: 253,4 WS 33024 + 512 [dmcrypt_write/2]
dmcrypt_write/2 3748 [010] 68.318843: block:block_bio_remap: 253,2 WS 55680 + 128 <- (253,4) 33024
dmcrypt_write/2 3748 [010] 68.318846: block:block_split: 253,4 WS 33152 / 33152 [dmcrypt_write/2]
dmcrypt_write/2 3748 [010] 68.318869: block:block_bio_remap: 253,2 WS 55808 + 128 <- (253,4) 33152
dmcrypt_write/2 3748 [010] 68.318871: block:block_split: 253,4 WS 33280 / 33280 [dmcrypt_write/2]
dmcrypt_write/2 3748 [010] 68.318884: block:block_bio_remap: 253,2 WS 55936 + 128 <- (253,4) 33280
dmcrypt_write/2 3748 [010] 68.318886: block:block_split: 253,4 WS 33408 / 33408 [dmcrypt_write/2]
dmcrypt_write/2 3748 [010] 68.318896: block:block_bio_remap: 253,2 WS 56064 + 128 <- (253,4) 33408
swapper 0 [010] 68.319264: block:block_bio_complete: 253,4 WS 33408 + 128 [0]

Note the matching sector and new_sector fields in the split event traces. This means that the split request size cannot be calculated from the difference. The end result is again more complexity to address that lack of information. In the above case, split events are triggered after the first remap. So the upcoming splits have to be inferred from the size difference of request [(253,4) 33024] when it gets remapped to the next layer. The size information from this remap operation is then carried over to the subsequent split events in order to track new requests and collect statistics.

Problems related to device mapper splits do not end there. The first bio never receives a bio complete event. Instead, it is the last split request that receives one. Accuracy issues are again the end result as completion stats are skewed.

Fifth, bio complete event format does not provide any information regarding the next layer. There is no static tracepoint event on the completion pipeline analogous to the remap event, either. This requires more complexity to address in the form of a stack mechanism to track layers. As a request travels down, components are pushed to the stack, as they travel up, components are popped off from the stack. It is not easy to get this right, particularly in cases like splits, and this compounds accuracy problems.

Sixth, there is no bio complete event triggered for the block device itself. The rq complete event, which actually belongs to the driver layer, is available instead. It is subsequently followed by a bio complete event for the upper layer.

Lastly, request flags can be dynamic and may influence behavior. In the case of a request with a flush flag, the original request is split into two distinct ones: first one with the flush flag only and the second one with all flags except the flush flag. In the case of a request with a fua flag, once the flush action takes place, it is removed from list of flags for the request. For these reasons, flush and fua flags have to be excluded when identifying a request.

Implementation

The test system presented in this article showcases only a subset of device mapper layers available. This lack of testing coverage may expose users to bugs. Devices may behave differently due to a variety of reasons and this tool may require additional work and/or fixes to function properly.

On a similar note, loop device coverage is not provided. Even if support is added for loop devices, without filesystem coverage, full instrumentation of the block stack is not possible at this time.

Certain events may not be properly covered as they are hard to trigger and, therefore, test. Frontmerge events constitute a good example in this category. This, coupled with the fact that debugging is hard, is another reminder that users are likely to hit bugs.

While the goal of this exercise was an attempt at systems engineering, it is also obvious that it was also an attempt at complexity, and, by extension, overhead. Although I was not able to discern much by comparing output with other tools, I also wonder how much overhead is actually introduced by simply tracing with a heavy design such as this one.

Improvements

Here is a list of improvements that I would like to see as I continue to iterate on this design:

Ability to target a specific block stack branch hierarchy.
Ability to target a specific cgroup.
Support for filesystem layer tracking.
Collection of flush statistics.
Integration of the page cache layer and its stats for non-direct operations, if possible.

Feedback

Any feedback from anyone who can test this prototype on different block stack configurations is welcome. I am also interested in hearing about potential improvements to the BPF code in terms of quality and performance.