Exploring the Nuances of PCI and PCIe
Here at Google Cloud, we recently began offering GCP customers the ability to use GPUs. But before we could do that, we needed to be 100% certain that these devices, which connect via PCI Express (PCIe), could not be compromised. That led our team to explore the nuances of the PCIe spec and also that of its predecessor, PCI, in preparation for building a software fuzzer to test out the proposed devices. Here’s a sampling of what we found, with a focus on security.
If you were installing a new sound, network, or video card in your computer in the ’90s, it was probably built to the old PCI standard, which described how to link these devices via physical wires to other parts of your computer.
PCI was exciting for its time. PCI devices are easier to install than devices built to the previous ISA standard. When an ISA device was plugged into the I/O bus, a computer could access it by communicating over the matching address on the bus. But it was hard to know in advance what devices would respond to a request, where the devices were located in I/O space, if the computer had the correct drivers to interact with the ISA card, or for devices to avoid conflicting with each other.
PCI changed this by introducing the notion of “configuration space,” a set of registers on the device area that allows the system to ask the card for information about itself, and respond accordingly.
PCI also had greater bandwidth than its predecessors, and thus it quickly became ubiquitous. However, as time went on, PCI’s limitations became more apparent:
- The speed of a set of PCI devices was limited to that of the slowest device on the bus. Connecting one outdated peripheral would slow down all devices.
- At the same time, demand for speed was at an all-time high, as gigabit ethernet became widespread, and PCI devices couldn’t keep up.
A new standard was needed.
To take on the limitations of PCI, PCIe needed to tackle issues with bus sharing and bus contention.
When multiple devices were on the same PCI bus, PCI was forced to “clock down” and match the speed of the slowest device on the bus. PCIe uses some electrical cleverness to address this. By encoding data using the 8b/10b line code, PCIe is able to encode both data and clock information in a single signal, removing the need for an external clock and greatly increasing potential bandwidth. (Newer versions of PCIe have continued to improve on this; PCIe 3.0 and onward encode with 128b/130b and enable even faster transfer rates.)
Additionally, PCIe needed a way to handle high-speed data streaming. If a device didn’t have enough buffering to contain all the data it wanted — as is the case with, say, high-definition video — then naturally you’d like the device to continuously stream the data. However, while that device was busily streaming a huge amount of data, it would put a stranglehold on the bus, preventing all the other devices from doing anything. PCIe resolves this issue by allowing for packet fragmentation, breaking up the data stream into smaller packets that can be transmitted via the transaction layer protocol that underlies the PCIe fabric.
Thus, it’s easier to think of PCIe as a network, rather than a physical bus. Each device has an address, and the spec describes functionality for flow control, error detection, and retransmissions, none of which existed in PCI.
This is all good news for performance-sensitive devices like GPUs. However, it also introduces a swath of areas for security concerns.
PCIe in focus
It’s been noted in the security community that standards and specs often contain low-hanging fruit in the form of provable design vulnerabilities. Our team spent a substantial amount of time combing the spec for edge cases and gaps to defend against.
One example, in an otherwise-innocuous section of the spec, is this note on Multicast TLPs:
“With the exception of ACS Source Validation, ACS access controls are not applicable to Multicast TLPs (see Section 6.14), and have no effect on them.”
It’s unintuitive that these TLPs should necessarily be exempt from ACS controls. Fortunately, there’s a separate mechanism for securing multicast TLPs, by assigning devices to “multicast groups” that are allowed to multicast to each other. However, this requires additional planning and effort by system designers and administrators.
A careful reading of the PCIe spec reveals multiple edge cases like this, and is vital to informing efforts to secure these devices.
The PCIe spec spans more than one thousand pages, and covers error handling, data and transaction layer design, the physical link fabric, dozens of capability registers, and much more — and that’s just the spec. Implementations of the spec can vary and introduce additional complexity. To understand the details of how the spec is implemented in a particular device — which registers correspond to which PCIe settings, which registers are accessed by Base Address Registers, and so on — you have to crack open the manual for that device. These manuals often contain hundreds of additional pages that a security team needs to refer to when testing a device.
Errata and updates
PCIe is now an old standard. Originally created in 2003, it’s been revised, updated, and extended continuously for more than a decade, with the most recent update in December 2016. Additionally, there’s a great deal of errata for each revision. Most of these changes are innocuous from a security standpoint, but it’s still important to check for revisions that introduce vulnerabilities as unintended consequences of other changes.
PCIe operates like a network, complete with its own protocols in the form of TLPs (Transaction Layer Packets) and DLLPs (Data Link Layer Packets). There are some security benefits to this approach. If PCIe devices instead used traditional hardware buses, there would be no effective way to deal with an untrusted device — there would be no way to know where a data access came from. The network architecture, by contrast, allows us to view source and destination ID for each packet, and allows us to manage, at each switch, whether we want packets to proceed onward to their destination. This provides a mechanism for mitigating the risk from malicious device behavior. However, it also leads to additional complexity, which must be carefully managed to limit security risk.
Limited integrated security
PCIe has a set of features known as Access Control Services (ACS), a name which suggests it may offer a convenient, definitive solution for securing PCIe devices. However, there is no official set of guidelines from the PCIe working group on which subset of ACS settings are necessary to securely run devices in the cloud. Furthermore, looking at the initial ACS proposal, we see that device security is only a subset of ACS’s purpose. Some of the items described are security-oriented, such as permissions validation for requests between downstream components, but others are focused on gracefully handling broken or malfunctioning devices, like mitigating the effects of packet corruption. A malfunctioning device can cause harm; however, a malicious device is also a serious concern. We focused on exploring the efficacy of ACS as a safeguard against motivated attackers.
With that as the backdrop, we set forth to build our fuzzer. For more information about how we did that, check out “Fuzzing PCI Express: Security in plaintext” on the Google Cloud Platform blog.