What is RoCEv2?

Ravi Kishore Chitakani
4 min readJan 5, 2023

--

RoCe (RoCE) is a network protocol that allows remote direct memory access (RDMA) over an Ethernet network. RoCEv2 is the second version of the protocol, which provides improvements in performance and functionality over the original RoCE.

RoCEv2 allows low latency, high-bandwidth communication between servers or other devices connected over an Ethernet network. It achieves this by bypassing the kernel and operating system network stack, and allowing direct access to memory over the network. This can significantly reduce the overhead of network communication, leading to improved performance in certain types of applications, such as high-performance computing, storage, and virtualization.

RoCEv2 also supports Quality of Service (QoS) and congestion control, allowing it to share bandwidth and resources fairly with other traffic on the network. It also supports multicast, allowing efficient communication with multiple recipients.

Overall, RoCEv2 is a useful tool for applications that require low-latency, high-bandwidth communication over Ethernet networks, and can provide significant performance improvements in certain scenarios.

Here is a high-level overview of the RoCEv2 packet flow:

  1. A client device sends a memory access request to a server device, specifying the location and size of the data it wants to access in the server’s memory.
  2. The server device receives the request and prepares to send the data. It also sends an acknowledgement (ACK) message back to the client to confirm that it received the request.
  3. The client device receives the ACK and sends a Remote Direct Memory Access (RDMA) Read request to the server. This request contains a Memory Key (MKey) that allows the client to access the specified memory location on the server.
  4. The server device receives the RDMA Read request and retrieves the data from its memory. It then sends the data back to the client in an RDMA message, along with an ACK to confirm that it received the RDMA Read request.
  5. The client device receives the data and sends an ACK back to the server to confirm receipt.

Here is a simple diagram illustrating this process:

Here is a step-by-step walkthrough of a RoCEv2 packet flow, using the example from the previous diagram:

  1. The client device sends a memory access request to the server, specifying the location and size of the data it wants to access in the server’s memory. This request may be part of a larger application protocol, such as a storage protocol or a message passing interface (MPI).
  2. The server device receives the request and prepares to send the data. It also sends an ACK message back to the client to confirm that it received the request.
  3. The client device receives the ACK and sends an RDMA Read request to the server. This request includes the MKey that allows the client to access the specified memory location on the server. The RDMA Read request may also include additional metadata, such as the size of the data being requested and any QoS or congestion control parameters.
  4. The server device receives the RDMA Read request and retrieves the data from its memory. It then sends the data back to the client in an RDMA message, along with an ACK to confirm that it received the RDMA Read request. The RDMA message may also include additional metadata, such as a completion queue entry (CQE) to signal the completion of the request to the client.
  5. The client device receives the data and sends an ACK back to the server to confirm receipt. The client may also process the data as needed and send additional requests to the server if necessary.

What is PFC?

PFC (Priority-based Flow Control) is a networking protocol that allows devices to pause and resume the transmission of data based on the priority of the traffic. PFC is used to prevent congestion and ensure that high-priority traffic is not delayed by lower-priority traffic.

PFC works by allowing devices to send special pause frames to their neighbors, indicating that they are temporarily unable to receive more data. The neighbors will then stop sending data until they receive a resume signal. PFC supports eight different priority levels, allowing different types of traffic to be prioritized differently.

PFC is often used in conjunction with Quality of Service (QoS) to ensure that important traffic is given precedence over less important traffic. It is commonly used in data centers and other high-bandwidth networks where it is important to ensure that critical applications and services are not impacted by network congestion.

What is DCQCN?

Data Center Quantized Congestion Notification (DCQCN) is a networking protocol that aims to improve the performance of RDMA traffic in congested networks. It works by sending periodic feedback messages from the receiver to the sender, known as “congestion reports,” which contain information about the current congestion level of the network. Based on this information, the sender can adjust its transmission rate to match the available capacity of the network, helping to prevent congestion and improve overall network performance.

DCQCN is specifically designed for use in data centers, where it can help to improve the efficiency and performance of networked applications and services. It is particularly useful in situations where there are a large number of flows competing for limited bandwidth, as it allows the network to allocate resources more efficiently and fairly.

DCQCN is part of a larger family of networking protocols known as Active Queue Management (AQM) protocols, which aim to improve the performance of networks by actively managing the flow of traffic through the network. Other examples of AQM protocols include RED (Random Early Detection), BLUE (Better Long-term Utilization of bandwidtE), and COPA (Cooperative Proportional Rate Allocation).

--

--

Ravi Kishore Chitakani

Hyperscale Data Center Networking Expert | RDMA | Network Automation | Cisco SDN | Cloud Security | SONiC | VXLAN EVPN