The future of memory architectures in datacenters

Moses Reuben

6 min readAug 21, 2023

The future of memory architectures in datacenters

Moses Reuben

Toga Networks Research Center (HUAWEI)

Former — Tech Lead at CISCO SYSTEM & MARVELL

OS EXPERT — LINUX KERNEL

Introduction

As resource usage in data centers increases due to new AI and HPC workloads that require higher memory capacity and bandwidth, it is becoming more challenging to keep the memory latency low, in the range of 70–80 ns, within the NUMA boundary.

Currently, the growth in computing far exceeds the growth in memory capacity and bandwidth, as we can see in [1], for several reasons:

1. Hardware challenges in increasing the number of memory channels limit memory capacity

2. Advances in CPU and GPU technologies have dramatically increased the number of compute cores per compute node (core count approximately doubles every two years)

New architectures are emerging to close this gap between computing and memory and prevent memory from becoming a bottleneck for increasing performance.

CXL

Even with the major improvements done in the Linux kernel with handling hot pages and evicting cold pages [4], the main bottleneck in HPC remains the same — CPU’s low utilization due to memory pressure:

1. CPU is waiting for a large portion of the time to fetch data into the cache line from the main memory

2. Penalty paid on TLB miss (kernel needs to walk on the page tables (MINOR PF))

3. Bring the page from the storage (MAJOR PF)

One way to solve this issue is to expand the memory accessible to the system via a CXL connection to a remote memory pool. Although CXL memory has higher access latency is the only memory expansion option once the system’s maximum memory capacity has been reached. With proper management (e.g. hierarchical memory tiering) can provide a cost-effective solution.

The CXL protocol, which is based on PCI-EXPRESS 5.0, allows an external device (memory or storage) to be connected via PCI and present it to the OS as local memory (load-store interface). The challenge many companies try to address is reducing the CXL latency to 120–130 ns, which will allow the OS to use the remote memory node with a latency equivalent to a remote NUMA node. CXL provides dynamic memory allocation from a shared memory device to multiple CXL-attached servers via a PCI hot-plug mechanism. This presents another challenge as this process should be fast enough to allow a cluster-wide memory orchestration mechanism to optimize memory usage across nodes.

[figure 1]

[figure 2]

CXL Linux kernel support

Currently, work is being done in the Linux kernel to enable memory hot plug/unplug to allow on-demand memory allocation and release from the CXL memory device [17].

[figure 3]

Distributed training

Another requirement that is increasing as a result of AI and machine learning applications is the need to transfer memory pages across machines as soon as the processing of the data has ended. RDMA technologies are insufficient, since their latency is 300–400 ns which is very slow for the demand of AI applications. Furthermore, this latency is not scaling up [18], [19] so the answers won’t come from RDMA.

Linux Kernel Swap Compression

A recent technology that improved memory performance under pressure is the front swap interface [13] on the Linux kernel. It compresses pages targeted for eviction into the DRAM instead of the disk (as in the current Linux swap implementation).

As there are an increasing number of computing cores for the compression and decompression, and the DRAM latency is significantly lower than HDDs\SSDs, the overall latency is reduced [10].

[figure 4,5]

Open Memory Interface (OMI)

The Open Memory interface [2] is a relatively new alternative to traditional parallel direct-attached DRAM based on the emerging JEDEC standard being promoted by IBM. OMI is an open, standardized version of Centaur. It allows for a significant increase in memory bandwidth and capacity on IBM’s POWER CPUs. Its memory-agnostic design allows for tiering and more design flexibility when designing systems based on that CPU.

[figure 6]

INFINISWAP

The INFINISWAP project which is led by META [15], [16] is an interesting use case that uses CXL capabilities together with the RDMA protocol to implement a distributed shared swap paging system that replaces the traditional, slow disk swap devices with a faster memory-based solution. It allows processes under memory pressure to relieve it by swapping pages to other cluster nodes with available memory instead of swapping to the disk.

[figure 7]

TPP

TPP [14] is an OS-level transparent page placement mechanism that uses remote memory over CXL as a secondary memory device. It identifies hot/cold pages so that optimally, cold pages are placed in CXL-based memory, and hot pages are placed in the main memory. This allows systems to replace physical DRAM with remote CXL memory while maintaining and even improving the overall performance.

[figure 8]

Processing in memory (PIM)

Another method to increase memory locality is software and hardware systems that allows computation to run on memory devices without the need to transfer memory to and from the CPU. The generic term used to describe this method is Processing In Memory (PIM). For example, this method can be applied to common memory operations such as memcpy, memset, and bitwise operations.

An engine implemented on hardware [7], [8], [9] can significantly reduce memory fetch/store data movements to and from CPU caches (L1/L2/L3), thus reducing cache line copies and contention.

To identify code execution blocks that can benefit from this optimization, both compile-time and run-time approaches can be used.

While this technology is being discussed for some time, technical and practical challenges have prevented it from reaching the market so far. One major issue is that adding a hardware unit on the memory bus may slow down non-optimized fetch/store instructions as well, and another issue is contention and synchronization issues between the CPU and the on-memory hardware unit.

Summary

DRAM bandwidth, capacity, and latency has always been critical for system performance and become more so as HPC & AI applications become more and more prevalent [20], [21]. We can expect CXL and RDMA technologies to take a large part in emerging memory architecture to address memory performance.

CXL will need to be dynamic, with high capacity and low latency (much lower than today’s latency), giving the ability to manage cluster-wide memory resources with minimal to no interference to system availability.

Mechanisms such as TPP and INFINISWAP will also play a part in improving large-scale memory usage on large clusters, leveraging the Linux kernel advanced swap sub-system [22].

PIM looks promising, but it might take some time before PIM-enabled devices reach the market.

References

[1] — https://www.nextplatform.com/2022/06/16/meta-platforms-hacks-cxl-memory-tier-into-linux/

[2] — https://fuse.wikichip.org/news/2893/ibm-adds-power9-aio-pushes-for-an-open-memory-agnostic-interface/

[3] — https://lwn.net/Articles/454795/

[4] — https://lwn.net/Articles/851184/

[5] — https://dl.acm.org/doi/10.1145/3296957.3173177

[6] — https://users.ece.cmu.edu/~omutlu/pub/pim-enabled-instructons-for-low-overhead-pim_isca15.pdf

[7] — https://dl.acm.org/doi/abs/10.1145/3007787.3001159

[8] — https://dl.acm.org/doi/pdf/10.1145/3087556.3087582

[9] — https://ieeexplore.ieee.org/document/8686556

[10] — https://www.researchgate.net/figure/Interaction-among-swap-front-end-zswap-and-zpool_fig2_335941803