Exploring NUMA

12 min readDec 30, 2022

In Symmetric Multiprocessor (SMP) systems all physical memory is seen as a single pool. All cpus share hardware resources and same address space due to single instance of kernel.

When physical memory and IO devices are equidistant in terms of latency from the set of independent physical cpus (sockets), the system is called UMA (Uniform Memory Access).

In UMA configuration, all physical cpus access memory from the same memory controller and share the same bus. Building a large SMP server is difficult due to physical limitation of shared bus and higher bus contention with increase in number of cpus.

NUMA (Non Uniform Memory Access) architecture allows designing a bigger system configuration but at a cost of varying memory latencies.

A single cpu socket (a socket may have multiple logical cores, each with two hyper-threads) is UMA. However, when a system designed with two or more cpu sockets, it is considered NUMA. NUMA is a design trade-off as it introduces higher memory latencies.

Primary benefit of NUMA servers is higher throughput in all resource dimensions (compute, memory, network, storage), sometime not possible with UMA servers.

Linux kernel is NUMA aware and thus takes into account processor affinity and data locality to keep the memory latency low, if possible. libnuma and “numactl” commands allows application to offer hints to Linux kernel on how its wants its memory managed.

NUMA Nodes

Typically, NUMA systems are made up of a number of nodes, each with its own cpus, memory and IO devices. Nodes are connected via high speed interconnect and uses it to access memory and IO devices on remote nodes.

Numa node has it own local memory and devices and access other node memory and devices via interconnect

Each NUMA node acts as a UMA SMP system with fast access to local memory and IO devices but relatively slower access to remote nodes memory.

By providing each node with its own local memory, it reduces contention issues associated with a shared memory bus, found in UMA servers and thus allow systems to achieve higher aggregate memory bandwidth. In general, cost of accessing memory increases with the increase in distance from cpu. Thus, more data fails to be local to the node that will access it, the more memory intensive workload performance will suffer from the architecture.

Linux Kernel is System Topology Aware

ACPI (Advanced Configuration and Power Interface) in BIOS builds a System Resource Allocation Table ( SRAT), that associates each cpu and block of memory in proximity domain, called numa nodes.

SRAT describes cpu memory architecture, that describes cpu(s) and memory ranges belong to a particular NUMA node. This information is used to map the memory of each node into a single sequential block of memory address space.

Linux kernel uses SRAT to understand which memory bank is local to a physical cpu and attempts to allocate local memory to each cpu. System Locality Information Table (SLIT), build by ACPI, provides a matrix that describes the relative distance (i.e. memory latency) between these proximity domains.

In the past, more nodes means overall higher latencies due to distance between nodes. Modern point-to-point architectures moved away from a ring topology to a full mesh topology thus keeping the fixed hop counts, no matter how large the numa configuration.

Cache Coherent Numa (ccNUMA)

Since each cpu core (physical cpu is made up of multiple logical cores) maintains its own set of caches, it may introduce cache-coherency issues due to multiple copies of same data.

Cache coherency means any variable that is to be used by cpu must have a consistent value. In other words, load instruction (read of the memory location) must retrieve the last store (write to the memory).

Without cache coherency, data modified in one cpu cache may not be visible to other cpu and that may result in data corruption. SMP requires platform to support cache coherency in the hardware. This is achieved by hardware protocol called cache snooping.

Snooping guarantees that all processors will have the same visibility of the current state of data or lock in physical memory. Cache snooping monitors cpu caches for modified data.

Cpu caches are divided into equal size storage slots, called cache lines. Data is fetched from physical memory in 64 bytes chunk and placed into cache lines for quick access.

When one cpu modifies a cache line, all cpu caches are checked for the same cache line. If found, cache line is invalidated or discarded from the cpu cache.

When cpu tries to access data not found in the cache, cache snooping first checks other cpu caches lines. If found, data is accessed from other cpu cache instead of fetching from memory. Thus write invalidate snooping feature, erases all copies of data in all cpu caches before writing to local cpu cache. This results in cache miss for invalidated cache line in other cpu local cache. Data is then served from the cache of other cpu containing the most recently modified copy.

Server Memory Hierarchy

There is a hierarchy of caches. Each cpu core has a private L1-L2 and shared L3 to help reduce latency associated with fetching data from physical memory. Physical memory access latency is lower on local numa node then remote node. Cache latency is measured in cycles and physical memory latency is measured in nanoseconds

Latency increases when access to remote cpu cache and memory

If a virtual to physical memory translation (v->p) is not cached within the TLB (Translation LookAside Buffer), 1–2 lookup is required to find the page table location that contains v->p translation and that mean an additional 1 or 2 access to DRAM, each with latencies (60–120 ns) before data can be fetched from the physical memory.

True DRAM latency should also include cache miss latency. When an application thread experiences a cache miss, the DRAM access request goes into a 16 level deep queue. Thus latency depends on how many other memory requests are still pending. A worst case request could have a latency of up to (3+16)* DRAM latency.

To reduce memory access latencies, one should:

Reduce TLB miss by using large pages (2 MB, 1GB) or make sure all memory access to occur in adjacent memory addresses.
Use temporary variables that can use cpu registers or optimized into registers.
Coordinate threads/process activities by running them close to each other (in same core or socket). This allow threads to share caches.

FileSystem Cache Latency

When application request memory allocation, Linux uses default memory allocation policy and attempts to allocate memory from local numa node. Also to reduce application file system reads and writes IO latency, page cache memory is also allocated from the local numa node. At some point, local numa node memory may get filled with application and file system pages. When application running in local numa node requests more memory, kernel has two options:

1. Free filesystem page cache memory in local node, considering page cache memory is counted as free memory
2. Allocate application memory from the remote node.

Linux decision is influenced by kernel tunable vm.zone_reclaim_node (Default: 0). Default behavior is to keep filesystem page cache memory intact in local node and honor application memory request from remote node, in case no free memory is available in local node. That keeps filesystem page cache (read/write) latency low, but application may see higher memory latency due to remote memory access.

Setting kernel tunable (vm.zone_reclaim_node=1) enables aggressive page reclamation by evicting filesystem pages from local nodes in favor of application memory pages. This may help with lower application memory latency but at the cost of higher file system cache latency.

Linux NUMA POLICIES

Processor affinity and data placement play important role in application performance.

Processor affinity refers to associating a thread/process or a task to a particular processor.

Linux, by default, attempts to keep the process on the same core to take advantage of cpu warm cache.

Task is considered “cache hot” if it wakes up within a half a millisecond (kernel tunable: sched_migration_cost_ns). Otherwise, it is a candidate for migration to other cores in the physical socket. If all cores are busy in the socket, then task may get pushed to core in remote socket. Running the task on other socket may induce memory access latency due to remote memory access.

Linux offers various tools (taskset, numactl, cgroup) and libnuma library to overwrite Linux default scheduling behavior and memory allocation policies by binding the task to subset of cpu cores and by allocating memory from particular numa node.

Linux “numactl” is numa aware that allows task placement and memory allocation to a particular numa node. numactl allows system admin to change the default memory allocation policy and specify the one that matches application requirements:

Bind: Memory allocation should come from the supplied numa nodes.
Preferred: Memory allocation preference by specifying list of numa nodes. If first node is full, memory is allocated from the next node
Interleave: Memory allocation is interleaved among a set of specified numa nodes. This is useful when amount of memory allocation cannot fit into a single numa node and application is multithreaded/multiprocess. This mode allows memory allocation across multiple numa nodes and balanced memory latency across all application threads.

Resource controllers, part of cgroup, are typically used for controlling resource usage of Linux containers (Docker).

Resource controllers are basically kernel drivers/modules that offer finer level control on system resources and thus allows system resources on a larger server to be logically partitioned.

For example: cpuset” resource controller can help partition numa nodes among workloads and allows dedicated use of cpus and memory on the numa node.

# Taskset and numactl Examples:
#Start application with an cpu affinity mask. This will pin the thread to set of cpus specified as a mask
taskset -p <cpu mask> <application>
# One can also specify cpu number instead. This will pin the application to cpu # 8
taskset -c 8 <application>
# Change affinity mask of a running task. Process will then restricted to run on list of cpu specified via cpu mask
taskset –p  <cpu mask> <pid>

#numctl 
#Run myapp on cpus 2,4,6,8 and allocate memory only to local memory where the process runs.
numactl -l --physcpubind=2,4,6,8 myapp

#Run multithreaded application myapp with its memory interleaved on all CPUs to achieve balanced latency across all application threads.
numactl --interleave=all myapp
 
#Run process on cpus that are part of node 0 with memory allocated on node 0 and 1.
numactl --cpunodebind=0 --membind=0,1 myapp

Setting processor affinity manually some time may affect application performance negatively due to:

scheduler restriction not to assign waiting threads to idle or underutilized cores.
Memory access time may also increase when additional allocation cannot be satisfied from the local node.

Old memory allocation may not be automatically migrated unless Linux kernel tunable “ numa_balancing” is enabled

numa_balancing tunable maps/unmaps application pages in an attempt to keep application memory local, if possible. Linux also offers utilities, migratepages, and API routines to move application memory pages manually from one node to another. . See sample program.

Application Design Consideration

Numa should be thought into application design as NUMA servers are becoming a more commonplace due to economy of scale and to control server sprawl in data centers.

Even docker containers are typically hosted on large numa servers as large servers allows packaging hundreds or even thousands of containers on a single host.

Large servers also have better RAS (Reliability, Availability, Serviceability) and can be configured to achieve higher availability. Few numa considerations are listed below:

Reduce TLB miss by using Linux HugePages feature that allocate large pages (2 MB, 1GB) instead of 4k size page. Oracle, MySQL and others offer tunables to use Hugepages.
Database (Oracle, MySQL,) and JVM supports “numa” feature and it should be enabled when hosted on NUMA servers to achieve balanced performance.
If not using Hugepages then memory access should occur in adjacent memory addresses to trigger Intel cpu hardware prefetch logic
Use temporary variables that can use cpu registers or optimized into registers.
Coordinate threads/process activities by running them close to each other (in same core or socket). This also allow threads to share cpu caches
For application using master/slave model where slave threads work on independent set of data, programmer should ensure that memory allocations are made by the thread that will subsequently access the data and not by an initialization code. Linux kernel will place memory pages local to the allocating thread, thus local to the worker threads that will access the data.
For multi-threaded or multi-process application sharing common but very large pool (mysql shared buffer cache, oracle SGA) of data that may not fit into a single node memory, it is recommended to use Linux memory mode “ — interleave=all” (set by numactl or libnuma) to spread the memory across multiple memory nodes to achieve balanced memory latency across all application threads or processes.
When application memory allocation size fits into the memory node (“numactl — hardware” can show memory configured in each node) and multiple threads/processes are accessing the same data, better performance is achieved by co-locating threads on the same node. This can be achieved by selecting Linux memory “bind” “-cpunodebind=0, — membind=0” policy via “numactl” command or NUMA API.
Some long-lived memory intensive application creates/destroys threads in a threadpool regularly to deal with increase and decrease load. Considering new threads created may not be local, it may result in higher latencies to access shared data. One should monitor memory access latencies periodically and , if needed, migrate pages close to threads if see remote memory access. In general, migration of memory pages from one node to another is an expensive operation, but it may be worthwhile for applications doing memory intensive operations.

To migrate application pages manually from one node to another, consider using “migratepages” command or numa API. Automated system level load balancing (“numa_balancing” feature) is also supported by Linux, but it may results in higher cpu kernel overhead (depending on workload). Better option is manually move pages using “migratepages” command or numa API as demonstrated in this program

For application that wants to enforce Linux policy (bind, preferred, interleave) early at allocation time should use “numactl — touch” option. Default is apply policy at page fault, when an application accesses a page.

NUMA MONITORING

“numactl” and “numastat” can be used to find:

number of numa nodes and distance between nodes
cpu and memory association with the node
Linux policy (default, bind, preferred, interleave) of a process
numa node statistics: numa_hit, numa_miss, numa_foreign, interleave_hit, local_node, other_node..
process level numa node memory allocation statistics. It is similar information reported in /proc/<pid>/numa_map

Numa API:

getcpu: determine cpu and NUMA node the calling thread is running
get_mempolicy: Retrieve numa memory policy for process. Tells which node contains the address
set_mempolicy: Sets numa memory policy for process (default, preferred, interleave)
move_pages: move individual pages of a process to other node
mbind: Sets memory policy (bind mode) for particular memory range or node.

Intel PMU (Performance Monitoring Unit):

Each core in Intel processor has its own PMU (Performance Monitor Unit) that gives wealth of statistics about cache and memory latency and throughput.

Linux perf can be used to capture events like: cpu-clock, cycles, task-clock (wall clock time).

PMU stats like: CPI (Clock per Instruction) or IPC (Instruction per cycle) can be used to estimate physical memory fetch latencies.

Intel PCM Library and tools (pcm.x, pcm-numa.x, pcm.mem.x) can provide additional information about off core (uncore) statistics like: shared L3 cache stats, Memory channels and QPI utilization and throughput, numa and memory latency and bandwidth. Uncore has its own PMU for monitoring these activities.

Hardware Features

Memory interleaving

Memory interleaving is a hardware feature that increases cpu to memory bandwidth by allowing parallel access to multiple memory banks.

Memory interleaving refers to how physical memory is interleaved across the physical DIMMs.

This feature is highly effective for reading large volumes of data in succession. Memory interleaving increase bandwidth by allowing simultaneous access to more than one bank of memory.

Interleaving works by dividing the system memory into multiple blocks of two or four, called 2-way or 4-way interleaving. Each block of memory is accessed by different sets of control lines that are merged together on the memory bus. Thus read/write of blocks of memory can be overlapped. Consecutive memory addresses are spread over the different blocks of memory to exploit interleaving.

A balanced system provides the best interleaving. Server is considered balanced when all memory channels on a socket have the equal amount of memory. Goal should be be to populate memory evenly across all sockets to improve throughput and reduce latency.

Dual or Quad Rank Memory Modules

The term “Rank” means 64-bit chunk of data. Number of ranks means number of independent DRAM sets in a DIMM that can be accessed for the 64 data bit, the width of the DIMM. Memory DIMMS with a single 64-bit chunk of data is called single-rank module. Nowadays, quad-rank modules are common considering higher rank allows denser DIMMs.

Clocking memory at a higher frequency also help improve memory throughput. For example, performance gain of using 1066MHz memory versus 800MHz memory is 28% and 1333MHz vs 1066MHz is 9%.

DDR4 Memory specification supports higher frequency than DDR3. DDR4 can support transaction rates of 2133–4266 MTps (Million Transaction per second) as compared to DDR3 that is limited to 800–2133 MTps. DDR4 also uses less power than DDR3.

Originally published at http://techblog.cloudperf.net.