Linux Beyond the Basics: NUMA

The Closer the Faster

Dagang Wei
6 min readJun 17, 2024

This blog post is part of the series Linux Beyond the Basics.

Introduction

Linux has long been the powerhouse behind servers, workstations, and even supercomputers. But as technology scales, so do the challenges. One such challenge is optimizing memory access for increasingly complex and demanding workloads. Enter NUMA, a memory architecture that can supercharge your Linux systems.

What is NUMA?

NUMA, short for Non-Uniform Memory Access, is a departure from traditional computer architectures. In a UMA (Uniform Memory Access) system, all processors access memory equally. NUMA shakes things up by grouping processors and memory together into nodes. Each node has its own local memory, and while processors can access memory from other nodes, it’s slower than accessing their own.

Imagine a multi-floor office building where each floor represents a NUMA node with its own processors (people) and memory (meeting rooms). Ideally, you want all meetings (data access) to happen on the same floor (local memory) of your assigned desk for faster access. But sometimes, meetings (data access) are scheduled on other floors (remote memory), forcing you to use the elevator (interconnect), which takes longer. This analogy illustrates how NUMA systems aim to optimize performance by balancing the need for large amounts of memory with the desire for fast access.

Why NUMA Matters

As servers become more powerful, with dozens of processors and terabytes of memory, UMA can become a bottleneck. NUMA helps to alleviate this by:

  • Scalability: NUMA allows systems to expand with more processors and memory without sacrificing performance.
  • Performance: By prioritizing local memory access, NUMA can boost the speed of applications, especially those designed with NUMA in mind.
  • Resource Optimization: NUMA enables finer-grained control over how memory is allocated and used, leading to more efficient resource management.

NUMA and Cloud Computing

Cloud computing has taken the world by storm, and NUMA plays a vital role behind the scenes. Cloud providers leverage NUMA to maximize the efficiency of their infrastructure.

  • Virtual Machines (VMs): Cloud providers often run multiple VMs on a single physical server. By allocating VMs to specific NUMA nodes and aligning their virtual CPUs (vCPUs) with the corresponding memory, they can significantly enhance VM performance.
  • Scaling: NUMA allows cloud environments to scale seamlessly. New nodes can be added without impacting the performance of existing ones, making it easier to meet growing demand.

Is NUMA Right for You?

NUMA is particularly beneficial for:

  • Large-Scale Systems: Servers with many processors and substantial memory.
  • High-Performance Computing: Workloads that demand maximum processing power and memory bandwidth.
  • Data-Intensive Applications: Applications that process large datasets and require fast memory access.

Core NUMA System Calls

These system calls provide the fundamental building blocks for NUMA management:

  • get_mempolicy(): Retrieves the current memory policy for a process or a specific memory region. The memory policy dictates which NUMA nodes a process should preferably use for memory allocation.
  • set_mempolicy(): Sets the memory policy for a process or a memory region, allowing you to specify preferred or disallowed NUMA nodes for memory allocation.
  • mbind(): Controls the binding of memory pages to specific NUMA nodes. You can specify which nodes a set of pages should be allocated on or migrated to.
  • migrate_pages(): Migrates a set of memory pages from one NUMA node to another. This can be used to optimize memory allocation or improve performance.
  • move_pages(): Similar to migrate_pages(), but it also allows you to move pages within a single NUMA node.

Higher-Level NUMA Libraries

While you can use the core system calls directly, it’s often easier and more convenient to use higher-level libraries that provide a simplified interface for NUMA management:

  • libnuma: This library offers a set of functions that wrap the core NUMA system calls, making it easier to work with memory policies, node masks, and other NUMA-related concepts.
  • numactl: This command-line tool provides a user-friendly way to interact with NUMA. It can be used to set memory policies, run processes on specific nodes, and gather NUMA statistics.

Additional NUMA Interfaces

Besides system calls and libraries, Linux offers other ways to access NUMA information and control NUMA behavior:

  • /proc/PID/numa_maps: This file in the proc filesystem provides information about the NUMA node mapping of a process's virtual memory.
  • hwloc: This library provides a more general way to discover and represent the topology of a system's hardware resources, including NUMA nodes, processors, and caches.

Important Considerations:

  • Kernel Configuration: NUMA support needs to be enabled in your kernel configuration (CONFIG_NUMA) for these system calls and tools to be available.
  • Root Privileges: Most NUMA-related system calls and tools require root privileges to use.
  • Performance Impact: Using NUMA features can have a significant impact on system performance, both positive and negative. Careful tuning and experimentation may be required to achieve optimal results.

NUMA’s Toolkit

Linux provides a powerful arsenal of tools to help you understand, manage, and optimize NUMA on your systems. These tools can be invaluable for administrators, developers, and anyone looking to squeeze the most performance out of NUMA-enabled hardware:

  • numactl: This versatile command-line tool is your NUMA Swiss Army knife. It allows you to control memory allocation policies, run processes on specific NUMA nodes, gather statistics about NUMA usage, and much more.
  • numastat: This handy tool provides a snapshot of NUMA node memory usage, showing how much memory is allocated on each node and how often remote nodes are being accessed. It can help identify potential bottlenecks and areas for optimization.
  • numatop: This powerful interactive tool provides a real-time view of NUMA behavior. It can show you which processes are using which NUMA nodes, how much remote memory access is occurring, and other key metrics. This information is invaluable for identifying and troubleshooting NUMA-related performance issues.
  • lstopo: This graphical tool generates a visual representation of your system's hardware topology, including NUMA nodes, processors, and memory. It can help you understand the layout of your NUMA system and how resources are interconnected.
  • Other Tools: Linux offers many other NUMA-related tools, including numamap (for viewing NUMA memory maps), perf (for profiling NUMA performance), and various libraries like libnuma for developers.

Choosing the Right Tool for the Job

The tool you choose will depend on your specific needs and goals. For example:

  • If you need to quickly control memory allocation or process placement, numactl is a great option.
  • To get a high-level overview of NUMA usage, numastat can be helpful.
  • For detailed analysis and troubleshooting of NUMA performance issues, numatop and perf are excellent choices.
  • To visualize your system’s NUMA topology, lstopo is a handy tool.

By using these tools in conjunction with your knowledge of NUMA, you can gain valuable insights into how your system is using its NUMA resources and make informed decisions about how to optimize them.

Example

Let’s explore some practical examples of using numactl in your Bash scripts:

1. Displaying NUMA Hardware Information:

numactl --hardware

This command shows the NUMA nodes in your system, their CPUs, and their associated memory.

Output (example):

available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10
node 0 size: 16364 MB
node 0 free: 14030 MB
node 1 cpus: 1 3 5 7 9 11
node 1 size: 16384 MB
node 1 free: 15153 MB
node distances:
node 0 1
0: 10 21
1: 21 10

2. Running a Program on a Specific NUMA Node:

numactl --cpunodebind=0 --membind=0 my_program

This runs my_program with its CPUs bound to NUMA node 0 and its memory allocated from node 0.

3. Interleaving Memory Allocation Across All NUMA Nodes:

numactl --interleave=all my_program

This distributes my_program's memory allocations evenly across all available NUMA nodes.

4. Setting Preferred NUMA Node for Memory Allocation:

numactl --preferred=1 my_program

This tells the kernel to preferably allocate memory for my_program from NUMA node 1, but it can still use other nodes if necessary.

5. Showing NUMA Policy of a Running Process:

numactl -s <PID>  # Replace <PID> with the process ID

This displays the NUMA policy (e.g., memory binding, CPU binding) of the specified process.

6. Setting NUMA Policy for Shared Memory:

numactl --membind=0,1 --shm /my_shared_memory_segment

This sets a NUMA policy for the shared memory segment “/my_shared_memory_segment,” binding it to nodes 0 and 1.

Advanced Tips:

  • You can combine multiple options (e.g., --cpunodebind and --membind).
  • The -C or --physcpubind option lets you bind to specific physical CPU cores.
  • You can use numactl --show to see the current NUMA policy of your shell.
  • For more advanced usage, refer to the numactl man page (man numactl).

Important Considerations:

  • NUMA optimization is most effective for programs with good data locality (data that is used together is located together).
  • Overly aggressive NUMA policies can sometimes hurt performance if not used carefully.
  • Experimentation and profiling are key to finding the optimal NUMA configuration for your specific workloads.

Conclusion

NUMA is a powerful tool for unlocking the full potential of modern Linux systems. By understanding how NUMA works and how to leverage its benefits, you can build more scalable, efficient, and high-performing applications, whether on your own servers or in the cloud.

References

--

--