How Linux Kernel Manages Application Memory

Amer Ather
9 min readDec 29, 2022

--

Linux uses Virtual Memory subsystem to act as a logical layer between application memory requests and physical memory (RAM). This type of abstraction hides the complexity of platform specific physical memory implementation from the application.

When application accesses virtual addresses exported by Linux Virtual Memory subsystem, hardware MMU raises an event to tell the kernel that an access has occurred to an area of memory that does not have physical memory mapped to it. This event results in an exception, called Page Fault, that is serviced by Linux kernel by mapping a faulted virtual address to a physical memory page.

Virtual address to Physical Memory mapping

Virtual addresses are transparently mapped to physical memory by collaboration of hardware ( MMU, Memory Management Unit) and software ( Page Tables). Virtual to physical mapping information is also cached in hardware, called TLB (Translation Lookaside Buffer), for later reference to allow quick lookup into physical memory locations.

A page is simply a group of contiguous linear addresses in physical memory. Page size is 4 KB on x86 platform

VM abstraction offers several benefits:

  • Programmers do not need to know physical memory architecture of the platform. VM hides it and allows writing architecture independent code.
  • Process always see linear contiguous range of bytes in its address space, regardless of how fragmented the physical memory.

For example: when application allocates 10 MB of memory, Linux kernel reserves 10 MB of contiguous virtual address range in the process address space. Physical memory locations where these virtual address range is mapped may not be contiguous. Only part that is guaranteed to be contiguous in the physical memory is the size of the page (4 KB).

  • Faster startup due to partial loading. Demand paging loads instructions as they are referenced.
  • Memory sharing. A single copy of library/program in physical memory can be mapped to multiple process address space. Allows efficient use of physical memory.

pmap -X <pid>” can be used to find what process resident memory is shared by other process or private.

  • Several programs with memory footprints bigger than physical memory can run concurrently. Kernel behind the scene relocates least recently accessed pages to disk (swap) transparently.
  • Processes are isolated into its own virtual address spaces and thus cannot affect or corrupt other process memory.

Two processes may use same virtual addresses, but these virtual addresses are mapped to different physical memory locations. Processes that attach to same shared memory (SHM) segment will have their virtual addresses mapped to same physical memory location.

32-bit address space is limited to 4GB, as compared to hundreds of Terabytes for 64-bit address space. Size of process address space limits the amount of physical memory application can use

Process virtual address space is composed of memory segments of type: Text, Data, Heap, Stack, Shared (SHM) memory and mmap. Process address space is defined as the range of virtual memory addresses that are exported to processes as its environment.

Each memory segment is composed of linear virtual address range with starting and ending addresses, and are backed by some backing store like: filesystem or swap.

Page fault is serviced by filling physical memory page from the backing store. During memory shortages, data cached in physical memory pages is migrated to its backing store. Process “Text” memory segments is backed by executable file on the file system. Stack, heap, COW (Copy-on-Write) and shared memory pages are called anonymous (Anon) pages and are backed up by swap (disk partition or file).

When swap is not configured, anonymous pages cannot be freed and are thus wired into physical memory considering no place to migrate data from these physical pages during memory shortages.

When process calls malloc() or sbrk(), kernel creates a new heap segment in the process address space and reserves the range of process virtual addresses that can be accessed legally. Any reference to a virtual address outside of reserved address range results in a segmentation violation, that kills the process. Physical memory allocation is delayed until process accesses the virtual addresses within the newly created memory segment.

Application performing large 50GB of malloc and touching (page faulting) only 10 MB range of virtual addresses will consume only 10 MB of physical memory.

virtual/physical memory usage can be viewed using:“ ps”, “pidstat” or “top” . SIZE column represents size of virtual memory segment and RSS column shows allocated physical memory.

Physical memory pages used for program Text and caching file system data (called page cache) can be freed quickly during memory shortages considering data can always be retrieved from the backing store (file system). However, to free anonymous pages, data needs to be written to swap device before it can be freed.

Linux Memory Allocation Policy

Process memory allocation is controlled by Linux memory allocation policy. Linux offers three different modes of memory allocations depending on the value set for a tunable, vm.overcommit_memory

  1. Heuristic overcommit (vm.overcommit_memory=0): Linux default mode allows processes to overcommit “reasonable” amount of memory as determined by internal heuristics, that takes into account: free memory and free swap. In addition to this, memory that can be freed by shrinking the file system cache and kernel slab caches (used by kernel drivers and subsystems) is also taken into consideration.

Pros: Uses relaxed accounting rules and it is useful for program that typically requests more memory than actually uses. As long as, there is a sufficient free memory and/or swap available to meet the request, process continue to function.

Cons: Linux kernel makes no attempt to reserve physical memory on behalf of process, unless process touches (access) all virtual addresses in the memory segment.

2. Always overcommit (vm.overcommit_memory=1): Allows process to overcommit as much memory as it wants and it always succeed.

Pros: Wild allocations are permitted considering no restrictions on free memory or swap.

Cons: Same as Heuristic overcommit. Application can malloc() TBs on a system with few GBs of physical memory. No failure until all pages are touched and that triggers OOM Killer.

3. Strict Overcommit (vm.overcommit_memory=2): Prevents overcommit by reserving both virtual memory range and physical memory. No overcommit means no OOM Killer.

With Strict Overcommit, Kernel keeps track of amount of physical memory reserved or already committed.

Since Strict Overcommit policy does not take free memory and swap into consideration, one should not rely on free memory or swap metrics (reported by free, vmstat ) to discover available memory. Instead use “cat /proc/meminfo” metrics: CommitLimit, Committed_AS to estimate memory available for allocation.

To calculate current overcommit or allocation limit, one should use the equation: “CommitLimit — Committed_AS”.

Kernel tunable “vm.overcommit_ratio” sets overcommit limit for this mode. Overcommit limit is set to: Physical Memory x overcommit_ratio + swap. Overcommit limit can be raised by setting vm.overcommit_ratio tunable to a bigger value (default 50% of physical memory).

Pros: Disables OOM Killer. Failure at the startup has lower production impact than being killed by OOM Killer while serving production load. Solaris OS offers only this mode. Strict overcommit does not use free memory/swap for overcommit limit calculations.

Cons: No overcommit allowed. Memory reserved but not used cannot not be used by other application. A new program may fail to allocate memory even when the system is reporting plenty of free memory. This is due to reservation against the physical memory on behalf of existing processes.

Monitoring for free memory becomes tricky with Strict Overcommit Policy. Applications in Linux typically do not handle memory allocation failures. Inability to check memory failures may results in corrupted memory and random hard to debug failures.

NOTE: For both heuristic and strict overcommit, the kernel reserves a certain amount of memory for root. In heuristic mode, 1/32nd of the free physical memory. In Strict overcommit mode it is 1/32nd of the percent of real memory that you set. This is hard coded in kernel and cannot be tuned. That means a system with 64GB will reserve 2GB for root user.

OOM Killer

When system level memory shortages reaches to an extreme situation where: filesystem cache has been shrunk; all possible memory pages has been reclaimed. If the memory demand continue to stay high will eventually exhausts all the available memory. To deal with such situation, kernel selects processes that can be killed to meet the memory allocation demand. This desperate kernel action is called OOM Killer.
Criteria used to find the candidate process some time kills the most critical process. There are few options available to deal with OOM Killer:

Disable OOM Killer by changing kernel memory allocation policy to strict overcommit. $sudo sysctl vm.overcommit_memory=2 | $sudo sysctl vm.overcommit_ratio=80

Opt out the critical process from OOM Killer consideration. Opting out critical server process may sometime not be enough to keep system functioning. Kernel still has to kill processes in order to free memory. In some cases, automated reboot server to deal with OOM Killer may be only option.

$sudo sysctl vm.panic_on_oom=1

$sudo sysctl kernel.panic=”number_of_seconds_to_wait_before_reboot”

File System Cache Benefits

Linux uses free memory that is not being used for caching file system pages and disk blocks.

Memory used by file system cache is counted as free memory and available for application when needed. Linux tool “free” reports file system cache memory as free memory.

Benefit of having a file system cache improves application reads and writes performance:

Read: When application reads from a file, kernel performs a physical IO to read data blocks from the disk. Data is cached in the file system cache for later use to avoid physical read. When application requests the same block, it only requires a logical IO (reading from filesystem page cache) and that improves application performance. Also, file systems prefetch (read ahead) blocks, when sequential IO pattern is detected, in an anticipation that application will request next adjacent blocks. This also help reduce IO latencies.

Write: When application writes to a file, kernel caches data into page cache and acknowledges completions (called buffer writes). Also file data sitting in filesystem cache can be updated multiple times (called write cancelling) in memory before kernel schedules dirty pages to be written to disk.

Dirty pages in file system cache are written by “flusher” (old name is pdflush) kernel thread. Dirty pages are flushed periodically when the proportion of dirty buffers in memory exceeds a virtual memory thresholds (kernel tunable).

File system cache improves application IO performance by hiding storage latencies.

HugeTLB or HugePages Benefits

Linux HugeTLB feature allows application to use huge or large pages: 2 MB, 1 GB than the default 4 KB size. TLB (Translation Look aside buffer) is a hardware component that caches virtual to physical translation. When translation is not found in TLB, called TLB miss, results in expensive walk to memory resident page tables to find virtual to physical memory translation.

TLB cache hit is becoming more important due to increasing disparity in cpu and memory speed and memory density. Frequent TLB miss may negatively impact application performance.

TLB is a scarce resource on cpu chip and Linux kernel tries to make best use of limited TLB cache entries. Each TLB cache entry can be programmed to provide access to contiguous physical memory addresses of various sizes: 4 KB, 2 MB or 1 GB.

TLB has 64 slots. If slots are programmed to cache bigger size page, it improves physical memory reach with TLB miss

4 KB page : 64x4 + 1024x4 = 4 MB

2 MB page : 32x2048 +1024x2048 = 2 GB

1 GB page : 4GB

Pros:

  • HugeTLB may help reduce TLB misses by covering bigger process address space.
  • Larger page size require fewer page table entries and levels are shallower. This reduces memory latency due to 2 level instead of 4 level page tables access and physical memory used for page table translation.
  • Huge Pages are locked in memory and thus are not candidate for page out during memory shortages
  • Reduces page fault rates. Each page fault can fill 2 MB or 1GB physical memory than 4 KB. Thus makes the application to warm up much faster.
  • If application access pattern has data locality, HugeTLB will help. Reading from random locations or only few bytes from each page (large hash table lookup) would benefit from 4 KB page size instead of large pages.
  • Large pages improves memory pre-fetch operation by eliminating the need to restart pre-fetch operation at 4K boundaries

Cons:

  • Huge Pages require upfront reservation. System Admin is required to set kernel tunable to desired number of HugePages: vm.nr_hugepages=<number_of_pages>

Linux Transparent Huge Pages (THP) feature does not have upfront cost.

  • Application should be HugePage aware.

To take advantage of HugePages, java application should be started with “-XX=+UseLargePages” option in order to use large pages for java heap. Otherwise, pages allocated may not be used for any purpose. One can monitor

To monitor Hugepages usage: “cat /proc/meminfo|grep PageTables”

  • HugePages require contiguous physical memory of sizes: 2 MB and 1GB. Request for large pages may fail if the system is up for a longer period and most of the memory is demoted to 4 KB chunks.

Originally published at http://techblog.cloudperf.net.

--

--