DevOps Tutorial [Linux Series : Part 1 Memory Management Basics(Theory)]
Before, we embark upon our journey towards the world of Linux memory management, we need to understand the concept of virtual memory or virtual address spaces.
Physical memory or RAM is expensive and hence it is limited. It is also not necessarily contagious and could be accessible as a set of distinct memory addresses. On top of that there could be scenarios where you want to run a program whose size is larger than the RAM installed on your computer. In order to address these kind of issues, virtual memory has been introduced.
When a program is compiled, the compiler converts a program code into machine code and during this process, addresses spaces are set in the form of virtual addresses. Suppose there is an application named XYZ. The machine code which made up the XYZ application is 20000 bytes in size. It also needs another 10000 bytes for data storage and I/O buffers. Which means in order to run this application there must be 30000 bytes of physical memory available. This 30000 bytes of requirement is known as the application’s virtual address space. Virtual address space is the maximum amount of address space available to an application. The word “virtual” in virtual address space means this is the total number of uniquely-addressable memory locations available to an application, but not the amount of physical memory either installed in the system, or dedicated to the application at any given time. In the case of our XYZ application, its virtual address space is 30000 bytes.
The virtual memory addresses should be converted into physical addresses in order to access RAM. To make this conversion easier, physical and virtual memory are divided into equal sized chunks called pages. Intel x86 systems uses 4 Kbyte pages.
Each of these pages is given a unique number; the page frame number (PFN). In this paged model, a virtual address is composed of two parts; an offset and a virtual page frame number. If the page size is 4 Kbytes, bits 11:0 of the virtual address contain the offset and bits 12 and above are the virtual page frame number. Each time the processor encounters a virtual address it must extract the offset and the virtual page frame number. The processor must translate the virtual page frame number into a physical one and then access the location at the correct offset into that physical page. To do this the processor uses page tables.
The above picture shows the virtual address spaces of two processes. Both the A and B processes has their own page table. These page tables map each processes virtual pages into physical pages in memory. We can see that process A’s virtual page frame number 3 is mapped into memory in physical page frame number 9 and that process B’s virtual page frame number 1 is mapped into physical page frame number 7. The page table contains following information :
- Flag which shows whether the page table entry is valid or not.
2. Physical page frame number information inside the entries.
3. Access control information such as whether that page is writable, whether it contains any binary code etc.
The page table is accessed using the virtual page frame number as an offset. Virtual page frame 4 would be the 5 th element of the table (0 based indexing). In order to convert a virtual address into a physical address, the processor must work out the virtual addresses page frame number and the offset within that virtual page.
Linux maintains the page table of each process in physical memory and accesses a page table through the identity-mapped kernel segment. Because they are stored in physical memory, page tables themselves cannot be swapped out to disk. This means that a process with a huge virtual address space could run out of memory simply because the page table alone uses up all available memory.
The processor uses the virtual page frame number as an index into the processes page table to retrieve its page table entry. If the page table entry at that offset is valid, the processor takes the physical page frame number from this entry. If the entry is invalid, the process has accessed a non-existent area of its virtual memory. In this case, the processor cannot resolve the address and must pass control to the operating system so that it can fix things up. Just how the processor notifies the operating system that the correct process has attempted to access a virtual address for which there is no valid translation is specific to the processor. However the processor delivers it, this is known as a page fault and the operating system is notified of the faulting virtual address and the reason for the page fault. Assuming that this is a valid page table entry, the processor takes that physical page frame number and multiplies it by the page size to get the address of the base of the page in physical memory.
page fault, though sounds like some kind of error is actually quite useful mechanism of requesting physical memory to be allocated to different processes. It occurs when a program is executing some code, and the executable code for that program isn’t currently being actively held in physical pages of Memory. Linux responds by allocating more pages to that program so that it can complete its execution of the code.
There are 2 types of page faults : Major and Minor
When a page is there in memory but it is no longer mapped to the process or not allocated to the process and when the processor tries to access it, a minor page fault occurs. When a page is not there in memory but the processor tries to access it, a major page fault occurs. We can see how frequently page fault happens in a linux system by executing below command :
#ps -eo min_flt,maj_flt,cmd
The MINFL field is showing the number of minor faults and the MAJFL field is showing the number of major faults.
Demand Paging
As the amount of physical memory is quite less than virtual memory the kernel must use the physical memory efficiently. The kernel saves physical memory by only loading virtual pages that are currently being used by the executing program. It doesn’t make any sense if the kernel loads the entire program into RAM at once since the whole program will not be used by the user at any given time. So the kernel only loads those pages which are accessed and the rest remains on disk. This technique of only loading virtual pages into memory as they are accessed is known as demand paging. Linux uses demand paging to load executable images into a processes virtual memory. Whenever a command is executed, the file containing it is opened and its contents are mapped into the processes virtual memory. This is done by modifying the data structures describing this processes memory map and is known as memory mapping. However, only the first part of the image is actually brought into physical memory. The rest of the image is left on disk. As the image executes, it generates page faults and Linux uses the processes memory map in order to determine which parts of the image to bring into memory for execution.
Swapping
When a process needs to bring a virtual page into RAM and there are no free physical pages available, the OS must make space for this page by discarding another page from physical memory. If the page to be discarded from physical memory came from an image or data file and has not been written to then the page does not need to be saved. Instead it can be discarded and if the process needs that page again it can be brought back into memory from the image or data file. However, if the page has been modified, the operating system must preserve the contents of that page so that it can be accessed at a later time. This type of page is known as a dirty page and when it is removed from memory it is saved in a special sort of file called the swap file. Accesses to the swap file are very long relative to the speed of the processor and RAM and the kernel must juggle the need to write pages to disk with the need to retain them in memory to be used again. Linux uses a Least Recently Used (LRU) algorithm to choose pages which might be removed from the system. Every page in the system has an age which changes as the page is accessed. The more that a page is accessed, the younger it is; the less that it is accessed the older and more stale it becomes. Old pages are good candidates for swapping.
There is a misconception regarding swapping. Many people thinks swapping only happens when physical memory is running low. The fact is swapping indicates how efficiently kernel handles system resources. Suppose there is a program which started during booting but it is not being used, then swapping those pages will save expensive physical memory.
Swappiness is a property of the kernel that changes the balance between swapping out memory of a running system versus dropping pages from the page cache. Swappiness can be set to values between 0 and 100. A low value means the kernel will try to avoid swapping as much as possible wherein a higher value means the kernel will try to use swap space as much as possible. There is another value which is considered and it is known as the distress value. The distress value is the measurement of how much trouble the kernel is having while trying to free up memory. The first time the kernel starts reclaiming pages, distress will be zero. If more attempts are needed, that value goes up gradually approaching a max value of 100. There is another concept of the mapped_ratio value. It is an approximate percentage of how much of the system’s total memory is mapped within a memory zone. And last but not the list there is this kernel parameter ‘vm.swapiness’. vm.swappiness is the swappiness parameter, set to 60 by default and represents the percentage of the free memory before activating swap. The lower the value, the less swapping is used and the more memory pages are kept in physical memory.
With these values, the OS calculates its “swap tendency”:
swap_tendency = mapped_ratio/2 + distress + vm_swappiness;
If swap_tendency is below 100, the kernel will only reclaim page cache pages. Once it goes above 100, the pages which are part of some process’s address space will also be considered for reclaim. So, if the system’s distress value is low and swappiness is set to 60, the system will not swap process memory until 80% of the total RAM in the system is allocated. People who don’t want app memory to be swapped out can set swappiness to a low value, like 5 or 10, causing the kernel to ignore process memory until the distress value gets quite high. Overall, increasing this value will make the system more inclined to utilize swap space, leaving more memory free for caches. Decreasing this value will make the system less inclined to swap, and may improve application responsiveness.
Now that we have discussed the basics of Linux Memory Management, let us see a high level diagram of the different components which together makes the memory management successful.
User Space applications such as the mozilla browser, the bash shell etc uses the system calls provided by glibc in order to interact with the kernel subsystem which further interacts with the VM subsystem in order to access physical memory.
Few important components of the VM subsystem :
Zoned Buddy Allocator — The Zoned Buddy Allocator is responsible for the management of page allocations to the entire system. It allocates physical pages. This code manages lists of physically contiguous pages and maps them into the MMU page tables, so as to provide other kernel subsystems with valid physical address ranges when the kernel requests them. The Buddy allocator also manages memory zones, which define pools of memory which have different purposes. Currently there are three memory pools which the buddy allocator manages accesses for:
- DMA — This zone consists of the first 16 MB of RAM, from which legacy devices allocate to perform direct memory operations
- DMA32 — It exists only in 64-bit Linux; it is the low 4 GBytes of memory, more or less. It exists because the transition to large memory 64-bit machines has created a class of hardware that can only do DMA to the low 4 GBytes of memory.
- NORMAL — This zone encompasses memory addresses from 16 MB to 1 GB2 and is used by the kernel for internal data structures, as well as other system and user space allocations.
- HIGHMEM — This zone includes all memory above 1 GB and is used exclusively for system allocations (file system buffers, user space allocations, etc).
The Linux kernel doesn’t consider the RAM to be one great big undifferentiated pool of memory. It divides the entire RAM into nodes and assigns one node per CPU. It further divides the nodes into a number of different memory regions known as zones. Memory zones are required because there are hardware like ISA devices which cannot access memory larger than a specific size. So dividing the entire memory makes possible for those hardware devices to access memory.
In short, we can say that kernel maintain it free pages, using buddy system. It keeps memory area contiguous. It allocate pages after getting page allocation requests.
There is a file in the /proc filesystem named buddyinfo which shows the number of free pages per node.
From the above screenshot, 1179 is 4KB pages, 1884 is 8 KB Pages, 1224 is 16KB pages, 416 is 32 KB pages and so on.
You can also run the command #echo m > /proc/sysrq-trigger and then check the /var/log/messages to see the page sizes.
In the above picture, we can see the different size of free pages available. We can also see the free memory available is 61 MB which is more than the min memory 44MB. In case of OOM scenarios, the available free memory becomes lower than the minimum memory.
Slab Allocator — The Slab Allocator provides a more usable front end to the Buddy Allocator for those sections of the kernel which require memory in sizes that are more flexible than the standard 4 KB page.
Kswapd — kswapd daemon is responsible for swapping pages. This daemon reclaim pages when memory runs low. It usually stays in sleeping state and only wakes up when the free page limit on any particular zone is reached.
In the next tutorial we will dive into the practicals and learn about different commands, tools and tuning techniques which will help us improve the memory related performance of linux systems.
Please click here to go to the 2nd part of this tutorial.