[How it Works] Memory and Performance. Part 2.

RandomDev
Geek Culture
Published in
9 min readJun 29, 2021

--

Photo by Umberto on Unsplash

This part will focus mostly on OS (Operating System) layer of memory management. It doesn’t matter in which language you have written your program, from OS perspective it’s a process. Let’s see what is staying between our programming language abstraction and hardware. Once we will have a good grasp, we will explore how we can take advantage of this knowledge. But before that, quick recap.

In the previous article, we covered memory basics (please read part 1 in case you haven’t). We learned about main types of memory and most important characteristics of it from a performance point of view. Also, I introduced you to linear memory model and briefly touched memory segments, call stack and stack frames. We also explored the access time and fragmentation problem and how memory managers and garbage collectors attempts to solve this.

When all segments of your program are loaded into RAM …

Hold on ! I have a question ! You told me that when my program will run into a “high fragmentation state”, memory manager will use… will use … ah yes, compaction ! It will magically copy chunks of memory and fix everything. But I don’t get one point, if the memory chunk was moved, this means the memory address was changed also, right ?

Well… It depends. I oversimplified a few things there. You see, your program lives in ̶M̶a̶t̶r̶i̶x̶ a virtual memory space. Each process has a layout, as we discussed, something like this:

On the left — Memory address. On the right — actual data divided into few logical segments.

You may have 10 processes and each one will think his static segment starts at 0x1000. All processes may run in parallel and without any issues. More than that, each process will never be able to access memory of any other process.

Okay, you are not helping. It is even more confusing. Memory address isn’t real ? How about CPU memory address lookup you were telling me about ?

Relax, both memory addresses exists, physical and virtual. CPU is using physical address, processes are using virtual addresses, that’s it. When a process want to access something, it will provide a virtual address, OS will intercept that request and translate it into a physical address (with some help from hardware). OS will create and maintain a virtual to physical memory mapping.

Ok, now it’s a bit more clear, but I still have questions… If we want to map 1 virtual address to 1 physical address, we’re going to need a lot of memory. One memory address is related to 1 byte of data, right ? Now, we need to store somewhere the relation between two memory addresses, virtual <-> physical, which will still be related to a single byte of data. If the address take 32bit (4 bytes), 2 addresses will take 32+32=64bit (4+4=8 bytes) and all this is just to point to a single byte of data ! We have 8 bytes of mapping information for a single byte of actual data. Huston, we have a problem!

You are absolutely right! This is not efficient at all. That’s why we are splitting our memory into chunks. These chunks are called memory pages. Instead of keeping track of virtual to physical mapping for each byte, we can do this for larger blocks.

Let’s assume we have a page of 4kB (4096 bytes) and a memory address which takes 32bit (4 bytes). If we spend 8 bytes for two addresses from each chunk of 4096 bytes this is ~ 0.19% of memory wasted on mapping. Not too shabby! By the way, this mapping is called a page table. And the pair of addresses virtual to physical is called page table entry. OS is in charge of all this and will create one page table per process. Pay attention here:

Visualization of Virtual and Physical memory. Page 1 from virtual memory space is associated with Page 1 from physical memory space, and so on…

Each page from virtual memory is associated with a page in physical memory (RAM). The order and location of pages in physical memory may be different from the virtual one. Both pages, virtual and physical have a fixed size. From a process perspective, the memory will always be continuous. This also means your process can’t reserve a single byte of memory, it can reserve at least a single page of memory, this leads to wasted memory often because the memory consumption usually is not exactly a multiple of a page size.

A snapshot of 3 different memory pages.

In the image above, you can see that in Page 1, the program is using only 3072 (3 x 1024 bytes blocks), the last block of memory is not used and can’t be reserved by other programs. Page 3 is used at 100% of its capacity, so we are not wasting memory at all. The worst case here is Page 2, only ¼ of the page is effectively used by a program.

Ok, I got it. If we have bigger memory pages, we can potentially waste more memory, but at the same time we can’t have a single byte sized page because the page table will be huge then. Why we need all this “virtuality” then ?

The primary benefit is that developers don’t have to manage a shared memory space. For example, we don’t need to decide where in memory to allocate space for a variable without breaking other running processes. Writing a simple “hello world” would be a serious challenge, requiring you to know all kinds of memory management things. Besides this, we also benefit from increased security (memory isolation) and few other things like swapping and caching, which we will touch in a moment.

Before going to the next benefit of virtual memory, let’s talk a bit about performance. Now we know that every memory access will involve an additional operation — virtual to physical mapping lookup. Finding the entry in the page table every time can be very costly. To solve this problem, modern computers have to maintain a very fast caching solution. And by very fast, I don’t mean RAM itself of course, because from a CPU perspective RAM is too slow. This solution is called TLB (translation look-aside buffer), and for best performance it is located typically right inside the MMU (Memory Management Unit) in CPU. The TLB is very fast and can return the right address in a single clock tick or less (Read More Here). The problem with TLB is that it have a limited size, thus it can’t hold many page table entries, if the entry is not found in TLB it will have to be searched “manually” which is expensive (may cost x10-x100 more).

Going back to our page sizes, larger page means fewer entries in page table; fewer entries means more profit from TLB. A smaller memory page means we will have to store more virtual-physical mappings, which means we will have more entries in our page table. Many entries may fill out TLB capacity, leading to a higher probability of more expensive “manual” searches.

But, virtual memory not only simplify software development, it can also help you to improve your performance.

Wow ! This is good, because so far I see only convenience at price of performance. What next ?

Next is Page cache.

Let’s imagine we have an application which periodically needs to read some information from your disk. We already know what is access time and how it can affect the performance, so wouldn’t be cool to have a smart way of caching it into RAM instead of accessing it from your HDD/SSD each time? Don’t tell anything, of course — Yes! Your OS is always tracking the memory demand and can cache files for you if you have a lot of unused memory. Isn’t this cool? And you don’t even have to bother by coding this.

When a file is accessed by your program, the page caching algorithm will load that file into the page cache (RAM) and then return the data to your program from the page cache. The next attempt to read something will return data from RAM instead of accessing it from disk, thus improving the performance at the price of higher RAM usage.

The same can happen when you write files. If a program writes small chunks of data frequently (Ex: logging) it will go first to the page cache. If the file we are about to write is in cache it will be marked as “dirty” after changes, this means that the cached file and the actual file on the disk is different now, and it needs to be flushed to disk. Page cache can batch few changes before flushing it to disk, once it is flushed the cache is marked as “clean” (means disk version and cached version is the same).

And what if I read only the first part of the file? Loading the whole file into RAM will not only decrease the performance, this can lead to using an excessive amount of memory in case of big files.

Hold on, your OS is smart and can cache chunks of files also. In the image below, only the first part of the file was requested by the process. If the process need to read that chunk again later, it will be served from page cache instead of accessing the disk, which results in a significant performance gain.

A chunk of file on disk is loaded into RAM into a specific page cache then served to a process directly from RAM.

What is even greater is that by having virtual to physical memory mapping + file to memory mapping, we can trick our programs into reading the page cache directly, thus eliminating RAM to RAM copy also.

Wow! This is cool!

Yes, it is! You know what is even cooler?

What?

Processes can share the same file. If your program requested a read for a file that is already mapped in cache, it will read it directly from cache, which means, again, increased significant performance.

2 different processes are reading the same file using the same page cache.

Or, in case you don’t have enough RAM for your programs, it can start writing things into disk (swap file) instead of RAM, but this will have obviously a negative impact over performance.

Hey, how knowing this stuff can help me with my programs if I don’t have control over this ?

Good point. These techniques will work in your favor 99% of time. But if you are designing heavy I/O programs, this may become an extremely annoying thing. And No, you Have control over this!

For DB systems keeping too much cache dirty is a concern because RAM as we remember is volatile, in case of power losses the changes from cache will be lost forever. There are other reasons as well, here you can find a tuning recommendation for Oracle Database (Tuning the Page Cache).

Sometimes, page caching, and write batching can cause disk I/O jitters. Basically what is happening, is that OS keeps in RAM a large portion of write operations then flush it all at once. If the amount of write operations accumulated is higher than disk I/O capacity, it will block all other disk I/O-s causing decreased performance of the system overall. This could be easily solved with some simple OS tuning. Easily if you know these things :).

Here is an overly-simplified graph to help you to understand it better. The average load is the same, but tuning can help you get a more smooth I/O load profile.

On the Red line, the period of times when disk I/O is fully saturated (100%), the system will can’t handle anything additional and typically will suffer from spikes in latencies. The smooth blue line will behave much better and even unlock your application to handle potentially more load.

Simplified visualization of suboptimal page cache configuration (red line) and tuned page cache configuration (blue line).

If you are not convinced by this theoretical situations, you can find real life examples. There are some excellent articles, for example a server running Kafka was struggling with I/O jitters and sporadic latencies: Linux page cache tuning for Kafka. If that is not enough, take a quick look Aerospike guide for tuning Linux Kernel which is facing the same problem. This should help you recognize and understand this kind of problem in the future. Even if you are a java developer, there are libraries which can help you to bypass OS caching and obtain maximum performance for your specific case.

That’s it! Thank you for reading this, hope you will find it useful. If you have any suggestions for future articles, feel free to leave a comment.

--

--

RandomDev
Geek Culture

Software engineer with a passion for performance.