Memory Mapped Files

Aniruddha
i0exception
Published in
3 min readFeb 3, 2018

Memory mapping of files is a very powerful abstraction that many operating systems support out of the box. Linux does this via the mmap system call. In most cases where an application reads (or writes) to a file at arbitrary positions, using mmap is a solid alternative to the more traditional read/write system calls. We’ve used it in the analytics database at Mixpanel to improve performance or make code more readable and I wanted to spend some time figuring out what actually happens under the hood.

At a high level, the mmap system call lets you read and write to a file as if you were accessing an array in memory. There are two main modes in which files can be mapped — MAP_PRIVATE and MAP_SHARED. In MAP_PRIVATE, any changes that you make to the file are in memory and not written back to it. In MAP_SHARED, changes made to the file are visible to other memory mappings of that file and are eventually committed to disk.

To understand what happens on calling mmap, it’s important to understand two things — how linux handles files and how memory addressing works.

You can open a file for reading or writing using the open system call. This returns a file descriptor. Linux maintains a global file descriptor table and adds an entry to it representing the opened file. This entry is represented by the file structure which is local to the process. Internally, linux uses the inode struct to represent the file. The file struct has a pointer to this and linux ensures that multiple file descriptors that touch the same file point to the same inode so that their changes are visible to each other. The i_mapping field on the inode struct is what’s used to get the right set of pages from the page cache for an offset in the file.

In linux, processes have a virtual memory address space that’s, well, virtual. This memory is not usually backed by physical memory unless you’re actually reading or writing to some part of it. Linux further divides the memory space into equal sized pages and a page is the unit of access as far as the kernel is concerned. So, when a process calls mmap, the short answer is that nothing really happens. The kernel simply reserves some part of this virtual memory address space and returns the address. The do_mmap function is what eventually gets called after some bookkeeping and does most of the work for allocating this virtual memory in the process’ address space. This function stores a pointer to the file struct in the vm_area_struct struct that represents the returned address.

When the process accesses the address, a page fault occurs. The page fault handler locates the vm_area_struct struct in the process’s address space and eventually finds the pages in the page cache that map to the file offsets being accessed. These pages are marked as dirty if there’s a write and mapped directly to user space — this way there is no need to copy data from kernel to user space.

Once you’re done using the memory mapped area, the munmap system call can be used to free up the memory. Any data written to the page cache is periodically committed to disk, although you can force it with msync. While mmap is useful, it definitely has drawbacks. Misses in the page cache always result in the page being read into the cache even if a write is going to overwrite the contents. Offsets need to be aligned to page boundaries. Error handling happens via signals because there is no way to indicate otherwise. And finally, you can’t mmap all types of file descriptors(pipes for example). As usual, conditions apply — so make sure you don’t use mmap indiscriminately.

--

--

Aniruddha
i0exception

Currently, eng @mixpanel. Previously @twitter, @google