Memory Mapped Files
Memory mapping of files is a very powerful abstraction that many operating systems support out of the box. Linux does this via the mmap
system call. In most cases where an application reads (or writes) to a file at arbitrary positions, using mmap
is a solid alternative to the more traditional read
/write
system calls. We’ve used it in the analytics database at Mixpanel to improve performance or make code more readable and I wanted to spend some time figuring out what actually happens under the hood.
At a high level, the mmap
system call lets you read and write to a file as if you were accessing an array in memory. There are two main modes in which files can be mapped — MAP_PRIVATE
and MAP_SHARED
. In MAP_PRIVATE
, any changes that you make to the file are in memory and not written back to it. In MAP_SHARED
, changes made to the file are visible to other memory mappings of that file and are eventually committed to disk.
To understand what happens on calling mmap
, it’s important to understand two things — how linux handles files and how memory addressing works.
You can open a file for reading or writing using the open
system call. This returns a file descriptor. Linux maintains a global file descriptor table and adds an entry to it representing the opened file. This entry is represented by the file
structure which is local to the process. Internally, linux uses the inode
struct to represent the file. The file
struct has a pointer to this and linux ensures that multiple file descriptors that touch the same file point to the same inode
so that their changes are visible to each other. The i_mapping
field on the inode
struct is what’s used to get the right set of pages from the page cache for an offset in the file.
In linux, processes have a virtual memory address space that’s, well, virtual. This memory is not usually backed by physical memory unless you’re actually reading or writing to some part of it. Linux further divides the memory space into equal sized pages and a page is the unit of access as far as the kernel is concerned. So, when a process calls mmap
, the short answer is that nothing really happens. The kernel simply reserves some part of this virtual memory address space and returns the address. The do_mmap
function is what eventually gets called after some bookkeeping and does most of the work for allocating this virtual memory in the process’ address space. This function stores a pointer to the file
struct in the vm_area_struct
struct that represents the returned address.
When the process accesses the address, a page fault occurs. The page fault handler locates the vm_area_struct
struct in the process’s address space and eventually finds the pages in the page cache that map to the file offsets being accessed. These pages are marked as dirty if there’s a write and mapped directly to user space — this way there is no need to copy data from kernel to user space.
Once you’re done using the memory mapped area, the munmap
system call can be used to free up the memory. Any data written to the page cache is periodically committed to disk, although you can force it with msync
. While mmap
is useful, it definitely has drawbacks. Misses in the page cache always result in the page being read into the cache even if a write is going to overwrite the contents. Offsets need to be aligned to page boundaries. Error handling happens via signals because there is no way to indicate otherwise. And finally, you can’t mmap
all types of file descriptors(pipes for example). As usual, conditions apply — so make sure you don’t use mmap
indiscriminately.