A primer on Linux filesystem

shashank Jain
5 min readAug 11, 2018

In this blog we try to understand the linux file system mechanism, overview of some of the data structures involved and focus majorly on block device based file systems.

Linux philosophy is to treat everything as a file. As an example Socket, Pipe, Block Devices are all represented as files.

The filesystems in Linux act as a container to abstract the underlying storage in case of Block devices. For non block devices like Sockets, pipes there are file systems in memory which have operations which can be invoked using the standard file system API.

Linux abstracts all file systems by a layer called the Virtual File system (VFS). All file systems register with the VFS. VFS has the following important data structures

1. File — This represents the open file and captures the information like offset etc. The userspace has a handle to an opened file via a structure called file descriptor. This is the handle used to interface with the filesystem.

2. Inode — This is mapped 1:1 to the file. The Inode is one of the most critical structures and also holds the meta-data about the file. As an example on which data blocks the file data is stored, what access permissions are there on the file. This all info is part of Inode. Inodes are also stored on disk by the specific filesystem, but there is a representation in memory which is part of VFS layer. The filesystem is responsible to enumerate the VFS Inode structure.

3. Dentry — This is the mapping between filename and inode. This is an in memory structure and not stored on disk. This is majorly relevant for lookup and path traversal.

4. Super block — This structure holds all information about the filesystem. How many blocks are there, device name etc. This structure is enumerated and brought into memory during a mount operation.

Each of the above data structure holds pointers to their specific operations. As an example the file has file_ops for reading and writing and super block has operations via super_ops to do mount unmount etc.

Just to give a basic understanding as to what happens during a mount operation

Mount operation creates a vfsmount data structure which holds a reference to a new superblock structure created from the filesystem which has to be mounted on the disk. The dentry has a reference to the vfsmount. This is where the vfs distinguishes between a directory and a mount point. So during a traversal if in a dentry the vfsmount is found, the inode number 2 on the mounted device is used (inode 2 is reserved for root directory).

So how does this fit all together in case of say a block device.

The user space process makes a call to say read a file.

The system call is made to the kernel. The VFS checks the path and starts to see if there are dentries cached starting from say root. As it traverses and finds the right dentry it locates the inode for the file to be opened. Once the inode is located, the permissions are checked and then the data blocks are loaded from the disk into the OS page cache. The same data is brought into the userspace of the process.

The page cache is an interesting optimization in the OS. For all reads and writes (except direct i/o) the reads and writes happen over the page cache. The page cache itself is represented by a data structure called the address_space. This address_space holds a tree of memory pages and the file inode holds a reference to that address_space data structure.

The diagram above shows how a file makes into the page cache. This is also key to understand that how operations like mmap for memory mapped files work. We will cover that when we cover file systems like tmpfs and shared memory IPC primitives.

So whenever the file read request comes, if its in page cache (which it determines via the address_space structure the inode of the file has) the data is served from there.

Whenever a write call is made on the file via the file descriptor the writes are first written to the page cache. The memory pages are marked dirty and then the linux kernel uses the write back cache mechanism which means there are threads in the background (called pdflush),which actually drain the page cache and via the block driver write to the physical disk. The mechanism of marking pages dirty is not just at page level since page is as an example 4k in size and even a minimal change will then cause a full page write. To avoid that there are structures at more fine grained granularity and represent a disk block in memory. These structures are called the buffer heads. So for example if block size is 512 bytes there are 8 buffer heads in 1 page of the page cache.

So now individual block can be marked dirty and made part of the writes. We will probably cover the details in another blog about how the vectored i/o works and what optimizations the kernel applies.

The buffers can be explicitly flushed to disk via these system calls

1. Sync() — This flushes all dirty buffers to disk

2. Fsync(fd) — this flushes only the file specific dirty buffers to disk including the changes to inode

3. Fdatasync(fd) — flushes only the dirty data buffers of the file to disk. Doesn’t flush the inodes.

So as an example this is how sync works

1. Check if superblock is dirty

2. Writeback the superblock

3. Iterate over each inode from inode list

a. If the inode is dirty write it back

b. If the page cache of the inode is dirty write it back

c. Clear the dirty flag

In next blog we cover memory mapped files and also the non block device based file systems like procfs. Stay tuned.

Followed up with article on pseudo File systems (https://medium.com/@jain.sm/pseudo-file-systems-in-linux-5bf67eb6e450)

Disclaimer : The views expressed above are personal and not of the company I work for.

--

--