Linux Beyond the Basics: How Linux Reads and Writes Files
Page Cache, Clean Pages, Dirty Pages
This blog post is part of the series Linux Beyond the Basics.
Introduction
Linux, being the powerhouse operating system it is, handles file operations with an intricate blend of efficiency and robustness. If you’ve ever wondered how Linux reads and writes files, and what’s happening behind the scenes, let’s dive in and explore the fascinating world of Page Cache, Clean and Dirty File Pages, and some handy tools for peeking into the dynamics.
Page Cache: The Speed Demon
At the heart of Linux’s file I/O magic lies the Page Cache. Imagine it as a high-speed buffer zone residing in memory, where frequently accessed file data is stored.
For Reads: When a program requests data from a file, the kernel first checks the page cache to see if the data is already available in memory. If it is, the data is returned directly from the cache, which is much faster than reading it from the hard disk.
For Writes: When a program writes data to a file, the data is first written to the page cache. The pages in memory that have been modified are marked as “dirty” and are eventually flushed to the hard disk. This ensures that the data is written to disk even if the program crashes or the system is shut down unexpectedly.
The operating system manages the page cache by caching, re-using, and evicting files as needed, depending on the workload. By default, all free physical memory is used for the page cache.
Using the page cache for both reads and writes improves data processing and system performance. For example, if you read a large file twice in a row, the second access will be much faster because the data is already in the page cache.
Clean vs Dirty: The Page’s Tale
- Clean Pages: These are pages in the Page Cache whose content matches what’s on disk. They are eligible for eviction if memory pressure arises.
- Dirty Pages: These pages have been modified in memory but haven’t yet been written back to disk. The kernel’s
pdflush
(or its modern counterpartsflush-x
) daemon periodically flushes these dirty pages to disk, ensuring data integrity.
Reclaiming Pages Under Memory Pressure
When the system is running low on memory, Linux employs various strategies to reclaim pages from the Page Cache, making room for new allocations.
- Clean Page Eviction: Clean pages, being readily available on disk, are the first candidates for eviction. The kernel simply discards them from the Page Cache.
- Dirty Page Writeback: Dirty pages need to be written back to disk before they can be evicted. The kernel triggers writeback operations to free up these pages.
- Least Recently Used (LRU): The kernel maintains an LRU list of pages in the Page Cache. When reclaiming pages, it tends to evict the least recently used ones, assuming they are less likely to be needed in the near future.
The Metrics: A Peek Under the Hood
Linux offers a treasure trove of tools to monitor the Page Cache and file I/O dynamics at both the process and system levels. Here are a few essentials:
System-Level Metrics
free -m
: This command provides a quick overview of system memory usage, including details about the amount of memory used by the page cache.
Process-Level Metrics
/proc/meminfo
: This file provides a wealth of information about memory usage, including details about the Page Cache.vmstat
: This command displays virtual memory statistics, giving you insights into paging activity, including page ins and outs.iostat
: Use this command to get disk I/O statistics, helping you track read and write operations./proc/[pid]/pagemap
: For a specific process, this file lets you see which virtual memory pages are mapped to physical pages and whether they are in the Page Cache.
Example: Witnessing the Dynamics
Let’s say you’re running a text editor and open a large file. Here’s what might happen:
- Initial Read: The first time you access parts of the file, Linux reads them from disk and stores them in the Page Cache. You might observe some disk activity using
iostat
. - Subsequent Reads: If you access those same parts again, the data is readily available in the Page Cache, resulting in lightning-fast reads.
vmstat
might show low or zero page faults. - Editing: As you make changes to the file, the corresponding pages in the Page Cache become dirty.
vmstat
might show an increase in "dirty" pages. - Background Writeback: Periodically, the kernel’s
pdflush
daemon writes the dirty pages back to disk, ensuring your changes are persistent.iostat
might reveal disk write activity.
Takeaway
Understanding how Linux reads and writes files, along with the role of the Page Cache and Clean/Dirty pages, empowers you to optimize your system’s performance and troubleshoot potential issues. Armed with the knowledge of essential monitoring tools, you can delve deeper into the inner workings of Linux and gain a newfound appreciation for its elegant design. Happy exploring!