Does NFS I/O count towards iowait% on linux?

or: distrust everything you see on a performance graph.

Jon Philpott
Ticketmaster Technology

--

tl;dr: Yes.

Oftentimes I am required to look at system performance graphs in order to diagnose system performance problems. One of the biggest challenges in interpreting these graphs is know exactly what each metric means — what is load_one, exactly? CPU idle? and for this case, we’re looking iowait% and the question: does iowait% include NFS operations?

Let the Journey Begin.

A healthy distrust of what we read on the internet leads us to check the kernel source (RTFKS.)

Caveat lector: this is kernel source 2.6.32.

Working backwards, how is nr_iowait accounted?

(from kernel/sched.c)void account_idle_time(cputime_t cputime)
{
struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
cputime64_t cputime64 = cputime_to_cputime64(cputime);
struct rq *rq = this_rq();
if (atomic_read(&rq->nr_iowait) > 0)
cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
else
cpustat->idle = cputime64_add(cpustat->idle, cputime64);
}

And how is nr_iowait increased? From io_schedule:

(also from kernel/sched.c)/*
* This task is about to go to sleep on IO. Increment rq->nr_iowait so
* that process accounting knows that this is a task in IO wait state.
*/
void __sched io_schedule(void)
{
struct rq *rq = raw_rq();
delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
current->in_iowait = 1;
schedule();
current->in_iowait = 0;
atomic_dec(&rq->nr_iowait);
delayacct_blkio_end();
}

From the code comment and the code itself it is quite clear how this function is used; I need to wait on some I/O, account for it and then call schedule() to bring in a new task while I wait.

NFS.

So how is io_schedule() used by NFS? Let’s take a look at fs/nfs/pagelist.c which describes itself as:

/*
* linux/fs/nfs/pagelist.c
*
* A set of helper functions for managing NFS read and write requests.
* The main purpose of these routines is to provide support for the
* coalescing of several requests into a single RPC call.
*
* Copyright 2000, 2001 (c) Trond Myklebust <trond.myklebust@fys.uio.no>
*
*/

Yep — seems like this is the right place to be. Just a quick note: I’m not trying to show my amazing knowledge of kernel but my awesome usage of grep, which brought me here:

static int nfs_wait_bit_uninterruptible(void *word)
{
io_schedule();
return 0;
}

And how is our new-found friend called?

/**
* nfs_wait_on_request — Wait for a request to complete.
* @req: request to wait upon.
*
* Interruptible by fatal signals only.
* The user is responsible for holding a count on the request.
*/
int
nfs_wait_on_request(struct nfs_page *req)
{
return wait_on_bit(&req->wb_flags, PG_BUSY,
nfs_wait_bit_uninterruptible,
TASK_UNINTERRUPTIBLE);
}

From just looking at the comments and the code it’s fairly plain to see that this function is used to wait for some kind of NFS request to complete and that somehow the nfs_wait_bit_uninterruptable() function is involved in doing this waiting and we know from the previous snippit that will involve calling schedule() and increasing the iowait.

So what does wait_on_bit() do, exactly? And when is nfs_wait_on_request() called?

wait a bit-twiddlin’ minute

From the Linux device drivers documentation (found here) we can see that wait_on_bit() is used to wait for a bit to be cleared on word. The function signature is:

int wait_on_bit (void * word, int bit, int (*action) (void *), unsigned mode);

The documentation states that the function will wait for bitth bit in the word pointed to by *word to clear. If the bit is not cleared it will call the function pointed to by *action to sleep. The mode parameter is used to tell the kernel what state to put the process in while it is waiting — in the case of nfs_wait_on_request() it is TASK_UNINTERRUPTIBLE (the dreaded process state.)

Taking this back to our specific function call we can see that we’re waiting for the PG_BUSY bit in a nfs_page struct member wb_flags to unclear and sleep function will eventually increase the iowait% parameter and schedule a new process.

NFS Survey.

So in what situations will the NFS code attempt to wait on a bit? A quick grep show us that nfs_wait_on_request() is only called from write.c and in the following functions:

  1. static struct nfs_page *nfs_find_and_lock_request(struct page *page)
  2. static int nfs_wait_on_requests_locked(struct inode *inode, pgoff_t idx_start, unsigned int npages)
  3. static struct nfs_page *nfs_try_to_update_request(struct inode *inode, struct page *page, unsigned int offset, unsigned int bytes)

Tracing the ways these functions are called leads us to find they are used when the kernel wants to flush pages back to the NFS server, for example in nfs_page_async_flush() which is called by nfs_do_writepage() which is called eventually called by nfs_writepage() which is registered with the VFS layer as the mechanism to write dirty cache pages back to the source:

(from fs/nfs/file.c)const struct address_space_operations nfs_file_aops = {
.readpage = nfs_readpage,
.readpages = nfs_readpages,
.set_page_dirty = __set_page_dirty_nobuffers,
.writepage = nfs_writepage,
.writepages = nfs_writepages,
.write_begin = nfs_write_begin,
.write_end = nfs_write_end,
.invalidatepage = nfs_invalidate_page,
.releasepage = nfs_release_page,
.direct_IO = nfs_direct_IO,
.migratepage = nfs_migrate_page,
.launder_page = nfs_launder_page,
.error_remove_page = generic_error_remove_page,
};

And the VFS documentation (Documentation/filesystems/vfs.txt):

writepage: called by the VM to write a dirty page to backing store. This may happen for data integrity reasons (i.e. ‘sync’), or to free up memory (flush). The difference can be seen in  wbc->sync_mode. The PG_Dirty flag has been cleared and PageLocked is true. writepage should start writeout, should set PG_Writeback, and should make sure the page is unlocked, either synchronously or asynchronously when the write operation completes.

Conclusion.

A quick survey with our trusty sharpened grep tool leads us to the conclusion that writing to NFS will result in an increase of iowait%. I was not able to find in my short adventure whether reading from NFS also results in iowait%, given that I can only find a path from write.c I am going to assume that it doesn’t.

My next search will be to confirm also that all NFS writes will result in the process being moved into TASK_INTERRUPTABLE — which the source seems to indicate and would result in a parallel increase in iowait% and load average during NFS write operations.

--

--