The bytes and events data for NFS mounts in /proc/self/mountstats

October 4, 2013

The per NFS mount mountstats performance stats (see here for an introduction) have two sets of high level statistics, reported in the bytes: and events: lines. Both of these come from counters that are described in comments in include/linux/nfs_iostat.h in the kernel source. Of the two, the simpler is bytes:.

A typical bytes: line looks like:

bytes:  2320629391 2297630544 0 0 2298347151 2297630544 718354 717816

In order, let's call these fields nread, nwrite, dread, dwrite, nfsread, nfswrite, pageread, and pagewrite. These count bytes read and written to the server with simple read() and write(), with read() and write() calls in O_DIRECT mode, the actual number of bytes read and written from the NFS server (regardless of how), and the number of pages (not bytes) read or written via directly mmap()'d files. I believe that the page size is basically always 4 Kb (at least on x86). It's routine for the O_DIRECT numbers to be zero. The most useful numbers of these for performance are what I've called nfsread and nfswrite, the fifth and sixth fields, because these represent the actual IO to the server.

A typical events: line looks like this:

events: 3717478 126331741 28393 1036981 3355459 1099901 133724160 1975168 3589 2878751 1405861 5669601 720939 96113 3235157 225053 30643 3026061 0 23184 1675425 24 0 0 0 0 0

The events: line tracks various sorts of high level NFS events. There are a lot of them, so I am just going to list them in order (with field numbers and some commentary):

  1. inode revalidate: How many times cached inode attributes have to be re-validated from the server.
  2. dnode revalidate: How many times cached dentry nodes (ie, name to inode mappings) have to be re-validated. I suspect that this spawns inode revalidations as well.
  3. data invalidate: How many times an inode had its cached data thrown out.
  4. attribute invalidate: How many times an inode has had cached inode attributes invalidated.

  5. vfs open: How many times files or directories have been open()'d.
  6. vfs lookup: How many name lookups in directories there have been.
  7. vfs access: How many times permissions have been checked via the internal equivalent of access().
  8. vfs update page: Count of updates (and potential writes) to pages.
  9. vfs read page: This is the same as what I called pageread in the bytes: field. (Quite literally. The counters are incremented next to each other in the source.)
  10. vfs read pages: Count of how many times a group of (mapped?) pages have been read. I believe it spawns 'vfs page read' events too but I'm not sure.
  11. vfs write page: Same as pagewrite in bytes:.
  12. vfs write pages: Count of grouped page writes. Probably spawns 'vfs write page' events too.
  13. vfs getdents: How many times directory entries have read with getdents(). These reads can be served from cache and don't necessarily imply actual NFS requests.
  14. vfs setattr: How many times we've set attributes on inodes.
  15. vfs flush: How many times pending writes have been forcefully flushed to the server (which can happen for various reasons).
  16. vfs fsync: How many times fsync() has been called on directories (which is a no-op for NFS) and files. Sadly you can't tell which is which.
  17. vfs lock: How many times people have tried to lock (parts of) a file, including in ways that are basic errors and will never succeed.
  18. vfs file release: Basically a count of how many times files have been closed and released.

  19. congestion wait: Not used for anything as far as I can tell. There doesn't seem to be anything in the current kernel source that actually increments the counter.

  20. truncation: How many times files have had their size truncated.
  21. write extension: How many times a file has been grown because you're writing beyond the existing end of the file.
  22. silly rename: How many times you removed a file while it was still open by some process, forcing the kernel to instead rename it to '.nfsXXXXXX' and delete it later.
  23. short read: The NFS server gave us less data than we asked for when we tried to read something.
  24. short write: The NFS server wrote less data than we asked it to.
  25. jukebox delay: How many times the NFS server told us EJUKEBOX, which is theoretically for when the server is slowly retrieving something from offline storage. I doubt that you will ever see this from normal servers.

  26. pnfs read: A count of NFS v4.1+ pNFS reads.
  27. pnfs write: A count of NFS v4.1+ pNFS writes.

All of the VFS operations are for VFS level file and address space operations. Fully understanding what these counters mean requires understanding when those operations are used, for what, and why. I don't have anywhere near this level of understanding of the Linux VFS layer, so my information here should be taken with some salt.

As you can see from my example events: line, some events are common, some are rare (eg #22, silly renames, of which there have been 24 over the lifetime of this NFS mount), and some basically never happen (eg everything from #23 onwards). Looking at our own collection of quite a lot of NFS v3 filesystem mounts, the only thing we've seen even a handful of (on three filesystems) are short writes. I suspect that those happen when a filesystem runs out of space on the fileserver.

Disclaimer: I'm somewhat fuzzy on what exactly a number of the events counted here really represent because I haven't traced backwards from the kernel code that increments the counters to figure out just what calls it and what it does and so on.

(This is one reason why the lack of good documentation on mountstats is really frustrating. Decoding a lot of this really needs someone who actively knows the kernel's internals for the best, most trustworthy results.)

Comments on this page:

By Anonymous at 2014-03-13 06:19:23:

"These count bytes read and written to the server with simple read() and write(), with read() and write() calls in O_DIRECT mode, the actual number of bytes read and written from the NFS server (regardless of how), and the number of pages (not bytes) read or written via directly mmap()'d files."

I'm not sure the description for nfsread/nfswrite is correct since this would mean that field1 + field3 = field5 and I have seen cases were field5 < field1 (quite notably, actually, while field3 == 0). Unfortunately, I can't explain what is causing this.

Otherwise, great stuff!

By cks at 2014-03-13 14:19:51:

I believe that nfsread and nfswrite are not just the sum of the two sorts of read() or write() calls. There is also at least mmap()'d page reads and writes (what I called pageread and pagewrite) and there may be other IO paths that are not fully accounted for in bytes.

(Note that I don't know how page reads and writes for Linux's special huge pages are accounted for, if you can even use huge pages to mmap() files.)

Hi, I understood as following what the difference among normal vs server vs direct.

First of all thanks to your reference link and encouragement to go through kernel code. The following is from the link you shared:

These counters can also help characterize which access methods

* are in use.  DIRECT by itself shows whether there is any O_DIRECT
* traffic.  NORMAL + DIRECT shows how much data is going through
* the system call interface.  A large amount of SERVER traffic
* without much NORMAL or DIRECT traffic shows that applications
* are using mapped files.

SERVER: The actual requests made on NFS share. Some data flow happens through system call interface (comes from the NFS server i.e., NORMAL) + the rest uses cache. Hence. we see smaller values in NORMAL. (NORMAL + CACHE = SERVER + LOCALDISKIO)

DIRECT: It show data trasferred using O_DIRECT method, it doesn't use cache at all.

Written on 04 October 2013.
« What is in /proc/self/mountstats for NFS mounts: an introduction
The xprt: data for NFS mounts in /proc/self/mountstats »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 4 01:08:07 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.