Wandering Thoughts archives

2013-10-24

How to force a disk write cache flush operation on Linux

Suppose, not entirely hypothetically, that you want to test how well some new storage hardware and disks stand up to a lot of write cache flush operations (eg); you don't care about high level filesystem operations, you just want to hammer on the disks and the disk interconnects.

I will cut to the chase: the simplest and most direct way of doing this on Linux is to call fsync() on a (or the) disk block device. This appears to always generate a SYNCHRONIZE_CACHE operation (or the SATA equivalent), although this will be a no-op in the disk if there is no cached writes.

(The one exception is that the Linux SCSI layer does not issue cache synchronization operations unless it thinks that the disk's write cache is enabled. Since SATA disks go through the SCSI layer these days, this applies to SATA drives too.)

Calling fsync() on files on filesystems can issue write cache flush operations under at least some circumstances, depending on the exact filesystem. However I can't follow the ext3 and ext4 code clearly enough to be completely sure that they always flush the write cache on fsync() one way or another, although I suspect that they do. In any case I generally prefer testing low-level disk performance using the raw block devices.

(It appears that under at least some circumstances, calling fsync() on things on extN filesystems will not directly flush the disk's write cache but will instead simply issue writes that are intended to bypass it. These may then get translated into write flushes on disks that don't support such a write bypass.)

Sidebar: Where this lives in the current 3.12.0-rc6 code

Since I went digging for this in the kernel source and would hate to have not written it down if I ever need it again:

  • blkdev_fsync() in fs/block_dev.c calls blkdev_issue_flush() and is what handles fsync() on block devices.
  • blkdev_issue_flush() in block/blk-flush.c issues a BIO operation with WRITE_FLUSH.
  • WRITE_FLUSH is a bitmap of BIO flags, including REQ_FLUSH.
  • sd_prep_fn() in drivers/scsi/sd.c catches REQ_FLUSH operations and calls scsi_setup_flush_cmnd(), which sets the SCSI command to SYNCHRONIZE_CACHE.
  • SATA disks translate SYNCHRONIZE_CACHE into either ATA_CMD_FLUSH or ATA_CMD_FLUSH_EXT in ata_get_xlat_func() and ata_scsi_flush_xlat() in drivers/ata/libata-scsi.c.

Whether or not the disk has write cache enabled comes into this through the SCSI layer:

  • sd_revalidate_disk() in drivers/scsi/sd.c configures the general block layer flush settings for the particular device based on whether the device has WCE and possibly supports FUA to support bypassing the write cache (see FUA in here). This is done by calling blk_queue_flush().
  • blk_queue_flush() in block/blk-settings.c sets the request queue's flush_flags field to the value passed in.
  • generic_make_request_checks() in block/blk-core.c filters out flush flags and flush requests for queues that did not advertise them, ie any 'SCSI' drive that didn't advertise at least WCE.

For a given SCSI drive, the state of WCE is reported in the sysfs attribute cache_type and the state of FUA in, surprise, FUA. This includes SATA drives (which are handled as SCSI drives, more or less).

For more kernel internal details, see this bit of kernel documentation. This may be useful to understand the interaction of various bits and pieces in the source code I've inventoried above.

By the way, git grep turns out to be really handy for this sort of thing.

PS: I don't know if there's a straightforward way to force FUA writes. You'd expect that O_SYNC writes would do it but I can't prove it from my reading of kernel source code so far, although I haven't dug deeply on this.

ForceDiskFlushes written at 00:31:00; Add Comment

2013-10-06

The per-NFS-operation data for NFS mounts in /proc/self/mountstats

The last big chunk of statistics in mountstats is a bunch of separate information for each NFS operation. Since the exact operations are version-specific, the list of operations depends on the NFS version in use; however, the statistics are the same for all protocols (and all operations).

A typical example (for NFS v3) looks like this:

per-op statistics
        NULL: 0 0 0 0 0 0 0 0
     GETATTR: 192686 192688 0 22726724 21580832 2800 79242 83598
     SETATTR: 10782 10783 0 1699192 1552608 1139 18919 20152
[....]

Operations are actually numerical codes and here they're printed in that numerical order, although with their standard names instead of their numbers. For each operation the actual numbers are, in order:

  1. operations: How many requests we've done for this operation.
  2. transmissions: How many times we've actually transmitted a RPC request for this operation. As you might have gathered from the last entry, this can exceed the operation count due to timeouts and retries.
  3. major timeouts: How many times a request has had a major timeout. Major timeouts produce 'nfs: server X not responding, still trying' messages. You can have timeouts and retries without having major timeouts (and my example lines do).
  4. bytes sent: This includes not just the RPC payload but also the RPC headers and so on. It closely matches the on-the-wire size.
  5. bytes received: As with bytes sent, this is the full size.
  6. cumulative queue time: How long (in milliseconds) all requests spent queued for transmission before they were sent.
  7. cumulative response time: How long (in milliseconds) it took to get a reply back after the request was transmitted. The kernel comments call this the RPC RTT.
  8. cumulative total request time: How long (in milliseconds) all requests took from when they were initially queued to when they were completely handled. The kernel calls this the RPC execution time.

At a first guess I think you'll want to divide bytes sent by transmissions but bytes received by operations if you're computing the average request size. Doing it differently will only matter in the face of retransmissions due to retries.

As you can see from my example lines, the total request time is a bit larger than the sum of the queue time and the response time. I don't know enough about the Linux NFS RPC code to know where this extra time may come from.

All of this is extremely useful for performance monitoring, especially if you sum it up across all operations to get a global view of how RPC for a given mount is performing. Most NFS RPC operations have relatively obvious names but it's helpful to know that COMMIT is part of NFS v3 writes. The other issue is that in NFS v3 it's possible to set different flags on WRITE operations (some can be made synchronous) and the per-operation statistics don't break these out separately. You may thus see wildly different WRITE performance numbers at different times depending on whether they are synchronous (and generally slow) or asynchronous.

(In fact I don't think anything in the overall mountstats data lets you tell these two sorts of WRITEs apart, even to the extent of finding out how much data is being written with synchronous WRITEs and how much with asynchronous ones. This is an unfortunate omission since asynch writes are likely to be much faster ones.)

PS: there may be other issues waiting with NFS v4 operations. We don't use NFS v4 here so I've never looked at it.

Sidebar: where to find things in the kernel

The code that prints the data is the rpc_print_iostats() routine in net/sunrpc/stats.c. The data is explained in comments in include/linux/sunrpc/metrics.h .

NFSMountstatsNFSOps written at 01:18:24; Add Comment

2013-10-05

The xprt: data for NFS mounts in /proc/self/mountstats

The xprt: line in mountstats reports various bits of overall NFS RPC information for the particular NFS mount in question. The information reported differs somewhat depending on the protocol in question, so I'm going to start by discussing the TCP stats. These look like (for the 1.0 format):

xprt:   tcp 695 1 1 0 16 96099368 96091328 6383 341933213458 1504192

The number fields are, in order:

  1. port: The local port used for this particular NFS mount. Probably not particularly useful, since on our NFS clients all NFS mounts from the same fileserver use the same port (and thus the same underlying TCP connection).
  2. bind count: I believe that this is basically how many times this mount has had to start talking to the NFS server from scratch. Normally 1. I don't know if it goes up if the NFS server reboots.
  3. connect count: How many times the client has made a TCP connection to the NFS server.
  4. connect 'idle' time: How long (in jiffies, an internal kernel measure of time, not seconds) this NFS mount has spent waiting for its connection(s) to the server to be established.
  5. idle time: How long (in seconds) since the NFS mount saw any RPC traffic.
  6. sends: How many RPC requests this mount has sent to the server.
  7. receives: How many RPC replies this mount has received from the server. Not every request necessarily gets a reply, due to timeouts and retries.
  8. bad XIDs: Every RPC request and reply has an XID. This counts how many times the NFS server has sent us a reply with an XID that didn't correspond to any outstanding request we knew about.
  9. cumulative 'active' request count: Every time we send a request, we add the current difference between sends and receives to this number. Since we've just sent a request, this is always going to be at least one. This number is not as useful as you think it should be.
  10. cumulative backlog count: Every time we send a request, we add the current backlog queue size to this counter.

    Recent versions of the kernel report a format version of 1.1 instead of 1.0. This adds three more counters:

  11. maximum RPC slots ever used: The maximum number of simultaneously active RPC requests that this mount has ever had.
  12. cumulative sending queue count: Every time we send a request, we add the current size of the sending queue to this counter.
  13. cumulative pending queue count: Every time we send a request, we add the current size of the pending queue to this counter.

(For all of the 'cumulative' counters you must divide by the number of requests to get the actual average, either for all time or over a particular time interval.)

What we would like with sends and receives is that the difference between them is the number of currently outstanding requests. Unfortunately it doesn't work that way, because requests that time out and are retried only increment sends, not receives. If you have any NFS server retries at all, the nominal 'currently outstanding requests' figure will drift steadily upwards due to this. In turn this make the cumulative active request count not useful except when measured over time intervals in which no timeouts and retries happen (which implies that you cannot use it to determine an 'average outstanding requests to date' figure; on some of our client machines for some filesystems, this can wind up claiming that the average is over a thousand).

(Note that the absence of 'nfs: server X not responding, still trying' messages doesn't mean that no requests were retried over that time. The kernel only starts logging those messages if requests are significantly delayed and retried.)

Now let's talk about RPC slots. Every NFS mount can have only so many active RPC requests at once, regardless of their state; each request is said to take up a slot. If something tries to submit an RPC request and there is no slot, the task goes to sleep on the backlog queue (how often this happens has changed between kernel versions). Once a request has been given a slot, it gets put on the queue to be sent (since many processes can try to send things at once, sending must be serialized). As far as I can tell, RPC requests wind up on the pending queue when they are waiting for one of a number of things, apparently including buffer space and a reply from the server (for requests that expect a reply, although I don't understand the logic in the code). It's possible that exactly what winds up in the pending queue when has changed over different kernel versions.

At least in our kernels, all NFS mounts from a single fileserver use the same local port number. This means that they all use the same underlying TCP connection (since the TCP connection is identified by the quad of local port, remote port, local IP, and remote IP and the last three of those are necessarily the same for all mounts). In turn this implies that all NFS RPC requests for all mounts from a single fileserver must serialize to be sent to the network, making the per mount cumulative sending queue sizes not exactly independent from each other.

For the local RPC transport there is no local port (field #1); all other fields are the same. For the udp RPC transport there is no connection count, connect idle time, or idle time (fields #3, #4, and #5); all other fields are the same. This omission of fields is quite annoying since it means you get to shuffle your code around instead of just ignoring certain fields for certain transports (they could perfectly well report the particular statistic as zero).

Sidebar: Where this is in the kernel code

As mentioned in the introduction to this series, the code that prints this information is in net/sunrpc/xprtsock.c and the stat structure it takes information from is in include/linux/sunrpc/xprt.h. Note that the elements from this structure are not printed in order; you need to look at the code to see how they all emerge.

NFSMountstatsXprt written at 00:32:26; Add Comment

2013-10-04

The bytes and events data for NFS mounts in /proc/self/mountstats

The per NFS mount mountstats performance stats (see here for an introduction) have two sets of high level statistics, reported in the bytes: and events: lines. Both of these come from counters that are described in comments in include/linux/nfs_iostat.h in the kernel source. Of the two, the simpler is bytes:.

A typical bytes: line looks like:

bytes:  2320629391 2297630544 0 0 2298347151 2297630544 718354 717816

In order, let's call these fields nread, nwrite, dread, dwrite, nfsread, nfswrite, pageread, and pagewrite. These count bytes read and written to the server with simple read() and write(), with read() and write() calls in O_DIRECT mode, the actual number of bytes read and written from the NFS server (regardless of how), and the number of pages (not bytes) read or written via directly mmap()'d files. I believe that the page size is basically always 4 Kb (at least on x86). It's routine for the O_DIRECT numbers to be zero. The most useful numbers of these for performance are what I've called nfsread and nfswrite, the fifth and sixth fields, because these represent the actual IO to the server.

A typical events: line looks like this:

events: 3717478 126331741 28393 1036981 3355459 1099901 133724160 1975168 3589 2878751 1405861 5669601 720939 96113 3235157 225053 30643 3026061 0 23184 1675425 24 0 0 0 0 0

The events: line tracks various sorts of high level NFS events. There are a lot of them, so I am just going to list them in order (with field numbers and some commentary):

  1. inode revalidate: How many times cached inode attributes have to be re-validated from the server.
  2. dnode revalidate: How many times cached dentry nodes (ie, name to inode mappings) have to be re-validated. I suspect that this spawns inode revalidations as well.
  3. data invalidate: How many times an inode had its cached data thrown out.
  4. attribute invalidate: How many times an inode has had cached inode attributes invalidated.

  5. vfs open: How many times files or directories have been open()'d.
  6. vfs lookup: How many name lookups in directories there have been.
  7. vfs access: How many times permissions have been checked via the internal equivalent of access().
  8. vfs update page: Count of updates (and potential writes) to pages.
  9. vfs read page: This is the same as what I called pageread in the bytes: field. (Quite literally. The counters are incremented next to each other in the source.)
  10. vfs read pages: Count of how many times a group of (mapped?) pages have been read. I believe it spawns 'vfs page read' events too but I'm not sure.
  11. vfs write page: Same as pagewrite in bytes:.
  12. vfs write pages: Count of grouped page writes. Probably spawns 'vfs write page' events too.
  13. vfs getdents: How many times directory entries have read with getdents(). These reads can be served from cache and don't necessarily imply actual NFS requests.
  14. vfs setattr: How many times we've set attributes on inodes.
  15. vfs flush: How many times pending writes have been forcefully flushed to the server (which can happen for various reasons).
  16. vfs fsync: How many times fsync() has been called on directories (which is a no-op for NFS) and files. Sadly you can't tell which is which.
  17. vfs lock: How many times people have tried to lock (parts of) a file, including in ways that are basic errors and will never succeed.
  18. vfs file release: Basically a count of how many times files have been closed and released.

  19. congestion wait: Not used for anything as far as I can tell. There doesn't seem to be anything in the current kernel source that actually increments the counter.

  20. truncation: How many times files have had their size truncated.
  21. write extension: How many times a file has been grown because you're writing beyond the existing end of the file.
  22. silly rename: How many times you removed a file while it was still open by some process, forcing the kernel to instead rename it to '.nfsXXXXXX' and delete it later.
  23. short read: The NFS server gave us less data than we asked for when we tried to read something.
  24. short write: The NFS server wrote less data than we asked it to.
  25. jukebox delay: How many times the NFS server told us EJUKEBOX, which is theoretically for when the server is slowly retrieving something from offline storage. I doubt that you will ever see this from normal servers.

  26. pnfs read: A count of NFS v4.1+ pNFS reads.
  27. pnfs write: A count of NFS v4.1+ pNFS writes.

All of the VFS operations are for VFS level file and address space operations. Fully understanding what these counters mean requires understanding when those operations are used, for what, and why. I don't have anywhere near this level of understanding of the Linux VFS layer, so my information here should be taken with some salt.

As you can see from my example events: line, some events are common, some are rare (eg #22, silly renames, of which there have been 24 over the lifetime of this NFS mount), and some basically never happen (eg everything from #23 onwards). Looking at our own collection of quite a lot of NFS v3 filesystem mounts, the only thing we've seen even a handful of (on three filesystems) are short writes. I suspect that those happen when a filesystem runs out of space on the fileserver.

Disclaimer: I'm somewhat fuzzy on what exactly a number of the events counted here really represent because I haven't traced backwards from the kernel code that increments the counters to figure out just what calls it and what it does and so on.

(This is one reason why the lack of good documentation on mountstats is really frustrating. Decoding a lot of this really needs someone who actively knows the kernel's internals for the best, most trustworthy results.)

NFSMountstatsBytesEvents written at 01:08:07; Add Comment

2013-10-03

What is in /proc/self/mountstats for NFS mounts: an introduction

As I discovered recently, for several years Linux kernels have made a huge amount of per-mount NFS performance statistics visible in /proc/self/mountstat. Unfortunately none of this is documented. Because I have a use for the information and I'm a glutton for punishment, I'm going to write up what I've found out. Mountstats contains so much information that this is going to take several entries.

To start with, let's talk about the overall format of a filesystem entry. In a relatively recent kernel, this looks like the following for an NFS v3 mount over TCP (probably a common case):

device fs1:/cs/mail mounted on /var/mail with fstype nfs statvers=1.0
        opts:   rw,[... many ...]
        age:    11343100
        caps:   caps=0x3fc7,wtmult=512,dtsize=8192,bsize=0,namlen=255
        sec:    flavor=1,pseudoflavor=1
        events: [... numbers ...]
        bytes:  [... numbers ...]
        RPC iostats version: 1.0  p/v: 100003/3 (nfs)
        xprt:   tcp [... numbers ...]
        per-op statistics
                NULL: 0 0 0 0 0 0 0 0
                [more operations]

(Some kernel configurations may produce an additional line with 'fsc:'. NFS v4 mounts will have a bunch of additional information that I haven't looked into because we don't have any.)

The entry ends with a blank line (not shown). There are no less than four different sets of statistics in this; events: for high-level NFS events, bytes: for actual data being read and written, xprt: for low level NFS RPC activity, and then relatively detailed statistics for each NFS protocol operation. If you're hunting performance issues you may wind up looking at all of them. As is usual for kernel stats, all of the numbers are 'from the beginning of time' ones and just count up as things happen; if you want to get per-second statistics or the like you need to read the file more than once then work out the difference between your two readings.

Describing most of the fields in mountstats is sufficiently complicated that it needs separate entries: bytes: and high-level NFS events:, xprt: NFS RPC information, and the per-operation statistics.

The age: field is how many seconds this particular NFS mount has been in existence. You can use this to compute overall per-interval stats from all of the counters if you're so inclined, although I don't usually find overall stats very useful (since our activity is invariably very bursty).

There are currently a number of programs that try to do something with this firehose of information. Unfortunately I have yet to stumble over one that gives what I consider useful reports.

PS: You might be wondering why this is /proc/<PID>/mountstats instead of a general file in /proc or whatever. My understanding is that the reason for this is that modern Linux systems can have multiple filesystem namespaces and hence you have to talk about a specific process's view of mounts. It's traditional to use /proc/self/mountstats because you can always read your own version of it.

Sidebar: where in the kernel this stuff lives

Since there is no documentation, understanding what is being reported here requires reading the kernel source. In current 3.12.0-rc3 code, the overall report is produced by nfs_show_stats() in fs/nfs/super.c. This is responsible for events: and bytes:; the information that appears in them is discussed in the comments for include/linux/nfs_iostat.h.

The RPC information is produced by rpc_print_iostats() in net/sunrpc/stats.c. The xptr: line is produced by several different functions in net/sunrpc/xprtsock.c (which one is used depends on the connection protocol) using information in a structure that is described in include/linux/sunrpc/xprt.h (note that these struct fields are not printed in order, so you really do need to read the code). The per-op statistics are produced using information described in include/linux/sunrpc/metrics.h but again you'll need to read the source for the order and details.

NFSMountstatsIndex written at 00:42:23; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.