2013-10-24
How to force a disk write cache flush operation on Linux
Suppose, not entirely hypothetically, that you want to test how well some new storage hardware and disks stand up to a lot of write cache flush operations (eg); you don't care about high level filesystem operations, you just want to hammer on the disks and the disk interconnects.
I will cut to the chase: the simplest and most direct way of doing
this on Linux is to call fsync() on a (or the) disk block device.
This appears to always generate a SYNCHRONIZE_CACHE operation
(or the SATA equivalent), although this will be a no-op in the disk
if there is no cached writes.
(The one exception is that the Linux SCSI layer does not issue cache synchronization operations unless it thinks that the disk's write cache is enabled. Since SATA disks go through the SCSI layer these days, this applies to SATA drives too.)
Calling fsync() on files on filesystems can issue write cache flush
operations under at least some circumstances, depending on the exact
filesystem. However I can't follow the ext3 and ext4 code clearly
enough to be completely sure that they always flush the write cache on
fsync() one way or another, although I suspect that they do. In any
case I generally prefer testing low-level disk performance using the
raw block devices.
(It appears that under at least some circumstances, calling fsync()
on things on extN filesystems will not directly flush the disk's write
cache but will instead simply issue writes that are intended to bypass
it. These may then get translated into write flushes on disks that don't
support such a write bypass.)
Sidebar: Where this lives in the current 3.12.0-rc6 code
Since I went digging for this in the kernel source and would hate to have not written it down if I ever need it again:
- blkdev_fsync() in fs/block_dev.c calls blkdev_issue_flush()
and is what handles
fsync()on block devices. - blkdev_issue_flush() in block/blk-flush.c issues a BIO operation
with
WRITE_FLUSH. WRITE_FLUSHis a bitmap of BIO flags, includingREQ_FLUSH.- sd_prep_fn() in drivers/scsi/sd.c catches
REQ_FLUSHoperations and calls scsi_setup_flush_cmnd(), which sets the SCSI command toSYNCHRONIZE_CACHE. - SATA disks translate
SYNCHRONIZE_CACHEinto eitherATA_CMD_FLUSHorATA_CMD_FLUSH_EXTin ata_get_xlat_func() and ata_scsi_flush_xlat() in drivers/ata/libata-scsi.c.
Whether or not the disk has write cache enabled comes into this through the SCSI layer:
- sd_revalidate_disk() in drivers/scsi/sd.c configures the general block layer flush settings for the particular device based on whether the device has WCE and possibly supports FUA to support bypassing the write cache (see FUA in here). This is done by calling blk_queue_flush().
- blk_queue_flush() in block/blk-settings.c sets the request queue's
flush_flagsfield to the value passed in. - generic_make_request_checks() in block/blk-core.c filters out flush flags and flush requests for queues that did not advertise them, ie any 'SCSI' drive that didn't advertise at least WCE.
For a given SCSI drive, the state of WCE is reported in the sysfs
attribute cache_type and the state of FUA in, surprise, FUA.
This includes SATA drives (which are handled as SCSI drives, more or
less).
For more kernel internal details, see this bit of kernel documentation. This may be useful to understand the interaction of various bits and pieces in the source code I've inventoried above.
By the way, git grep turns out to be really handy for this sort
of thing.
PS: I don't know if there's a straightforward way to force FUA writes.
You'd expect that O_SYNC writes would do it but I can't prove it
from my reading of kernel source code so far, although I haven't dug
deeply on this.
2013-10-06
The per-NFS-operation data for NFS mounts in /proc/self/mountstats
The last big chunk of statistics in mountstats
is a bunch of separate information for each NFS operation. Since the
exact operations are version-specific, the list of operations depends
on the NFS version in use; however, the statistics are the same for
all protocols (and all operations).
A typical example (for NFS v3) looks like this:
per-op statistics
NULL: 0 0 0 0 0 0 0 0
GETATTR: 192686 192688 0 22726724 21580832 2800 79242 83598
SETATTR: 10782 10783 0 1699192 1552608 1139 18919 20152
[....]
Operations are actually numerical codes and here they're printed in that numerical order, although with their standard names instead of their numbers. For each operation the actual numbers are, in order:
- operations: How many requests we've done for this operation.
- transmissions: How many times we've actually transmitted a RPC request for this operation. As you might have gathered from the last entry, this can exceed the operation count due to timeouts and retries.
- major timeouts: How many times a request has had a major timeout. Major timeouts produce 'nfs: server X not responding, still trying' messages. You can have timeouts and retries without having major timeouts (and my example lines do).
- bytes sent: This includes not just the RPC payload but also the RPC headers and so on. It closely matches the on-the-wire size.
- bytes received: As with bytes sent, this is the full size.
- cumulative queue time: How long (in milliseconds) all requests spent queued for transmission before they were sent.
- cumulative response time: How long (in milliseconds) it took to get a reply back after the request was transmitted. The kernel comments call this the RPC RTT.
- cumulative total request time: How long (in milliseconds) all requests took from when they were initially queued to when they were completely handled. The kernel calls this the RPC execution time.
At a first guess I think you'll want to divide bytes sent by transmissions but bytes received by operations if you're computing the average request size. Doing it differently will only matter in the face of retransmissions due to retries.
As you can see from my example lines, the total request time is a bit larger than the sum of the queue time and the response time. I don't know enough about the Linux NFS RPC code to know where this extra time may come from.
All of this is extremely useful for performance monitoring, especially
if you sum it up across all operations to get a global view of how RPC
for a given mount is performing. Most NFS RPC operations have relatively
obvious names but it's helpful to know that COMMIT is part of NFS v3
writes. The other issue is that in NFS
v3 it's possible to set different flags on WRITE operations (some can
be made synchronous) and the per-operation statistics don't break these
out separately. You may thus see wildly different WRITE performance
numbers at different times depending on whether they are synchronous
(and generally slow) or asynchronous.
(In fact I don't think anything in the overall mountstats data lets
you tell these two sorts of WRITEs apart, even to the extent of
finding out how much data is being written with synchronous WRITEs and
how much with asynchronous ones. This is an unfortunate omission since
asynch writes are likely to be much faster ones.)
PS: there may be other issues waiting with NFS v4 operations. We don't use NFS v4 here so I've never looked at it.
Sidebar: where to find things in the kernel
The code that prints the data is the rpc_print_iostats() routine
in net/sunrpc/stats.c. The data is explained in comments in
include/linux/sunrpc/metrics.h .
2013-10-05
The xprt: data for NFS mounts in /proc/self/mountstats
The xprt: line in mountstats reports various
bits of overall NFS RPC information for the particular NFS mount in
question. The information reported differs somewhat depending on the
protocol in question, so I'm going to start by discussing the TCP
stats. These look like (for the 1.0 format):
xprt: tcp 695 1 1 0 16 96099368 96091328 6383 341933213458 1504192
The number fields are, in order:
- port: The local port used for this particular NFS mount. Probably not particularly useful, since on our NFS clients all NFS mounts from the same fileserver use the same port (and thus the same underlying TCP connection).
- bind count: I believe that this is basically how many times this mount has had to start talking to the NFS server from scratch. Normally 1. I don't know if it goes up if the NFS server reboots.
- connect count: How many times the client has made a TCP connection to the NFS server.
- connect 'idle' time: How long (in jiffies, an internal kernel measure of time, not seconds) this NFS mount has spent waiting for its connection(s) to the server to be established.
- idle time: How long (in seconds) since the NFS mount saw any RPC traffic.
- sends: How many RPC requests this mount has sent to the server.
- receives: How many RPC replies this mount has received from the server. Not every request necessarily gets a reply, due to timeouts and retries.
- bad XIDs: Every RPC request and reply has an XID. This counts how many times the NFS server has sent us a reply with an XID that didn't correspond to any outstanding request we knew about.
- cumulative 'active' request count: Every time we send a request, we add the current difference between sends and receives to this number. Since we've just sent a request, this is always going to be at least one. This number is not as useful as you think it should be.
- cumulative backlog count: Every time we send a request, we add the
current backlog queue size to this counter.
Recent versions of the kernel report a format version of 1.1 instead of 1.0. This adds three more counters:
- maximum RPC slots ever used: The maximum number of simultaneously active RPC requests that this mount has ever had.
- cumulative sending queue count: Every time we send a request, we add the current size of the sending queue to this counter.
- cumulative pending queue count: Every time we send a request, we add the current size of the pending queue to this counter.
(For all of the 'cumulative' counters you must divide by the number of requests to get the actual average, either for all time or over a particular time interval.)
What we would like with sends and receives is that the difference between them is the number of currently outstanding requests. Unfortunately it doesn't work that way, because requests that time out and are retried only increment sends, not receives. If you have any NFS server retries at all, the nominal 'currently outstanding requests' figure will drift steadily upwards due to this. In turn this make the cumulative active request count not useful except when measured over time intervals in which no timeouts and retries happen (which implies that you cannot use it to determine an 'average outstanding requests to date' figure; on some of our client machines for some filesystems, this can wind up claiming that the average is over a thousand).
(Note that the absence of 'nfs: server X not responding, still trying' messages doesn't mean that no requests were retried over that time. The kernel only starts logging those messages if requests are significantly delayed and retried.)
Now let's talk about RPC slots. Every NFS mount can have only so many active RPC requests at once, regardless of their state; each request is said to take up a slot. If something tries to submit an RPC request and there is no slot, the task goes to sleep on the backlog queue (how often this happens has changed between kernel versions). Once a request has been given a slot, it gets put on the queue to be sent (since many processes can try to send things at once, sending must be serialized). As far as I can tell, RPC requests wind up on the pending queue when they are waiting for one of a number of things, apparently including buffer space and a reply from the server (for requests that expect a reply, although I don't understand the logic in the code). It's possible that exactly what winds up in the pending queue when has changed over different kernel versions.
At least in our kernels, all NFS mounts from a single fileserver use the same local port number. This means that they all use the same underlying TCP connection (since the TCP connection is identified by the quad of local port, remote port, local IP, and remote IP and the last three of those are necessarily the same for all mounts). In turn this implies that all NFS RPC requests for all mounts from a single fileserver must serialize to be sent to the network, making the per mount cumulative sending queue sizes not exactly independent from each other.
For the local RPC transport there is no local port (field #1); all
other fields are the same. For the udp RPC transport there is no
connection count, connect idle time, or idle time (fields #3, #4, and
#5); all other fields are the same. This omission of fields is quite
annoying since it means you get to shuffle your code around instead
of just ignoring certain fields for certain transports (they could
perfectly well report the particular statistic as zero).
Sidebar: Where this is in the kernel code
As mentioned in the introduction to this series,
the code that prints this information is in net/sunrpc/xprtsock.c
and the stat structure it takes information from is in
include/linux/sunrpc/xprt.h.
Note that the elements from this structure are not printed in order;
you need to look at the code to see how they all emerge.
2013-10-04
The bytes and events data for NFS mounts in /proc/self/mountstats
The per NFS mount mountstats performance stats (see here for
an introduction) have two sets of high level
statistics, reported in the bytes: and events: lines. Both of
these come from counters that are described in comments in
include/linux/nfs_iostat.h
in the kernel source.
Of the two, the simpler is bytes:.
A typical bytes: line looks like:
bytes: 2320629391 2297630544 0 0 2298347151 2297630544 718354 717816
In order, let's call these fields nread, nwrite, dread, dwrite, nfsread,
nfswrite, pageread, and pagewrite. These count bytes read and written
to the server with simple read() and write(), with read() and
write() calls in O_DIRECT mode, the actual number of bytes read
and written from the NFS server (regardless of how), and the number of
pages (not bytes) read or written via directly mmap()'d files. I
believe that the page size is basically always 4 Kb (at least on x86).
It's routine for the O_DIRECT numbers to be zero.
The most useful numbers of these for performance are what I've
called nfsread and nfswrite, the fifth and sixth fields, because
these represent the actual IO to the server.
A typical events: line looks like this:
events: 3717478 126331741 28393 1036981 3355459 1099901 133724160 1975168 3589 2878751 1405861 5669601 720939 96113 3235157 225053 30643 3026061 0 23184 1675425 24 0 0 0 0 0
The events: line tracks various sorts of high level NFS events. There
are a lot of them, so I am just going to list them in order (with field
numbers and some commentary):
- inode revalidate: How many times cached inode attributes have to be re-validated from the server.
- dnode revalidate: How many times cached dentry nodes (ie, name to inode mappings) have to be re-validated. I suspect that this spawns inode revalidations as well.
- data invalidate: How many times an inode had its cached data thrown out.
- attribute invalidate: How many times an inode has had cached inode
attributes invalidated.
- vfs open: How many times files or directories have been
open()'d. - vfs lookup: How many name lookups in directories there have been.
- vfs access: How many times permissions have been checked via the
internal equivalent of
access(). - vfs update page: Count of updates (and potential writes) to pages.
- vfs read page: This is the same as what I called pageread in the
bytes:field. (Quite literally. The counters are incremented next to each other in the source.) - vfs read pages: Count of how many times a group of (mapped?) pages have been read. I believe it spawns 'vfs page read' events too but I'm not sure.
- vfs write page: Same as pagewrite in
bytes:. - vfs write pages: Count of grouped page writes. Probably spawns 'vfs write page' events too.
- vfs getdents: How many times directory entries have read with
getdents(). These reads can be served from cache and don't necessarily imply actual NFS requests. - vfs setattr: How many times we've set attributes on inodes.
- vfs flush: How many times pending writes have been forcefully flushed to the server (which can happen for various reasons).
- vfs fsync: How many times
fsync()has been called on directories (which is a no-op for NFS) and files. Sadly you can't tell which is which. - vfs lock: How many times people have tried to lock (parts of) a file, including in ways that are basic errors and will never succeed.
- vfs file release: Basically a count of how many times files have been
closed and released.
- congestion wait: Not used for anything as far as I can tell.
There doesn't seem to be anything in the current kernel source that
actually increments the counter.
- truncation: How many times files have had their size truncated.
- write extension: How many times a file has been grown because you're writing beyond the existing end of the file.
- silly rename: How many times you removed a file while it was still
open by some process, forcing the kernel to instead rename it
to '
.nfsXXXXXX' and delete it later. - short read: The NFS server gave us less data than we asked for when we tried to read something.
- short write: The NFS server wrote less data than we asked it to.
- jukebox delay: How many times the NFS server told us
EJUKEBOX, which is theoretically for when the server is slowly retrieving something from offline storage. I doubt that you will ever see this from normal servers. - pnfs read: A count of NFS v4.1+ pNFS reads.
- pnfs write: A count of NFS v4.1+ pNFS writes.
All of the VFS operations are for VFS level file and address space operations. Fully understanding what these counters mean requires understanding when those operations are used, for what, and why. I don't have anywhere near this level of understanding of the Linux VFS layer, so my information here should be taken with some salt.
As you can see from my example events: line, some events are common,
some are rare (eg #22, silly renames, of which there have been 24 over
the lifetime of this NFS mount), and some basically never happen (eg
everything from #23 onwards). Looking at our own collection of quite
a lot of NFS v3 filesystem mounts, the only thing we've seen even a
handful of (on three filesystems) are short writes. I suspect that those
happen when a filesystem runs out of space on the fileserver.
Disclaimer: I'm somewhat fuzzy on what exactly a number of the events counted here really represent because I haven't traced backwards from the kernel code that increments the counters to figure out just what calls it and what it does and so on.
(This is one reason why the lack of good documentation on mountstats
is really frustrating. Decoding a lot of this really needs someone who
actively knows the kernel's internals for the best, most trustworthy
results.)
2013-10-03
What is in /proc/self/mountstats for NFS mounts: an introduction
As I discovered recently, for several years
Linux kernels have made a huge amount of per-mount NFS performance
statistics visible in /proc/self/mountstat. Unfortunately none of
this is documented. Because I have a use for the information and I'm a
glutton for punishment, I'm going to write up what I've found out.
Mountstats contains so much information that this is going to take
several entries.
To start with, let's talk about the overall format of a filesystem entry. In a relatively recent kernel, this looks like the following for an NFS v3 mount over TCP (probably a common case):
device fs1:/cs/mail mounted on /var/mail with fstype nfs statvers=1.0
opts: rw,[... many ...]
age: 11343100
caps: caps=0x3fc7,wtmult=512,dtsize=8192,bsize=0,namlen=255
sec: flavor=1,pseudoflavor=1
events: [... numbers ...]
bytes: [... numbers ...]
RPC iostats version: 1.0 p/v: 100003/3 (nfs)
xprt: tcp [... numbers ...]
per-op statistics
NULL: 0 0 0 0 0 0 0 0
[more operations]
(Some kernel configurations may produce an additional line with
'fsc:'. NFS v4 mounts will have a bunch of additional information
that I haven't looked into because we don't have any.)
The entry ends with a blank line (not shown). There are no less than
four different sets of statistics in this; events: for high-level NFS
events, bytes: for actual data being read and written, xprt: for
low level NFS RPC activity, and then relatively detailed statistics for
each NFS protocol operation. If you're hunting performance issues you
may wind up looking at all of them. As is usual for kernel stats, all
of the numbers are 'from the beginning of time' ones and just count up
as things happen; if you want to get per-second statistics or the like
you need to read the file more than once then work out the difference
between your two readings.
Describing most of the fields in mountstats is sufficiently
complicated that it needs separate entries: bytes: and high-level
NFS events:, xprt: NFS RPC
information, and the per-operation statistics.
The age: field is how many seconds this particular NFS mount has been
in existence. You can use this to compute overall per-interval stats
from all of the counters if you're so inclined, although I don't usually
find overall stats very useful (since our activity is invariably very
bursty).
There are currently a number of programs that try to do something with this firehose of information. Unfortunately I have yet to stumble over one that gives what I consider useful reports.
PS: You might be wondering why this is /proc/<PID>/mountstats
instead of a general file in /proc or whatever. My understanding
is that the reason for this is that modern Linux systems can have
multiple filesystem namespaces and hence you have to talk about a
specific process's view of mounts. It's traditional to use
/proc/self/mountstats because you can always read your own version
of it.
Sidebar: where in the kernel this stuff lives
Since there is no documentation, understanding what is being reported
here requires reading the kernel source. In current 3.12.0-rc3 code,
the overall report is produced by nfs_show_stats() in
fs/nfs/super.c. This is responsible for events: and bytes:;
the information that appears in them is discussed in the comments
for include/linux/nfs_iostat.h.
The RPC information is produced by rpc_print_iostats() in
net/sunrpc/stats.c. The xptr: line is produced by several
different functions in net/sunrpc/xprtsock.c (which one is used
depends on the connection protocol) using information in a structure
that is described in include/linux/sunrpc/xprt.h (note that these
struct fields are not printed in order, so you really do need to
read the code). The per-op statistics are produced using information
described in include/linux/sunrpc/metrics.h but again you'll need
to read the source for the order and details.