Some things on ZFS's per-pool performance statistics

January 5, 2019

Somewhat to my surprise, I recently found out that ZFS has had basic per-pool activity and performance statistics for a while (they're old enough that they're in our version of OmniOS, which is not exactly current these days). On sufficiently modern versions of ZFS (currently only the development version of ZFS on Linux), you can even get a small subset of these per-pool stats for each separate dataset, which may be useful for tracking activity. To be clear, these are not the stats that are made visible through 'zpool iostat'; these are a separate set of stats that are visible through (k)stats, and which can be picked up and tracked by at least some performance metrics systems on some platforms.

(Specifically, I know that Prometheus's host agent can collect them from ZFS on Linux. In theory you could add collecting them on OmniOS to the agent, perhaps using my Go kstat package, but someone would have to go to that programming work and I don't know if the Prometheus people would accept it. I haven't looked at other metrics host agents to see if they can collect this information on OmniOS or other Illumos systems. I suspect that no host agent can collect the more detailed 'zpool iostat' statistics because as far as I know there's no documented API to obtain them programatically.)

The best description of the available stats is in the kstat(3kstat) manpage, in the description of 'I/O Statistics' (for ZFS on Linux, you want to see here). Most of the stats are relatively obvious, but there's two important things to note for them. First, as far as I know (and can tell), these are for direct IO to disk, not for user level IO. This shows up in particular for writes if you have any level of vdev redundancy. Since we use mirrors, the amount written is basically twice the user level write rate; if a user process is writing at 100 MB/sec, we see 200 MB/sec of writes. This is honest but a little bit confusing for keeping track of user activity.

The second is that the rtime, wtime, rlentime, and wlentime stats are not distinguishing between read and write IO, but between 'run' and 'wait' IO. This is best explained in a comment from the Illumos kstat manpage:

A large number of I/O subsystems have at least two basic "lists" of transactions they manage: one for transactions that have been accepted for processing but for which processing has yet to begin, and one for transactions which are actively being processed (but not done). For this reason, two cumulative time statistics are defined here: pre-service (wait) time, and service (run) time.

I don't know enough about the ZFS IO queue code to be able to tell you if a ZFS IO being in the 'run' state only happens when it's actively submitted to the disk (or other device), or if ZFS has some internal division. The ZFS code does appear to consider IO 'active' at the same point as it makes it 'running', and based on things I've read about the ZIO scheduler I think this is probably at least close to 'it was issued to the device'.

(On Linux, 'issued to the device' really means 'put in the block IO system'. This may or may not result in it being immediately issued to the device, depending on various factors, including how much IO you're allowing ZFS to push to the device.)

If you have only moderate activity, it's routine to have little or no 'wait' (w) activity or time, with most of the overall time for request handling being in the 'run' time. You will see 'wait' time (and queue sizes) rise as your ZFS pool IO level rises, but before then you can have an interesting and pretty pattern where your average 'run' time is a couple of milliseconds or higher but your average 'wait' time is in the microseconds.

In terms of Linux disk IO stats, the *time stats are the equivalent of the use stat, and the *lentime stats are the equivalent of the aveq field. There is no equivalent of the Linux ruse or wuse fields, ie no field that gives you the total time taken by all completed 'wait' or 'run' IO. I think that there's ways to calculate much of the same information you can get for Linux disk IO from what ZFS (k)stats give you, but that's another entry.

For ZFS datasets, you currently get only reads, writes, nread, and nwritten. For datasets, the writes appear to be user-level writes, not low level disk IO writes, so they will track closely with the amount of data written at the user level (or at the level of an NFS server). As I write this here in early 2019, these per-dataset stats aren't in any released version of even ZFS on Linux, but I expect to see them start showing up in various places (such as FreeBSD) before too many years go by.

PS: I regret not knowing that these stats existed some time ago, because I probably would have hacked together something to look at them on our OmniOS machines, even though we never used 'zpool iostat' very much for troubleshooting for various reasons. In general, if you have multiple ZFS pools it's always useful to be able to see what the active things are at the moment.

Written on 05 January 2019.
« Planning ahead in documentation worked out for us
Linux network-scripts being deprecated is a problem for my home PPPoE link »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jan 5 00:19:31 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.