Some things on ZFS's per-pool performance statistics
Somewhat to my surprise, I recently found out that ZFS has had basic
per-pool activity and performance statistics for a while (they're
old enough that they're in our version of OmniOS, which is not
exactly current these days). On sufficiently modern versions of ZFS
(currently only the development version of ZFS on Linux), you can even get a small subset of these
per-pool stats for each separate dataset, which may be useful for
tracking activity. To be clear, these are not the stats that are
made visible through 'zpool iostat
'; these are a separate set of
stats that are visible through (k)stats, and which can be picked
up and tracked by at least some performance metrics systems on some
platforms.
(Specifically, I know that Prometheus's host agent can collect them from
ZFS on Linux. In theory you could add
collecting them on OmniOS to the agent, perhaps using my Go kstat
package, but someone would
have to go to that programming work and I don't know if the Prometheus
people would accept it. I haven't looked at other metrics host
agents to see if they can collect this information on OmniOS or
other Illumos systems. I suspect that no host agent can collect the
more detailed 'zpool iostat
' statistics because as far as I know
there's no documented API to obtain them programatically.)
The best description of the available stats is in the kstat(3kstat) manpage, in the description of 'I/O Statistics' (for ZFS on Linux, you want to see here). Most of the stats are relatively obvious, but there's two important things to note for them. First, as far as I know (and can tell), these are for direct IO to disk, not for user level IO. This shows up in particular for writes if you have any level of vdev redundancy. Since we use mirrors, the amount written is basically twice the user level write rate; if a user process is writing at 100 MB/sec, we see 200 MB/sec of writes. This is honest but a little bit confusing for keeping track of user activity.
The second is that the rtime
, wtime
, rlentime
, and wlentime
stats are not distinguishing between read and write IO, but between
'run' and 'wait' IO. This is best explained in a comment from the
Illumos kstat manpage:
A large number of I/O subsystems have at least two basic "lists" of transactions they manage: one for transactions that have been accepted for processing but for which processing has yet to begin, and one for transactions which are actively being processed (but not done). For this reason, two cumulative time statistics are defined here: pre-service (wait) time, and service (run) time.
I don't know enough about the ZFS IO queue code to be able to tell you if a ZFS IO being in the 'run' state only happens when it's actively submitted to the disk (or other device), or if ZFS has some internal division. The ZFS code does appear to consider IO 'active' at the same point as it makes it 'running', and based on things I've read about the ZIO scheduler I think this is probably at least close to 'it was issued to the device'.
(On Linux, 'issued to the device' really means 'put in the block IO system'. This may or may not result in it being immediately issued to the device, depending on various factors, including how much IO you're allowing ZFS to push to the device.)
If you have only moderate activity, it's routine to have little or no 'wait' (w) activity or time, with most of the overall time for request handling being in the 'run' time. You will see 'wait' time (and queue sizes) rise as your ZFS pool IO level rises, but before then you can have an interesting and pretty pattern where your average 'run' time is a couple of milliseconds or higher but your average 'wait' time is in the microseconds.
In terms of Linux disk IO stats, the
*time
stats are the equivalent of the use
stat, and the
*lentime
stats are the equivalent of the aveq
field. There
is no equivalent of the Linux ruse
or wuse
fields, ie no
field that gives you the total time taken by all completed 'wait'
or 'run' IO. I think that there's ways to calculate much of the
same information you can get for Linux disk IO
from what ZFS (k)stats give you, but that's another entry.
For ZFS datasets, you currently get only reads
, writes
,
nread
, and nwritten
. For datasets, the writes appear to be
user-level writes, not low level disk IO writes, so they will track
closely with the amount of data written at the user level (or at
the level of an NFS server). As I write this here in early 2019,
these per-dataset stats aren't in any released version of even ZFS
on Linux, but I expect to see them start showing up in various
places (such as FreeBSD) before too many
years go by.
PS: I regret not knowing that these stats existed some time ago,
because I probably would have hacked together something to look at
them on our OmniOS machines, even though
we never used 'zpool iostat
' very much for troubleshooting for
various reasons. In general, if you have multiple ZFS pools it's
always useful to be able to see what the active things are at the
moment.
|
|