2022-07-25
ZFS pool IO statistics (and vdev statistics) are based on physical disk IO
Today I wound up re-learning something that I sort of already knew about the IO statistics about pools and vdevs that you can get through things such as zpool iostat. Namely, that at least for bytes read and written and the number of IO operations, these IO statistics are what I call physical IO statistics; they aggregate and sum up the underlying physical disk IO information.
Whenever you're generating IO statistics for a logical entity with redundancy, you have a choice for how to present information about the volume and number of IOs. One option is to present the logical view of IO, where something asked you to write 1 GByte of data so you report that. The other option is to present the physical view of IO, where although you were given 1 GB to write, you wrote 2 GB to disk because you wrote it to both sides of a two-way mirror.
The logical view is ofte how people think of doing IO to entities with redundancy, and it's what things such as Linux's software RAID normally report. If you write 1 GB to a two-way Linux software RAID mirror, your disk IO statistics will tell you that there was 1 GB of writes to 'mdX' and 1 GB of writes to two disks (correlating this is up to you). If you do the same thing to a ZFS filesystem in a pool using mirrored vdevs, 'zpool iostat' will report that the pool did 2 GB of write IO.
(Well, 'zpool iostat' itself will report this as a bandwidth number. But the underlying information that ZFS provides is the cumulative volume in bytes.)
Presenting pool and vdev IO volume statistics the way ZFS does has the useful property that the numbers all add up. If you sum up all of the IO to devices in a vdev, you get the IO volume for the vdev; if you sum up IO volume across all vdevs, you get the IO volume for the pool (ignoring for the moment the prospect of devices being removed from a vdev). However, it makes it somewhat harder to know what logical write volume (and sometimes read volume) you're actually seeing, because you have to know how your vdevs multiply logical write IO and take that into account. A pool reporting 1 GB of write IO with two-way mirrors is seeing much more logical IO than a pool with four-way mirrors would be.
(Of course, IO volume and load is already not really comparable across pools because different pools may have different numbers of vdevs even if they have the same type of vdev. A pool with three mirror vdevs can handle much more write volume than a pool with only one mirror vdev, assuming they're using the same disks and so on.)
One view of logical IO volume for a pool can be gotten by adding up all of the per-dataset IO statistics (assuming that you can get them). However this will give you a genuine logical view of things, including read IO that was satisfied from the ARC and never went to disk at all. For some purposes this will be what you want; for others, it may be awkward.
The old simple per-pool IO statistics worked this way, so in a sense I should have known this already, but those are a different and now-obsolete system than the statistics 'zpool iostat' uses. ZFS is also somewhat inconsistent on this; for example, pool scrub progress on mirrored vdevs is reported in logical bytes, not physical ones.
PS: As you'd expect, the 'zpool iostats' level IO statistics report scrub IO in physical terms, not the logical one that 'zpool scrub' reports. In modern sequential scrubs on mirrored vdevs, the physical IO from 'zpool iostat' can add up to more than the total amount claimed to be scrubbed times the mirror level. I assume that this extra IO is due to any IO needed for the initial metadata scan.