Wandering Thoughts archives

2022-07-27

What ZFS 'individual' and 'aggregated' IO size statistics mean

If you consult the zpool iostat manual page, one of the things it will tell you about is request size histograms, which come in two variants for each type of ZFS IO:

Print request size histograms [...]. This includes histograms of individual I/O (ind) and aggregate I/O (agg). These stats can be useful for observing how well I/O aggregation is working. [...]

This leaves many things unexplained. Let's start with what aggregated IO is and where it comes from. In the kernel, the primary source of aggregated IO in normal operation is vdev_queue_aggregate() in vdev_queue.c, where there are some comments:

Sufficiently adjacent io_offset's in ZIOs will be aggregated. [...]

We can aggregate I/Os that are sufficiently adjacent and of the same flavor, as expressed by the AGG_INHERIT flags. The latter requirement is necessary so that certain attributes of the I/O, such as whether it's a normal I/O or a scrub/resilver, can be preserved in the aggregate.

In other words, if ZFS is processing a bunch of independent IOs that are adjacent or sufficiently close together, it will aggregate them together. The 'flavor' of ZFS IO here is more than read versus write; it covers the IO priority as well (eg, synchronous versus asynchronous reads). There's also a limit on how large a span of data this aggregation can cover, and the limit depends on whether your ZFS pool is on SSDs or on HDDs. On SSDs the limit is the ZFS parameter zfs_vdev_aggregation_limit_non_rotating, which defaults to 128 KBytes, and on HDDs the limit is zfs_vdev_aggregation_limit, which defaults to 1 MByte.

An eventual ZFS IO is either 'aggregated' or 'individual'; the two are exclusive of each other (see the histogram update code in vdev_stat_update() in vdev.c). The sum of aggregated and individual IO is how much IO you've done. Similarly, the size distribution of the joint histogram is the size distribution of your IO. This is experimentally verified for counts, although I can't completely prove it to myself in the OpenZFS code. I'm also not entirely sure what's going on with the IO sizes.

(An aggregated IO swallows other, existing ZIOs. The size and count of the aggregated IO could be either just the 'head' IO and its size, or it could wind up including all of the swallowed IOs. I can't figure out from the OpenZFS code which it is or convince myself of one or the other, although I certainly would expect the aggregated IO's statistics to include and cover all of the swallowed ZIOs, and for the swallowed ZIOs to not show up in the 'individual' numbers.)

On all of the machines that I can readily get statistics for, individual IO seems more common by count and to be responsible for more IO size than aggregated IO. However, aggregated IO is often large enough in both count and especially size to matter; it's clear to me that you can't just ignore aggregated IO. Asynchronous writes seem to especially benefit from aggregation. All of this is on SSD based pools, often using large files, and I don't understand how the non-rotating aggregation limit is interacting with the 128 KB default ZFS recordsize.

I don't know if you should expect I/O aggregation to happen (and be alarmed if your statistics suggest that it's not) or just be pleased when it does. Presumably some of this has to do with what size individual IOs you see, especially on SSDs.

PS: The current types of ZFS IO that you're likely to see are synchronous and asynchronous reads and writes, scrub, trim, and 'rebuild'; all of these actually come from the priority that ZFS assigns to IOs. The non-histogram per-type counters for total bytes and total operations are somewhat different, but to confuse you trim IO is counted as 'ioctl' there, for reasons beyond the scope of this entry.

ZFSIndividualVsAggregatedIOs written at 23:09:51; Add Comment

2022-07-25

ZFS pool IO statistics (and vdev statistics) are based on physical disk IO

Today I wound up re-learning something that I sort of already knew about the IO statistics about pools and vdevs that you can get through things such as zpool iostat. Namely, that at least for bytes read and written and the number of IO operations, these IO statistics are what I call physical IO statistics; they aggregate and sum up the underlying physical disk IO information.

Whenever you're generating IO statistics for a logical entity with redundancy, you have a choice for how to present information about the volume and number of IOs. One option is to present the logical view of IO, where something asked you to write 1 GByte of data so you report that. The other option is to present the physical view of IO, where although you were given 1 GB to write, you wrote 2 GB to disk because you wrote it to both sides of a two-way mirror.

The logical view is ofte how people think of doing IO to entities with redundancy, and it's what things such as Linux's software RAID normally report. If you write 1 GB to a two-way Linux software RAID mirror, your disk IO statistics will tell you that there was 1 GB of writes to 'mdX' and 1 GB of writes to two disks (correlating this is up to you). If you do the same thing to a ZFS filesystem in a pool using mirrored vdevs, 'zpool iostat' will report that the pool did 2 GB of write IO.

(Well, 'zpool iostat' itself will report this as a bandwidth number. But the underlying information that ZFS provides is the cumulative volume in bytes.)

Presenting pool and vdev IO volume statistics the way ZFS does has the useful property that the numbers all add up. If you sum up all of the IO to devices in a vdev, you get the IO volume for the vdev; if you sum up IO volume across all vdevs, you get the IO volume for the pool (ignoring for the moment the prospect of devices being removed from a vdev). However, it makes it somewhat harder to know what logical write volume (and sometimes read volume) you're actually seeing, because you have to know how your vdevs multiply logical write IO and take that into account. A pool reporting 1 GB of write IO with two-way mirrors is seeing much more logical IO than a pool with four-way mirrors would be.

(Of course, IO volume and load is already not really comparable across pools because different pools may have different numbers of vdevs even if they have the same type of vdev. A pool with three mirror vdevs can handle much more write volume than a pool with only one mirror vdev, assuming they're using the same disks and so on.)

One view of logical IO volume for a pool can be gotten by adding up all of the per-dataset IO statistics (assuming that you can get them). However this will give you a genuine logical view of things, including read IO that was satisfied from the ARC and never went to disk at all. For some purposes this will be what you want; for others, it may be awkward.

The old simple per-pool IO statistics worked this way, so in a sense I should have known this already, but those are a different and now-obsolete system than the statistics 'zpool iostat' uses. ZFS is also somewhat inconsistent on this; for example, pool scrub progress on mirrored vdevs is reported in logical bytes, not physical ones.

PS: As you'd expect, the 'zpool iostats' level IO statistics report scrub IO in physical terms, not the logical one that 'zpool scrub' reports. In modern sequential scrubs on mirrored vdevs, the physical IO from 'zpool iostat' can add up to more than the total amount claimed to be scrubbed times the mirror level. I assume that this extra IO is due to any IO needed for the initial metadata scan.

ZFSPoolIostatsPhysical written at 23:16:16; Add Comment

By day for July 2022: 25 27; before July; after July.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.