What ZFS 'individual' and 'aggregated' IO size statistics mean

July 27, 2022

If you consult the zpool iostat manual page, one of the things it will tell you about is request size histograms, which come in two variants for each type of ZFS IO:

Print request size histograms [...]. This includes histograms of individual I/O (ind) and aggregate I/O (agg). These stats can be useful for observing how well I/O aggregation is working. [...]

This leaves many things unexplained. Let's start with what aggregated IO is and where it comes from. In the kernel, the primary source of aggregated IO in normal operation is vdev_queue_aggregate() in vdev_queue.c, where there are some comments:

Sufficiently adjacent io_offset's in ZIOs will be aggregated. [...]

We can aggregate I/Os that are sufficiently adjacent and of the same flavor, as expressed by the AGG_INHERIT flags. The latter requirement is necessary so that certain attributes of the I/O, such as whether it's a normal I/O or a scrub/resilver, can be preserved in the aggregate.

In other words, if ZFS is processing a bunch of independent IOs that are adjacent or sufficiently close together, it will aggregate them together. The 'flavor' of ZFS IO here is more than read versus write; it covers the IO priority as well (eg, synchronous versus asynchronous reads). There's also a limit on how large a span of data this aggregation can cover, and the limit depends on whether your ZFS pool is on SSDs or on HDDs. On SSDs the limit is the ZFS parameter zfs_vdev_aggregation_limit_non_rotating, which defaults to 128 KBytes, and on HDDs the limit is zfs_vdev_aggregation_limit, which defaults to 1 MByte.

An eventual ZFS IO is either 'aggregated' or 'individual'; the two are exclusive of each other (see the histogram update code in vdev_stat_update() in vdev.c). The sum of aggregated and individual IO is how much IO you've done. Similarly, the size distribution of the joint histogram is the size distribution of your IO. This is experimentally verified for counts, although I can't completely prove it to myself in the OpenZFS code. I'm also not entirely sure what's going on with the IO sizes.

(An aggregated IO swallows other, existing ZIOs. The size and count of the aggregated IO could be either just the 'head' IO and its size, or it could wind up including all of the swallowed IOs. I can't figure out from the OpenZFS code which it is or convince myself of one or the other, although I certainly would expect the aggregated IO's statistics to include and cover all of the swallowed ZIOs, and for the swallowed ZIOs to not show up in the 'individual' numbers.)

On all of the machines that I can readily get statistics for, individual IO seems more common by count and to be responsible for more IO size than aggregated IO. However, aggregated IO is often large enough in both count and especially size to matter; it's clear to me that you can't just ignore aggregated IO. Asynchronous writes seem to especially benefit from aggregation. All of this is on SSD based pools, often using large files, and I don't understand how the non-rotating aggregation limit is interacting with the 128 KB default ZFS recordsize.

I don't know if you should expect I/O aggregation to happen (and be alarmed if your statistics suggest that it's not) or just be pleased when it does. Presumably some of this has to do with what size individual IOs you see, especially on SSDs.

PS: The current types of ZFS IO that you're likely to see are synchronous and asynchronous reads and writes, scrub, trim, and 'rebuild'; all of these actually come from the priority that ZFS assigns to IOs. The non-histogram per-type counters for total bytes and total operations are somewhat different, but to confuse you trim IO is counted as 'ioctl' there, for reasons beyond the scope of this entry.

Written on 27 July 2022.
« To be fully useful, Prometheus histograms want their cumulative sums
Print based debugging and infrequent developers »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jul 27 23:09:51 2022
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.