2016-01-20
Illumos's ZFS prefetching has recently become less superintelligent than it used to be
Several years ago (in 2012) I wrote How ZFS file prefetching seems to work, which discussed how ZFS prefetching worked at the time. As you may have guessed from the title of this entry, things have recently changed, at least in Illumos and other things built on the open source ZFS code (which includes the very latest ZFS on Linux). The basic change is Illumos 5987 - zfs prefetch code needs work, which landed in mainstream Illumos in early September of 2015, appears to have made it into FreeBSD trunk shortly afterwards, and which made it into ZFS on Linux only in late December.
The old code detected up to 8 streams (by default) of forward and reverse reads that were either straight sequential or strided (eg 'read every fourth block'). The new code still has 8 streams, but each stream now only matches sequential forward reads. This makes ZFS prefetching much easier to avoid and makes the code much easier to follow. I suspect that it won't have much effect on real workloads, although you never know; maybe there's real code that does strided forward reads or the like.
(There is also a tunable change; zfetch_max_distance
replaces
zfetch_block_cap
as the limit on the amount of data that will
be prefetched for a single stream. It's in bytes and defaults to
8 MBytes.)
Unfortunately the largest single drawback of ZFS prefetching still remains: prefetching (still) doesn't notice if the data it read in gets discarded from the ARC before it could be used. Just as before, as long as you're reading sequentially from the file, it will keep prefetching more and more data. Nor do streams time out if the file hasn't been touched at all in a while; each ZFS dnode may have up to eight of them hanging around basically forever, waiting patiently to match against the next read and restart prefetching (perhaps very large prefetching, as the amount of data to be prefetched never shrinks as far as I can see).
(That streams are per dnode instead of per open file handle does help explain why ZFS wants up to eight of them, since the dnode is shared across everyone who has the file open. If multiple people have the same file open and are reading from it sequentially (perhaps in different spots), it's good if they all get prefetched.)