How ZFS file prefetching seems to work
Since I was just digging into this, I want to write down what I've learned before I forget it. I will give you the punchline right up front: if you want to do IO that avoids ZFS prefetching as much as possible, you need to use randomized IO. It also turns out that ZFS prefetching has what I consider some flaws that can cause IO heartburn.
(You should use a fixed seed for your randomization function and watch ZFS statistics to make sure that you really are avoiding the prefetcher. Change the seed as necessary.)
So here is what I've deduced from reading the ZFS code in OpenSolaris
and from some experimentation with DTrace instrumentation. First off,
ZFS read prefetching operates on ZFS blocks, which are generally 128
recordsize ZFS property), and uses what it calls 'streams'.
Each stream can recognize one pattern of sequential IO: linear forward
reads, linear backwards reads, and forward or backwards reads with a
'stride' (where you skip forward or backwards N blocks every time you do
a read). Each file can have up to eight separate streams associated with
it. How streams are created and deleted is complex and I don't entirely
understand it, but I believe the basic approach is that ZFS attempts to
match your read IO with an existing stream and if it can't it tries
to create a new stream (there is some tricky code to discover strided
When a stream matches an IO pattern, it triggers a prefetch read; this fetches some amount of reads ahead of what you're reading now. The more times the stream matches, the more it reads ahead; how much it reads ahead starts at one read and more or less doubles every time the stream is used, up to a maximum size. The prefetch is for the size of read you're normally doing (so if you read one block, the prefetch is for one block), but it may stride forward to cover multiple expected future reads. For example, if you are reading one block every 10 blocks, after a while it will be fetching one block 10 blocks ahead, one block 20 blocks ahead, one block 30 blocks ahead, and so on.
What makes it basically impossible to avoid this prefetching with a sufficiently cleverly chosen and unpredictable pattern is how streams get recycled. Normally you might think that new IOs and new (attempted) streams would push out old streams, so if you just flood ZFS with a sufficiently long sequence of unpatterned IO you would be fine. It turns out that this is not how ZFS does it; streams only get recycled based on time since last use. A stream must be unused for at least 2 seconds before it will be tossed in favour of a new attempt to create a stream. So if you are doing a bunch of unpredictable IO on a file, your first eight or so IOs will create eight initial streams, which will then sit there for two seconds attempting to match themselves up with some IO you're doing. Only after those 2 seconds will they start to be tossed in favour of new ones (and then this cycle repeats for another two seconds or so, assuming that you can do completely unpatterned IO).
Given that two seconds covers both a significant amount of IO and worse, an unpredictable amount, this is why I say that ZFS prefetching can only really be defeated by randomized IO.
This prefetching approach has what I consider to be several flaws. The largest single flaw is ZFS prefetching does not check to see that its work was useful. What it cares about is that it matched your IO pattern; it doesn't notice if the data it prefetched for you expired from the ARC unread. Under memory pressure this combines explosively with two other ZFS prefetching features. First, prefetch streams seem to be valid forever as long as the znode for their file stays in memory; they are more or less never declared too old to be useful. Second, as previously mentioned a successful match against your IO pattern triggers more and more prefetching, even if this prefetched data will just expire unread from the ARC. Now that I understand the ZFS prefetching mechanisms, I can see why our prefetching problems happened; sequential reads from a lot of files at once is a worst case for ZFS prefetching.
This shows how vital it is for any prefetch mechanism to have end to end feedback. Prefetch should not be considered successful until the user-level code itself has read the prefetched data from cache. Merely predicting the user-level IO pattern is not good enough except in favorable situations.
(For example, as far as I can see ZFS will happily prefetch the maximum amount for a file that you are reading through sequentially at a rate of one read every five minutes even in the presence of a huge IO load that evicts the prefetched data from the ARC long before you issue the next read.)
Sidebar: the tunables and code involved in this
All of the prefetching code is in uts/common/fs/zfs/dmu_zfetch.c in your handy OpenSolaris or Illumos source repository. In OpenSolaris, the available tunables are:
||Maximum number of streams per file; defaults to 8.|
||The maximum number of blocks that we can ever prefetch when a stream triggers prefetch. These blocks may be split up between multiple IO positions. Defaults to 256 blocks.|
||Do not do prefetching for reads larger than this size (in bytes). Defaults to 1 MB.|
||Minimum seconds before an inactive stream can be reclaimed; defaults to 2 seconds.|
All of these tunables are global ones; they affect all ZFS filesystems and all ZFS pools.