Some basic ZFS ARC statistics and prefetching

December 5, 2018

I've recently been trying to understand some basis ZFS ARC statistics, partly because of our new shiny thing and partly because of a simple motivating question: how do you know how effective ZFS's prefetching is for your workload?

(Given that ZFS prefetching can still run away with useless IO, this is something that I definitely want to keep an eye on.)

If you read the arcstat manpage or look at the raw ARC kstats, you'll very soon notice things like 'prefetch hits percentage' arcstat field and a prefetch_data_hits kstat. Unfortunately these prefetch-related kstats will not give us what we want, because they mean something different that what you might expect.

The ARC divides up all incoming read requests into four different categories, based on the attributes of the read. First, the read can be for data or for metadata. Second, the read can be a generally synchronous demand read, where something actively needs the data, or it can be a generally asynchronous prefetch read, where ZFS is just reading some things to prefetch them. These prefetch reads can find what they're looking for in the ARC, and this is what the 'prefetch hits percentage' and so on mean. They're not how often the prefetched data was used for regular demand reads, they're how often an attempt to prefetch things found them already in the ARC instead of having to read them from disk.

(If you repeatedly sequentially re-read the same file and it fits into the ARC, ZFS will fire up its smart prefetching every time but every set of prefetching after the first will find that all the data is still in the ARC. That will give you prefetch hits (for both data and metadata), and then later demand hits for the same data as your program reaches that point in the file.)

All of this gives us four combinations of reads; demand data, demand metadata, prefetch data, and prefetch metadata. Some things can be calculated from this and from the related *_miss kstats. In no particular order:

  • The ARC demand hit rate (for data and metadata together) is probably the most important thing for whether the ARC is giving you good results, although this partly depends on the absolute volume of demand reads and a few other things. Demand misses mean that programs are waiting on disk IO.

  • The breakdown of *_miss kstats will tell you why ZFS is reading things from disk. You would generally like this to be prefetch reads instead of demand reads, because at least things aren't waiting on prefetch reads.

  • The combined hit and miss kstats for each of the four types (compared to the overall ARC hit and miss counts) will tell you what sorts of read IO ZFS is doing in general. Sometimes there may be surprises there, such as a surprisingly high level of metadata reads.

One limitation of all of these kstats is that they count read requests, not the amount of data being read. I believe that you can generally assume that data reads are for larger sizes than metadata reads, and prefetch data reads may be larger than regular data reads, but you don't know for sure.

Unfortunately, none of this answers our question about the effectiveness of prefetching. Before we give up entirely, in modern versions of ZFS there are two additional kstats of interest:

  • demand_hit_predictive_prefetch counts demand reads that found data in the ARC from prefetch reads. This sounds exactly like what we want, but experimentally it doesn't seem to come anywhere near fully accounting for hits on prefetched data; I see low rates of it when I am also seeing a 100% demand hit rate for sequentially read data that was not previously in the ARC.

  • sync_wait_for_async counts synchronous reads (usually or always demand reads) that found an asynchronous read in progress for the data they wanted. In some versions this may be called async_upgrade_sync instead. Experimentally, this count is also (too) low.

My ultimate conclusion is that there are two answers to my question about prefetching's effectiveness. If you want to know if prefetching is working to bring in data before you need it, you need to run your workload in a situation where it's not already in the ARC and watch the demand hit percent. If the demand hit percent is low and you're seeing a significant number of demand reads that go to disk, prefetching is not working. If the demand hit rate is high (especially if it is essentially 100%), prefetching is working even if you can't see exactly how in the kstats.

If you want to know if ZFS is over-prefetching and having to throw out prefetched data that has never been touched, unfortunately as far as I can see there is no kstat that will give us the answer. ZFS could keep a count of how many prefetched but never read buffers it has discarded, but currently it doesn't, and without that information we have no idea. Enterprising people can perhaps write DTrace scripts to extract this from the kernel internals, but otherwise the best we can do today is to measure this indirectly by observing the difference in read data rate between reads issued to the disks and reads returned to user level. If you see a major difference, and there is any significant level of prefetch disk reads, you have a relatively smoking gun.

If you want to see how well ZFS thinks it can predict your reads, you want to turn to the zfetchstats kstats, particularly zfetchstats:hits and zfetchstats:misses. These are kstats exposed by dmu_zfetch.c, the core DMU prefetcher. A zfetchstats 'hit' is a read that falls into one of the streams of reads that the DMU prefetcher was predicting, and it causes the DMU prefetcher to issue more prefetches for the stream. A 'miss' is a read that doesn't fall into any current stream, for whatever reason. Zfetchstat hits are a necessary prerequisite for prefetches but they don't guarantee that the prefetches are effective or guard against over-fetching.

One useful metric here is that the zfetchstat hit percentage is how sequential the DMU prefetcher thinks the overall IO pattern on the system is. If the hit percent is low, the DMU prefetcher thinks it has a lot of random or at least unpredictable IO on its hands, and it's certainly not trying to do much prefetching; if the hit percent is high, it's all predictable sequential IO or at least close enough to it for the prefetcher's purposes.

(For more on ZFS prefetching, see here and the important update on the state of modern ZFS here. As far as I can tell, the prefetching code hasn't changed substantially since it was made significantly more straightforward in late 2015.)

Written on 05 December 2018.
« The brute force cron-based way of flexibly timed repeated alerts
Modern Bourne shell arithmetic is pretty pleasant »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Dec 5 23:17:12 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.