2018-12-05
Some basic ZFS ARC statistics and prefetching
I've recently been trying to understand some basis ZFS ARC statistics, partly because of our new shiny thing and partly because of a simple motivating question: how do you know how effective ZFS's prefetching is for your workload?
(Given that ZFS prefetching can still run away with useless IO, this is something that I definitely want to keep an eye on.)
If you read the arcstat
manpage
or look at the raw ARC kstats, you'll very soon notice things like
'prefetch hits percentage' arcstat field and a prefetch_data_hits
kstat. Unfortunately these prefetch-related kstats will not give
us what we want, because they mean something different that what
you might expect.
The ARC divides up all incoming read requests into four different categories, based on the attributes of the read. First, the read can be for data or for metadata. Second, the read can be a generally synchronous demand read, where something actively needs the data, or it can be a generally asynchronous prefetch read, where ZFS is just reading some things to prefetch them. These prefetch reads can find what they're looking for in the ARC, and this is what the 'prefetch hits percentage' and so on mean. They're not how often the prefetched data was used for regular demand reads, they're how often an attempt to prefetch things found them already in the ARC instead of having to read them from disk.
(If you repeatedly sequentially re-read the same file and it fits into the ARC, ZFS will fire up its smart prefetching every time but every set of prefetching after the first will find that all the data is still in the ARC. That will give you prefetch hits (for both data and metadata), and then later demand hits for the same data as your program reaches that point in the file.)
All of this gives us four combinations of reads; demand data, demand metadata, prefetch data, and prefetch metadata. Some things can be calculated from this and from the related *_miss kstats. In no particular order:
- The ARC demand hit rate (for data and metadata together) is probably
the most important thing for whether the ARC is giving you good
results, although this partly depends on the absolute volume of
demand reads and a few other things. Demand misses mean that
programs are waiting on disk IO.
- The breakdown of *_miss kstats will tell you why ZFS is
reading things from disk. You would generally like this to
be prefetch reads instead of demand reads, because at least
things aren't waiting on prefetch reads.
- The combined hit and miss kstats for each of the four types (compared to the overall ARC hit and miss counts) will tell you what sorts of read IO ZFS is doing in general. Sometimes there may be surprises there, such as a surprisingly high level of metadata reads.
One limitation of all of these kstats is that they count read requests, not the amount of data being read. I believe that you can generally assume that data reads are for larger sizes than metadata reads, and prefetch data reads may be larger than regular data reads, but you don't know for sure.
Unfortunately, none of this answers our question about the effectiveness of prefetching. Before we give up entirely, in modern versions of ZFS there are two additional kstats of interest:
- demand_hit_predictive_prefetch counts demand reads that found
data in the ARC from prefetch reads. This sounds exactly like what
we want, but experimentally it doesn't seem to come anywhere near
fully accounting for hits on prefetched data; I see low rates of
it when I am also seeing a 100% demand hit rate for sequentially
read data that was not previously in the ARC.
- sync_wait_for_async counts synchronous reads (usually or always demand reads) that found an asynchronous read in progress for the data they wanted. In some versions this may be called async_upgrade_sync instead. Experimentally, this count is also (too) low.
My ultimate conclusion is that there are two answers to my question about prefetching's effectiveness. If you want to know if prefetching is working to bring in data before you need it, you need to run your workload in a situation where it's not already in the ARC and watch the demand hit percent. If the demand hit percent is low and you're seeing a significant number of demand reads that go to disk, prefetching is not working. If the demand hit rate is high (especially if it is essentially 100%), prefetching is working even if you can't see exactly how in the kstats.
If you want to know if ZFS is over-prefetching and having to throw out prefetched data that has never been touched, unfortunately as far as I can see there is no kstat that will give us the answer. ZFS could keep a count of how many prefetched but never read buffers it has discarded, but currently it doesn't, and without that information we have no idea. Enterprising people can perhaps write DTrace scripts to extract this from the kernel internals, but otherwise the best we can do today is to measure this indirectly by observing the difference in read data rate between reads issued to the disks and reads returned to user level. If you see a major difference, and there is any significant level of prefetch disk reads, you have a relatively smoking gun.
If you want to see how well ZFS thinks it can predict your reads, you want to turn to the zfetchstats kstats, particularly zfetchstats:hits and zfetchstats:misses. These are kstats exposed by dmu_zfetch.c, the core DMU prefetcher. A zfetchstats 'hit' is a read that falls into one of the streams of reads that the DMU prefetcher was predicting, and it causes the DMU prefetcher to issue more prefetches for the stream. A 'miss' is a read that doesn't fall into any current stream, for whatever reason. Zfetchstat hits are a necessary prerequisite for prefetches but they don't guarantee that the prefetches are effective or guard against over-fetching.
One useful metric here is that the zfetchstat hit percentage is how sequential the DMU prefetcher thinks the overall IO pattern on the system is. If the hit percent is low, the DMU prefetcher thinks it has a lot of random or at least unpredictable IO on its hands, and it's certainly not trying to do much prefetching; if the hit percent is high, it's all predictable sequential IO or at least close enough to it for the prefetcher's purposes.
(For more on ZFS prefetching, see here and the important update on the state of modern ZFS here. As far as I can tell, the prefetching code hasn't changed substantially since it was made significantly more straightforward in late 2015.)
The brute force cron-based way of flexibly timed repeated alerts
Suppose, not hypothetically, that you have a cron job that monitors something important. You want to be notified relatively fast if your Prometheus server is down, so you run your cron job frequently, say once every ten minutes. However, now we have the problem that cron is stateless, so if our Prometheus server goes down and our cron job starts alerting us, it will re-alert us every ten minutes. This is too much noise (at least for us).
There's a standard pattern for dealing with this in cron jobs that send alerts; once the alert happens, you create a state file somewhere and as long as your current state is the same as the state file, you don't produce any output or send out your warning or whatever. But this leads to the next problem, which is that you alert once and are then silent forever afterward, leaving it to people to remember that the problem (still) exists. It would be better to re-alert periodically, say once every hour or so. This isn't too hard to do; you can check to see if the state file is more than an hour old and just re-send the alert if it is.
(One way to do this is with 'find <file> -mmin +... -print
'.
Although it may not be Unixy, I do rather wish for olderthan
and
newerthan
utilities as a standard and widely available thing. I
know I can write them in a variety of ways, but it's not the same.)
But this isn't really what we want, because we aren't around all of the time. Re-sending the alert once an hour in the middle of the night or the middle of the weekend will just give us a big pile of junk email to go through when we get back in to the office; instead we want repeats only once every hour or two during weekdays.
When I was writing our checker script, I got to this point and started planning out how I was going to compare against the current hour and day of weeek in the script to know when I should clear out the state file and so on. Then I had a flash of the obvious and realized that I already had a perfectly good tool for flexibly specifying various times and combinations of time conditions, namely cron itself. The simple way to reset the state file and cause re-alerts at whatever flexible set of times and time patterns I want is to do it through crontab entries.
So now I have one cron entry that runs every ten minutes for the main script, and another cron entry that clears the state file (if it exists) several times a day during the weekday. If we decide we want to be re-notified once a day during the weekend, that'll be easy to add as another cron entry. As a bonus, everyone here understands cron entries, so it will be immediately obvious when things run and what they do in a way that it wouldn't be if all of this was embedded in a script.
(It's also easy for anyone to change. We don't have to reach into a script; we just change crontab lines, something we're already completely familiar with.)
As it stands this is slightly too simplistic, because it clears the
state file without caring how old it is. In theory we could generate
an alert shortly before the state file is due to cleared, clear the
state file, and then immediately re-alert. To deal with that I
decided to go the extra distance and only clear the state file if
it was at least a minimum age (using find
to see if it was old
enough, because we make do with the tools Unix gives us).
(In my actual implementation, the main script takes a special
argument that makes it just clear the state file. This way only the
script has to know where the state file is or even just what to do
to clear the 'do not re-alert' state; the crontab entry just runs
'check-promserver --clear
'.)