2019-04-07
Why selecting times is still useful even for dashboards that are about right now
In the aftermath of our power outage, one of the things that I did was put together a Grafana dashboard that was specifically focused on dealing with large scale issues, specifically a lot of machines being down or having problems. In this sort of situation, we don't need to see elaborate status displays and state information; basically we want a list of down machines and a list of other alerts, and very little else to get in the way.
(We have an existing overview dashboard, but it's designed with the tacit assumption that only a few or no machines are down and we want to see a lot of other state information. This is true in our normal situation, but not if we're going through a power shutdown or other large scale event.)
This dashboard will likely only ever be used in production displaying the current time, because 'what is (still) wrong right now' is its entire purpose. Yet when I built it, I found that I not only wanted to leave in the normal Grafana time setting options but specifically build in a panel that would let me easily narrow in on a specific (end) time. This is because setting the time to a specific point is extremely useful for development, testing, and demos of your dashboard. In my case, I could set my in-development dashboard back to a point during our large scale power outage issues and ask myself whether what I was seeing was useful and complete, or whether it was annoying and missing things we'd want to know.
(And also test that the queries and Grafana panel configurations and so on were producing the results that I expected and needed.)
This is obviously especially useful for dashboards that are only interesting in exceptional conditions, conditions that you hopefully don't see all the time and can't find on demand. We don't have large scale issues all that often, so if I want to see and test my dashboard during one before the next issue happens I need to rewind time and set it at a point where the last crisis was happening.
(Now that I've written this down it all feels obvious, but it initially wasn't when I was staring at my dashboard at the current time, showing nothing because nothing was down, and wondering how I was going to test it.)
Sidebar: My best time-selection option in Grafana
In my experience, the best way to select a time range or a time endpoint in Grafana is through a graph panel that shows something over time. What you show doesn't matter, although you might as well try to make it useful; what you really care about is the time scale at the bottom that lets you swipe and drag to pick the end and start points of the time range. The Grafana time selector at the top right is good for the times that it gives fast access to, but it is slow and annoying if you want, say, '8:30 am yesterday'. It is much faster to use the time selector to get your graph so that it includes the time point you care about, then select it off the graph.
A ZFS resilver can be almost as good as a scrub, but not quite
We do periodic scrubs of our pools, roughly every four weeks on a revolving schedule (we only scrub one pool per fileserver at once, and only over the weekend, so we can't scrub all pools on one of our HD based fileservers in one weekend). However, this weekend scrubbing doesn't happen if there's something else more important happening on the fileserver. Normally there isn't, but one of our iSCSI backends didn't come back up after our power outage this Thursday night. We have spare backends, so we added one in to the affected fileserver and started the process of resilvering everything onto the new backend's disks to restore redundancy to all of our mirrored vdevs.
I've written before about the difference between scrubs and resilvers, which is that a resilver potentially reads and validates less than a scrub does. However, we only have two way mirrors and we lost one side of all of them in the backend failure, so resilvering all mirrors has to read all of the metadata and data on every remaining device of every pool. At first, I thought that this was fully equivalent to a scrub and thus we had effectively scrubbed all of our pools on that fileserver, putting us ahead of our scrub schedule instead of behind it. Then I realized that it isn't, because resilvering doesn't verify that the newly written data on the new devices is good.
ZFS doesn't have any explicit 'read after write' checks, although it will naturally do some amount of reads from your new devices just as part of balancing reads. So although you know that everything on your old disks is good, you can't have full confidence that your new disks have correct copies of everything. If something got corrupted on the way to the disk or the disk has a bad spot that wasn't spotted by its electronics, you won't know until it's read back, and the only way to force that is with an explicit scrub.
For our purposes this is still reasonably good. We've at least checked half of every pool, so right now we definitely have one good copy of all of our data. But it's not quite the same as scrubbing the pools and we definitely don't want to reset all of the 'last scrubbed at time X' markers for the pools to right now.
(If you have three or four way mirrors, as we have had in the past, a resilver doesn't even give you this because it only needs to read each piece of data or metadata from one of your remaining N copies.)