How having a metrics system centralized information and got me to check it

January 30, 2019

We recently had a little incident with one of the SSDs on our new fileservers. To condense the story, ZFS started detecting a few checksum errors over time (starting January 13th), we assumed they were a one-time thing, we arranged to scrub the pool this weekend, and during the scrub ZFS discovered dozens more checksum errors (and faulted the SSD). With the Linux kernel reporting no drive errors, we turned to smartctl to see if there was anything in the SMART attributes. And indeed there was; when I looked on Friday, before the scrub, I noticed right away that the drive had a 'reallocated event count' and 'reallocate(d) NAND block count' of 1.

A while back, we arranged to periodically collect all SMART attributes for all of our fileserver drives and dump them into our Prometheus system as local metrics (we had reasons to want to keep an eye on these). Shortly after that, I built a very brute force Grafana dashboard that let me see the state of those metrics as of some point in time and what had changed in them from an earlier point in time. On Friday, I used the historical log of metrics in Prometheus to see that the reallocated counts had both gone from 0 to 1 early on January 13th, not long before the first checksum errors were reported.

(There is probably a clever way in Prometheus to extract the time when this happened. I did it the brute force way of graphing the attributes and then backing up in time until the graph changed.)

After the scrub had run into all of those problems, I took a second look at smartctl output for the drive to see if anything new had appeared, and it didn't look like it; certainly both reallocated counts were still the '1' that they had been on Friday. Then I also checked my brute force Grafana dashboard and it promptly showed me another difference; the 'Raw Read Error Rate' had gone up by a decent amount. There were two reasons I hadn't spotted this earlier; first, it was a reasonably large number that looked like the other reasonably large numbers in other SMART attributes, and second, it had also been non-zero on Friday, before the scrub.

(smartctl output is dozens of lines with a bunch of fields for each attribute; there is a lot of noise, and it's easy to glaze over yet another set of things. The Grafana dashboard made things jump out by only presenting changed attributes and the changes in their raw values, which reduced it to about five or six much easier to read attributes.)

Some quick graphing later and I could see that the raw read error rate had been zero before January 13th and had been steadily counting up ever since then (with a sudden jump during the scrub). This didn't look like a SSD that had one NAND block that had gone bad, taking out some random collection of sectors and ZFS blocks; this looked like a SSD that was dying. Or maybe not, because SSDs are notoriously magical at a low level so perhaps it was routine for our Crucial SSDs to see some raw read errors and to count up that SMART attribute. And if it was actually a problem indicator, were we seeing it on any other fileserver drives?

Since I had all of our SMART metrics for all of our fileserver drives in one place, in Prometheus, I could easily do some quick PromQL checks and see that basically no other Crucial SSD had a non-zero 'Raw Read Error Rate'. We had a real problem but it was only with this one drive.

There's nothing in this story that I couldn't have done without Prometheus, at least in theory. We could have created a historical log of SMART metrics in some other way, I could have paid more attention to smartctl output and noticed the non-zero 'Raw Read Error Rate' (either on Friday or after the scrub explosion), I could have manually diff'd or otherwise compared two smartctl reports to see the increased RRER, and I could have gone around to all of the fileservers to check all of the disks for their RRER. But in practice very little of this would have happened without Prometheus or something equivalent to it.

What Prometheus did for us here is two things. First, it centralized all of this information in one place and thus gave us an easy global view, just like a central syslog server. The second is that it reduced the friction of looking at all of these things (over and above centralizing them in one place). Reducing friction is always and evermore a huge subtle advantage, one that you shouldn't underrate; over and over, reducing friction past a crucial point has caused a sea change in how I do things.

(This is related to how our metrics system gets me to explore casual questions. This wasn't a casual question, but the low friction caused me to go further than I otherwise would have tried to.)

Written on 30 January 2019.
« Go 2 Generics: some features of contracts that I like
ZFS On Linux's kernel modules issues are not like NVidia's »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jan 30 00:27:29 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.