How having a metrics system centralized information and got me to check it
We recently had a little incident with one of the SSDs on our new
fileservers. To condense the story,
ZFS started detecting a few checksum errors over time (starting January
13th), we assumed they were a one-time thing, we arranged to scrub the
pool this weekend, and during the scrub ZFS discovered dozens more
checksum errors (and faulted the SSD). With the Linux kernel reporting
no drive errors, we turned to smartctl
to see if there was anything
in the SMART attributes. And indeed there was; when I looked on Friday,
before the scrub, I noticed right away that the drive had a 'reallocated
event count' and 'reallocate(d) NAND block count' of 1.
A while back, we arranged to periodically collect all SMART attributes for all of our fileserver drives and dump them into our Prometheus system as local metrics (we had reasons to want to keep an eye on these). Shortly after that, I built a very brute force Grafana dashboard that let me see the state of those metrics as of some point in time and what had changed in them from an earlier point in time. On Friday, I used the historical log of metrics in Prometheus to see that the reallocated counts had both gone from 0 to 1 early on January 13th, not long before the first checksum errors were reported.
(There is probably a clever way in Prometheus to extract the time when this happened. I did it the brute force way of graphing the attributes and then backing up in time until the graph changed.)
After the scrub had run into all of those problems, I took a second
look at smartctl
output for the drive to see if anything new had
appeared, and it didn't look like it; certainly both reallocated
counts were still the '1' that they had been on Friday. Then I also
checked my brute force Grafana dashboard and it promptly showed me
another difference; the 'Raw Read Error Rate' had gone up by a
decent amount. There were two reasons I hadn't spotted this earlier;
first, it was a reasonably large number that looked like the other
reasonably large numbers in other SMART attributes, and second, it
had also been non-zero on Friday, before the scrub.
(smartctl
output is dozens of lines with a bunch of fields for
each attribute; there is a lot of noise, and it's easy to glaze
over yet another set of things. The Grafana dashboard made things
jump out by only presenting changed attributes and the changes in
their raw values, which reduced it to about five or six much easier
to read attributes.)
Some quick graphing later and I could see that the raw read error rate had been zero before January 13th and had been steadily counting up ever since then (with a sudden jump during the scrub). This didn't look like a SSD that had one NAND block that had gone bad, taking out some random collection of sectors and ZFS blocks; this looked like a SSD that was dying. Or maybe not, because SSDs are notoriously magical at a low level so perhaps it was routine for our Crucial SSDs to see some raw read errors and to count up that SMART attribute. And if it was actually a problem indicator, were we seeing it on any other fileserver drives?
Since I had all of our SMART metrics for all of our fileserver drives in one place, in Prometheus, I could easily do some quick PromQL checks and see that basically no other Crucial SSD had a non-zero 'Raw Read Error Rate'. We had a real problem but it was only with this one drive.
There's nothing in this story that I couldn't have done without
Prometheus, at least in theory. We could have created a historical
log of SMART metrics in some other way, I could have paid more
attention to smartctl
output and noticed the non-zero 'Raw Read
Error Rate' (either on Friday or after the scrub explosion), I could
have manually diff'd or otherwise compared two smartctl
reports
to see the increased RRER, and I could have gone around to all of
the fileservers to check all of the disks for their RRER. But in
practice very little of this would have happened without Prometheus
or something equivalent to it.
What Prometheus did for us here is two things. First, it centralized all of this information in one place and thus gave us an easy global view, just like a central syslog server. The second is that it reduced the friction of looking at all of these things (over and above centralizing them in one place). Reducing friction is always and evermore a huge subtle advantage, one that you shouldn't underrate; over and over, reducing friction past a crucial point has caused a sea change in how I do things.
(This is related to how our metrics system gets me to explore casual questions. This wasn't a casual question, but the low friction caused me to go further than I otherwise would have tried to.)
Comments on this page:
|
|