ZFS On Linux's kernel modules issues are not like NVidia's
In the Hacker News discussion of my entry on the general risk to ZFS from the shift in its userbase towards ZFS on Linux, a number of people suggested that the risks to ZFS on Linux were actually low since proprietary kernel modules such as NVidia's GPU drivers don't seem to have any problems dealing with things like the kernel making various functions inaccessible or GPL-only. I have two views on this, because I think that the risks are a bit different and less obvious than what they initially look like.
On a purely technical level, ZFS on Linux probably has it easier and is at less risk that NVidia's GPU drivers. Advanced GPU drivers deal with hardware that only they can work with and may need to do various weird sorts of operations for memory mapping, DMA control, and so on that aren't needed or used by existing kernel modules. It's at least possible that the Linux kernel could not support access to this sort of stuff from any kernel module (in tree or out, regardless of license), and someday leave a GPU driver up the creek.
By contrast, most in-tree filesystems are built as loadable modules (for good reasons) so the kernel already provides to modules everything necessary to support a filesystem, and it's likely that ZFS on Linux could survive with just this. What ZFS on Linux needs from the kernel is likely to be especially close to what BTRFS needs, since both are dealing with very similar core issues like talking to multiple disks at once, checksum computation, and so on, and there is very little prospect that BTRFS will either be removed from the kernel tree or only be supported if built into the kernel itself.
But on the political and social level it's another thing entirely. NVidia and other vendors of proprietary kernel modules have already decided that they basically don't care about anything except what they can implement. Licenses and people's views of their actions are irrelevant; if they can do it technically and they need to in order to make their driver work, they will. GPL shim modules to get access to GPL-only kernel symbols are just the starting point.
Most of the people involved in ZFS on Linux are probably not going to feel this way. Sure, ZFS on Linux could implement shim modules and other workarounds if the kernel cuts off full access to necessary things, but I don't think they're going to. ZFS on Linux developers are open source developers in a way that NVidia's driver programmers are not, and if the Linux kernel people yell at them hard enough they will likely go away, not resort to technical hacks to get around the technical barriers.
In other words, the concern with ZFS on Linux is not that it will become technically unviable, because that's unlikely. The concern is that it will become socially unviable, that to continue on a technical level its developers and users would have to become just as indifferent to the social norms of the kernel license as NVidia is.
(And if that did happen, which it might, I think it would make ZFS on Linux much more precarious than it currently is, because ZoL would be relying on its ability to find and keep both good kernel developers and sources of development funding that are willing to flout social norms in a way that they don't have to today.)
How having a metrics system centralized information and got me to check it
We recently had a little incident with one of the SSDs on our new
fileservers. To condense the story,
ZFS started detecting a few checksum errors over time (starting January
13th), we assumed they were a one-time thing, we arranged to scrub the
pool this weekend, and during the scrub ZFS discovered dozens more
checksum errors (and faulted the SSD). With the Linux kernel reporting
no drive errors, we turned to
smartctl to see if there was anything
in the SMART attributes. And indeed there was; when I looked on Friday,
before the scrub, I noticed right away that the drive had a 'reallocated
event count' and 'reallocate(d) NAND block count' of 1.
A while back, we arranged to periodically collect all SMART attributes for all of our fileserver drives and dump them into our Prometheus system as local metrics (we had reasons to want to keep an eye on these). Shortly after that, I built a very brute force Grafana dashboard that let me see the state of those metrics as of some point in time and what had changed in them from an earlier point in time. On Friday, I used the historical log of metrics in Prometheus to see that the reallocated counts had both gone from 0 to 1 early on January 13th, not long before the first checksum errors were reported.
(There is probably a clever way in Prometheus to extract the time when this happened. I did it the brute force way of graphing the attributes and then backing up in time until the graph changed.)
After the scrub had run into all of those problems, I took a second
smartctl output for the drive to see if anything new had
appeared, and it didn't look like it; certainly both reallocated
counts were still the '1' that they had been on Friday. Then I also
checked my brute force Grafana dashboard and it promptly showed me
another difference; the 'Raw Read Error Rate' had gone up by a
decent amount. There were two reasons I hadn't spotted this earlier;
first, it was a reasonably large number that looked like the other
reasonably large numbers in other SMART attributes, and second, it
had also been non-zero on Friday, before the scrub.
smartctl output is dozens of lines with a bunch of fields for
each attribute; there is a lot of noise, and it's easy to glaze
over yet another set of things. The Grafana dashboard made things
jump out by only presenting changed attributes and the changes in
their raw values, which reduced it to about five or six much easier
to read attributes.)
Some quick graphing later and I could see that the raw read error rate had been zero before January 13th and had been steadily counting up ever since then (with a sudden jump during the scrub). This didn't look like a SSD that had one NAND block that had gone bad, taking out some random collection of sectors and ZFS blocks; this looked like a SSD that was dying. Or maybe not, because SSDs are notoriously magical at a low level so perhaps it was routine for our Crucial SSDs to see some raw read errors and to count up that SMART attribute. And if it was actually a problem indicator, were we seeing it on any other fileserver drives?
Since I had all of our SMART metrics for all of our fileserver drives in one place, in Prometheus, I could easily do some quick PromQL checks and see that basically no other Crucial SSD had a non-zero 'Raw Read Error Rate'. We had a real problem but it was only with this one drive.
There's nothing in this story that I couldn't have done without
Prometheus, at least in theory. We could have created a historical
log of SMART metrics in some other way, I could have paid more
smartctl output and noticed the non-zero 'Raw Read
Error Rate' (either on Friday or after the scrub explosion), I could
have manually diff'd or otherwise compared two
to see the increased RRER, and I could have gone around to all of
the fileservers to check all of the disks for their RRER. But in
practice very little of this would have happened without Prometheus
or something equivalent to it.
What Prometheus did for us here is two things. First, it centralized all of this information in one place and thus gave us an easy global view, just like a central syslog server. The second is that it reduced the friction of looking at all of these things (over and above centralizing them in one place). Reducing friction is always and evermore a huge subtle advantage, one that you shouldn't underrate; over and over, reducing friction past a crucial point has caused a sea change in how I do things.
(This is related to how our metrics system gets me to explore casual questions. This wasn't a casual question, but the low friction caused me to go further than I otherwise would have tried to.)