A problem with strict memory overcommit in practice
We've used Linux's strict overcommit mode on our compute servers for years to limit memory usage to the (large) amount of RAM on the machines, on the grounds that compute jobs that allocate tens or hundreds of GB of RAM generally intend to use it. Recently we had a series of incidents where compute machines were run out of memory, and unfortunately these incidents illustrated a real problem with relying on strict overcommit handling. You see, in practice, strict overcommit kills random processes. Oh, not literally, but the effect is generally the same.
Quite a lot of programs need memory as they run, and when they can't
get it there is often not very much they can do except exit. This is
especially the case for shells and shell scripts; even if the shell
can get enough memory to do its own internal work, the moment it tries
exec() some external program, it's going to fail, and there goes your shell script. All sorts
of things can start failing, including things that shouldn't fail. Do
you have a '
trap EXIT "rm $lockfile"' in your shell script? Well,
rm isn't a builtin, so your lock file is probably not going away.
Strict overcommit is blind here; it denies more memory to any and all processes that want it after the limit is hit, no matter what they are, how big they are, or what their allocation pattern has been. And in practice the amount of available memory doesn't just go to zero and stay there; instead, some things will try to allocate memory, fail, and exit, releasing their current memory, which allows other programs to get more memory for a bit and maybe finish and exit, and then the whole cycle bounces around. This oscillation is what creates the randomness, where if you ask at the right time you get memory and get to survive but if you ask at the wrong time you fail and die.
In our environment, processes failing unpredictably turns out to be surprisingly disruptive to routine ongoing maintenance tasks, like password propagation and managing NFS mounts. I'd say that our scripts weren't designed for things exploding half-way through, but I'm not sure there's any way to design scripts to survive in a setting where any command and any internal Bourne shell operation might fail at any time.
(We are taking steps to mitigate some problems.)
Despite our recent experience with this, we're probably going to stick with strict overcommit on our compute servers; it has saved us in other situations, and there really isn't a better alternative. The OOM killer has its own problems and is probably the wrong answer for our compute servers, especially in its current form.
(There is an argument that can be made that the OOM killer's apparent current approach of 'kill the biggest process' is actually a fair one on a shared compute server, but I think it's at least questionable.)
PS: We saw this last fall, but at the time we didn't really fully appreciate the potential problems and how hard they may be to deal with.
Hand-building an updated upstream kernel module for your (Fedora) kernel
Suppose, not hypothetically, that you've run into an issue where a distribution kernel update has a new bug in some functionality that's important to you, for instance Fedora's 4.20.x kernels fail to see one fan sensor on your Asus Prime Z370-A motherboard. After some work, you identify the issue, report it, and the fix is made upstream. Now you would like to test this upstream fix in your kernel.
The proper way to do this would be to extract some form of the
change that fixes things as a patch, get the source RPM for the
latest Fedora kernel, add your patch to the specfile as another RPM
patch, rebuild your new kernel source RPM to get the binary kernel
RPMs, and install them (let's assume you don't have to worry about
signed kernels). However, this is a lot of work and I'm lazy, so I
took the easier way of rebuilding this changed code as an out of
kernel module and then just dumping it into
/lib/modules as a
replacement for the regular Fedora kernel module from the kernel
RPM. And rather than trying to create the proper control files for
this, we'll just see if we can just build from the actual kernel
source directory involved, which it turns out that we can in this
case (but perhaps not always).
(As a disclaimer, my process here probably only works under the right circumstances because it's a hack.)
That's confusing, so let me put it more simply: we have some kernel source tree, let's call it groeck/linux-staging, and we're going to build a module from that source tree but build it for our currently running Fedora kernel. This tree has the fix we want, but it's not actually a kernel tree for the right version of the kernel, which is part of what makes this a hack.
(The current Fedora kernel is based on 4.20.5, while groeck/linux-staging is currently based on 5.0-rc3, because it's fixes that are intended to be pulled into the main Linux kernel. Eventually these fixes will hopefully make their way back into the 4.20.x series and all of this will be unnecessary.)
In theory, building a kernel module out of the tree once you have
Makefile in the directory is very easy.
The kbuild modules.txt
documentation describes the important magic. When you're in the
directory where the source of the module you want to build is, you
make -C /lib/modules/$(uname -r)/build M=$(pwd)
If all goes well, this will build your module to a <whatever>.ko
nct6775.ko. You probably want to compress this with
xz, because that turns out to be the standard for kernel modules
nowadays. You can check that things look good with '
nct6775.ko.xz'. My actual process wound up somewhat more involved
than this, partly because I'm a cautious person and partly because
I tried some things that exploded.
- Clone the Git repo with the fix
to somewhere, say
- Clone the stable kernel Git repo to
/var/tmp/linux-stable. This isn't absolutely required, but we're going to use it to make sure we're not picking up unexpected changes.
- In the stable repo, check out the branch for your particular kernel
line. Fedora is based on 4.20.x currently, so:
git checkout linux-4.20.y
- In the staging repo, switch to the appropriate branch. Don't
accidentally build from the wrong branch, giving you a version of
the module without the fix, the way I did the first time.
git checkout hwmon-next
You may have to hunt around with '
git branch -r' and so on to find out where your fix is hiding. If you know the git commit ID, this stackoverflow answer led me to '
git branch --contains <commit-id>', which would have saved me having to guess and be somewhat lucky.
- Compare the module source in the stable branch and in the
development version. There is probably a really clever way to put
both in the same Git repo and compare with a git command, but I
took the simple brute force way of two repos:
cd /var/tmp diff -u linux-stable/drivers/hwmon/nct6775.c linux-staging/drivers/hwmon/nct6775.c
You want to see little or no changes other than the bugfix for your issue. If there are significant changes, the module from the upstream may not compile for the stable kernel you want to build it against, or work properly in that kernel, and you have more work to do. I was lucky and the only change in
nct6775.cthat I cared about was the bugfix for my issue.
- Change into the directory holding the updated module you want and
explicitly tell the module build process what module to build,
instead of letting it build everything in the directory. Here
we want to build the
cd /var/tmp/linux-staging/drivers/hwmon make -C /lib/modules/$(uname -r)/build M=$(pwd) nct6775.ko
If you leave out the specific module you want to build, the build system will try to build all of them and you may find that some of the modules in your upstream tree aren't compatible with the stable kernel you're trying to build against. This is the case here and is why I wound up comparing the nct6775.c code in the stable tree against the upstream; even though it built against stable, I wanted some assurance that there hadn't been changes that would mean I was shooting myself in the foot.
(I may still be shooting myself in the foot, but at least it's not obvious.)
nct6775.kobuilt, you can compress it with
xz, then copy it to
/lib/modules/$(uname -r)/extra/, save the original module (located in
kernel/drivers/hwmon) to some other place, and run '
depmod $(uname -r)' to re-apply magic module processing. If you reboot, you should be using your new hand-built module and in my case, my case fan was back to being reported.
You can pre-build the module against a kernel other than what you're
currently running by replacing the '
$(uname -r)' bits with the
actual name of the other kernel. This is handy if you're installing
an updated kernel and don't want to reboot twice.
(Clever people who know what they're doing can proceed to make this a DKMS module. It's apparently not hard, but I'm in a brute force mood here.)
ZFS On Linux's kernel modules issues are not like NVidia's
In the Hacker News discussion of my entry on the general risk to ZFS from the shift in its userbase towards ZFS on Linux, a number of people suggested that the risks to ZFS on Linux were actually low since proprietary kernel modules such as NVidia's GPU drivers don't seem to have any problems dealing with things like the kernel making various functions inaccessible or GPL-only. I have two views on this, because I think that the risks are a bit different and less obvious than what they initially look like.
On a purely technical level, ZFS on Linux probably has it easier and is at less risk that NVidia's GPU drivers. Advanced GPU drivers deal with hardware that only they can work with and may need to do various weird sorts of operations for memory mapping, DMA control, and so on that aren't needed or used by existing kernel modules. It's at least possible that the Linux kernel could not support access to this sort of stuff from any kernel module (in tree or out, regardless of license), and someday leave a GPU driver up the creek.
By contrast, most in-tree filesystems are built as loadable modules (for good reasons) so the kernel already provides to modules everything necessary to support a filesystem, and it's likely that ZFS on Linux could survive with just this. What ZFS on Linux needs from the kernel is likely to be especially close to what BTRFS needs, since both are dealing with very similar core issues like talking to multiple disks at once, checksum computation, and so on, and there is very little prospect that BTRFS will either be removed from the kernel tree or only be supported if built into the kernel itself.
But on the political and social level it's another thing entirely. NVidia and other vendors of proprietary kernel modules have already decided that they basically don't care about anything except what they can implement. Licenses and people's views of their actions are irrelevant; if they can do it technically and they need to in order to make their driver work, they will. GPL shim modules to get access to GPL-only kernel symbols are just the starting point.
Most of the people involved in ZFS on Linux are probably not going to feel this way. Sure, ZFS on Linux could implement shim modules and other workarounds if the kernel cuts off full access to necessary things, but I don't think they're going to. ZFS on Linux developers are open source developers in a way that NVidia's driver programmers are not, and if the Linux kernel people yell at them hard enough they will likely go away, not resort to technical hacks to get around the technical barriers.
In other words, the concern with ZFS on Linux is not that it will become technically unviable, because that's unlikely. The concern is that it will become socially unviable, that to continue on a technical level its developers and users would have to become just as indifferent to the social norms of the kernel license as NVidia is.
(And if that did happen, which it might, I think it would make ZFS on Linux much more precarious than it currently is, because ZoL would be relying on its ability to find and keep both good kernel developers and sources of development funding that are willing to flout social norms in a way that they don't have to today.)
The Linux kernel's pstore error log capturing system, and ACPI ERST
In response to my entry yesterday on enabling reboot on panic on your servers, a commentator left the succinct suggestion of 'setup pstore'. I had never heard of pstore before, so this sent me searching and what I found is actually quite interesting and surprising, with direct relevance to quite a few of our servers.
Pstore itself is a kernel feature that dates to 2011. It provides
a generic interface to storage that persists across reboots and
gets used to save kernel messages during a crash, as covered in
LWN's Persistent storage for a kernel's "dying breath" and the kernel documentation. Your
kernel very likely has pstore built in and your Linux probably
mounts the pstore filesystem at
(The Ubuntu 16.04 and 18.04 kernels, the CentOS 7 kernel, and the
Fedora kernel all have it built in. If in doubt, check your kernel's
configuration, which is often found in
/boot/conf-*; you're looking
CONFIG_PSTORE and associated things.)
By itself, pstore does nothing for you because it needs a chunk of storage that persists across reboots, and that's up to your system to provide in some way. One such source of this storage is in an optional part of ACPI called the Error Record Serialization Table (ERST). Not all machines have an ERST (it's apparently most common in servers), but if you do have one, pstore will probably automatically use it. If you have ERST at all, it will normally show up in the kernel's boot time messages about ACPI:
ACPI: ERST 0x00000000BF7D6000 000230 (v01 DELL PE_SC3 00000000 DELL 00040000)
If pstore is using ERST, you will get some additional kernel messages:
ERST: Error Record Serialization Table (ERST) support is initialized. pstore: using zlib compression pstore: Registered erst as persistent store backend
Some of our servers have ACPI ERST and some of them have crashed,
so out of idle curiosity I went and looked at
all of them. This led to a big surprise, which is that there may
be nothing in your Linux distribution that checks
to see if there are captured kernel crash logs. Pstore is
persistent storage, and so it does what it says on the can; if
you don't move things out of
/sys/fs/pstore, they stay there,
possibly for a very long time (one of our servers turned out to
have pstore ERST captures from a year ago). This is especially
important because things like ERST only have so much space, so
lingering old crash logs may keep you from saving new ones, ones
that you may discover you very much would like records of.
(The year-old pstore ERST captures are especially ironic because the machine's current incarnation was reinstalled this September, so they are from its previous life as something else entirely, making them completely useless to us.)
Another pstore backend that you may have on some machines is one that uses UEFI variables. Unfortunately, you need to have booted your system using UEFI in order to have access to UEFI services, including UEFI variables (as I found out the hard way once), so even on a UEFI-capable system you may not be able to use this backend because you're still using MBR booting. It's possible that using UEFI variables for pstore is disabled by some Linux distributions, since actually using UEFI variables has caused UEFI BIOS problems in the past.
(This makes it somewhat more of a pity that I failed to migrate to UEFI booting, since I would actually potentially get something out of it on my workstations. Also, although many of our servers are probably UEFI capable, they all use MBR booting today.)
Given that nothing in our Ubuntu 18.04 server installs seems to
/sys/fs/pstore and we have some machines with things in
it, we're probably going to put together some shell scripting of
our own to at least email us if something shows up.
(Additional references: Matthew Garrett's A use for EFI, CoreOS's Collecting
which mentions the need to clear out
/sys/fs/pstore, and abrt's
pstore oops wiki page,
which includes a list of pstore backends.)
PS: The awkward, brute force way to get pstore space is with the ramoops backend, which requires fencing off some section of your RAM from your kernel (it should be RAM that your BIOS won't clear on reboot for whatever reason). This is beyond my enthusiasm level on my machines, despite some recent problems, and I have the impression that ramoops is usually used on embedded ARM hardware where you have little or no other options.
Consider setting your Linux servers to reboot on kernel problems
As I sort of mentioned when I wrote about things you can do to make your Linux servers reboot on kernel problems, the Linux kernel normally doesn't reboot if it hits kernel problems. Problems like OOPSes and RCU stalls generally kill some processes and try to continue on; more serious issues cause panics, which freeze the machine entirely.
If your goal is to debug kernel problems, this is great because it preserves as much of the evidence as possible (although you probably also want things like a serial console or at least netconsole, to capture those kernel crash messages). If your goal is to have your servers running, it is perhaps not as attractive; you may quite reasonably care more about returning them to service as soon as possible than trying to collect evidence for a bug report to your distribution.
(Even if you do care about collecting information for a bug report,
there are probably better ways than letting the machine sit there.
Future kernels will have a kernel sysctl called
panic_print to let you dump out as much information in the
initial report as possible, which you can preserve through your
console server system, and in
general there is Kdump (also).
In theory netconsole might also let you capture the initial messages,
but I don't trust it half as much as I do a serial console.)
My view is that most people today are in the second situation, where there's very little you're going to do with a crashed server except reboot or power cycle it to get it back into service. If this is so, you might as well cut out the manual work by configuring your servers to reboot on kernel problems, at least as their initial default settings. You do want to wait just a little bit after an OOPS to reboot, in the hopes that maybe the kernel OOPS message will be successfully written to disk or transmitted off to your central syslog server, but that's it; after at most 60 seconds or so, you should reboot.
(If you find that you have a machine that is regularly OOPSing and you want to diagnose in a more-hands on way, you can change the settings on it as needed.)
We have traditionally not thought about this and so left our servers
in the standard default 'lock up on kernel problems' configuration,
which has gone okay because kernel problems are very rare in the
first place. Leaving things as they are would still be the least
effort approach, but changing our standard system setup to enable
reboots on panics would not be much effort (it's three sysctls in
/etc/sysctl.d file), and it's probably worth it, just in case.
(This is the kind of change that you hope not to need, but if you do wind up needing it, you may be extremely thankful that you put it into place.)
PS: Not automatically rebooting on kernel panics is pretty harmless for Linux machines that are used interactively, because if the machine has problems there's a person right there to immediately force a reboot. It's only unattended machines such as servers where this really comes up. For desktop and laptop focused distributions it probably makes troubleshooting somewhat easier, because at least you can ask someone who's having crash problems to take a picture of the kernel errors with their phone.
Things you can do to make your Linux servers reboot on kernel problems
One of the Linux kernel's unusual behaviors is that it often doesn't reboot after it hits an internal problem, what is normally called a kernel panic. Sometimes this is a reasonable thing and sometimes this is not what you want and you'd like to change it. Fortunately Linux lets you more or less control this through kernel sysctl settings.
(The Linux kernel differentiates between things like OOPSes and RCU stalls, which it thinks it can maybe continue on from, and kernel panics, which immediately freeze the machine.)
What you need to do is twofold. First, you need to make it so that
the kernel reboots when it considers itself to have paniced. This
is set through the
kernel.panic sysctl, which is a number of
seconds. Some sources recommend setting this to 60 seconds under
various circumstances, but in limited experience we haven't found
that to do anything for us except delay reboots, so we now use 10
kernel.panic to 0 restores the default state,
where panics simply hang the machine.
Second, you need to arrange for various kernel problems to trigger
panics. The most important thing here is usually for kernel OOPS
messages or BUG messages to trigger panics; the kernel considers
these nominally recoverable, except that they mostly aren't and
will often leave your machine effectively hung. Panicing on OOPS
is turned on by setting
kernel.panic_on_oops to 1.
Another likely important sign of trouble is RCU stalls; you can
panic on these with
kernel.panic_on_rcu_stall. Note that I'm
biased about RCU stalls. The kernel
documentation in sysctl/kernel.txt mentions
some other ones as well, currently
panic_on_warn. Of these, I would definitely be wary about turning
panic_on_warn; our systems appear to see a certain number
of them in reasonably routine operation.
(You can detect these warnings by searching your kernel logs for
the text '
WARNING: CPU: <..> PID: <...>'. One of our WARNs was
for a network device transmit queue timeout, which recovered almost
immediately. Rebooting the server due to this would have been
entirely the wrong reaction in practice.)
Note that you can turn on any or all of the various
settings while still having
kernel.panic set to 0. If you do
this, you convert OOPSes, RCU stalls, or whatever into things that
are guaranteed to hang the whole machine when they happen, instead
of perhaps having it continue on in partial operating order. There
are systems where this may be desirable behavior.
PS: If you want to be as sure as possible that the machine reboots
after hitting problems, you probably want to enable a hardware
watchdog as well if you can. The kernel
panic() function tries
hard to reboot the machine, but things can probably go wrong.
Unfortunately not all machines have hardware watchdogs available,
although many Intel ones do.
Sidebar: The problem with kernel OOPSes
When a kernel oops happens, the kernel kills one or more processes. These processes were generally in kernel code at the time (that's usually what generated the oops), and they may have been holding locks or have been in the middle of modifying data structures, submitting IO operations, or doing other kernel things. However, the kernel has no idea what exactly needs to be done to safely release these locks, revert the data structure modifications, and so on; instead it just drops everything on the floor and hopes for the best.
Sometimes this works out, or at least the damage done is relatively contained (perhaps only access to one mounted filesystem starts hanging because of a lock held by the now-dead process that will never be unlocked). Often it is not and more or less everything grinds to a more or less immediate halt. If you're lucky, enough of the system survives long enough for the kernel oops message to be written to disk or sent out to your central syslog server.
A surprise potential gotcha with
sharenfs in ZFS on Linux
In Solaris and Illumos, the standard and well supported way to set
and update NFS sharing options for ZFS filesystems is through the
sharenfs ZFS filesystem property. ZFS on Linux sort of supports
sharenfs, but it
attempts to be compatible with Solaris
and in practice that doesn't work well, partly because there are
Solaris options that cannot be easily translated to Linux. When we faced this issue for our Linux ZFS
fileservers, we decided that we would build
an entirely separate system to handle NFS exports that directly invokes
which has worked well. This turns out to have been lucky, because
there is an additional and somewhat subtle problem with how
is currently implemented in ZFS on Linux.
On both Illumos and Linux, ZFS actually implements
calling the existing normal command to manipulate NFS exports; on
Illumos this uses
and on Linux,
By itself this is not a problem and actually makes a lot of sense
(especially since there's no official public API for this on either
Linux or Illumos). On Linux, the specific functions involved are
found in lib/libshare/nfs.c.
When you initially share a NFS filesystem, ZFS will wind up running
the following command for each client:
exportfs -i -o <options> <client>:<path>
When you entirely unshare a NFS filesystem, ZFS will wind up running:
exportfs -u <client>:<path>
The potential problem comes in when you change an existing
setting, either to modify what clients the filesystem is exported
to or to alter what options you're exporting it with. ZFS on Linux
implements this by entirely unexporting the filesystem to all
clients, then re-exporting it with whatever options and to whatever
clients your new
sharenfs settings call for.
(The code for this is in
On the one hand this is a sensible if brute force implementation,
and computing the difference in sharing (for both clients and
options) and how to transform one to the other is not an easy
problem. On the other hand, this means that clients that are actually
doing NFS traffic during the time when you change
be unlucky enough to try a NFS operation in the window of time
between when the filesystem was unshared (to them) and when it was
reshared (to them). If they hit this window, they'll get various
forms of NFS permission denied messages, and with some clients this
may produce highly undesirably consequences, such as libvirt
guests having their root filesystems go read-only.
(The zfs-discuss re-query from Todd Pfaff today is what got several people to go digging and figure out this issue. I was one of them, but only because I rushed into exploring the code before reading the entire email thread.)
I would like to say that our system for ZFS NFS export permissions avoids this issue, but it has exactly
the same problem. Rather than try to reconcile the current NFS
export settings and the desired new ones, it just does a brute force
exportfs -u' for all current clients and then reshares things.
Fortunately we only very rarely change the NFS exports for a
filesystem because we export to netgroups instead of individual
clients, so adding and removing individual clients is almost entirely
done by changing netgroup membership. The actual
only has to change if we add or remove entire netgroups.
(Exportfs has a tempting '
-r' option to just resynchronize everything,
but our current system doesn't use it and I don't know why. I know that
I poked around with
exportfs when I was developing it but I don't
seem to have written down notes about my exploration, so I don't know
if I ran into problems with
-r, didn't notice it, or had some other
reason I rejected it. If I didn't overlook it, this is definitely a case
where I should have documented why I wasn't doing an attractive thing.)
Linux CPU numbers are not necessarily contiguous
In Linux, the kernel gives all CPUs a number; you can see this number
in, for example,
cpu0 [...] cpu1 [...] cpu2 [...] cpu3 [...]
Under normal circumstances, Linux has contiguous CPU numbers that
start at 0 and go up to however many CPUs the system has. However,
this is not guaranteed and is not always the case on certain live
configurations. It's perfectly possible to have a configuration
where, for example, you have sixteen CPUs that are numbered 0 to 7
and 16 to 23, with 8 to 15 missing. In this situation,
will match the kernel's numbering, with lines for
cpu23. If your code sees this and
decides to fill in the missing CPUs 8 through 15, it will be wrong.
You might think that no code could possibly make this mistake, but it's not quite that simple. If, for example, you make a straightforward array to hold CPU status, read in information from various sources, and then print out your accumulated data for CPUs 0 through the highest CPU you saw, you will invent those missing CPUs 8 through 15 (possibly with random unset data for them). In situations like this, you need to actively keep track of what CPUs in your array are valid and what ones aren't, or you need a more sophisticated data structure.
(If you've created an API that says 'I return an array of CPU information for CPUs 0 through N', well, you have a problem. You're probably going to need an API change; if this is in a structure, at least an API addition of a new field to tell people which CPUs are valid.)
I can see why people make this mistake. It's tempting to have simple code, displays, and so on, and almost all Linux machines have contiguous CPU numbering so your code will work almost everything (we only wound up with non-contiguous numbering through bad luck). But, sadly, it is a mistake and sooner or later it will bite either you or someone who uses your code.
(It's unfortunate that doing this right is more complicated. Life certainly would be simpler if Linux guaranteed that CPU numbers were always contiguous, but given that CPUs can come and go, that could cause CPU numbers to not always refer to the same actual CPU over time, which is worse.)
Sidebar: How we have non-contiguous CPU numbers
We have one dual-socket machine with hyperthreading where one socket has cooling problems and we've shut it down by offlining the CPUs. Each socket has eight cores, and Linux enumerated one side of the HT pairs for both sockets before starting on the other side of the HT pairs. CPUs 0 through 7 and 16 through 23 are the two HTs for the eight cores on the first socket; CPUs 8-15 would be the first set of CPUs for the second socket, if they were online, and then CPUs 24-32 would be the other side of the HT pairs.
In general, HT pairing is unpredictable. Some machines will pair adjacent CPU numbers (so CPU 0 and CPU 1 are a HT pair) and some machines will enumerate all of one side before they enumerate all of the other. My Ryzen-based office workstation enumerates HT pairs as adjacent CPU numbers, so CPU 0 and 1 are a pair, while my Intel-based home machine enumerates all of one HT side before flipping over to enumerate all of the other, so CPU 0 and CPU 6 are a pair.
(I prefer the Ryzen ordering because it makes life simpler.)
It's possible that we should be doing something less or other than offlining all of the CPUs for the socket with the cooling problem (perhaps the BIOS has an option to disable one socket entirely). But offlining them all seemed like the most thorough and sure option, and it certainly was simple.
Two views of ZFS's GPL-incompatibility and the Linux kernel
As part of a thread on linux-kernel where ZFS on Linux's problem with a recent Linux kernel change in exported symbols was brought up, Greg Kroah-Hartman wrote in part in this message:
My tolerance for ZFS is pretty non-existant. Sun explicitly did not want their code to work on Linux, so why would we do extra work to get their code to work properly?
If one frames the issue this way, my answer would be that in today's world, Sun (now Oracle) is no longer at all involved in what is affected here. It stopped being 'Sun's code' years ago, when Oracle Solaris and OpenSolaris split apart, and it's now in practice the code of the people who use ZFS on Linux, with a side digression into FreeBSD and Illumos. The people affected by ZoL not working are completely disconnected from Oracle, and anything the Linux kernel does to make ZoL work will not help Oracle more than a tiny fraction.
In short, the reason to do extra work here is that the people affected are Linux users who are using their best option for a good modern filesystem, not giant corporations taking advantage of Linux.
(I suspect that the kernel developers are not happy that people would much rather use ZFS on Linux than Btrfs, but I assure them that it is still true. I am not at all interested in participating in a great experiment to make Btrfs sufficiently stable, reliable, and featureful, and I am especially not interested in having work participate in this for our new fileservers.)
However, there is a different way to frame this issue. If you take it as given that Sun did not want their code to be used with Linux (and Oracle has given no sign of feeling otherwise), then fundamental social respect for the original copyright holder and license means respecting their choice. If Sun didn't want ZFS to work on Linux, it's hostile to them for the kernel community to go to extra work to enable it to work on Linux. If people outside the kernel community hack it up so that it works anyway, that's one thing. But if the kernel community goes out of its way to enable these hacks, well, then the kernel community becomes involved and is violating the golden rule as applied to software licenses.
As a result, I can reluctantly and unhappily support or at least accept 'no extra work for ZFS' as a matter of principle for Linux kernel development. But if your concern is not principle but practical effects, then I think you are mistaken.
(And if Oracle actually wanted to take advantage of the Linux kernel for ZFS, they could easily do so. Whether they ever will or not is something I have no idea about, although I can speculate wildly and their relicensing of DTrace is potentially suggestive.)
The risk that comes from ZFS on Linux not being GPL-compatible
A couple of years ago I wrote about the harm of ZFS not being GPL-compatible, which was that this kept ZFS from being bundled into most Linux distributions. License compatibility is both a legal and a social thing, and the social side is quite clear; most people who matter consider ZFS's CDDL license to be incompatible with the kernel. However, it turns out that there is another issue and another side of this that I didn't realize back at the time. This issue surfaced recently with the 5.0 kernel release candidates, as I first saw in Phoronix's ZFS On Linux Runs Into A Snag With Linux 5.0.
The Linux kernel doesn't allow kernel modules to use just any internal kernel symbols; instead they must be officially exported symbols. Some symbols (often although not entirely old ones) are exported to all kernel modules, regardless of the module's license, while others are exported in a way that marks them as restricted to GPL'd kernel modules. At the same time the kernel does not have a stable API of these exported symbols and previously exported ones can be removed as code is revived. Removed symbols may have no replacement at all or the replacement may be a GPL-only one when the previous symbol was generally available.
Modules that are part of the Linux kernel source are always going to work, so the kernel always exports enough symbols for them (although possibly as GPL-only symbols, since in-tree kernel modules are all GPL'd). Out of kernel modules that do the same sort of thing as in-kernel ones are also always going to work, at least if they're GPL'd; you're always going to be able to have out kernel modules for device drivers in general, for example. But out of kernel modules for less common things are more or less at the mercy of what symbols the kernel exports, especially if they're not GPL'd modules. If you're an out of kernel module with a GPL-compatible license, you might get the kernel developers to export some symbols you needed. If your module has a license that is seen as not GPL-compatible, well, the kernel developers may not be very sympathetic.
This is what has happened with ZFS on Linux as of the 5.0 pre-release, as covered in the Phoronix story and ZoL issue #8259. This specific problem will probably be worked around, but it shows a systemic risk for ZFS on Linux (and for any unusual non-GPL'd module), which is that you are at the mercy of the Linux kernel people to keep working in some vaguely legal way. If the Linux kernel people ever decide to be hostile they can systematically start making your life hard, and they may well make your life hard just as a side effect.
Is it likely that ZFS on Linux will someday be unable to work at all with new kernels, because crucial symbols it needs are not available at all? I think it's unlikely, but it's certainly possible and that makes it a risk for long term usage of ZFS on Linux. If it happened (hopefully far in the future), at work our answer would be to replace our current Linux-based ZFS fileservers with FreeBSD ones. On my own machines, well, I'd have to figure out some way of migrating all of my data around and what I'd put it on, and it would definitely be a pain and make me unhappy.
(It wouldn't be BTRFS, unless things change a lot by that point.)