Limiting the Nouveau kernel driver's messages via removal
Over on the Fediverse, I said:
Current status: solving software problems triggered by hardware problems with 'rmmod <module>'. It even worked.
(Modules cannot incessantly log kernel messages when they are unloaded. I was just glad the module did unload, given likely broken hardware that it was complaining about.)
Naturally there is a story here.
We have a collection of hand-built AMD Threadripper based compute servers (we bought all the parts, including 4U cases, and assembled them). In order to boot, these machines need video cards, since they don't have on-CPU GPUs and the standard AMD Threadripper motherboards we bought don't come with an onboard GPU the way server motherboards do. So we dug around in the department's collection spare parts collection and came up with a collection of old NVidia cards to stick in these machines.
(Where by 'old' I mean things like Quadro FX 570s, Quadro K420s, a Quadro NVS 285, and even one GeForce 8400 GS, as identified by lspci.)
This morning, after rebooting of of these machines to bring it into service, it began logging hundreds of kernel messages a second from the nouveau driver, to the effect of things like:
nouveau 0000:41:00.0: fifo: PBDMA0: 80000000 [SIGNATURE] ch 1 [007fcf3000 DRM] subc 0 mthd 0000 data 00000000 nouveau 0000:41:00.0: fifo: PBDMA2: 80006000 [GPFIFO GPPTR SIGNATURE] ch 0 [007fcf4000 DRM] subc 0 mthd 0000 data 00000000
This completely overwhelmed the machine (and ran it out of disk space), and didn't do great things to our central syslog server (which got quite busy handling these).
At first I thought that this was yet another case of not ratelimiting kernel messages when you should. It is, but after I was able to reboot the machine through trickery and examine the early kernel messages from the nouveau driver, it turns out to probably also be broken hardware:
nouveau 0000:41:00.0: DRM: DCB conn 00: 00001030 nouveau 0000:41:00.0: DRM: DCB conn 01: 00002146 [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [drm] Driver supports precise vblank timestamp query. nouveau 0000:41:00.0: disp: chid 0 mthd 0088 data f0000000 00007088 00000000 nouveau 0000:41:00.0: fifo: write fault at 0000001000 engine 04 [BAR1] client 08 [HOST_CPU_NB] reason 00 [PDE] on channel -1 [007fd25000 unknown] nouveau 0000:41:00.0: fifo: write fault at 0000040000 engine 05 [BAR2] client 08 [HOST_CPU_NB] reason 02 [PTE] on channel -1 [007fd76000 unknown] nouveau 0000:41:00.0: fifo: DROPPED_MMU_FAULT 00000000 nouveau 0000:41:00.0: fifo: PBDMA2: 80000000 [SIGNATURE] ch 0 [007fcf4000 DRM] subc 7 mthd 1ffc data ffeff7f7 nouveau 0000:41:00.0: fifo: read fault at 0000009000 engine 04 [BAR1] client 07 [HOST_CPU] reason 00 [PDE] on channel -1 [007fd25000 unknown]
That's not looking too healthy in general, and this is old hardware (the machine has one of the Quadro K420s).
I attempted various things to get the nouveau driver to shut off
these messages, without success, and then while I was flailing
around I had a crazy idea: perhaps I could just
rmmod the entire
driver. It might leave the machine without much of a video console,
but all of these compute servers are effectively headless (they
don't normally have a screen plugged in). Somewhat to my surprise,
this worked and with the driver unloaded, the messages naturally
(I was worried about the nouveau driver hanging on unload because
it was unable to cleanly shut down the hardware it was talking to,
since it was clearly having trouble talking to the hardware in
general. And before I checked
lsmod I was worried about the driver
having a non-zero usage count due to the generic console system or
something. It seems a little bit alarming that the kernel driver
for your console can have a zero usage count.)