Limiting the Nouveau kernel driver's messages via removal

December 17, 2020

Over on the Fediverse, I said:

Current status: solving software problems triggered by hardware problems with 'rmmod <module>'. It even worked.

(Modules cannot incessantly log kernel messages when they are unloaded. I was just glad the module did unload, given likely broken hardware that it was complaining about.)

Naturally there is a story here.

We have a collection of hand-built AMD Threadripper based compute servers (we bought all the parts, including 4U cases, and assembled them). In order to boot, these machines need video cards, since they don't have on-CPU GPUs and the standard AMD Threadripper motherboards we bought don't come with an onboard GPU the way server motherboards do. So we dug around in the department's collection spare parts collection and came up with a collection of old NVidia cards to stick in these machines.

(Where by 'old' I mean things like Quadro FX 570s, Quadro K420s, a Quadro NVS 285, and even one GeForce 8400 GS, as identified by lspci.)

This morning, after rebooting of of these machines to bring it into service, it began logging hundreds of kernel messages a second from the nouveau driver, to the effect of things like:

nouveau 0000:41:00.0: fifo: PBDMA0: 80000000 [SIGNATURE] ch 1 [007fcf3000 DRM] subc 0 mthd 0000 data 00000000
nouveau 0000:41:00.0: fifo: PBDMA2: 80006000 [GPFIFO GPPTR SIGNATURE] ch 0 [007fcf4000 DRM] subc 0 mthd 0000 data 00000000

This completely overwhelmed the machine (and ran it out of disk space), and didn't do great things to our central syslog server (which got quite busy handling these).

At first I thought that this was yet another case of not ratelimiting kernel messages when you should. It is, but after I was able to reboot the machine through trickery and examine the early kernel messages from the nouveau driver, it turns out to probably also be broken hardware:

nouveau 0000:41:00.0: DRM: DCB conn 00: 00001030
nouveau 0000:41:00.0: DRM: DCB conn 01: 00002146
[drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[drm] Driver supports precise vblank timestamp query.
nouveau 0000:41:00.0: disp: chid 0 mthd 0088 data f0000000 00007088 00000000
nouveau 0000:41:00.0: fifo: write fault at 0000001000 engine 04 [BAR1] client 08 [HOST_CPU_NB] reason 00 [PDE] on channel -1 [007fd25000 unknown]
nouveau 0000:41:00.0: fifo: write fault at 0000040000 engine 05 [BAR2] client 08 [HOST_CPU_NB] reason 02 [PTE] on channel -1 [007fd76000 unknown]
nouveau 0000:41:00.0: fifo: DROPPED_MMU_FAULT 00000000
nouveau 0000:41:00.0: fifo: PBDMA2: 80000000 [SIGNATURE] ch 0 [007fcf4000 DRM] subc 7 mthd 1ffc data ffeff7f7
nouveau 0000:41:00.0: fifo: read fault at 0000009000 engine 04 [BAR1] client 07 [HOST_CPU] reason 00 [PDE] on channel -1 [007fd25000 unknown]

That's not looking too healthy in general, and this is old hardware (the machine has one of the Quadro K420s).

I attempted various things to get the nouveau driver to shut off these messages, without success, and then while I was flailing around I had a crazy idea: perhaps I could just rmmod the entire driver. It might leave the machine without much of a video console, but all of these compute servers are effectively headless (they don't normally have a screen plugged in). Somewhat to my surprise, this worked and with the driver unloaded, the messages naturally stopped.

(I was worried about the nouveau driver hanging on unload because it was unable to cleanly shut down the hardware it was talking to, since it was clearly having trouble talking to the hardware in general. And before I checked lsmod I was worried about the driver having a non-zero usage count due to the generic console system or something. It seems a little bit alarming that the kernel driver for your console can have a zero usage count.)

Comments on this page:

By george at 2020-12-18 17:01:37:

You don't need a gpu to boot a server. Headless linux is totally a thing. If you need to have hands on with it, usb dongle for serial port would do it.

By cks at 2020-12-18 17:53:28:

These are built with consumer Threadripper motherboards, so it's not clear to us if they'll even boot without a video card for the BIOS to talk to. Even if they do, it's much more convenient to deal with a video console for both the BIOS and Linux itself. Of course now that the machine is installed and running that is much less of an issue; hopefully we won't need to touch it again until we reinstall it with a later Ubuntu version.

Written on 17 December 2020.
« Mailing lists and bounce handling (or not handling bounces) today
On Go, release timing, and new machines »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Dec 17 23:05:28 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.