All kernel messages should be usefully ratelimited. No exceptions.

November 30, 2012

I've written about this before, but here's today's version. On some of our new Dell servers, the Ubuntu 12.04 kernel will produce messages like these two:

[Firmware Warn]: GHES: Failed to read error status block address for hardware error source: 49378.
ghes_read_estatus: 2 callbacks suppressed

(If you're coming here from a web search on this message text and want to know how to fix it, see the sidebar at the bottom of this entry.)

See that number at the end of the first message? It changes. That makes the total message different, which makes all of syslog's duplicate suppression stuff fail. The net result is that this flooded our logs to the tune of two or three messages every two seconds. Yesterday this was good for 151,000 messages. Good luck seeing any important kernel messages in that flood.

(Today we noticed and disabled this entire subsystem, which took a system reboot. Fortunately this machine is not in production.)

There are three failures here. The first is the customary epic failure to consider scale. The people responsible for the code that dumped out this message presumably thought that it would trigger only very occasionally, but instead on some systems it sticks on and boom, so much for your logs. When you write kernel messages (or any code that prints messages), you need to consider what happens if the cause is more common than you expect. Do people actually need to see that many messages? The answer is generally no.

(What makes this worse is that the code responsible for this actually tries to ratelimit it a bit but doesn't do it anywhere near well enough.)

The second failure is that this message is, in the jargon, not actionable. There is nothing most people running a Linux machine can really do about this message except ignore it. The kernel code has some problem interacting with my hardware? I can't fix either part of this and if the machine is not malfunctioning as a result of this I don't care about the situation. This is essentially a useless kernel message.

The third failure is a failure of the Linux kernel infrastructure. These cases keep happening over and over again because the default message reporting interface does not try to make people think about these issues. The more messages the Linux kernel dumps in my lap, the more I think that its printk() interfaces should be (usefully) ratelimited by default and you should have to go out of your way to print something at all frequently. When good ratelimiting is hard, people don't do it. When it's the default, people will.

(Another thing that would help is a separate message reporting infrastructure for bulk warnings like this, one where the messages do not appear in the kernel message stream but instead show up in, say, per-module message logs in sysfs or something. Then code that felt compelled to report every instance of this sort of thing could do so while not contaminating the useful kernel messages.)

Sidebar: what this means and how to fix it

The simple way to fix this: add 'ghes.disable=1' to the kernel command line in whatever way is appropriate for your distribution (these days, usually changing /etc/default/grub and running update-grub) and reboot your machine. This will turn off the entire subsystem responsible for this message, which is unfortunately the only good way to do it.

This is apparently a standing bug in some kernels (allegedly only 32-bit ones, which matches our experiences) on at least (some) recent Dell servers; see Ubuntu bug #881164 and Fedora #746755. However, it's erratic; it doesn't happen on all of our recent Dells, even Dells of the same model that are all running 32-bit kernels. Since this has happened on two of our servers already, I suspect that we're going to wind up just automatically disable GHES in our standard Ubuntu 12.04 install. The very vague potential gains of GHES reports are not worth the clear downsides.

The message itself comes from drivers/acpi/apei/ghes.c. To quote the comments in that file:

APEI Generic Hardware Error Source support

Generic Hardware Error Source provides a way to report platform hardware errors (such as that from chipset).

'APEI' is the ACPI Platform Error Interface (cf). If you are cringing at the mention of ACPI here, well, yeah, that was my reaction too.

Written on 30 November 2012.
« My new view on why you need to profile code
What goes into the terminal's 'cbreak' and 'raw' modes »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Nov 30 17:53:07 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.