Wandering Thoughts archives

2012-11-30

All kernel messages should be usefully ratelimited. No exceptions.

I've written about this before, but here's today's version. On some of our new Dell servers, the Ubuntu 12.04 kernel will produce messages like these two:

[Firmware Warn]: GHES: Failed to read error status block address for hardware error source: 49378.
ghes_read_estatus: 2 callbacks suppressed

(If you're coming here from a web search on this message text and want to know how to fix it, see the sidebar at the bottom of this entry.)

See that number at the end of the first message? It changes. That makes the total message different, which makes all of syslog's duplicate suppression stuff fail. The net result is that this flooded our logs to the tune of two or three messages every two seconds. Yesterday this was good for 151,000 messages. Good luck seeing any important kernel messages in that flood.

(Today we noticed and disabled this entire subsystem, which took a system reboot. Fortunately this machine is not in production.)

There are three failures here. The first is the customary epic failure to consider scale. The people responsible for the code that dumped out this message presumably thought that it would trigger only very occasionally, but instead on some systems it sticks on and boom, so much for your logs. When you write kernel messages (or any code that prints messages), you need to consider what happens if the cause is more common than you expect. Do people actually need to see that many messages? The answer is generally no.

(What makes this worse is that the code responsible for this actually tries to ratelimit it a bit but doesn't do it anywhere near well enough.)

The second failure is that this message is, in the jargon, not actionable. There is nothing most people running a Linux machine can really do about this message except ignore it. The kernel code has some problem interacting with my hardware? I can't fix either part of this and if the machine is not malfunctioning as a result of this I don't care about the situation. This is essentially a useless kernel message.

The third failure is a failure of the Linux kernel infrastructure. These cases keep happening over and over again because the default message reporting interface does not try to make people think about these issues. The more messages the Linux kernel dumps in my lap, the more I think that its printk() interfaces should be (usefully) ratelimited by default and you should have to go out of your way to print something at all frequently. When good ratelimiting is hard, people don't do it. When it's the default, people will.

(Another thing that would help is a separate message reporting infrastructure for bulk warnings like this, one where the messages do not appear in the kernel message stream but instead show up in, say, per-module message logs in sysfs or something. Then code that felt compelled to report every instance of this sort of thing could do so while not contaminating the useful kernel messages.)

Sidebar: what this means and how to fix it

The simple way to fix this: add 'ghes.disable=1' to the kernel command line in whatever way is appropriate for your distribution (these days, usually changing /etc/default/grub and running update-grub) and reboot your machine. This will turn off the entire subsystem responsible for this message, which is unfortunately the only good way to do it.

This is apparently a standing bug in some kernels (allegedly only 32-bit ones, which matches our experiences) on at least (some) recent Dell servers; see Ubuntu bug #881164 and Fedora #746755. However, it's erratic; it doesn't happen on all of our recent Dells, even Dells of the same model that are all running 32-bit kernels. Since this has happened on two of our servers already, I suspect that we're going to wind up just automatically disable GHES in our standard Ubuntu 12.04 install. The very vague potential gains of GHES reports are not worth the clear downsides.

The message itself comes from drivers/acpi/apei/ghes.c. To quote the comments in that file:

APEI Generic Hardware Error Source support

Generic Hardware Error Source provides a way to report platform hardware errors (such as that from chipset).

'APEI' is the ACPI Platform Error Interface (cf). If you are cringing at the mention of ACPI here, well, yeah, that was my reaction too.

linux/RatelimitKernelMessages written at 17:53:07; Add Comment

My new view on why you need to profile code

The usual and traditional story of why you want to profile your code when you're optimizing it is that people are terrible at predicting what their code is actually doing. It's not just that people don't predict where the time is going; often they don't even predict things like what functions are called a lot, how many times loops run, and so on. This view offers what I'll call a fantasy of perfect knowledge, the idea that if we really understood what our code was doing and how it was being used, if we already knew the code flow information that profiling uncovers, we could predict all of this performance stuff in advance. The job of profiling is to fill in for our imperfect and mistaken information.

Several years ago I read a Malcolm Gladwell article where he drew a distinction between puzzles and mysteries that has stuck with me ever since. To use his terms, in a puzzle you don't have enough information and your challenge is to get more while in a mystery you have plenty of information and your challenge is to make sense of it all. In these terms, the traditional story of profiling and optimization is that it is a puzzle; you resort to profiling because you have incomplete and wrong information about what your program is actually doing.

At one level this is undeniably true, but it's also incomplete. I now believe that profiling is actually a mystery inside a puzzle. Even if you 'solved' the puzzle, gaining perfect knowledge of your code's real behavior, you still would not be able to accurately predict the code's performance because our programs and our systems are too complex for all of their interactions to be fully predicted in advance (at least by real people in any useful way). The fundamental reason to profile is because it cuts straight through this, removing the need to solve the mystery.

This is a much more pessimistic and strong take on profiling than usual, but I've come around to the view that it's the real situation. We just don't usually think about it because we generally feel we don't have sufficiently good information about what our program is doing (and we usually blame failures of prediction on lack of information).

(Although this is not quite how Aristotle Pagaltzis put it in his comment on this entry, I suspect this matches up with his opinion on the situation.)

programming/ProfilingReason written at 00:58:39; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.