Wandering Thoughts archives

2014-12-10

What good kernel messages should be about and be like

Linux is unfortunately a haven of terrible kernel messages and terrible kernel message handling, as I have brought up before. In a spirit of shouting at the sea, today I feel like writing down my principles of good kernel messages.

The first and most important rule of kernel messages is that any kernel message that is emitted by default should be aimed at system administrators, not kernel developers. There are very few kernel developers and they do not look at very many systems, so it's pretty much guaranteed that most kernel messages are read by sysadmins. If a kernel message is for developers, it's useless for almost everyone reading it (and potentially confusing). Ergo it should not be generated by default settings; developers who need it for debugging can turn it on in various ways (including kernel command line parameters). This core rule guides basically all of the rest of my rules.

The direct consequence of this is that all messages should be clear, without in-jokes or cleverness that is only really comprehensible to kernel developers (especially only subsystem developers). In other words, no yama-style messages. If sysadmins looking at your message have no idea what it might refer to, no lead on what kernel subsystem it came from, and no clue where to look for further information, your message is bad.

Comprehensible messages are only half of the issue, though; the other half is only emitting useful messages. To be useful, my view is that a kernel message should be one of two things: it should either be what they call actionable or it should be necessary in order to reconstruct system state (one example is hardware appearing or disappearing, another is log messages that explain why memory allocations failed). An actionable message should cause sysadmins to do something and really it should mean that sysadmins need to do something.

It follows that generally other systems should not be able to cause the kernel to log messages by throwing outside traffic at it (these days that generally means network traffic), because outsiders should not be able to harm your kernel to the degree where you need to do anything; if this is the case, they are not actionable for the sysadmin of the local machine. And yes, I bang on this particular drum a fair bit; that's because it keeps happening.

Finally, almost all messages should be strongly ratelimited. Unfortunately I've come around to the view that this is essentially impossible to do at a purely textual level (at least with acceptable impact for kernel code), so it needs to be considered everywhere kernel code can generate a message. This very definitely includes things like messages about hardware coming and going, because sooner or later someone is going to have a flaky USB adapter or SATA HD that starts disappearing and then reappearing once or twice a second.

To say this more compactly, everything in your kernel messages should be important to you. Kernel messages should not be a random swamp that you go wading in after problems happen in order to see if you can spot any clues amidst the mud; they should be something that you can watch live to see if there are problems emerging.

linux/GoodKernelMessages written at 22:52:03; Add Comment

How to delay your fileserver replacement project by six months or so

This is not exactly an embarrassing confession, because I think we made the right decisions for the long term, but it is at least an illustration of how a project can get significantly delayed one little bit at a time. The story starts back in early January, where we had basically finalized the broad details of our new fileserver environment; we had the hardware picked out and we knew we'd run OmniOS on the fileservers and our current iSCSI target software on some distribution of Linux. But what Linux?

At first the obvious answer was CentOS 6, since that would get us a nice long support period and RHEL 5 had been trouble-free on our existing iSCSI backends. Then I really didn't like RHEL/CentOS 6 and didn't want to use it here for something we'd have to deal with for four or five years to come (especially since it was already long in the tooth). So we switched our plans to Ubuntu, since we already run it everywhere else, and in relatively short order I had a version of our iSCSI backend setup running on Ubuntu 12.04. This was probably complete some time in late February, based on circumstantial evidence.

Eliding some rationale, Ubuntu 12.04 was an awkward thing to settle on in March or so of this year because Ubuntu 14.04 was just around the corner. Given that we hadn't built and fully tested the production installation, we might actually have wound up in the position of deploying 12.04 iSCSI backends after 14.04 had actually come out. Since we didn't feel in a big rush at the time, we decided it was worthwhile to wait for 14.04 to be released and for us to spin up the 14.04 version of our local install system, which we expected to have done by not too long after the 14.04 release. As it happened it was June before I picked the new fileserver project up again and I turned out to dislike Ubuntu 14.04 too.

By the time we knew we didn't really want to use Ubuntu 14.04, RHEL 7 was out (it came out June 10th). While we couldn't use it directly for local reasons, we though that CentOS 7 was probably going to be released soon and that we could at least wait a few weeks to see. CentOS 7 was released on July 7th and I immediately got to work, finally getting us back on track to where we probably could have been at the end of January if we'd stuck with CentOS 6.

(Part of the reason that we were willing to wait for CentOS 7 was that I actually built a RHEL 7 test install and everything worked. That not only proved that CentOS 7 was viable, it meant that we had an emergency fallback if CentOS 7 was delayed too long; we could go into at least initial production with RHEL 7 instead. I believe I did builds with CentOS 7 beta spins as well.)

Each of these decisions was locally sensible and delayed things only a moderate bit, but the cumulative effects delayed us by five or six months. I don't have any great lesson to point out here, but I do think I'm going to try to remember this in the future.

sysadmin/FileserverSixMonthDelay written at 00:13:51; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.