Make sure that (system) email works on every machine

January 4, 2017

We have a central log server, which as you might imagine is a machine we care about a fair bit. Today we discovered that one of the disks in its software RAID mirror for /var had failed. Perhaps you are thinking that it failed over the university's just-ended winter break, so let me be more honest and precise here: it had failed in late October. And we didn't so much 'find' that the disk had failed as stumble over the fact more or less through luck.

We didn't miss the news because the machine wasn't sending out notifications of it. The machine's mdadm setup had dutifully sent out email about it several times. It's just that the email hadn't gone anywhere except to the local /var/spool/mail/root, because we hadn't set up a null-client Postfix configuration on it. There are multiple causes for that, but I'm sure that one of them is that it simply slipped our mind that the machine might generate important local email.

(The central log server is a deliberately isolated one-off CentOS 7 machine, instead of one of our standard Ubuntu installs. Our standard Ubuntu machines automatically get a null-client Postfix configuration that sends all locally generated email to our mail submission machine and to us, but there's nothing that automatically sets that up for one-off CentOS 7 machines so it dropped through the cracks.)

There are a number of lessons here. The most immediately useful is make sure that the mail system is configured on all your machines. All of them. If something generates email on a machine, however unlikely that may seem to you, that email should not effectively vanish into the void; it should go somewhere where you'll at least have a record of it.

(There is an argument that you should have a better monitoring system for problems like this than reading email. Sure, in an ideal world, but systems come out of the box set up to send email to root right now. And even with a better monitoring system there are still unusual problems that will be reported by email, such as cron jobs exploding spectacularly. Handling email as a backup is just the simplest way.)

We aren't perfect on this, but at least now our central syslog server (and a couple of other similar systems) will have its mail get through to us.

(There are some tricky parts about doing this really well that we aren't currently doing. To do it perfectly you need a separate submission configuration from your regular machines, but that's sufficiently complicated that it's another entry.)

Sidebar: How we found this

We were applying the recent CentOS 7 updates to the machine, and after the 'yum update' finished, the shell gave us that classic note:

You have new mail in /var/spool/mail/root

We wondered what on the machine would be sending email, so we took a look at root's mailbox. It turned out that one of the updated packages was mdadm and updating it had restarted its monitoring, which had caused it to send out another email about the degraded array.

A lot of things had to go right in order for us to be lucky here. One moral to draw is take a look at oddities, like surprise notices about new mail. They may be the surface manifestation of something quite important.

Written on 04 January 2017.
« Software should support configuring overall time limits
Mail submission by users versus by your machines »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jan 4 01:37:59 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.