2017-01-04
Make sure that (system) email works on every machine
We have a central log server, which as you
might imagine is a machine we care about a fair bit. Today we
discovered that one of the disks in its software RAID mirror for
/var
had failed. Perhaps you are thinking that it failed over the
university's just-ended winter break, so let me be more honest and
precise here: it had failed in late October. And we didn't so much
'find' that the disk had failed as stumble over the fact more or
less through luck.
We didn't miss the news because the machine wasn't sending out
notifications of it. The machine's mdadm setup had dutifully sent
out email about it several times. It's just that the email hadn't
gone anywhere except to the local /var/spool/mail/root
, because
we hadn't set up a null-client Postfix configuration on it. There
are multiple causes for that, but I'm sure that one of them is that
it simply slipped our mind that the machine might generate important
local email.
(The central log server is a deliberately isolated one-off CentOS 7 machine, instead of one of our standard Ubuntu installs. Our standard Ubuntu machines automatically get a null-client Postfix configuration that sends all locally generated email to our mail submission machine and to us, but there's nothing that automatically sets that up for one-off CentOS 7 machines so it dropped through the cracks.)
There are a number of lessons here. The most immediately useful is make sure that the mail system is configured on all your machines. All of them. If something generates email on a machine, however unlikely that may seem to you, that email should not effectively vanish into the void; it should go somewhere where you'll at least have a record of it.
(There is an argument that you should have a better monitoring
system for problems like this than reading email. Sure, in an ideal
world, but systems come out of the box set up to send email to
root
right now. And even with a better monitoring system there
are still unusual problems that will be reported by email, such as
cron jobs exploding spectacularly. Handling email as a backup is
just the simplest way.)
We aren't perfect on this, but at least now our central syslog server (and a couple of other similar systems) will have its mail get through to us.
(There are some tricky parts about doing this really well that we aren't currently doing. To do it perfectly you need a separate submission configuration from your regular machines, but that's sufficiently complicated that it's another entry.)
Sidebar: How we found this
We were applying the recent CentOS 7 updates to the machine, and
after the 'yum update
' finished, the shell gave us that classic
note:
You have new mail in /var/spool/mail/root
We wondered what on the machine would be sending email, so we took
a look at root's mailbox. It turned out that one of the updated
packages was mdadm
and updating it had restarted its monitoring,
which had caused it to send out another email about the degraded
array.
A lot of things had to go right in order for us to be lucky here. One moral to draw is take a look at oddities, like surprise notices about new mail. They may be the surface manifestation of something quite important.