Wandering Thoughts

2017-05-15

How we failed at making all our servers have SSD system disks

Several years ago I wrote an entry about why we're switching to SSDs for system disks, yet the other day there I was writing about how we recycle old disks to be system disks and maybe switching to fixed size root filesystems to deal with some issues there. A reasonable person might wonder what happened between point A and point B. What happened is not any of the problems that I thought might happen; instead it is a story of good intentions meeting rational but unfortunate decisions.

The first thing that happened was that we didn't commit whole-heartedly to this idea. Instead we decided that even inexpensive SSDs were still costly enough that we wouldn't use them on 'less important' machines; instead we'd reuse older hard drives on some machines. This opened a straightforward wedge in our plans, because now we had to decide if a machine was important enough for SSDs and we could always persuade ourselves that the answer was 'no'.

(It would have been one thing if we'd said 'experimental scratch machines get old HDs', but we opened it up to 'less important production machines'.)

Our next step was that we didn't buy (and continue to buy) enough SSDs to always clearly have plenty of SSDs in stock. The problem here is straightforward; if you want to make something pervasive in the servers that you set up, you need to make it pervasive on your stock shelf, and you need to establish the principle that you're always going to have more. This holds just as true for SSDs for us as it does for RAM; once we had a limited supply, we had an extra reason to ration it, and we'd already created our initial excuse when we decided that some servers could get HDs instead of SSDs.

Then as SSD stocks dwindled below a critical point, we had the obvious reaction of deciding that more and more machines weren't important enough to get SSDs as their system disks. This was never actively planned and decided on (and if it had been planned, we might have ordered more SSDs). Instead it happened bit by bit; if I was setting out to set up a server, and we had only (say) four SSDs left, I have to decide on the spot if my server is that important. It's easy to talk myself into saying 'I guess not, this can live with HDs', because I have to make a decision right then on the spot in order to keep moving forward on putting the server together.

(Had we sat down to plan out, say, our next four or five servers that we were going to build and talked about which ones were and weren't important, we might well have ordered more SSDs because the global situation would have been clearer and we would have been doing this further in advance. On the spot decision making is not infrequently driven to be focused on the short term and the immediate perspective, instead of a long term and global one.)

At this point we have probably flipped over to a view that HDs are the default on new or replacement servers and a server has to strike us as relatively special to get SSDs. This is pretty much the inverse of where we started out, although arguably it's a rational and possibly even correct response to budget pressures and so on. In other words, maybe our initial plan was always over-ambitious for the realities of our environment. It did help, because we got SSDs into some important servers and thus we've probably made them less likely to have disk failures.

A contributing factor is that it turned out to be surprisingly annoying to put SSDs in the 3.5" drive bays in a number of our servers, especially Dell R310s, because they have strict alignment requirements for the SATA and power connectors, and garden variety SSD 2.5" to 3.5" adaptors don't put the SSDs at the right place for this. Getting SSDs into such machines required extra special hardware; this added extra hassle, extra parts to keep in stock, and extra cost.

(This entry is partly in the spirit of retrospectives.)

SSDSystemDisksFailure written at 01:23:56; Add Comment

2017-05-08

Some things I've decided to do to improve my breaks and vacations

About a year ago I wrote about my email mistake of mingling personal and work email together and how it made taking breaks from work rather harder. It may not surprise you to hear that I have done nothing to remedy that situation since then. Splitting out email is a slog and I'm probably never going to get around to it. However, there are a couple of cheap tricks that I've decided to do for breaks and vacations (in fact I decided on them last year, but never got around to either writing about them or properly implementing them).

There are a number of public mailing lists for things like Exim and OmniOS that I participate in. I broadly like reading them, learning from them, and perhaps even helping people on them, but at the same time they're not software that I've got a deep personal interest in; I'm primarily on those mailing lists because we use all of these things at work. What I've found in the past is that these mailing lists feed me a constant drip of email traffic that I'm just not that interested in during breaks; after a while it becomes an annoyance to slog through. So now I am going to procmail away all of the traffic from those mailing lists for the duration of any break. Maybe I'll miss the opportunity to help someone, but it's worth it to stop distracting myself. All of that stuff can wait until I'm back in the office.

(I may also do this for some mailing lists for software I use personally. For example, I'm not sure that I need to be keeping up on the latest mail about my RAW processor if I'm trying to take a break from things.)

The other cheap trick is simple. I have a $HOME/adm directory full of various scripts I use to monitor and check in on things about our systems, and one of my fidgets is to run some of them just because. So I'm going to make that directory inaccessible when I'm taking a break by just doing 'chmod 055 $HOME/adm' (055 so that my co-workers can keep using these scripts if they want to). This isn't exactly a big obstacle I've put in my way; I can un-chmod the directory if I want to. But it's enough of a roadblock to hopefully break my habit of reflexively checking things, which is both a distraction and a temptation to get pulled into looking more deeply at anything I spot.

It's going to feel oddly quiet and empty to not have these email messages coming in and these fidgets around, but I think it's going to be good for me. If nothing else, it's going to be different and that's not a bad thing.

(Completely disconnecting from work would be ideal but it's not possible while my email remains entangled and, as mentioned, I still don't feel energetic enough to tackle everything involved in starting to fix that.)

HacksForBetterBreaks written at 21:43:35; Add Comment

2017-04-14

Sometimes laziness doesn't pay off

My office workstation has been throwing out complaints about some of its disks for some time, which I have been quietly clearing up rather than replace the disks. This isn't because these are generally good disks; in fact they're some Seagate 1TB drives which we haven't had the best of luck with. I was just some combination of too lazy to tear my office machine apart to replace a drive and too parsimonious with hardware to replace a disk drive before it failed.

(Working in a place with essentially no hardware budget is a great way to pick up this reflex of hardware parsimony. Note that I do not necessarily claim that it's a good reflex, and in some ways it's a quite inefficient one.)

Recently things got a bit more extreme, when one disk went from nothing to suddenly reporting 92 new 'Offline uncorrectable sector' errors (along with 'Currently unreadable (pending) sectors', which seems to travel with offline uncorrectable sectors). I looked at this, thought that maybe this was a sign that I should replace the disk, but then decided to be lazy; rather than go through the hassle of a disk replacement, I cleared all the errors in the usual way. Sure, the disk was probably going to fail, but it's in a mirror and when it actually did fail I could swap it out right away.

(I actually have a pair of disks sitting on my desk just waiting to be swapped in in place of the current pair. I think I've had them set aside for this for about a year.)

Well, talking about that idea, let's go to Twitter:

@thatcks: I guess I really should have just replaced that drive in my office workstation when it reported 92 Offline uncorrectable sectors.

@thatcks: At 5:15pm, I'm just going to hope that the other side of the mirrored pair survives until Monday. (And insure I have pretty full backups.)

Yeah, I had kind of been assuming that the disk would fail at some convenient time, like during a workday when I wasn't doing anything important. There are probably worse times for my drive to fail than right in front of me at 5:15 pm immediately before a long weekend, especially when I have a bike ride that evening that I want to go to, but I can't think of many that are more annoying.

(The annoyance is in knowing that I could swap the drive on the spot, if I was willing to miss the bike ride. I picked the bike ride, and a long weekend is just short enough that I'm not going to come in in the middle of it to swap the drive.)

I have somewhat of a habit of being lazy about this sort of thing. Usually I get away with it, which of course only encourages me to keep on being lazy and do it more. Then some day things blow up in my face, because laziness doesn't always pay off. I need to be better about getting myself to do those annoying tasks sooner or later rather than putting them off until I have no choice about it.

(At the same time strategic laziness is important, so important that it can be called 'prioritization'. You usually can't possibly give everything complete attention, time, and resources, so you need to know when to cut some nominal corners. This usually shows up in security, because there are usually an infinite number of things that you could be doing to make your environment just a bit more secure. You have to stop somewhere.)

LazinessSometimesBackfires written at 00:40:00; Add Comment

2017-04-12

Generating good modern self-signed TLS certificates in today's world

Once upon a time, generating decently good self-signed certificates for a host with OpenSSL was reasonably straightforward, especially if you didn't know about some relevant nominal standards. The certificate's Subject name field is a standard field with standard components, so OpenSSL would prompt you for all of them, including the Common Name (CN) that you'd put the hostname in. Then things changed and in modern TLS, you really want to put the hostname in the Subject Alternative Name field. SubjectAltName is an extension, and because it's an extension 'openssl req' will not prompt you to fill it in.

(The other thing is that you need to remember to specify -sha256 as one of the arguments; otherwise 'openssl req' will use SHA1 and various things will be unhappy with your certificate. Not all examples you can find on the Internet use '-sha256', so watch out.)

You can get 'openssl req' to create a self-signed certificate with a SAN, but since OpenSSL won't prompt for this you must use an OpenSSL configuration file to specify everything about the certificate, including the hostname(s). This is somewhat intricate, even if it turns out to be possibly to do this more or less through the command line with suitably complicated incantations. I particularly admire the use of the shell's usually obscure '<(...)' idiom.

Given how painful this is, what we really need is a better tool to create self-signed certificates and fortunately for me, it turns out that there is just what I need sitting around in the Go source code as generate_cert.go. Grab this file, copy it to a directory, then:

$ go build generate_cert.go
$ ./generate_cert --host www.example.com --duration 17520h
2017/04/11 23:51:21 written cert.pem
2017/04/11 23:51:21 written key.pem

This generates exactly the sort of modern self-signed certificate that I want; it uses SHA256, it has a 2048-bit RSA key (by default), and it's got SubjectAltName(s). You can use it to generate ECDSA based certificates if you're feeling bold.

Note that this generates a certificate without a CN. Since there are real CN-less certificates out there in the wild issued by real Certificate Authorities (including the one for this site), not having a CN should work fine with web browsers and most software, but you may run into some software that is unhappy with this. If so, it's only a small modification to add a CN value.

(You could make a rather more elaborate version of generate_cert.go with various additional useful options, and perhaps someone has already done so. I have so far resisted the temptation to start changing it myself.)

A rather more elaborate but more complete looking alternative is Cloudflare's CFSSL toolkit. CFSSL can generate self-signed certificates, good modern CSRs, and sign certificates with your own private CA certificate, which covers everything I can think of. But it has the drawback that you need to feed it JSON (even if you generate the JSON on the fly) and then turn its JSON output into regular .pem files with one of its included programs.

For basic, more or less junk self-signed certificates, generate_cert is the simple way to go. For instance my sinkhole SMTP server now uses one of these certs; SMTP senders don't care about details like good O values in your SMTP certificates, and even if they did in general spammers probably don't. If I was generating more proper self-signed certificates, one where people might see them in a browser or something, I would probably use CFSSL.

(Although if I only needed certificates with a constant Subject name, the lazy way to go would be to hardcode everything in a version of generate_cert and then just crank out a mass of self-signed certificates without having to deal with JSON.)

PS: We might someday want self-signed certificates with relatively proper O values and so on, for purely internal hosts that live in our own internal DNS zones. Updated TLS certificates for IPMI web interfaces are one potential case that comes to mind.

PPS: It's entirely possible that there's a better command line tool for this out there that I haven't stumbled over yet. Certainly this feels like a wheel that people must have reinvented several times; I almost started writing something myself before finding generate_cert.

MakingModernSelfSignedSSLCerts written at 00:23:59; Add Comment

2017-04-08

Doing things the clever way in Exim ACLs by exploiting ACL message variables

Someone recently brought a problem to the Exim mailing list where, as we originally understood it, they wanted to reject messages at SMTP time if they had a certain sender, went to certain recipients, and had a specific message in their Subject:. This is actually a little bit difficult to do straightforwardly in Exim because of the recipients condition.

In order to check the Subject: header, your ACL condition must run in the DATA phase (which is the earliest that the message headers are available). If you don't need to check the recipients, this is straightforward and you get something like this:

deny
   senders = <address list>
   condition = ${if match{$h_subject:}{Commit}}
   message = Prohibited commit message

The problem is in matching against the recipients. By the DATA phase there may be multiple recipients, so Exim doesn't offer any simple condition to match against them (the recipients ACL condition is valid only in the RCPT TO ACL, although Exim's current documentation doesn't make this clear). Exim exposes the entire accepted recipients list as $recipients, but you have to write a matching expression for this yourself and it's not completely trivial.

Fortunately there is a straightforward way around this: we can do our matching in stages and then accumulate and communicate our match results through ACL message variables. So if we want to match recipient addresses, we do that in the RCPT TO ACL in a warn ACL stanza whose only purpose is providing us a place to set an ACL variable:

warn
  recipients = <address list>
  set acl_m0_nocommit = 1

(After all, it's easy to match the recipient address against things in the RCPT TO ACL, because that's a large part of its purpose.)

Then in our DATA phase ACL we can easily match against $acl_m0_nocommit being set to 1. If we're being extra-careful we'll explicitly set $acl_m0_nocommit to 0 in our MAIL FROM ACL, although in practice you'll probably never run into a case where this matters.

Another example of communicating things from RCPT TO to DATA ACLs is in how we do milter-based spam rejection. Because DATA time rejection applies to all recipients and not all of our users have opted in to the same level of server side spam filtering, we accumulate a list of everyone's spam rejection level in the RCPT TO ACLs, then work out the minimum level in the DATA ACLs. This is discussed in somewhat more detail in the sidebar here.

In general ACL message variables can be used for all sorts of communication across ACL stanzas, both between different ACLs and even within the same ACL. As I sort of mentioned in how we do MIME attachment type logging with Exim, our rejection of certain sorts of attachments is done by recording the attachment type information into an ACL message variable and then reusing it repeatedly in later stanzas. So we have something like this:

warn
  # exists just to set our ACL variable
  [...]
  set acl_m1_astatus = ${run [...]}

deny
  condition = ${if match{$acl_m1_astatus} {\N (zip|rar) exts:.* .(exe|js|wsf)\N} }
  message = ....

deny
  condition = ${if match{$acl_m1_astatus} {\N MIME file ext: .(exe|js|bat|com)\N} }
  message = ....

deny
  condition = ${if match{$acl_m1_astatus} {\N; zip exts: .zip; inner zip exts: .doc\N} }
  message = ....

[...]

(Note that these conditions are simplified and shortened from our real versions.)

None of this is surprising. Exim's ACL message variables are variables, and so you can use them for communicating between different chunks of code just as you do in any other programming language. You just have to think of Exim ACLs and ACL stanzas as being a programming language and thus being something that you can write code in. Admittedly it's a peculiar programming language, but then much of Exim is this way.

EximMultiStageACLMatching written at 23:41:04; Add Comment

2017-03-31

I quite like the simplification of having OpenSSH canonicalize hostnames

Some time ago I wrote up some notes on OpenSSH's optional hostname canonicalization. At the time I had just cautiously switched over to having my OpenSSH setup on my workstation canonicalize my hostnames, and I half expected it to go wrong somehow. It's been just over a year since then and not only has nothing blown up, I now actively prefer having OpenSSH canonicalize the hostnames that I use and I've just today switched my OpenSSH setup on our login servers over to do this and cleared out my ~/.ssh/known_hosts file to start it over from scratch.

That latter bit is a big part of why I've come to like hostname canonicalization. We have a long history of using multiple forms of shortened hostnames for convenience, plus sometimes we wind up using a host's fully qualified name. When I didn't canonicalize names, my known_hosts file wound up increasingly cluttered with multiple entries for what was actually the same host, some of them with the IP address and some without. After canonicalization, all of this goes away; every host has one entry and that's it. Since we already maintain a system-wide set of SSH known hosts (partly for our custom NFS mount authentication system), my own known_hosts file now doesn't even accumulate very many entries.

(I should probably install our global SSH known hosts file even on my workstation, which is deliberately independent from our overall infrastructure; this would let me drastically reduce my known_hosts file there too.)

The other significant reason to like hostname canonicalization is the reason I mentioned in my original entry, which is that it allows me to use much simpler Host matching rules in my ~/.ssh/config while only offering my SSH keys to hosts that should actually accept them (instead of to everyone, which can have various consequences). This seems to have become especially relevant lately, as some of our recently deployed hosts seem to have reduced the number of authentication attempts they'll accept (and each keypair you offer counts as one attempt). And in general I just like having my SSH client configuration saying what I actually want, instead of having to flail around with 'Host *' matches and so on because there was no simple way to say 'all of our hosts'. With canonical hostnames, now there is.

As far as DNS reliability for resolving CNAMEs goes, we haven't had any DNS problems in the past year (or if we have, I failed to notice them amidst greater problems). We might someday, but in general DNS issues are going to cause me problems no matter what, since my ssh has to look up at least IP addresses in DNS. If it happens I'll do something, but in the mean time I've stopped worrying about the possibility.

SSHCanonHostnamesWin written at 22:07:59; Add Comment

2017-03-29

The work of safely raising our local /etc/group line length limit

My department has now been running our Unix computing environment for a very long time (which has some interesting consequences). When you run a Unix environment over the long term, old historical practices slowly build up and get carried forward from generation to generation of the overall system, because you've probably never restarted everything from complete scratch. All of this is an elaborate route to say that as part of our local password propagation infrastructure, we have a program that checks /etc/passwd and /etc/group to make sure they look good, and this program puts a 512 byte limit on the size of lines in /etc/group. If it finds a group line longer than that, it complains and aborts and you get to fix it.

(Don't ask what our workaround is for groups with large memberships. I'll just say that it raises some philosophical and practical questions about what group membership means.)

We would like to remove this limit; it makes our life more complicated in a number of ways, causes problems periodically, and we're pretty sure that it's no longer needed and probably hasn't been needed for years. So we should just take that bit of code out, or at least change the '> 512' to '> 4096', right?

Not so fast, please. We're pretty sure that doing so is harmless, but we're not certain. And we would like to not blow up some part of our local environment by mistake if it turns out that actually there is still something around here that has heartburn on long /etc/group lines. So in order to remove the limit we need to test to make sure everything still works, and one of the things that this has meant is sitting down and trying to think of all of the places in our environment where something could go wrong with a long group line. It's turned out that there were a number of these places:

  • Linux could fail to properly recognize group membership for people in long groups. I rated this as unlikely, since the glibc people are good at avoiding small limits and relatively long group lines are an obvious thing to think about.

  • OmniOS on our fileservers could fail to recognize group membership. Probably unlikely too; the days when people put 512-byte buffers or the like into getgrent() and friends are likely to be long over by now.

    (Hopefully those days were long over by, say, 2000.)

  • Our Samba server might do something special with group handling and so fail to properly deal with a long group, causing it to think that someone wasn't a member or deny them access to group-protected file.

  • The tools we use to build an Apache format group file from our /etc/group could blow up on long lines. I thought that this was unlikely too; awk and sed and so on generally don't have line length limitations these days.

    (They did in the past, in one form or another, which is probably part of why we had this /etc/group line length check in the first place.)

  • Apache's own group authorization checking could fail on long lines, either completely or just for logins at the end of the line.

  • Even if they handled regular group membership fine, perhaps our OmniOS fileservers would have a problem with NFS permission checks if you were in more than 16 groups and one of your extra groups was a long group, because this case causes the NFS server to do some additional group handling. I thought this was unlikely, since the code should be using standard OmniOS C library routines and I would have verified that those worked already, but given how important NFS permissions are for our users I felt I had to be sure.

(I was already confident that our local tools that dealt with /etc/group would have no problems; for the most part they're written in Python and so don't have any particular line length or field count limitations.)

It's probably worth explicitly testing Linux tools like useradd and groupadd to make sure that they have no problems manipulating group membership in the presence of long /etc/group lines. I can't imagine them failing (just as I didn't expect the C library to have any problems), but that just means it would be really embarrassing if they turned out to have some issue and I hadn't checked.

All of this goes to show that getting rid of bits of the past can be much more work and hassle than you'd like. And it's not particularly interesting work, either; it's all dotting i's and crossing t's just in case, testing things that you fully expect to just work (and that have just worked so far). But we've got to do this sometime, or we'll spend another decade with /etc/group lines limited to 512 bytes or less.

(System administration life is often not particularly exciting.)

GroupSizeIncreaseWorries written at 01:57:29; Add Comment

2017-03-26

Your exposure from retaining Let's Encrypt account keys

In a comment on my entry on how I think you have lots of Let's Encrypt accounts, Aristotle Pagaltzis asked a good question:

Taking this logic to its logical conclusion: as long as you can arrange to prove your control of a domain under some ACME challenge at any time, should you not immediately delete an account after obtaining a certificate through it?

(Granted – in practice, there is the small matter that deleting accounts appears unimplemented, as per your other entry…)

Let's take the last bit first: for security purposes, it's sufficient to destroy your account's private key. This leaves dangling registration data on Let's Encrypt's servers, but that's not your problem; with your private key destroyed, no one can use your authorized account to get any further certificates.

(If they can, either you or the entire world of cryptography have much bigger problems.)

For the broader issue: yes, in theory it's somewhat more secure to immediately destroy your private key the moment you have successfully obtained a certificate. However, there is a limit to how much security you get this way because someone with unrestricted access to your machine can get their own authorization for it with an account of their own. If I have root access to your machine and you normally run a Let's Encryption authorization process from it, I can just use my own client to do that same and get my own authorized account. I can then take the private key off the machine and later use it to get my own certificates for your machine.

(I can also reuse an account I already have and merely pass the authorization check, but in practice I might as well get a new account to go with it.)

The real exposure for existing authorized accounts is when it's easier to get at the account's private key than it is to get unrestricted access to the machine itself. If you keep the key on the machine and only accessible to root, well, I won't say you have no additional exposure at all, but in practice your exposure is probably fairly low; there are a lot of reasonably sensitive secrets that are protected this way and we don't consider it a problem (machine SSH host keys, for example). So in my opinion your real exposure starts going up when you transport the account key off the machine, for example to reuse the same account on multiple machines or over machine reinstalls.

As a compromise you might want to destroy account keys every so often, say once a year or every six months. This limits your long-term exposure to quietly compromised keys while not filling up Let's Encrypt's database with too many accounts.

As a corollary to this and the available Let's Encrypt challenge methods, someone who has compromised your DNS infrastructure can obtain their own Let's Encrypt authorizations (for any account) for any arbitrary host in your domain. If they issue a certificate for it immediately you can detect this through certificate transparency monitoring, but if they sit on their authorization for a while I don't think you can tell. As far as I know, LE provides no way to report on accounts that are authorized for things in your domain (or any domain), so you can't monitor this in advance of certificates being issued.

For some organizations, compromising your DNS infrastructure is about as difficult as getting general root access (this is roughly the case for us). However, for people who use outside DNS providers, such a compromise may only require gaining access to one of your authorized API keys for their services. And if you have some system that allows people to add arbitrary TXT records to your DNS with relatively little access control, congratulations, you now have a pretty big exposure there.

LetsEncryptAccountExposure written at 01:22:08; Add Comment

2017-03-22

Setting the root login's 'full name' to identify the machine that sent email

Yesterday I wrote about making sure you can identify what machine sent you a status email, and the comments Sotiris Tsimbonis shared a brilliant yet simple solution to this problem:

We change the gecos info for this purpose.

chfn -f "$HOSTNAME root" root

Take it from me; this is beautiful genius (so much so that both we and another group here immediately adopted it). It's so simple yet still extremely effective, because almost everything that sends email does so using programs like mail that will fill out the From: header using the login's GECOS full name from /etc/passwd. You get email that looks like:

From: root@<your-domain> (hebikera root)

This does exactly what we want by immediately showing the machine that the email is from. In fact many mail clients these days will show you only the 'real name' from the From: header by default, not the actual email address (I'm old-fashioned, so I see the traditional full From: header).

This likely works with any mail-sending program that doesn't require completely filled out email headers. It definitely works in the Postfix sendmail cover program for 'sendmail -t' (as well as the CentOS 6 and 7 mailx, which supplies the standard mail command).

(As an obvious corollary, you can also use this trick for any other machine-specific accounts that send email; just give them an appropriate GECOS 'full name' as well.)

There's two perhaps obvious cautions here. First, if you ever rename machines you have to remember to re-chfn the root login and any other such logins to have the correct hostname in them. It's probably worth creating an officially documented procedure for renaming machines, since there are other things you'll want to update as well (you might even script it). Second, if you have some sort of password synchronization system you need it to leave root's GECOS full name alone (although it can update root's password). Fortunately ours already does this.

IdentifyMachineEmailByRootName written at 23:49:10; Add Comment

Making sure you can identify what machine sent you a status email

I wrote before about making sure that system email works, so that machines can do important things like tell you that their RAID array has lost redundancy and you should do something about that. In a comment on that entry, -dsr- brought up an important point, which is you want to be able to easily tell which machine sent you email.

In an ideal world, everything on every machine that sends out email reports would put the machine's hostname in, say, the Subject: header. This would give you reports like:

Subject: SMART error (FailedOpenDevice) detected on host: urd

In the real world you also get helpful emails like this:

Subject: Health

Device: /dev/sdn [SAT], FAILED SMART self-check. BACK UP DATA NOW!

The only way for us to tell which machine this came from was to look at either the Received: headers or the Message-ID, which is annoying.

There are at least two ways to achieve this. The first approach is what -dsr- said in the comment, which is to make every machine send its email to a unique alias on your system. This unfortunately has at least two limitations. The first is that it somewhat clashes with a true 'null client' setup, where your machines dump absolutely all of their email on the server. A straightforward null client does no local rewriting of email at all, so to get this you need a smarter local mailer (and then you may need per-machine setup, hopefully automated). The second limitation is that there's no guarantee that all of the machine's email will be sent to root (and thus be subject to simple rewriting). It's at least likely, but machines have been known to send status email to all sorts of addresses.

(I'm going to assume that you can arrange for the unique destination alias to be visible in the To: header.)

You can somewhat get around this by doing some of the rewriting on your central mail handler machine (assuming that you can tell the machine email apart from regular user email, which you probably want to do anyways). This needs a relatively sophisticated configuration, but it probably can be done in something like Exim (which has quite powerful rewrite rules).

However, if you're going to do this sort of magic in your central mail handler machine, you might as well do somewhat different magic and alter the Subject: header of such email to include the host name. For instance, you might just add a general rule to your mailer so that all email from root that's going to root will have its Subject: altered to add the sending machine's hostname, eg 'Subject: [$HOSTNAME] ....'. Your central mail handler already knows what machine it received the email from (the information went into the Received header, for example). You could be more selective, for instance if you know that certain machines are problem sources (like the CentOS 7 machine that generated my second example) while others use software that already puts the hostname in (such as the Ubuntu machine that generated my first example).

I'm actually more attracted to the second approach than the first one. Sure, it's a big hammer and a bit crude, but it creates the easy to see marker of the source machine that I want (and it's a change we only have to make to one central machine). I'd feel differently if we routinely got status emails from various machines that we just filed away (in which case the alias-based approach would give us easy per-machine filing), but in practice our machines only email us occasionally and it's always going to be something that goes to our inboxes and probably needs to be dealt with.

IdentifyingStatusEmailSource written at 01:11:32; Add Comment

(Previous 10 or go back to March 2017 at 2017/03/13)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.