Wandering Thoughts archives

2019-11-17

It's good to make sure you have notifications of things

In the course of writing yesterday's entry on the operational differences between notifications and logs, I wound up having a realization that is obvious in retrospect: not having notifications for things is at the root of a lot of sysadmin horror stories. We've all heard the stories of people who lost hardware RAID arrays because disks failed silently, for example; that's a missing notification (either because there was nothing at all or because the failure information only went to logs). Logs are useful to tell you what's happening, but notifications are critical to tell you that there's something you need to look at and probably deal with.

The corollary of this for me is that when I set up a new system (or upgrade to one, as with Certbot), I should check to make sure that any necessary notifications are being generated for it. Sometimes this is an obvious part of setting up a new service, as it was for Prometheus, but sometimes it's easy to let things drop through the cracks, either because I just assume it's going to work without actually checking or because there's no obvious way to do it. Making this an actual checklist item for setting up new things will hopefully reduce the incidents of surprises.

(We may decide that something doesn't need explicit checks and notifications for various reasons, but if so at least we'll have actively considered it.)

I think of alerts as one form of notification, or alternately one way of generating notifications, but not the only form or source. Email from cron about a cron job failing is a notification, but probably not an alert. Nor do notifications necessarily have to directly go to you and bother you. We have a daily cron job on our Ubuntu machines that sends us email about new pending Ubuntu package updates, but we don't actually read that email; we use the presence of that email from one or more of our machines as a sign that we should run our 'update all of our Ubuntu machines' script in the morning.

(It may be easiest or most useful to generate an alert as your notification, or you may want to generate the notification in another way. For Certbot, we could generate an alert but because of how Prometheus and so on work, the alert would have relatively little information. With an email-based notification that comes directly from the machine, we can include what is hopefully the actual error being reported by Certbot, which hopefully shortens the investigation by a step.)

sysadmin/CheckForNotificationsWorking written at 23:36:38; Add Comment

The operational differences between notifications and logs

In a comment on my entry on how systemd timer units hide errors, rlaager raised an interesting issue:

The emphasis on emails feels like status quo bias, though. Imagine the situation was reversed: that everything was using systemd timers and then someone wrote cron and people started switching to that. In that case, there is a similar operational change. You'd switch from having a centralized status (e.g. systemctl list-units --failed) and centralized logging (the journal, which also defaults to forwarding to syslog) to crond sending emails. Is that an improvement, a step backwards, neither or both?

My answer is 'both', because in their normal state, emails from cron are fundamentally different from systemd journal entries. Emails from cron are notifications, while log entries of all sorts are, well, logs. A switch from notifications to logs or vice versa is a deep switch with real operational impacts because you get different things from each of them.

Logs give you a history. You can look back through your logs to see what happened when, and with merged logs (or just multiple logs) you can try to correlated this with other things happening at the time. Notifications let you know that something happened (or is happening, but cron only sends email when the cron job finishes), but they don't provide history unless you capture and save each separate email (in order).

(You can create one from the other with additional work, of course. With notifications, you save the notifications in a log, and with logs you have something watch the logs and send you notifications. But you have to go to that additional work, and if you don't do it you're going to miss something.)

On an operational level, switching from one to the other is potentially dangerous because in each case you lose something that you were probably counting on. If you move from a system that gives you notifications (such as cron jobs sending email on failure) to one that gives you logs (such as systemd timer units logging their failures to the journal), you lose the notifications that you're expecting and that you're using to discover problems. If you move from logs to notifications, you lose history and you may get spammed with notifications that you don't actually care about. And of course the most dangerous switches are the ones where you don't realize that you're actually switching (or that the software you use has quietly switched for you, for example by moving from cron jobs to systemd timer units).

(You may also have built your systems differently in the first place. In a log-based world, it's perfectly sensible to have things emit a lot of messages (and then to drive notifications from a subset of them, if you do). If you move to a world where emitting messages triggers notifications, suddenly you will be getting a lot of notifications that you don't want.)

sysadmin/NotificationsVersusLogs written at 00:43:44; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.