It's good to make sure you have notifications of things

November 17, 2019

In the course of writing yesterday's entry on the operational differences between notifications and logs, I wound up having a realization that is obvious in retrospect: not having notifications for things is at the root of a lot of sysadmin horror stories. We've all heard the stories of people who lost hardware RAID arrays because disks failed silently, for example; that's a missing notification (either because there was nothing at all or because the failure information only went to logs). Logs are useful to tell you what's happening, but notifications are critical to tell you that there's something you need to look at and probably deal with.

The corollary of this for me is that when I set up a new system (or upgrade to one, as with Certbot), I should check to make sure that any necessary notifications are being generated for it. Sometimes this is an obvious part of setting up a new service, as it was for Prometheus, but sometimes it's easy to let things drop through the cracks, either because I just assume it's going to work without actually checking or because there's no obvious way to do it. Making this an actual checklist item for setting up new things will hopefully reduce the incidents of surprises.

(We may decide that something doesn't need explicit checks and notifications for various reasons, but if so at least we'll have actively considered it.)

I think of alerts as one form of notification, or alternately one way of generating notifications, but not the only form or source. Email from cron about a cron job failing is a notification, but probably not an alert. Nor do notifications necessarily have to directly go to you and bother you. We have a daily cron job on our Ubuntu machines that sends us email about new pending Ubuntu package updates, but we don't actually read that email; we use the presence of that email from one or more of our machines as a sign that we should run our 'update all of our Ubuntu machines' script in the morning.

(It may be easiest or most useful to generate an alert as your notification, or you may want to generate the notification in another way. For Certbot, we could generate an alert but because of how Prometheus and so on work, the alert would have relatively little information. With an email-based notification that comes directly from the machine, we can include what is hopefully the actual error being reported by Certbot, which hopefully shortens the investigation by a step.)

Written on 17 November 2019.
« The operational differences between notifications and logs
LiveJournal and the path to NoSQL »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Nov 17 23:36:38 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.