My current editor usage (as of mid 2023)
I use three (Unix) editors on a regular basis, and there's a story or two in that and how my editor usage has shifted over time. For me, the big shift has been that vim has become my default editor, the editor I normally use unless there's some special circumstance. One way to put it is that vim has become my editing path of least resistance. This shift isn't something that I would have predicted years ago (back then I definitely didn't particularly like vim), but looking back from today it feels almost inevitable.
Many years ago I wrote Why
vi has become my sysadmin's editor, where I talked about how I used vi a lot as a
sysadmin because it was always there (at the time it really was
more or less vi, not vim, cf). Why vim
has become my default editor is kind of an extension of that. Because
I was using vi(m) all of the time for sysadmin things, learning new
vim tricks (like windows or
multi-file changes) had a high
payoff since I could use them everywhere, any time I was using vim
(and they pulled me into being a vim user, not a vi user). As I improved my vim skills and used it
more and more, vim usage became more and more reflexive and vim was
generally sufficient for the regular editing I wanted to do and
usually the easiest thing to use. Then of course I also learned new
vim tricks as part of regular editing, improving my sysadmin vim
usage as well, all in a virtuous cycle of steadily increasing usage.
My vim setup is almost completely stock, because I work in too many different environments to try to keep a vim configuration in sync across them. If I customized my own vim environment very much, I would lose the virtuous cycle of going back and forth between my vim environment and the various other standard setups where I'm using vim because it's there and it works. I do customize my vim environment slightly, but it's mostly to turn off irritations.
My second most frequently used editor is my patched version of an X11 version of Rob
sam editor. Sam is the editor that
my exmh environment invokes when I use it to reply
to email, and I still read and reply to much of my email in exmh.
In theory it wouldn't be too hard to make exmh use vim instead (in
an xterm); in practice, I like sam and I like still using it here.
However, when I write or reply to email from the command line with
NMH commands, I edit that email in
vim. I sometimes use sam for other editing, but not very often, and
sometimes I'm sad about this shift. I like sam; I just wish I liked
it enough to stubbornly stick with it against the vim juggernaut.
My third most frequently used editor is GNU Emacs. GNU Emacs is what I use if I'm doing something that benefits from a superintelligent editor, and Magit is extremely compelling all by itself, especially with Magit's excellent support for selective commits. Apart from Magit, my major use for GNU Emacs is working with Go or Python, where I've gone through the effort to set up intelligent LSP-based support for them (Go version, Python version), as well as various additional tweaks and hacks (for example). If I had cause to do significant editing of C code, I'd probably also do it in GNU Emacs because I have an existing C auto-indentation setup that I like (I preserved it after blowing up my ancient Emacs setup). I still consider GNU Emacs to be my editor of choice for serious code editing (more or less regardless of the language), for various reasons, but I don't do very much programming these days. If I had to read and trace my way through Rust code, I might try doing it in GNU Emacs just because I have the Rust LSP server installed and I know how to jump around in lsp-mode.
(Today I mostly use GNU Emacs over X, because all of this LSP intelligence really wants graphics and various other things in order to look nice. GNU Emacs in a black and white xterm terminal window is a shadow of itself, at least in my configuration.)
My use of GNU Emacs stems from history. I used to use GNU Emacs a lot, so I built up a great deal of familiarity with editing in it and customizing it (my vim familiarity is much more recent). I use GNU Emacs enough to keep that familiarity alive, so it keeps being a decent editing environment for me. The same is true of my sam usage; there was a time when I used sam much more than I do now and I still retain a lot of the knowledge from then.
I'm sentimentally fond of sam, even if I don't use it broadly; it still feels right when I edit messages in it. I'm not sure I'm fond of either vim or GNU Emacs (any more than I am of most software), but vim has come to feel completely natural and GNU Emacs is an old friend even if I don't see it all that often. I feel no urge to try to make vim replace GNU Emacs by adding plugins, for various reasons including how I feel about doing that with vim (also).
(This expands on a Fediverse post, which was sparked by Jaana Dogan's mention of Acme, which linked to Russ Cox'x video tour of Acme.)
How I set up a server for testing new Grafana versions and other things
I mentioned yesterday that you probably should have a server that you can test Grafana upgrades on, and having one is useful for experiments. There are a couple of ways to set up such a server, and as it happens our environment is built in such a way to make it especially easy. Although our Prometheus server and Grafana run on the same machine and so Grafana could access Prometheus as 'localhost:9090', when I set this up I decided that Grafana should instead access Prometheus through our reverse proxy Apache setup, using the server's public name.
(When I set this up, I think I had ideas of being able to watch for potential Grafana query problems by looking at the Apache log and seeing the queries it was sending to Prometheus. Then Grafana switched from querying Prometheus with GET and query parameters to using POST and a POST body, a change that's better for Grafana but which does limit what we can now get from Apache logs.)
This makes setting up a testing Grafana server basically trivial.
We (I) install a random (virtual) machine, follow the steps to set
up Apache and Grafana as if it were our production metrics server,
and copy the production metrics server's current
it (while Grafana isn't running). When I restart Grafana, it will
come up with all of our dashboards and talking to our production
Prometheus data source, since it's using the public name. This
gives me a way to directly compare new versions of Grafana against
the production version, including trying to update old panel types
to new panel types and comparing the results.
(In our environment we have few enough changes to the production
grafana.db that I can just copy the file around more or less any
time I want to; I don't need to shut down the production Grafana
to save a safe copy.)
This would still be relatively simple if I'd used 'localhost:9090' as the URL for our Prometheus data source instead of its public URL, since you can change the URL of a data source. I'd just have to remember to modify the copied database that way every time I copied or re-copied it. If our Prometheus server wasn't accessible at all off the machine (either on its own or through a reverse proxy Apache), I would probably enable my testing by ssh'ing from the test server to the production server to port forward 'localhost:9090'. I couldn't leave this running all the time, but there's generally no reason to leave the test Grafana server running unless I'm specifically interested in something.
(This is far easier to do with virtual machines than with physical ones, since these days starting up and shutting them down is as simple as 'virsh start <x>' and 'virsh shutdown <x>'.)
PS: Another trick you can play with a Grafana testing server is to
keep multiple copies of your
grafana.db around and swap them in
and out of being the live one depending on what you want to do. For
example, you might have one with a collection of test dashboards
(to look into things like how to display status over time in the
latest Grafana), and another that's a
current copy of your production Grafana's database.
In practice, Grafana has not been great at backward compatibility
We started our Prometheus and Grafana based metrics setup in late 2018. Although many of our Grafana dashboards weren't created immediately, the majority of them were probably built by the middle of 2019. Based on release history, we probably started somewhere around v6.4.0 and had many dashboards done by the time v7.0.0 came out. We're currently frozen on v8.3.11, having tried v8.4.0 and rejected it and all subsequent versions. The reason for this is fairly straightforward; from v8.4.0 onward, Grafana broke too many of our dashboards. The breakage didn't start in 8.4, to be honest. For us, things started to degrade from the change between the 7.x series and 8.0, but 8.4 was the breaking point where too much was off or not working.
(I've done experiments with Grafana v9.0 and onward, and it had more issues over the latest 8.x releases. In one way this isn't too surprising, since it is a new major release.)
I've encountered issues in several areas in Grafana during upgrades. Grafana's handling of null results from Prometheus queries has regressed more than once while we've been using it. Third party panels that we use have been partially degraded or sometimes completely broken (cf). Old panel types sprouted new bugs; new panel types that were supposed to replace them had new bugs, or sometimes lacked important functionality that the old panel types had. Upgrading (especially automatically) from old panel types to their nominally equivalent new panel types didn't always carry over all of your settings (for settings the new panel type supported, which wasn't always all of them).
Grafana is developed and maintained by competent people. That these backward compatibility issues happen anyway tell me that broad backward compatibility is not a priority in Grafana development. This is a perfectly fair thing; the Grafana team is free to pick their priorities (for example, not preserving compatibility for third party panels if they feel the API being used is sub-par and needs to change). But I'm free to quietly react to them, as I have by freezing on 8.3.x, the last release where things worked well enough.
I personally think that Grafana's periodic lack of good backward compatibility is not a great thing. Dashboards are not programs, and I can't imagine that many places want them to be in constant development. I suspect that there are quite a lot of places that want to design and create their dashboards and then have them just keep working until the metrics they draw on change (forcing the dashboards to change to keep up). Having to spend time on dashboards simply to keep them working as they are is not going to leave people enthused, especially if the new version doesn't work as well as the old version.
The corollary of this is that I think you should maintain a testing Grafana server, kept up to date with your primary server's dashboards, where you can apply Grafana updates to test them to see if anything you care about is broken or sufficiently different to cause you problems. You should probably also think about what might happen if you have to either freeze your version of Grafana or significantly rebuild your dashboards to cope with a new version. If you allow lots of people to build their own dashboards, perhaps you want to consider how to reach out to them to get them to test their dashboards or let them know of issues you've found and the potential need to update their dashboards.
(I didn't bother filing bug reports about the Grafana issues that I encountered, because my experience with filing other Grafana issues was that doing so didn't produce results. I'm sure that there are many reasons for this, including that Grafana probably gets a lot of issues filed against it.)
Having metrics for something attracts your attention to it
For reasons beyond the scope of this entry, we didn't collect any metrics from our Ubuntu 18.04 ZFS fileservers (trying to do so early on led to kernel panics). When we upgraded all of them to Ubuntu 22.04, we changed this, putting various host agents on them and collecting a horde of metrics that go into our Prometheus metrics system, some of which automatically appear on our dashboards. One of the results of this is that we've started noticing things about what's happening on our fileservers. For example, at various times, we've noticed significant NFS read volume, significant NFS RPC counts, visible load averages, and specific moments when the ZFS ARC has shrunk. Noticing these things has led us to investigate some of them and pushed me to put together tools to make this easier.
What we haven't seen is any indication that these things we're now noticing are causing issues on our NFS clients (ie, our normal Ubuntu servers), or that they're at all unusual. Right now, my best guess is that everything we're seeing now has been quietly going on for some time. Every so often for years, people have run jobs on our SLURM cluster that repeatedly read a lot of data over NFS, and other people have run things that scan directories a lot, and I know our ZFS ARC size has been bouncing around for a long time. Instead, what we're seeing is that metrics attract attention, at least when they're new.
This isn't necessarily a bad thing, as long as we don't over-react. Before we had these metrics we probably had very little idea what was a normal operating state for our fileservers, so if we'd had to look at them during a problem we'd have had much less idea what was normal and what was exceptional. Now we're learning more, and in a while the various things these metrics are telling us probably won't be surprising news (and to a certain extent that's already happening).
This is in theory not a new idea for me, but it's one thing to know it intellectually and another thing to experience it as new metrics appear and I start digging into them and what they expose. It's at least been a while since I went through this experience, and this time around is a useful reminder.
(This is related to the idea that having metrics for something can be dangerous and also that dashboards can be overly attractive. Have I maybe spent a bit too much time fiddling with ZFS ARC metrics when our ARC sizes don't really matter because our ARC hit rates are high? Possibly.)
PS: Technically what attracts attention is being able to readily see those metrics, not the metrics themselves. We collect huge piles of metrics that draw no attention at all because they go straight into the Prometheus database and never get visualized on any dashboards. But that's a detail, so let's pretend that we collect metrics because we're going to use them instead of because they're there by default.
I've mostly stopped reading technical mailing lists
Once upon a time, I subscribed to and read a lot of mailing lists for the software we used, things like Exim, Dovecot, Prometheus, ZFS, OmniOS, and so on. These days I've mostly stopped, which is a shift that makes me a bit sad. Technically I'm still subscribed to some of these mailing lists, but in practice all their incoming messages get filed away to a folder and periodically I throw out the accumulated contents and start again. These days I only read one of these mailing lists on the rare occasion when I'm asking a question on it
Some of this is because my interests and available time have shifted. It no longer feels so interesting and compelling to keep up with these mailing lists, so I don't; usually I feel I have better things to do than to slog through any of them. But that 'slog through' phrasing is a clue to another reason, where reading some of these lists doesn't feel as productive or engaging as it used to. It's not all of the mailing lists; some of them simply have more volume than I have time and enthusiasm for, with not enough important information. But with some mailing lists, it feels like the character of them has changed over time to one that is significantly less useful to me.
(To some extent I expect all technical mailing lists to get less useful for me over time. In the beginning I'm relatively new to the software and so I can learn a lot from listening in on ongoing discussions and hearing routine questions and answers. As I become more experienced in whatever it is, more and more of the messages are about things I already know and the nuggets of new information are increasingly rare.)
The primary change I've noticed of these mailing lists is that they see a lot more questions that are either basic or very specific, where if I had the question I would have expected to answer it myself by reading through the documentation. In the beginning I had unkind descriptions of these sorts of questions, but I've come to be more sympathetic to them, especially the questions that come from people abroad who may not have English as their first language. The unfortunate fact is that projects aren't necessarily well documented and their documentation probably is dauntingly hard to read for people who aren't fluent in technical English, and people have work to get done (using those projects). Turning to the project user mailing list and asking their questions, if it works, is probably much faster than the alternatives (and their boss may be yelling at them to get it done ASAP).
This is, unfortunately, where the differences between mailing lists and forums become pointedly relevant to my experiences with mailing lists. My current mail reading environment is not well suited to situations where a large amount of the volume is uninteresting to me and I'm picking through it for the small number of interesting discussions of interesting issues. In a forum, I could ignore entire topic areas and most of the threads, focusing only on some specific ones of interests; in my current mailing list environment, I drown and stop reading.
I could likely manage to do better with various alternative ways of following these mailing lists. But although I kind of miss being the sort of person who reads mailing lists for software I use (and tries to write helpful responses), the amount of value I might get out of doing this has so far felt too low to motivate me to do anything. Just not reading things is easier and I'm probably not missing much.
(Pragmatically, some software I use has forums instead of mailing lists and I don't try to follow those either. As much as I would like it to be otherwise, I think my interests and enthusiasms have shifted significantly from the past. Am I burned out on the whole 'mailing list/forum' thing? Maybe.)
A crontab related mistake you can make with internal email ratelimits
Due to past painful experiences, we've given our email system a collection of internal ratelimits on various things, such as how much email a single machine can send at a time. When the ratelimit is hit, Exim will temporarily reject the email with a SMTP 4xx series error, so that (in theory) no email will actually be lost, only delayed (and someone who's caused their machine to suddenly send them thousands of email messages has a chance to fix it before being overwhelmed). When I set up these ratelimits in Exim, I set them to what seemed to be a perfectly reasonable limit of '60 messages in 60 minutes', which averages to one message every minute while allowing for burst of sending activity (this is a tradeoff you make with the ratelimit duration). Today, I discovered that this is a little bit of a mistake and that we actually want to set our ratelimits for a bit higher than one email message a minute.
Suppose, not entirely hypothetically, that something has a crontab job that runs once a minute and that has started to generate output every time it runs, which cron will email to the crontab's owner (or MAILTO setting). This means that this machine is now running right at the edge of its sending ratelimit; if it generates even one more email message (for example from some other cron job that runs once a day to notify you about pending package updates on that machine), it will hit the ratelimit and have an email message stalled. Once even a single message stalls, this machine will never recover and will always have something in its local mail queue. If it sends a second extra email, you'll wind up with the local mail queue always having two things waiting, and so on.
(These may or may not wait for very long, depending on how the machine's local mailer behaves.)
In our environment, we want our machines to clear their local mail queues unless there's been an explosion. In a relatively 'normal' situation like this, we'd prefer that the email get delivered rather than have potentially random email get delayed on the machine. As a result, we've discovered that we need to raise our ratelimits so that they're a bit above one message a minute on average. For now, we're using 70 messages in 60 minutes ('70 / 60m' in Exim's format for ratelimits).
(This elaborates on a Fediverse post.)
I can't recommend serious use of an all-in-one local Grafana Loki setup
Grafana Loki is often (self-)described as 'Prometheus for logs'. Like Prometheus, it theoretically has a simple all in one local installation mode of operation (which is a type of monolithic deployment mode), where you install the Loki server binary, point it at some local disk space, and run Promtail to feed your system logs (ie, the systemd journal) into Loki. This is what we do, to supplement our central syslog server. Although you might wonder why you'd have two different centralized log collection systems, I've found that there are things I like using Grafana Loki for.
However, I can no longer recommend running such an all-in-one Grafana Loki setup for anything serious, including what you might call 'production', and I think you should be wary about attempting to run Grafana Loki yourself in any configuration.
The large scale reason I say this is that most available evidence is that Grafana Inc, the developers of Loki, are simply not very interested in supporting this usage case or possibly any usage case involving other people running Loki. Unlike Prometheus, where the local usage case is considered real and how many people operate Prometheus (us included), the Loki 'local usage' comes across as a teaser to convince you of Loki's general virtues, and ingesting systemd logs through Promtail merely the most convenient way to get a bunch of logs (you can even get them in JSON format, although you probably shouldn't in real usage).
If you do try to operate Grafana Loki in this all-in-one configuration (and perhaps in other ones), you'll likely run into an ongoing series of issues. In general I've found the Loki documentation to be frustratingly brief in important areas such as what all of the configuration file settings mean and what the implications of setting them to various values are. The documentation's example configuration for promtail reading systemd logs is actively dangerous due to cardinality issues in systemd labels, and while Loki is called 'Prometheus for logs' it differs from Prometheus in critical ways that can force you to destroy your accumulated log data. The documentation will not tell you about this.
Even if you do everything as right as you can, things may well still go wrong. Grafana Inc shipped a Linux x86 promtail 2.8.0 binary that didn't read the systemd journal, which is only one of the (nominal) headline features of promtail on its dominant platform. An attempt to upgrade our Loki 2.7.4 to 2.8.1 failed badly and could not be reverted, forcing us to delete our entire accumulated log data for the second time in a few months (after the first time). Worse, I feel that diagnosing and thus fixing this issue would have been all but impossible within a reasonable time because Loki simply didn't log enough useful information for a system administrator. When the only reported 'error', to the extent that there is one, is 'empty ring', there is both a specific problem (what 'ring' out of several and how do you make it non-empty given that you're running in a monolith and don't have rings as such) and a deep-seated problem.
The deep seated problem is that Loki doesn't feel like it's been built to be operable by people who don't know its code and its internal details. If you are a Loki specialist who understands everything there is about Loki, perhaps you can diagnose '"empty ring" as the response to everything'. But if you're running Loki in the all-in-one filesystem setup as a busy system administrator, you probably aren't such a specialist and never will be. Loki doesn't feel like it's built to be run in production by you and me, not safely and reliably, and I don't expect Grafana Inc to ever change that.
We will probably keep running Grafana Loki, because it's already there, I derive some value from it, it's been integrated a bit into our Grafana dashboards, and since we already have our central syslog server I can live with periodically throwing away the accumulated log data and starting over from scratch, although I don't like it. But if I ever leave I'll advise my co-workers to rip out all of the Loki and Promtail infrastructure, which is also my plan if dealing with it becomes too time-consuming and irritating. If I know now what I knew back when I started to set up Loki, I'm not sure I'd have bothered.
(This elaborates on some Fediverse posts.)
PS: Loki also has some container-ized multi-component run-it-yourself example setups. I don't have any experience with them so I have no idea if they're better supported and more reliable in practice than the all-in-one version (which isn't particularly, as we've seen). A container based setup ingesting custom application logs with low label cardinality and storing the actual logs in the cloud instead of the filesystem may be a much better place to be for using Loki in practice than 'all in one systemd journal ingestion to the filesystem'. Certainly it's closer to how Grafana Inc probably runs Loki in their 'Grafana Cloud' service, and the VC-funded Grafana Inc certainly wants you to use Grafana Cloud instead of wrestling with Loki configuration and operation.
Thinking about our passive exposure to IPv6 issues
Over on the Fediverse, I recently saw a thought-provoking remark by Michael Gebis (via):
If I were an evil threat actor, I'd be learning as much about #ipv6 as possible right now. I'm convinced that many companies that say they "aren't using" IPv6 are in reality just ignoring IPv6, and it would be easy to set up a "shadow network" consisting of IPv6 traffic where you could get away with murder. Nobody at the company is logging IPv6 traffic and events, none of the tools are configured to monitor it, and a large majority of the staff knows nothing about it.
"But we disable IPv6!"
Really? On your users mobile devices? On printers? On random IoT devices? And most of all, on your remote user's networks? Good luck, my guy.
This got me thinking about what I'll call our passive exposure to IPv6. By that I mean whatever exposure we get from having IPv6 enabled on random devices, and perhaps automatically configured by systems or deliberately set up by a threat actor. For example, could a threat actor use IPv6 to bypass firewall protections or pass traffic in ways that would be invisible to our normal monitoring?
So far, we don't have IPv6 set
up on any of our networks, and as a result we have no deliberate
IPv6 network to network routing (or public IPv6 address assignment
through any of DHCP6,
or static assignment). However, as is the way of the world these
days almost all of our servers have IPv6 enabled and they've assigned
themselves fe80::/10 link local IPv6 addresses.
It's likely that people's desktops, laptops, and mobile devices
have done similarly on the networks where they live. When we look
at general network traffic with tools like
tcpdump, we do see a
certain amount of IPv6 traffic trying to happen (using fe80::/10
link local addresses from what I've seen).
All of our firewalls explicitly block IPv6 traffic, and our core router doesn't have any IPv6 routes (because we have no IPv6 networks). A threat actor that wanted to use plain IPv6 across network boundaries would have to compromise one or more firewall rule sets, and getting plain IPv6 routing to the outside world would be quite complex. However, because we don't have IPv6 active on our own networks, we probably wouldn't notice if a threat actor set up a DHCP6 or SLAAC server that advertised itself as an IPv6 gateway to the local network. The attacker would have to tunnel the traffic in and out through IPv4, but there are various options for that that we probably wouldn't notice.
(Our core router might automatically start routing IPv6 traffic between our public subnets if those subnets started advertising appropriate IPv6 bits, but I don't know. However, those subnets can already reach each other with IPv4, and we're not doing traffic monitoring on the core router.)
Machines on internal networks that don't have port isolation can talk to each other with IPv6 (using link local addresses or addresses handed out by a threat actor), but then they can already talk to each other with IPv4. Machines on port isolated internal networks should still be port isolated from each other even if they use IPv6, but I admit we haven't tested the behavior of our switches to be completely sure of that.
(Port isolation is theoretically an Ethernet layer thing, not an IP layer one, but switches are extra smart these days so one is never entirely sure.)
We mostly don't use IP filtering on servers themselves, only on firewalls, so we don't have to worry about server IP filtering not being set up for IPv6 (an issue that I once had on my home machine). Where we do have some IP filtering, it's in the form of a default deny with a restricted positive allowance (for example, systemd's IP access controls or Apache access controls); this means that using an IPv6 address to access the service instead of an IPv4 address wouldn't bypass access restrictions; you specifically have to have the right IPv4 IP address.
We handle our DNS ourselves, so compromising our DNS infrastructure to add IPv6 addresses for some of our services would already be a pretty bad compromise by itself. We're no more or less likely to notice the addition of IPv6 records than we are to notice any other DNS change that doesn't visibly and clearly degrade services. I can imagine that the addition of IPv6 records might be able to do some damage because it might let an attacker get TLS certificates for some of our names and intercept the traffic of outside people on networks with IPv6 support.
(To avoid people noticing the service not working, the attacker could then send the traffic to the original server via IPv4.)
So on the whole I think we have a low passive exposure to IPv6 issues. A lot of this comes down to the long ago decision to explicitly block IPv6 traffic on all of our firewalls, even though we didn't (and still don't) have any IPv6 traffic.
A Prometheus Alertmanager alert grouping conundrum
We have various host-related alerts in our Prometheus and Alertmanger setup. Some of those are about things on the host not being right (full disks, for example, or network interfaces not being at the right speed), but some of them are alerts that fire if the host is down; for example there's alerts on ping failures, SSH connection failures, and the Prometheus host agent not responding. Unsurprisingly, we reboot our machines every so often and we don't like to get spammed with spurious alerts, so in our Alertmanager configuration we delay those alerts a bit so that they won't send us an alert if the machine is just rebooting. This looks like:
- match_re: alertname: 'NoPing|NoSSH|DownAgent' group_wait: 6m group_interval: 3m
We don't want to delay when these alerts start firing in Prometheus by
giving them a long
for: delay; we want the Prometheus version of their
state to reflect reality as we consider it. We also have some machines
that are sufficiently important and sufficiently rarely rebooted that we
don't wait this six minutes but instead alert almost immediately, such
as our ZFS fileservers.
However, this creates a little conundrum where that alert matching up there is actually a lie. We have a number of other alerts that will fire on some hosts if the host is down, for example if the host runs an additional agent. If we don't put these alerts in the alert matching, Alertmanager groups them separately and we get two separate alerts if the host genuinely goes down, one from this grouping of 'host has probably rebooted' alerts and one from the default grouping of other per-host alerts. This is an easy thing to overlook when creating new alerts; generally I find such alerts when such a host goes down and we get more alert messages than we should.
(Once we have more than a few of these 'host could be rebooting' alerts, it might be better to set a special label on all of them and then match on the label in Alertmanager. However, it becomes less immediately visible what all of the alerts are.)
However, just adding these extra alerts to the alertname match has a more subtle trap that can still cause us to get extra alerts, and that is alert activation time. If an additional alert is sufficiently slow to trigger (which isn't uncommon for alerts such as ones about additional agents being down), it will miss the six minute group wait interval, not be included in the initial alert sent to us about the host being down, and will be added in the next cycle of alert notices, giving us two alert notifications when a host is down. This too is easy to overlook, although once I realized it I added a comment to the Alertmanager stanza above about it, so I have a better chance of avoiding it in the future.
I could switch to having Alertmanager inhibit various extra alerts if a host is down, but I'm not sure that's the right approach. We do sort of want to know what other things we're missing if a host goes down, although at the same time some things are irrelevant (eg, that additional host-specific exporters are down). One tricky bit about this is that you can't make inhibitions depend on multiple alerts all being active, so I'd probably need to have Prometheus trigger a synthetic 'host is down' alert if all of the conditions are true.
One way to look at this situation is that it's happening because you can't have Alertmanager conditionally group alerts together; alert groupings are static things. This makes perfect sense and is a lot easier to implement (and it avoids all sorts of corner cases), but sometimes it means that alert grouping gets in the way of alert aggregation.
Automated status tests need to have little or no 'noise'
In a comment on my entry on how you should automate some basic restore testing of your backups, Simon made a perfectly reasonable suggestion:
Another relatively basic, but useful, check is to do some plausibility check on the backup size. For most systems huge jumps in backup size (in either direction) likely mean you are not backing up what you are thinking. This is a bit more complicated than what you mention in your article, since it needs some tuning and can generate false positives. But I think it still can be a valuable check that in most cases won't be too hard to implement.
I've come to believe that automated system status tests, and automated system things in general, need to have little or no 'noise'. By noise I mean errors, warnings, alerts, or messages that happen when there isn't actually any problem. The fundamental reason for this comes down to the problem of false positives with large numbers.
Unless you're in a bad situation, your systems are almost always working; your backups are happening properly, your machines are up, your disks aren't dangerously full, and so on. Actual failures are a small percentage of the time. This means that even with a very low false positive rate on a percentage basis, almost all of the time your tests are raising some sort of alert, they're giving you noise, a false alert. This gives you the security alert problem; it will be very easy to get habituated to ignoring or downplaying the warning messages. The very rare occasion when they're warning you about a real problem will be drowned out by the noise of non-problems.
As a system administrator, it can feel morally wrong to not send out a warning if we detect something potentially broken. But here again, the perfect is the enemy of the good. It's better to reliably generate warnings that people will notice and heed when something is definitely broken, even if this doesn't send warnings in some situations when you can't be sure.
This isn't a new thing and this isn't unique to system administrators. Programmers have their own version of this for linters, compiler warnings, dependency monitoring, tests (unit and otherwise), and so on. For all of these, programmers have lots of painful experience saying that noisy things are often not worth it because they'll hide actual problems.
(If you get 200 compiler warnings today, you'll probably not spot the ten critical ones, or notice that tomorrow you have 202 compiler warnings.)