2024-11-28
My life has been improved by my quiet Prometheus alert status monitor
I recently created a setup to provide a backup for our email-based Prometheus alerts; the basic result is that if our current Prometheus alerts change, a window with a brief summary of current alerts will appear out of the way on my (X) desktop. Our alerts are delivered through email, and when I set up this system I imagined it as a backup, in case email delivery had problems that stopped me from seeing alerts. I didn't entirely realize that in the process, I'd created a simple, terse alert status monitor and summary display.
(This wasn't entirely a given. I could have done something more clever when the status of alerts changed, like only displaying new alerts or alerts that had been resolved. Redisplaying everything was just the easiest approach that minimized maintaining and checking state.)
After using my new setup for several days, I've ended up feeling that I'm more aware of our general status on an ongoing and global basis than I was before. Being more on top of things this way is a reassuring feeling in general. I know I'm not going to accidentally miss something or overlook something that's still ongoing, and I actually get early warning of situations before they trigger actual emails. To put it in trendy jargon, I feel like I have more situational awareness. At the same time this is a passive and unintrusive thing that I don't have to pay attention to if I'm busy (or pay much attention to in general, because it's easy to scan).
Part of this comes from how my new setup doesn't require me to do anything or remember to check anything, but does just enough to catch my eye if the alert situation is changing. Part of this comes from how it puts information about all current alerts into one spot, in a terse form that's easy to scan in the usual case. We have Grafana dashboards that present the same information (and a lot more), but it's more spread out (partly because I was able to do some relatively complex transformations and summarizations in my code).
My primary source for real alerts is still our email messages about alerts, which have gone through additional Alertmanager processing and which carry much more information than is in my terse monitor (in several ways, including explicitly noting resolved alerts). But our email is in a sense optimized for notification, not for giving me a clear picture of the current status, especially since we normally group alert notifications on a per-host basis.
(This is part of what makes having this status monitor nice; it's an alternate view of alerts from the email message view.)
2024-11-22
My new solution for quiet monitoring of our Prometheus alerts
Our Prometheus setup delivers all alert messages through email, because we do everything through email (as a first approximation). As we saw yesterday, doing everything through email has problems when your central email server isn't responding; Prometheus raised alerts about the problems but couldn't deliver them via email because the core system necessary to deliver email wasn't doing so. Today, I built myself a little X based system to get around that, using the same approach as my non-interrupting notification of new email.
At a high level, what I now have is an xlbiff based notification of our current Prometheus alerts. If there are no alerts, everything is quiet. If new alerts appear, xlbiff will pop up a text window over in the corner of my screen with a summary of what hosts have what alerts; I can click the window to dismiss it. If the current set of alerts changes, xlbiff will re-display the alerts. I currently have xlbiff set to check the alerts every 45 seconds, and I may lengthen that at some point.
(The current frequent checking is because of what started all of this; if there are problems with our email alert notifications, I want to know about it pretty promptly.)
The work of fetching, checking, and formatting alerts is done by a Python program I wrote. To get the alerts, I directly query our Prometheus server rather than talking to Alertmanager; as a side effect, this lets me see pending alerts as well (although then I have to have the Python program ignore a bunch of pending alerts that are too flaky). I don't try to do the ignoring with clever PromQL queries; instead the Python program gets everything and does the filtering itself.
Pulling the current alerts directly from Prometheus means that I can't readily access the explanatory text we add as annotations (and that then appears in our alert notification emails), but for the purposes of a simple notification that these alerts exist, the name of the alert or other information from the labels is good enough. This isn't intended to give me full details about the alerts, just to let me know what's out there. Most of the time I'll get email about the alert (or alerts) soon anyway, and if not I can directly look at our dashboards and Alertmanager.
To support this sort of thing, xlbiff has the notion of a 'check'
program that can print out a number every time it runs, and will
get passed the last invocation's number on the command line (or '0'
at the start). Using this requires boiling down the state of the
current alerts to a single signed 32-bit number. I could have used
something like the count of current alerts, but me being me I decided
to be more clever. The program takes the start time of every current
alert (from the ALERTS_FOR_STATE
Prometheus metric), subtracts
a starting epoch to make sure we're not going to overflow, and adds
them all up to be the state number (which I call a 'checksum' in
my code because I started out thinking about more complex tricks
like running my output text through CRC32).
(As a minor wrinkle, I add one second to the start time of every firing alert so that when alerts go from pending to firing the state changes and xlbiff will re-display things. I did this because pending and firing alerts are presented differently in the text output.)
To get both the start time and the alert state, we must use the usual trick for pulling in extra labels:
ALERTS_FOR_STATE * ignoring(alertstate) group_left(alertstate) ALERTS
I understand why ALERTS_FOR_STATE
doesn't include the alert state,
but sometimes it does force you to go out of your way.
PS: If we had alerts going off all of the time, this would be far too obtrusive an approach. Instead, our default state is that there are no alerts happening, so this alert notifier spends most of its time displaying nothing (well, having no visible window, which is even better).
2024-11-21
Our Prometheus alerting problem if our central mail server isn't working
Over on the Fediverse, I said something:
Ah yes, the one problem that our Prometheus based alert system can't send us alert email about: when the central mail server explodes. Who rings the bell to tell you that the bell isn't working?
(This is of course an aspect of monitoring your Prometheus setup itself, and also seeing if Alertmanager is truly healthy.)
There is a story here. The short version of the story is that today we wound up with a mail loop that completely swamped our central Exim mail server, briefly running its one minute load average up to a high water mark of 3,132 before a co-worker who'd noticed the problem forcefully power cycled it. Plenty of alerts fired during the incident, but since we do all of our alert notification via email and our central email server wasn't delivering very much email (on account of that load average, among other factors), we didn't receive any.
The first thing to note is that this is a narrow and short term problem for us (which is to say, me and my co-workers). On the short term side, we send and receive enough email that not receiving email for very long during working hours is unusual enough that someone would have noticed before too long, in fact my co-worker noticed the problems even without an alert actively being triggered. On the narrow side, I failed to notice this as it was going on because the system stayed up, it just wasn't responsive. Once the system was rebooting, I noticed almost immediately because I was in the office and some of the windows on my office desktop disappeared.
(In that old version of my desktop I would have
noticed the issue right away, because an xload
for the machine
in question was right in the middle of these things. These days
it's way off to the right side, out of my routine view, but I could
change that back.)
One obvious approach is some additional delivery channel for alerts about our central mail server. Unfortunately, we're entirely email focused; we don't currently use Slack, Teams, or other online chatting systems, so sending selected alerts to any of them is out as a practical option. We do have work smartphones, so in theory we could send SMS messages; in practice, free email to SMS gateways have basically vanished, so we'd have to pay for something (either for direct SMS access and we'd build some sort of system on top, or for a SaaS provider who would take some sort of notification and arrange to deliver it via SMS).
For myself, I could probably build some sort of script or program that regularly polled our Prometheus server to see if there were any relevant alerts. If there were, the program would signal me somehow, either by changing the appearance of a status window in a relatively unobtrusive way (eg turning it red) or popping up some sort of notification (perhaps I could build something around a creative use of xlbiff to display recent alerts, although this isn't as simple as it looks).
(This particular idea is a bit of a trap, because I could spend a lot of time crafting a little X program that, for example, had a row of boxes that were green, yellow, or red depending on the alert state of various really important things.)
2024-11-15
IPv6 networks do apparently get probed (and implications for address assignment)
For reasons beyond the scope of this entry, my home ISP recently changed my IPv6 assignment from a /64 to a (completely different) /56. Also for reasons beyond the scope of this entry, they left my old /64 routing to me along with my new /56, and when I noticed I left my old IPv6 address on my old /64 active, because why not. Of course I changed my DNS immediately, and at this point it's been almost two months since my old /64 appeared in DNS. Today I decided to take a look at network traffic to my old /64, because I knew there was some (which is actually another entry), and to my surprise much more appeared than I expected.
On my old /64, I used ::1/64 and ::2/64 for static IP addresses,
of which the first was in DNS, and the other IPv6 addresses in it
were the usual SLAAC assignments. The first thing I discovered in
my tcpdump
was a surprisingly large number of cloud-based IPv6
addresses that were pinging my ::1 address. Once I excluded that
traffic, I was left with enough volume of port probes that I could
easily see them in a casual tcpdump.
The somewhat interesting thing is that these IPv6 port probes were happening at all. Apparently there is enough out there on IPv6 that it's worth scraping IPv6 addresses from DNS and then probing potentially vulnerable ports on them to see if something responds. However, as I kept watching I discovered something else, which is that a significant number of these probes were not to my ::1 address (or to ::2). Instead they were directed to various (very) low-number addresses on my /64. Some went to the ::0 address, but I saw ones to ::3, ::5, ::7, ::a, ::b, ::c, ::f, ::15, and a (small) number of others. Sometimes a sequence of source addresses in the same /64 would probe the same port on a sequence of these addresses in my /64.
(Some of this activity is coming from things with DNS, such as various shadowserver.org hosts.)
As usual, I assume that people out there on the IPv6 Internet are doing this sort of scanning of low-numbered /64 IPv6 addresses because it works. Some number of people put additional machines on such low-numbered addresses and you can discover or probe them this way even if you can't find them in DNS.
One of the things that I take away from this is that I may not want to put servers on these low IPv6 addresses in the future. Certainly one should have firewalls and so on, even on IPv6, but even then you may want to be a little less obvious and easily found. Or at the least, only use these IPv6 addresses for things you're going to put in DNS anyway and don't mind being randomly probed.
PS: This may not be news to anyone who's actually been using IPv6 and paying attention to their traffic. I'm late to this particular party for various reasons.
2024-11-14
Your options for displaying status over time in Grafana 11
A couple of years ago I wrote about your options for displaying status over time in Grafana 9, which discussed the problem of visualizing things how many (firing) Prometheus alerts there are of each type over time. Since then, some things have changed in the Grafana ecosystem, and especially some answers have recently become clearer to me (due to an old issue report), so I have some updates to that entry.
The generally best panel type you want to use for this is a state timeline panel, with 'merge equal consecutive values' turned on. State timelines are no longer 'beta' in Grafana 11 and they work for this, and I believe they're Grafana's more or less officially recommended solution for this problem. By default a state timeline panel will show all labels, but you can enable pagination. The good news (in some sense) is that Grafana is aware that people want a replacement for the old third party Discrete panel (1, 2, 3) and may at some point do more to move toward this.
You can also use bar graphs and line graphs, as mentioned back then, which continue to have the virtue that you can selectively turn on and off displaying the timelines of some alerts. Both bar graphs and line graphs continue to have their issues for this, although I think they're now different issues than they had in Grafana 9. In particular I think (stacked) line graphs are now clearly less usable and harder to read than stacked bar graphs, which is a pity because they used to work decently well apart from a few issues.
(I've been impressed, not in a good way, at how many different ways Grafana has found to make their new time series panel worse than the old graph panel in a succession of Grafana releases. All I can assume is that everyone using modern Grafana uses time series panels very differently than we do.)
As I found out, you don't want to use the status history panel for this. The status history panel isn't intended for this usage; it has limits on the number of results it can represent and it lacks the 'merge equal consecutive values' option. More broadly, Grafana is apparently moving toward merging all of the function of this panel into the Heatmap panel (also). If you do use the status history panel for anything, you want to set a general query limit on the number of results returned, and this limit is probably best set low (although how many points the panel will accept depends on its size in the browser, so life is fun here).
Since the status history panel is basically a variant of heatmaps, you don't really want to use heatmaps either. Using Heatmaps to visualize state over time in Grafana 11 continue to have the issues that I noted in Grafana 9, although some of them may be eliminated at some point in the future as the status history panel is moved further out. Today, if for some reason you have to choose between Heatmaps and Status History for this, I think you should use Status History with a query limit.
If we ever have to upgrade from our frozen Grafana version, I would expect to keep our line graph alert visualizations and replace our Discrete panel usage with State Timeline panels with pagination turned on.
2024-11-12
Finding a good use for keep_firing_for in our Prometheus alerts
A while back (in 2.42.0), Prometheus
introduced a feature to artificially keep alerts firing for some
amount of time after their alert condition had cleared; this is
'keep_firing_for
'. At the time, I said that I didn't really
see a use for it for us, but I now
have to change that. Not only do we have a use for it, it's one
that deals with a small problem in our large scale alerts.
Our 'there is something big going on' alerts exist only to inhibit
our regular alerts. They trigger when there seems to be 'too much'
wrong, ideally fast enough that their inhibition effect stops the
normal alerts from going out. Because normal alerts from big issues
being resolved don't necessarily clean out immediately, we want our
large scale alerts to linger on for some time after the amount of
problems we have drop below their trigger point. Among other things,
this avoids a gotcha with inhibitions and resolved alerts. Because we created these alerts
before v2.42.0, we implemented the effect of lingering on by using
max_over_time()
on the alert conditions (this was the old
way of giving an alert a minimum duration).
The subtle problem with using max_over_time() this way is that it means you can't usefully use a 'for:' condition to de-bounce your large scale alert trigger conditions. For example, if one of the conditions is 'there are too many ICMP ping probe failures', you'd potentially like to only declare a large scale issue if this persisted for more than one round of pings; otherwise a relatively brief blip of a switch could trigger your large scale alert. But because you're using max_over_time(), no short 'for:' will help; once you briefly hit the trigger number, it's effectively latched for our large scale alert lingering time.
Switching to extending the large scale alert directly with
'keep_firing_for
' fixes this issue, and also simplifies the
alert rule expression. Once we're no longer using max_over_time(),
we can set 'for: 1m' or another useful short number to de-bounce
our large scale alert trigger conditions.
(The drawback is that now we have a single de-bounce interval for all of the alert conditions, whereas before we could possibly have a more complex and nuanced set of conditions. For us, this isn't a big deal.)
I suspect that this may be generic to most uses of max_over_time() in alert rule expressions (fortunately, this was our only use of it). Possibly there are reasonable uses for it in sub-expressions, clever hacks, and maybe also using times and durations (eg, also, also).
2024-11-11
Prometheus makes it annoyingly difficult to add more information to alerts
Suppose, not so hypothetically, that you have a special Prometheus meta-alert about large scale issues, that exists to avoid drowning you in alerts about individual hosts or whatever when you have a large scale issue. As part of that alert's notification message, you'd like to include some additional information about things like why you triggered the alert, how many down things you detected, and so on.
While Alertmanager creates
the actual notification messages by expanding (Go) templates, it
doesn't have direct access to Prometheus or any other source of
external information, for relatively straightforward reasons.
Instead, you need to pass any additional information from Prometheus
to Alertmanager in the form (generally) of alert annotations.
Alert annotations (and alert labels) also go through template
expansion,
and in the templates for alert annotations, you can directly make
Prometheus queries with the query
function.
So on the surface this looks relatively simple, although you're
going to want to look carefully at YAML string quoting.
I did some brief experimentation with this today, and it was enough to convince me that there are some issues with doing this in practice. The first issue is that of quoting. Realistic PromQL queries often use " quotes because they involve label values, and the query you're doing has to be a (Go) template string, which probably means using Go raw quotes unless you're unlucky enough to need ` characters, and then there's YAML string quoting. At a minimum this is likely to be verbose.
A somewhat bigger problem is that straightforward use of Prometheus
template expansion (using a simple pipeline) is generally going to
complain in the error log if your query provides no results. If
you're doing the query to generate a value, there are some standard
PromQL hacks to get around this.
If you want to find a label, I think you need to use a more complex
template with
operation; on
the positive side, this may let you format a message fragment with
multiple labels and even the value.
More broadly, if you want to pass multiple pieces of information from a single query into Alertmanager (for example, the query value and some labels), you have a collection of less than ideal approaches. If you create multiple annotations, one for each piece of information, you give your Alertmanager templates the maximum freedom but you have to repeat the query and its handling several times. If you create a text fragment with all of the information that Alertmanager will merely insert somewhere, you basically split writing your alerts between Alertmanager and Prometheus alert rules, And if you encode multiple pieces of information into a single annotation with some scheme, you can use one query in Prometheus and not lock yourself into how the Alertmanager template will use the information, but your Alertmanager template will have to parse that information out again with Go template functions.
What all of this is a symptom of is that there's no particularly good way to pass structured information between Prometheus and Alertmanager. Prometheus has structured information (in the form of query results) and your Alertmanager template would like to use it, but today you have to smuggle that through unstructured text. It would be nice if there was a better way.
(Prometheus doesn't quite pass through structured information from a single query, the alert rule query, but it does make all of the labels and annotations available to Alertmanager. You could imagine a version where this could be done recursively, so some annotations could themselves have labels and etc.)