Some thoughts on having set up a personal Alertmanager instance

May 27, 2021

I've been running a personal Prometheus instance on my desktop machines for long enough that I don't remember exactly how long (probably since December of 2018). Recently, I extended that by setting up a personal Alertmanager on my work desktop. The direct trigger of this was concerns about the kernel turning off the GPU fan and melting down the GPU if I didn't catch it in time, but once I'd set up Alertmanager and written the initial alert rule, I came up with a number of other problems I wanted to know about.

All of this was pretty easy because I could mostly copy the Alertmanager configuration and alerting rules from our production Prometheus setup, with only minor edits. If I'd had to write a full setup from scratch, I might not have bothered (or it would have been a very minimal and relatively little customized one). This does have the modest downside that the emails about my personal alerts look an awful lot like our production alerts, but I'm relatively confident I can keep them straight.

On the one hand, all of this is useful, partly because it gives me a test environment. I routinely pilot new versions of many Prometheus components on my own machines, but until now that didn't include Alertmanager; now it does. It's also useful to know about problems on my work machine, especially at the moment when I'm not in the office to notice various issues in person. And some problems I'm checking for (like Prometheus exporters that have failed to start) are ones I've had before on both my home and my work machine.

On the other hand, this is a lot of duplication of effort to get monitoring and alerts for my office machine, since I've set up most of a copy of our production Prometheus setup. But I can't easily include my work machine in our production setup without causing a bunch of administrative issues, and this actually points out a general issue with (our) Prometheus.

The core issue is that our Prometheus setup is not designed or set up for multi-tenant operation, and I'm not sure there's any easy path to make a Prometheus setup work that way. My office desktop is not the only additional 'tenant' we might want to support; for example, it's quite possible that there are research groups (supported by other people) who want to monitor and alert about their own machines. Right now, all we can offer to them is to tell them how we set up our own Prometheus server and clients. We can't let them actually use our Prometheus, our Grafana, our Alertmanager, and so on; they have to provision their own machine and run their own instances.

I don't think this is necessarily within Prometheus's scope (although some people do build setups that work this way). But I do sometimes wish it was something Prometheus and Grafana supported, even if it's not clear to me how they could (or if we would take advantage of it if they did).

(One big problem with a multi-tenant environment is storage space for metrics. Our server's available storage is dedicated to our own metrics and we need all the space there. Even on a bigger server that could take more disk drives we'd have issues of allocating and handling space.)

Written on 27 May 2021.
« Being able to see links I've visited in Firefox is startlingly better
Less can filter what it shows to you (a thing I recently learned) »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu May 27 23:26:27 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.