Wandering Thoughts archives

2019-06-02

Exploring the start time of Prometheus alerts via ALERTS_FOR_STATE

In Prometheus, active alerts are exposed through two metrics, the reasonably documented ALERTS and the under-documented new metric ALERTS_FOR_STATE. Both metrics have all of the labels of the alert (although not its annotations), and also an 'alertname' label; the ALERTS metric also has an additional 'alertstate' metric. The value of the ALERTS metric is always '1', while the value of ALERTS_FOR_STATE is the Unix timestamp of when the alert rule expression started being true; for rules with 'for' delays, this means that it is the timestamp when they started being 'pending', not when they became 'firing' (see this rundown of the timeline of an alert).

(The ALERTS_FOR_STATE metric is an internal one added in 2.4.0 to persist the state of alerts so that 'for' delays work over Prometheus restarts. See "Persist 'for' State of Alerts" for more details, and also Prometheus issue #422. Because of this, it's not exported from the local Prometheus and may not be useful to you in clustered or federated setups.)

The ALERTS_FOR_STATE metric is quite useful if you want to know the start time of an alert, because this information is otherwise pretty much unavailable through PromQL. The necessary information is sort of in Prometheus's time series database, but PromQL does not provide any functions to extract it. Also, unfortunately there is no good way to see when an alert ends even with ALERTS_FOR_STATE.

(In both cases the core problem is that alerts that are not firing don't exist as metrics at all. There are some things you can do with missing metrics, but there is no good way to see in general when a metric appears or disappears. In some cases you can look at the results of manually evaluating the underlying alert rule expression, but in other cases even this will have a null value when it is not active.)

We can do some nice things with ALERTS_FOR_STATE, though. To start with, we can calculate how long each current alert has been active, which is just the current time minus when it started:

time() - ALERTS_FOR_STATE

If we want to restrict this to alerts that are actually firing at the moment, instead of just being pending, we can write it as:

    (time() - ALERTS_FOR_STATE)
and ignoring(alertstate) ALERTS{alertstate="firing"}

(We must ignore the 'alertstate' label because the ALERTS_FOR_STATE metric doesn't have it.)

You might use this in a dashboard where you want to see which alerts are new and which are old.

A more involved query is one to tell us the longest amount of time that a firing alert has been active over the past time interval. The full version of this is:

max_over_time( ( (time() - ALERTS_FOR_STATE)
                  and ignoring(alertstate)
                         ALERTS{alertstate="firing"}
               )[7d:] )

The core of this is the expression we already saw, and we evaluate it over the past 7 days, but until I thought about things it wasn't clear why this gives us the longest amount of time for any particular alert. What is going on is that while an alert is active, ALERTS_FOR_STATE's value stays constant while time() is counting up, because it is evaluated at each step of the subquery. The maximum value of 'time() - ALERTS_FOR_STATE' happens right before the alert ceases to be active and its ALERTS_FOR_STATE metric disappears. Using max_over_time captures this maximum value for us.

(If the same alert is active several times over the past seven days, we only get the longest single time. There is no good way to see how long each individual incident lasted.)

We can exploit the fact that ALERTS_FOR_STATE has a different value each time an alert activates to count how many different alerts activated over the course of some range. The simplest way to do this is:

changes( ALERTS_FOR_STATE[7d] ) + 1

We have to add one because going from not existing to existing is not counted as a change in value for the purpose of changes(), so an alert that only fired once will be reported as having 0 changes in its ALERTS_FOR_STATE value. I will leave it as an exercise to the reader to extend this to only counting how many times alerts fired, ignoring alerts that only became pending and then went away again (as might happen repeatedly if you have alerts with deliberately long 'for' delays).

(This entry was sparked by a recent prometheus-users thread, especially Julien Pivotto's suggestion.)

sysadmin/PrometheusAlertStartTimeStuff written at 22:15:26; Add Comment

I haven't customized my Vim setup and I'm not sure I should try to (yet)

I was recently reading At least one Vim trick you might not know (via). In passing, the article divides Vim users (and its tips) into purists, who deliberately use Vim with minimal configuration, and exobrains, who "stuff Vim full of plugins, functions, and homebrew mappings". All of this is to say that currently, as a Vim user I am a non-exobrain; I use Vim with minimal customization (although not none).

This is not because I am a deliberate purist. Instead, it's partly because I've so far perceived the universe of Vim customizations as a daunting and complex place that seems like too much work to explore when my Vim (in its current state) works well enough for me. Well, that's not entirely true. I'm also aware that I could improve my Vim experience with more knowledge and use of Vim's own built in features. Trying to add customizations to Vim when I haven't even mastered its relative basics doesn't seem like a smart idea, and it also seems like I'd make bad decisions about what to customize and how.

(Part of the dauntingness is that in my casual reading, there seem to be several different ways to manage and maintain Vim plugins. I don't know enough to pick the right one, or even evaluate which one is more popular or better.)

There are probably Vim customizations and plugins that could improve various aspects of my Vim experience. But finding them starts with the most difficult part, which is understanding what I actually want from my Vim experience and what sort of additions would clash with it. The way I've traditionally used Vim is that I treat it as a 'transparent' editor, one where my interest is in getting words (and sometimes code) down on the screen. In theory, a good change would be something that increases this transparency, that deals with some aspect of editing that currently breaks me out of the flow and makes me think about mechanics.

(I think that the most obvious candidate for this would be some sort of optional smart indentation for code and annoying things like YAML files. I don't want smart indentation all of the time, but putting the cursor in the right place by default is a great use of a computer, assuming that you can make it work well inside Vim's model.)

Of course the other advantage of mostly avoiding customizing my Vim experience is that it preserves a number of the advantages that make Vim a good sysadmin's editor. I edit files with Vim in a lot of different contexts, and it's useful if these all behave pretty much the same. And of course getting better at core Vim improves things for me in all of these environments, since core Vim is everywhere. Even if I someday start customizing my main personal Vim with extra things to make it nicer, focusing on core Vim until I think I have all of the basics I care about down is more generally useful right now.

(As an illustration of this, one little bit of core Vim that I've been finding more and more convenient as I remember it more is the Ctrl-A and Ctrl-X commands to increment and decrement numbers in the text. This is somewhat a peculiarity of our environment, but it comes up surprisingly often. And this works everywhere.)

PS: Emacs is not entirely simpler than Vim here as far as customization go, but I have a longer history with customizing Emacs than I do with Vim. And it does seem like Emacs has their package ecology fairly nailed down, based on my investigations from a while back for code editing.

unix/VimMinimalCustomization written at 00:23:58; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.