Working out how frequently your ICMP pings fail in Prometheus
Suppose, not hypothetically, that your Prometheus setup pings a bunch of machines (through the blackbox exporter) and some of those pings seem to fail some of the time. If they fail continuously for long enough, you'll raise an alert, but beyond that you may want to know how often they've flaked out over some time period for use in a Grafana dashboard. Today, I wanted this both as a failure percentage and then as a count of how many pings had failed.
Our Blackbox setup reports ICMP results as a probe_success
metric with a probe="icmp"
label (among others); it has a 1
value if the probe was successful and a 0 value if it failed. For
such metrics that are either 1 or 0, the classical way to determine
the percentage of time they're up (or successful) is to use
avg_over_time
, as covered in Robust Perception's "What
percentage of time is my service down for?".
The straightforward PromQL
query for 'what percent is this down' as a 0.0 to 1.0 value is thus:
1  avg_over_time( probe_success{ probe="icmp" }[$__range] )
(This uses the Grafana range variable, covered here in the 'Using interval and range variables' section.)
However, I forgot this when I was initially setting our new ping status dashboard today and used a different approach. In general if you're looking for the 0.0 to 1.0 percentage of a subset of your data, you want the subcount divided by the total count. The total count of ICMP ping probes over time is:
count_over_time( probe_success{ probe="icmp" }[$__range] )
Because successful ping probes have a value of 1, we can get the
count of them with sum_over_time
, making the full expression
be:
1  ( sum_over_time( probe_success{ probe="icmp" }[$__range] ) / count_over_time( probe_success{ probe="icmp" }[$__range] ) )
Of course, this 'sum / count' is just the average and so we can
replace this with the more efficient avg_over_time
expression.
(We don't want to use a subquery to count
up how many times probe_success
was zero, because a subquery
won't necessarily get the same number of metric points as
count_over_time
will. You might even have different Blackbox
ping frequencies for different targets.)
This version of our 'percentage of pings that failed' expression points the way to giving us the total number of failed pings. This is the total number of pings minus the successful pings, which is the parts of our complicated percentage expression flipped around:
count_over_time( probe_success{ probe="icmp" }[$__range]  sum_over_time( probe_success{ probe="icmp" }[$__range] )
Note that this is not the amount of time that pings were failing for. In general, it's impossible to work out a completely accurate number for that for various reasons, including that we may have metric points that are missing entirely for whatever reason. If we assume that we have metric points evenly covering the entire time range, the amount of time (in seconds) which pings were failing for is the total range of time in seconds times the 0.0 to 1.0 failure percentage. In Grafana again, this would be:
$__range_s * (1  avg_over_time( probe_success{ probe="icmp" }[$__range] ))
(There may be a better way to compute this. I haven't thought much about it because I think 'amount of time down' is misleading here in a way that 'percentage of pings that failed' is not.)

