Working out how frequently your ICMP pings fail in Prometheus

May 22, 2020

Suppose, not hypothetically, that your Prometheus setup pings a bunch of machines (through the blackbox exporter) and some of those pings seem to fail some of the time. If they fail continuously for long enough, you'll raise an alert, but beyond that you may want to know how often they've flaked out over some time period for use in a Grafana dashboard. Today, I wanted this both as a failure percentage and then as a count of how many pings had failed.

Our Blackbox setup reports ICMP results as a probe_success metric with a probe="icmp" label (among others); it has a 1 value if the probe was successful and a 0 value if it failed. For such metrics that are either 1 or 0, the classical way to determine the percentage of time they're up (or successful) is to use avg_over_time, as covered in Robust Perception's "What percentage of time is my service down for?". The straightforward PromQL query for 'what percent is this down' as a 0.0 to 1.0 value is thus:

1 - avg_over_time( probe_success{ probe="icmp" }[$__range] )

(This uses the Grafana range variable, covered here in the 'Using interval and range variables' section.)

However, I forgot this when I was initially setting our new ping status dashboard today and used a different approach. In general if you're looking for the 0.0 to 1.0 percentage of a subset of your data, you want the subcount divided by the total count. The total count of ICMP ping probes over time is:

count_over_time( probe_success{ probe="icmp" }[$__range] )

Because successful ping probes have a value of 1, we can get the count of them with sum_over_time, making the full expression be:

1 - ( sum_over_time( probe_success{ probe="icmp" }[$__range] ) /
      count_over_time( probe_success{ probe="icmp" }[$__range] )

Of course, this 'sum / count' is just the average and so we can replace this with the more efficient avg_over_time expression.

(We don't want to use a subquery to count up how many times probe_success was zero, because a subquery won't necessarily get the same number of metric points as count_over_time will. You might even have different Blackbox ping frequencies for different targets.)

This version of our 'percentage of pings that failed' expression points the way to giving us the total number of failed pings. This is the total number of pings minus the successful pings, which is the parts of our complicated percentage expression flipped around:

count_over_time( probe_success{ probe="icmp" }[$__range] -
   sum_over_time( probe_success{ probe="icmp" }[$__range] )

Note that this is not the amount of time that pings were failing for. In general, it's impossible to work out a completely accurate number for that for various reasons, including that we may have metric points that are missing entirely for whatever reason. If we assume that we have metric points evenly covering the entire time range, the amount of time (in seconds) which pings were failing for is the total range of time in seconds times the 0.0 to 1.0 failure percentage. In Grafana again, this would be:

$__range_s *
  (1 - avg_over_time( probe_success{ probe="icmp" }[$__range] ))

(There may be a better way to compute this. I haven't thought much about it because I think 'amount of time down' is misleading here in a way that 'percentage of pings that failed' is not.)

Written on 22 May 2020.
« How I work on Python 2 and Python 3 with the Python Language Server (in GNU Emacs)
Mixed feelings about Firefox Addons' new non-Recommended extensions warning »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri May 22 00:28:22 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.