2022-12-03
Using Dovecot 2.3's 'events' system to create Prometheus metrics
Last time around I covered using Dovecot 2.3's events to generate log messages. This is actually the less interesting thing (to us) that you can do with them; the more interesting thing is that you can have Dovecot directly expose an OpenMetrics exporter for statistics, which Prometheus can scrape directly (the OpenMetrics metrics format is more or less the Prometheus one, and Prometheus can deal with it these days). However, actually generating useful metrics and understanding what you get is a little bit complicated.
(You'll need a service
definition to expose your metrics for
scraping, per the basic configuration,
which you can just copy as is.)
In Prometheus terms, Dovecot statistics
can give you counters or histograms (either exponential or linear).
Histograms are the simpler thing so I'll cover them first. A histogram
is created with a metric with a group_by
that
sets either an exponential
or a linear
set of histogram buckets. For example, a duration histogram:
metric imap_command_time { filter = event=imap_command_finished AND \ tagged_reply_state=OK group_by = cmd_name user \ duration:exponential:1:30:2 }
The remaining group_by
fields become histogram labels; in other
words, we're creating a group of histograms by IMAP command and
user (which is potentially a lot of histograms). These histograms
are of the command duration, and have thirty buckets starting from
1 microsecond and going up to 1073.7 seconds (17.8 minutes), which
should be enough of a range. That the duration is in microseconds
is covered in Global Fields, but
fortunately, for duration Dovecot will convert this to the standard
Prometheus version of seconds for you. These histograms also have the
standard Prometheus histogram metrics of *_sum
and *_count
,
which is handy for reasons we'll come back to later.
The other metric Dovecot will create is counters, which Dovecot
calls discrete
statistics. However, these metrics have a major limitation, which
is that they can only be done to count how many times something
happened (ie, Dovecot 'events'), not additional data associated
with those Dovecot events. These Dovecot statistics create Prometheus
metrics for the count itself and for the 'duration' associated with
the event. You cannot use these 'discrete' statistics to count, for
example, the number of bytes output for IMAP commands; you can only
count how many IMAP commands there were, and along with that the
sum of their durations. THe group_by
clause also behaves
peculiarly (from a Prometheus perspective) for counter metrics. So
let's start with a counter metric definition and then talk about
what happens:
metric imap_command { filter = event=imap_command_finished group_by = cmd_name user tagged_reply_state }
This creates two groups of Prometheus metrics,
dovecot_imap_command_total
(the count of them) and
dovecot_imap_command_duration_seconds_total
(the total duration
in seconds). However, in each you get not just a single set of labels
(the way you do with histograms), but a hierarchy, as show in the
example in the exporter documentation.
Here, that would create a set of labels that look like this:
{cmd_name="LIST"} {cmd_name="LIST", user="tstuser"} {cmd_name="LIST", user="tstuser", tagged_reply_state="OK"}
Prometheus can consume these metrics but the result may be confusing
(and voluminous). You may also want to consider carefully the order
of group_by
, because it will influence which aggregate stats
are readily at hand (here, count and duration by command) versus
which aren't so easy (count and duration by user).
Although the Dovecot Statistics documentation talks about
using the 'fields
' setting to specify "a list of fields that are
included in the metrics", as of Dovecot 2.3.16 this doesn't actually
do anything for Prometheus metrics (although it does for Dovecot's
internal statistics that you can access through the 'doveadm
'
command). It would be nice if it did at some point in the future,
because it would allow us to easily obtain Prometheus metrics of,
say, the total bytes output by IMAP commands (broken down as above).
Instead we have to reach for a hack to generate such a thing.
If you want a Prometheus counter metric of, for example, bytes output by IMAP command and user, then the solution to this limitation of Dovecot 'discrete' statistics is to use the world's smallest linear histogram:
metric imap_command_out_bytes { filter = event=imap_command_finished AND \ tagged_reply_state=OK group_by = cmd_name user \ bytes_out:linear:1:2:1 }
We don't actually care about the histogram itself (and could have
Prometheus drop it from the scrape results); what we care about are
the associated *_count
and *_sum
metrics, which will give
us the running sum of bytes out and the count of how many commands
we've had.
You can similarly use the world's smallest histogram to eliminate the usual cascade from counter metrics. Simply make a histogram metric where the tiny histogram is of duration:
metric imap_command { filter = event=imap_command_finished group_by = cmd_name user tagged_reply_state \ duration:linear:1:2:1 }
However, this will probably generate more metrics in total than you would with a regular Dovecot discrete metric, although you can drop all the histogram buckets on ingestion to cut that down.
If you have Dovecot 'metrics' which exist only to log events, you can exclude them from Dovecot's exposed Prometheus metrics by giving them names that are invalid as OpenMetrics metrics, for example by putting one or more '/' in them. Dovecot will complain on startup, but so what.