In Prometheus, don't be afraid of high cardinality metrics if they're valuable enough
We generate any number of custom local metrics that we feed into our local Prometheus metrics and monitoring setup. Most of them are pretty conventional, but one of them is probably something that will raise a lot of eyebrows among people who are familiar with Prometheus and set them to muttering about cardinality explosions. Among our more conventional metrics about how much disk space is free on our fileserver filesystems, we generate a per-user, per-filesystem disk space usage metric. In the abstract, this looks like:
cslab_user_used_bytes{ user="cks", filesystem="/h/281" } 59533330944 cslab_user_used_bytes{ user="cks", filesystem="/cs/mail"} 512
(Users that are not using any space on a filesystem do not get listed for that filesystem.)
Having a time series per user is generally not recommended, and then having it per filesystem as well makes it worse. This metric generates a lot of distinct time series and I'm sure a lot of people would tell us that maybe we shouldn't have it.
However, it's turned out that we derive a major amount of practical value from having this information and having it in Prometheus (and therefor having not just current data but fine grained historical data going back a long ways). Many of our filesystems and ZFS pools perpetually run relatively full and periodically fill up, and when they do this information can immediately tell people what happened, not just in the immediate past but over larger time scales too. Obviously we can easily get various sorts of summed up information, such as per-pool usage by person.
(Another use is finding Unix logins using space in filesystems we
didn't expect. When I first set this up, I found root
-owned stuff
littered in all sorts of places, often by accident or by omission.)
Before I set this metric up and we started using it, I was nervous about the cardinality issue; in fact, cardinality worries kept me from doing this for a while, until various things pushed me over the edge. But now it's clear that the metric is very much worth it, despite all of those different time series it creates.
The large scale Prometheus lesson I took from this is that sometimes high cardinality metrics provide enough value that they're worth having anyway. You don't want to create unnecessary cardinality and you don't want to be too excessive (or overload your Prometheus), but there's value in detail that isn't there in broad overviews. I should be cautious, but not too afraid.
(Now that the most recent versions of Prometheus will actually tell
you about your highest cardinality metric names, I've found out
that this metric is actually nowhere near our highest cardinality
metrics. The highest cardinality one by far is node_systemd_unit_state
,
which is a standard host agent metric, although not
one that is enabled in the default configuration.)
|
|