In Prometheus, don't be afraid of high cardinality metrics if they're valuable enough

November 25, 2019

We generate any number of custom local metrics that we feed into our local Prometheus metrics and monitoring setup. Most of them are pretty conventional, but one of them is probably something that will raise a lot of eyebrows among people who are familiar with Prometheus and set them to muttering about cardinality explosions. Among our more conventional metrics about how much disk space is free on our fileserver filesystems, we generate a per-user, per-filesystem disk space usage metric. In the abstract, this looks like:

cslab_user_used_bytes{ user="cks", filesystem="/h/281" } 59533330944
cslab_user_used_bytes{ user="cks", filesystem="/cs/mail"} 512

(Users that are not using any space on a filesystem do not get listed for that filesystem.)

Having a time series per user is generally not recommended, and then having it per filesystem as well makes it worse. This metric generates a lot of distinct time series and I'm sure a lot of people would tell us that maybe we shouldn't have it.

However, it's turned out that we derive a major amount of practical value from having this information and having it in Prometheus (and therefor having not just current data but fine grained historical data going back a long ways). Many of our filesystems and ZFS pools perpetually run relatively full and periodically fill up, and when they do this information can immediately tell people what happened, not just in the immediate past but over larger time scales too. Obviously we can easily get various sorts of summed up information, such as per-pool usage by person.

(Another use is finding Unix logins using space in filesystems we didn't expect. When I first set this up, I found root-owned stuff littered in all sorts of places, often by accident or by omission.)

Before I set this metric up and we started using it, I was nervous about the cardinality issue; in fact, cardinality worries kept me from doing this for a while, until various things pushed me over the edge. But now it's clear that the metric is very much worth it, despite all of those different time series it creates.

The large scale Prometheus lesson I took from this is that sometimes high cardinality metrics provide enough value that they're worth having anyway. You don't want to create unnecessary cardinality and you don't want to be too excessive (or overload your Prometheus), but there's value in detail that isn't there in broad overviews. I should be cautious, but not too afraid.

(Now that the most recent versions of Prometheus will actually tell you about your highest cardinality metric names, I've found out that this metric is actually nowhere near our highest cardinality metrics. The highest cardinality one by far is node_systemd_unit_state, which is a standard host agent metric, although not one that is enabled in the default configuration.)

Written on 25 November 2019.
« I use unit tests partly to verify that something works in the first place
Capturing command output in a Bourne shell variable as a brute force option »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Nov 25 23:01:07 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.