How many Prometheus metrics a typical host here generates

September 11, 2021

When I think about new metrics sources for our Prometheus setup, often one of the things on my mind is the potential for a metrics explosion if I add metrics data that might have high cardinality. Now, sometimes high cardinality data is very worth it and sometimes data that might be high cardinality won't actually be, so this doesn't necessarily stop me. But in all of this, I haven't really developed an intuition for what is a lot of metrics (or time series) and what isn't. Recently it struck me is that one relative measuring stick for this is how many metrics (ie time series) a typical host generates here.

Currently, most hosts only run the host agent, although its standard metrics are augmented with locally generated data. On machines that have our full set of NFS mounts, a major metrics source is a local set of metrics for Linux's NFS mountstats. A machine generating these metrics has anywhere between 6,000 to 8,000 odd time series. An Ubuntu Linux machine that doesn't generate these metrics generally has around 1,300 time series.

(Our modern OpenBSD machines, which also support the host agent, have around 150 time series.)

Our valuable disk space usage metrics have between 7,400 time series, on NFS fileservers where almost every user has some files, such as the fileserver hosting our /var/mail, and under 2,000. on other fileservers. Some fileservers have significantly less, down to just over 300 on our newest and least used fileserver. Having these numbers gives me a new perspective on how "high cardinality" these metrics really are; at most, the metrics from one fileserver are roughly equivalent to adding another Ubuntu server with all our NFS mounts. More often they're equivalent to a standalone Ubuntu server.

This equivalence matters to me for thinking about new metrics because I add monitoring for new Ubuntu servers without thinking about it. If a new metrics source is equivalent to another Ubuntu server, I don't really need to think about it either (unless I'm going to do something like add it to each existing server, effectively doubling their metrics load). However, significantly raising the number of host equivalents that we monitor would be an issue, since currently the host agent is collectively our single largest source of metrics by far.

One interesting metrics source is Cloudflare's Linux eBPF exporter, which can be used to get things like detailed histograms of disk read and write IO times. I happen to be doing this on my office workstation, where it generates about 500 time series that cover two SATA SSDs and two NVMe drives. This suggests that it would be entirely feasible to add it to machines of interest, even our NFS fileservers, where at a very rough ballpark I might see about 2,400 new time series per server (each has 18 disks).

(For how to calculate this sort of thing for your own Prometheus setup, see my entry on how big our Prometheus setup is.)

Written on 11 September 2021.
« Some things to reduce background bandwidth usage on a Fedora machine
Why I'm mostly not a fan of coloured text (in terminals or elsewhere) »

Page tools: View Source.
Search:
Login: Password:

Last modified: Sat Sep 11 22:32:02 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.