2022-12-21
The Prometheus cardinality issues with systemd unit-related metrics
Over on the Fediverse I said something about cAdvisor's Prometheus metrics:
Unless I'm missing something, cAdvisor's Prometheus metrics export has no way to limit what cgroups it generates metrics for. This is fatal on systems used by users, since it leads to a label cardinality explosion as the cgroups reported on will include 'session-NNN.scope' cgroups, with a constantly increasing and basically never re-used NNN.
This is a general issue for any Prometheus metrics that are related to particular systemd units. You absolutely need to limit what units you report on or ingest metrics for, because some systemd unit names have essentially unlimited cardinality. Any metric that includes their name as a label will thus have label cardinality issues, and in practice this means all per-unit metrics are potentially a problem.
(As a side note, there are two ways of reporting unit names in metric labels; you can report the direct unit name, or you can report the entire path to the unit through systemd's hierarchy of units, which is also the cgroup structure it uses. The full hierarchy is often more dangerous for label cardinality explosions than the unit name alone.)
User session scope units are called 'session-<number>.scope', where the number increments steadily until the system is rebooted. These units occur under 'user-<uid>.slice' units, so on many systems the full path to a user session is a label that at best repeats extremely infrequently after system reboots (the same UID would need to get the same session number). The session unit name alone repeats only on system reboot; if that's infrequent, you'll get a lot of fairly unique per-host labels.
But user session scope units (or scope and slice units more generally) aren't the only source of unit name problems. For example, templated socket service units have unique non-repeating names, and those are top level system '.service' units that you will normally report on even if you're reporting metrics only for system service units (provided that the unit lives long enough). There are probably other such variable unit names lurking out there, as well as a profusion of unit names that aren't variable as such but which may be system-specific (such as the various .device units).
(One can get an idea of the possibilities by looking at the templated systemd units installed on your system, although many of them probably aren't in use.)
In an ideal world the exporter you're using will have a collection of features to limit what systemd units it reports on, so you can do things like filter out ones you're not interested in, only include types you care about, and perhaps limit the depth that the exporter traverses to (if it does things like traverse systemd's cgroup hierarchy, as cAdvisor does). If your exporter doesn't have that, you'll need to use the general Prometheus features to filter and drop metrics on ingestion, or forgo use of the exporter entirely.
More generally, any time an exporter's metrics include labels with systemd unit names, you need to take a closer look. It's potentially quite dangerous to start scraping such exporters without checking and thinking about this (and in fact I missed this issue the first time I looked into cAdvisor for systemd cgroup resource usage metrics).
(I've seen similar cardinality issues with systemd unit names with Grafana Loki, when I discovered what can go wrong with label cardinality in Loki.)