2021-09-11
How many Prometheus metrics a typical host here generates
When I think about new metrics sources for our Prometheus setup, often one of the things on my mind is the potential for a metrics explosion if I add metrics data that might have high cardinality. Now, sometimes high cardinality data is very worth it and sometimes data that might be high cardinality won't actually be, so this doesn't necessarily stop me. But in all of this, I haven't really developed an intuition for what is a lot of metrics (or time series) and what isn't. Recently it struck me is that one relative measuring stick for this is how many metrics (ie time series) a typical host generates here.
Currently, most hosts only run the host agent, although its standard metrics are augmented with locally generated data. On machines that have our full set of NFS mounts, a major metrics source is a local set of metrics for Linux's NFS mountstats. A machine generating these metrics has anywhere between 6,000 to 8,000 odd time series. An Ubuntu Linux machine that doesn't generate these metrics generally has around 1,300 time series.
(Our modern OpenBSD machines, which also support the host agent, have around 150 time series.)
Our valuable disk space usage metrics
have between 7,400 time series, on NFS fileservers where almost every user has some files,
such as the fileserver hosting our /var/mail
, and under 2,000. on
other fileservers. Some fileservers have significantly less, down to
just over 300 on our newest and least used fileserver. Having these
numbers gives me a new perspective on how "high cardinality" these
metrics really are; at most, the metrics from one fileserver are roughly
equivalent to adding another Ubuntu server with all our NFS mounts.
More often they're equivalent to a standalone Ubuntu server.
This equivalence matters to me for thinking about new metrics because I add monitoring for new Ubuntu servers without thinking about it. If a new metrics source is equivalent to another Ubuntu server, I don't really need to think about it either (unless I'm going to do something like add it to each existing server, effectively doubling their metrics load). However, significantly raising the number of host equivalents that we monitor would be an issue, since currently the host agent is collectively our single largest source of metrics by far.
One interesting metrics source is Cloudflare's Linux eBPF exporter, which can be used to get things like detailed histograms of disk read and write IO times. I happen to be doing this on my office workstation, where it generates about 500 time series that cover two SATA SSDs and two NVMe drives. This suggests that it would be entirely feasible to add it to machines of interest, even our NFS fileservers, where at a very rough ballpark I might see about 2,400 new time series per server (each has 18 disks).
(For how to calculate this sort of thing for your own Prometheus setup, see my entry on how big our Prometheus setup is.)
Some things to reduce background bandwidth usage on a Fedora machine
Suppose, not entirely hypothetically, that you have a Fedora laptop and you want it to use minimal bandwidth for things that you don't specifically do. Unfortunately there are only a few things that I know of to do, and I'm not sure they're comprehensive. Most of my information comes from this old r/Fedora post.
First, turn off dnf-makecache. DNF's cache
updates can apparently download significant amounts of data if you
let them. Second, set your connection as metered in NetworkManager,
which can be done with nmcli
or through some but not all GUIs.
In nmcli
, it is:
nmcli connection modify <connection> connection.metered yes
The NetworkManager nm-connections-editor
GUI allows you to set
the metered state of connections. In Cinnamon's regular network
desklet thing (which is not nm-applet but integrated into Cinnamon's
shell), there is both a "Network Settings" and a "Network Connections"
option. Only the latter runs nm-connections-editor and lets you set
the metered option in a GUI.
Finally, according to the Reddit post, you can also disable Gnome PackageKit refreshes with:
gsettings set org.gnome.software download-updates false
PackageKit and DNF may not be the only things in Cinnamon (or Gnome) that are probing for updates and so using up your limited bandwidth, but I haven't pinned down anything else yet. KDE probably has its own equivalent; XFCE is perhaps free of such annoyances.
In an ideal world, both DNF and PackageKit would take their cues from the connection being metered in NetworkManager. In this world, I'm not sure if they do (or if I trust them to really do this right). Turning them off entirely and doing manual refreshes on demand (which probably means "not at all while on a limited connection") is the easier and more definitely reliable way.
Simple network usage information can be extracted with 'ifconfig
'.
However, I believe those counters reset every time an interface
comes and goes, which might happen more than you think if you move
a machine around. As far as I know there's nothing in NetworkManager
that keeps this information for NM "connections", which is usually
the level you care about this on a laptop or other moving machine.
(I don't blame NetworkManager for this, since it's far from clear what stats people would be interested in and over what time ranges. I expect that NetworkManager developers are uninterested in a new sideline in a time series metrics database.)