Our Prometheus host metrics saved us from some painful experiences
A couple of weeks ago, a few days after a kernel upgrade on our servers, we had an Ubuntu 22.04 server basically die with a constant series of out-of-memory kills from the kernel of both vital demons and random bystander processes. There was no obvious clue as to why, with no program or cgroup consuming an unusual amount of memory. In the constant spew of text from the kernel as it repeatedly OOM killed things, we did notice something:
kernel: [361299.864757] Unreclaimable slab info: kernel: [361299.864757] Name Used Total [...] kernel: [361299.864924] kmalloc-2k 6676584KB 6676596KB [...]
Among our Grafana dashboards is one that provides a relatively detailed look into the state of a particular server, including various bits of memory usage. After we rebooted the failing server I took a look at its dashboard, and immediately noticed that the 'Slab' memory usage was basically a diagonal line going up over time from when it had its kernel update a few days ago and been rebooted.
This caused me to immediately go look at the Slab memory usage for other servers, and all of our 22.04 servers had the same behavior. All of them had a constantly increasing amount of slab memory usage (and in some digging, 'unreclaimable' slab memory usage); it was just that this particular server had a combination of usage and low(er) RAM that caused it to run out of memory sooner than anything else. It was clear we had a systemic issue that would take down every one of our 22.04 servers sooner or later, with a number of them already being alarmingly close to also running out of memory (including our Prometheus metrics server).
At first this looked very much like an issue with the new kernel. But it occurred to us that we'd effectively made another kernel change at the same time. Back at the start of August, after discovering that AppArmor profiles had started activating themselves, we'd set a kernel command line option to turn off AppArmor in the kernel. However, activating that option requires a reboot (to use the new command line), and on most machines we hadn't rebooted them until our kernel update. However, a few 22.04 machines had been rebooted earlier with the command line update in place, and some of those machines were even running older kernels. Inspection of Prometheus host metrics showed that their Slab usage had started going up in the same pattern from the moment they were rebooted, including the machines that had older kernels.
We immediately reverted this kernel command line change on a few machines that we could readily reboot without affecting people (including the Prometheus metrics server), while leaving them using the current kernel. Within a few hours it was fairly clear that disabling AppArmor on the kernel command line was the trigger for this kernel memory leak, and by the next morning it was basically certain. We reverted the kernel command line change everywhere and started scheduling server reboots for all of our 22.04 machines.
(We also filed Ubuntu bug 1987430.)
Without our Prometheus and Grafana setup, this most likely would have been a rather different and more painful experience. We probably would have written off the first server going out of memory as a one time piece of weirdness and only started reacting when a second server had the same thing happen to it a day or two later (and there probably would have been a succession of servers hitting limits at that time). Then it might have taken longer to realize that we had a steady slab leak over time, and we'd probably have blamed the recent kernel update and spent a bunch of time and effort reverting to a previous 22.04 kernel without actually fixing the problem. As it was, our Grafana dashboards surfaced a big indicator of the problem right away and then our historical data let us see that it wasn't actually the recent kernel update at fault.
Most of the time our metrics system just seems nice and useful, not a critical thing (alerts are critical, but those don't necessarily require metrics and metric history). This was not one of those times; it's one of the few times where having metrics, both current and historical, clearly saved our bacon. A part of me feels that this incident justifies our metrics systems all by itself.
(This elaborates on a Fediverse post of mine, and also a tweet.)