The case of mysterious load average spikes on our Linux login server
We have a Linux login server that is our primary server basically by default; it's the first one in numbering and the server a convenient alias is pointed to, so most people wind up using it. Naturally we monitor its OS level metrics as part of our Prometheus setup, and as part of that a graph of its load average (along with all our other interesting servers) appears on our overview Grafana dashboard. For basically as long as we've been doing this, we've noticed that this server experiences periodic and fairly drastic short term load average spikes for no clear reason.
A typical spike will take the 1-minute load average from 0.26 or
so (the typical load average for it) up to 6.5 or 7 in a matter of
seconds, and then immediately start dropping back down. There seems
to often be some correlation with other metrics, such as user and
system CPU time usage, but not necessarily a high one. We capture
ps
and top
output periodically for reasons beyond the scope of
this entry, and these captures have never shown anything in particular
even when they capture the high load average itself. The spikes
happen at all times, day or night and weekday or weekend, and don't
seem to come in any regular pattern (such as every five minutes).
The obvious theory for what is going on is that there are a bunch
of processes that have some sort of periodic wakeup where they do
a very brief amount of work, and they've wound up more or less in
sync with each other. When the periodic wakeup triggers, a whole
bunch of processes become ready to run and so spike the load average
up, but once they do run they don't do very much so the log-jam
clears almost immediately (and the load average immediately drops).
Since it seems to be correlated with the number of logins, this may
be something in systemd's per-login process infrastructure. Since
all of these logins happen over SSH, it could also partly be because
we've set a ClientAliveInterval
in our sshd_config so sshd
likely wakes up periodically for some connections; however, I'm not
clear how that would wind up in sync for a significant number of
people.
I don't know how we'd go about tracking down the source of this without a lot of work, and I'm not sure there's any point in doing that work. The load spikes don't seem to be doing any harm, and I suspect there's nothing we could really do about the causes even if we identified them. I rather expect that having a lot of logins on a single Linux machine is now not a case that people care about very much.
Comments on this page:
|
|