Wandering Thoughts archives

2020-02-17

The uncertainty of an elevated load average on our Linux IMAP server

We have an IMAP server, using Dovecot on Ubuntu 18.04 and with all of its mail storage on our NFS fileservers. Because of historical decisions (cf), we've periodically had real performance issues with it; these issues have been mitigated partly through various hacks and partly through migrating the IMAP server and our NFS fileservers from 1G Ethernet to 10G (our IMAP server routinely reads very large mailboxes, and the faster that happens the better). However, the whole experience has left me with a twitch about problem indicators for our IMAP server, especially now that we have a Prometheus metrics system that can feed me lots of graphs to worry about.

For a while after we fixed up most everything (and with our old OmniOS fileservers), the IMAP server was routinely running at a load average of under 1. Since then its routine workday load average has drifted upward, so that a load average of 2 is not unusual and it's routine for it to be over 1. However, there are no obvious problems the way there used to be; 'top' doesn't show constantly busy IMAP processes, for example, indicators such as the percentage of time the system spends in iowait (which on Linux includes waiting for NFS IO) is consistently low, and our IMAP stats monitoring doesn't show any clear slow commands the way it used to. To the extent that I have IMAP performance monitoring, it only shows slow performance for looking at our test account's INBOX, not really other mailboxes.

(All user INBOXes are in our NFS /var/mail filesystem and some of them are very large, so it's a really hot spot and is kind of expected to be slower than other filesystems; there's only really so much we can do about it. Unfortunately we don't currently have Prometheus metrics from our NFS fileservers, so I can't easily tell if there's some obvious performance hotspot on that fileserver.)

All of this leaves me with two closely related mysteries. First, does this elevated load average actually matter? This might be the sign of some real IMAP performance problem that we should be trying to deal with, or it could be essentially harmless. Second, what is causing the load average to be high? Maybe we frequently have blocked processes that are waiting on IO or something else, or that are running in micro-bursts of CPU usage.

(eBPF based tracing might be able to tell us something about all of this, but eBPF tools are not really usable on Ubuntu 18.04 out of the box.)

Probably I should invest in developing some more IMAP performance measurements and also consider doing some measurements of the underlying NFS client disk IO, at least for simple operations like reading a file from a filesystem. We might not wind up with any more useful information than we already have, but at least I'd feel like I was doing something.

linux/LoadAverageIMAPImpactQuestion written at 22:22:22; Add Comment

The case of mysterious load average spikes on our Linux login server

We have a Linux login server that is our primary server basically by default; it's the first one in numbering and the server a convenient alias is pointed to, so most people wind up using it. Naturally we monitor its OS level metrics as part of our Prometheus setup, and as part of that a graph of its load average (along with all our other interesting servers) appears on our overview Grafana dashboard. For basically as long as we've been doing this, we've noticed that this server experiences periodic and fairly drastic short term load average spikes for no clear reason.

A typical spike will take the 1-minute load average from 0.26 or so (the typical load average for it) up to 6.5 or 7 in a matter of seconds, and then immediately start dropping back down. There seems to often be some correlation with other metrics, such as user and system CPU time usage, but not necessarily a high one. We capture ps and top output periodically for reasons beyond the scope of this entry, and these captures have never shown anything in particular even when they capture the high load average itself. The spikes happen at all times, day or night and weekday or weekend, and don't seem to come in any regular pattern (such as every five minutes).

The obvious theory for what is going on is that there are a bunch of processes that have some sort of periodic wakeup where they do a very brief amount of work, and they've wound up more or less in sync with each other. When the periodic wakeup triggers, a whole bunch of processes become ready to run and so spike the load average up, but once they do run they don't do very much so the log-jam clears almost immediately (and the load average immediately drops). Since it seems to be correlated with the number of logins, this may be something in systemd's per-login process infrastructure. Since all of these logins happen over SSH, it could also partly be because we've set a ClientAliveInterval in our sshd_config so sshd likely wakes up periodically for some connections; however, I'm not clear how that would wind up in sync for a significant number of people.

I don't know how we'd go about tracking down the source of this without a lot of work, and I'm not sure there's any point in doing that work. The load spikes don't seem to be doing any harm, and I suspect there's nothing we could really do about the causes even if we identified them. I rather expect that having a lot of logins on a single Linux machine is now not a case that people care about very much.

linux/LoadAverageMultiuserSpikes written at 01:19:38; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.