The uncertainty of an elevated load average on our Linux IMAP server
We have an IMAP server, using Dovecot on Ubuntu 18.04 and with all of its mail storage on our NFS fileservers. Because of historical decisions (cf), we've periodically had real performance issues with it; these issues have been mitigated partly through various hacks and partly through migrating the IMAP server and our NFS fileservers from 1G Ethernet to 10G (our IMAP server routinely reads very large mailboxes, and the faster that happens the better). However, the whole experience has left me with a twitch about problem indicators for our IMAP server, especially now that we have a Prometheus metrics system that can feed me lots of graphs to worry about.
For a while after we fixed up most everything (and with our old
OmniOS fileservers), the IMAP
server was routinely running at a load average of under 1. Since
then its routine workday load average has drifted upward, so that
a load average of 2 is not unusual and it's routine for it to be
over 1. However, there are no obvious problems the way there used
to be; '
top' doesn't show constantly busy IMAP processes, for
example, indicators such as the percentage of time the system spends
in iowait (which on Linux includes waiting for NFS IO) is consistently low, and our IMAP stats
monitoring doesn't show any clear slow commands the way it used to.
To the extent that I have IMAP performance monitoring, it only shows
slow performance for looking at our test account's INBOX, not really
(All user INBOXes are in our NFS
/var/mail filesystem and some
of them are very large, so it's a really hot spot and is kind of
expected to be slower than other filesystems; there's only really
so much we can do about it. Unfortunately we don't currently
have Prometheus metrics from our NFS fileservers, so I can't easily tell if there's some
obvious performance hotspot on that fileserver.)
All of this leaves me with two closely related mysteries. First, does this elevated load average actually matter? This might be the sign of some real IMAP performance problem that we should be trying to deal with, or it could be essentially harmless. Second, what is causing the load average to be high? Maybe we frequently have blocked processes that are waiting on IO or something else, or that are running in micro-bursts of CPU usage.
(eBPF based tracing might be able to tell us something about all of this, but eBPF tools are not really usable on Ubuntu 18.04 out of the box.)
Probably I should invest in developing some more IMAP performance measurements and also consider doing some measurements of the underlying NFS client disk IO, at least for simple operations like reading a file from a filesystem. We might not wind up with any more useful information than we already have, but at least I'd feel like I was doing something.