2023-09-26
Some reasons to combine systemd-resolved with your private DNS resolver
Probably like many people, we have some machines that are set up
as local DNS resolvers. Originally we had one set for everyone,
both our own servers and other people's machines on our internal
networks, but after some recent
issues we want to make DNS resolution on our own critical servers
more reliable and are doing that partly by having a dedicated
private DNS resolver for our servers.
Right now all of our servers do DNS in the old fashioned way, with
a nsswitch.conf
that tells them to use DNS and an /etc/resolv.conf
that points to our two (now three on some servers) DNS resolvers.
One of the additional measures I've been considering is whether we
want to consider using systemd-resolved
on some servers.
Systemd-resolved has two features that make it potentially attractive
for making server DNS more reliable. The obvious one is that it
normally has a small cache of name resolutions (the Cache=
configuration directive). Based on 'resolvectl statistics
' on a
few machines I have that are running systemd-resolved, this cache
doesn't seem to get very big and doesn't get very high a hit rate,
even on machines that are just sitting there doing nothing (and so
are only talking to the same few hosts over and over again). I
certainly don't think we can count on this cache to do very much
if our normal DNS resolvers stop responding for some reason.
The second feature is much more interesting, and it's that
systemd-resolved will rapidly switch to another DNS resolver if
your initial one stops responding. In situations where you have
multiple DNS servers (for a given network link or global setting, because systemd-resolved thinks in those
terms), systemd-resolved maintains a 'current DNS server' and will
send all traffic to it. If this server stops responding, resolved
will switch over and then latch on whichever of your DNS servers
is still working. This makes the failure of your 'primary' DNS
server much less damaging than in a pure /etc/resolv.conf situation.
In normal resolv.conf
handling, every program has to fail over
itself (and I think some runtime environments may always keep trying
the first listed 'nameserver
' and waiting for it to time out).
The generally slow switching of nameservers listed in your resolv.conf means that you really want the first DNS resolver to stay responsive (whatever it is). Systemd-resolved makes it much less dangerous to add another DNS resolver along side your regular ones, as long as you can trust it to not give wrong answers. If it stops working, those systems using it will switch over to other DNS resolvers fast enough that very little will notice.
(Unfortunately getting those systems to switch back may be annoying, but in a sense you don't care whether or not they're using your special private DNS resolver that's just for them or one of your public DNS resolvers. If your public DNS resolvers get flooded by other people's traffic and stop responding, systemd-resolved will switch the systems back to your private DNS resolver again.)
PS: Of course there are configuration issues with systemd-resolved that you may need to care about, but very little is flawless.
2023-09-24
I wish Linux exposed a 'OOM kills due to cgroup limits' kernel statistic
Under ertain circumstances, Linux will trigger the Out-Of-Memory Killer and kill some process. For some time, there have been two general ways for this to happen, either a global OOM kill because the kernel thinks it's totally out of memory, or a per-cgroup based OOM kill where a cgroup has a memory limit. These days the latter is quite easy to set up through systemd memory limits, especially user memory limits.
The kernel exposes a vmstat statistic for total OOM kills from all
causes, as 'oom_kill
' in /proc/vmstat
; this is probably being
surfaced in your local metrics collection agent under some name.
Unfortunately, as far as I know the kernel doesn't expose a simple
statistic for how many of those OOM kills are global OOM kills
instead of cgroup OOM kills. This difference is of quite some
interest to people monitoring their systems, because a global
OOM kill is probably important while a cgroup OOM kill may be
entirely expected.
Each cgroup does have information about OOM kills in its hierarchy
(or sometimes itself only, if you used the memory_localevents
cgroups v2 mount option, per cgroups(7)). This
information is in the 'memory.events
' file, but as covered in
the cgroups v2 documentation, this file is
only present in non-root cgroups, which means that you can't find
a system wide version of this information in one place. If you know
on a specific system that only one top level cgroup can have OOM
kills, you can perhaps monitor that, but otherwise you need something
more sophisticated (and in theory you might miss transient top level
cgroups, although in practice most are persistent).
The kernel definitely knows this information; the kernel log messages for global OOM kills are distinctly different from the kernel log messages for cgroup OOM kills. So the kernel could expose this information, for example as a new /proc/vmstat field or two; it just doesn't (currently, as of fall 2023).
(Someday we may add a Prometheus cgroups metrics exporter to our host agents in our Prometheus environment and so collect this information, but so far I haven't found a cgroup exporter that I like and that provides the information I want to know.)
2023-09-20
Restarting nfs-server on a Linux NFS (v3) server isn't transparent
A while back I wrote an article on enabling NFS v4 on an Ubuntu
22.04 fileserver (instead of just NFS v3),
where one of the final steps was to restart 'nfsd', the NFS server
daemon (sort of), with 'systemctl restart nfs-server
'. In that
article I said that as far as I could tell this entire process was
transparent to NFS v3 clients that were talking to the NFS server.
Unfortunately I have to take that back. Restarting 'nfs-server
'
will cause the NFS server to discard locks obtained by NFS v3
clients, without telling the NFS v3 clients anything about this.
This results in the NFS v3 clients thinking that they hold locks
while the NFS server believes that everything is unlocked and so
will allow another client to lock it.
(What happens with NFS v4 clients is more uncertain to me; they may more or less ride through things.)
On Linux, the NFS server is in the kernel and runs as kernel
processes, generally visible in process lists as '[nfsd]
'. You
might wonder how these processes are started and stopped, and the
answer is through a little user-level shim, rpc.nfsd
. What this
program actually does is write to some files in /proc/fs/nfsd that control
the portlist, the NFS versions offered, and the number of kernel
nfsd threads that are running. To restart (kernel) NFS service, the
nfs-server.service unit first stops it with 'rpc.nfsd 0', telling
the kernel to run '0' nfsd threads, and then starts it again by
writing some appropriate number of threads into place, which starts
NFS service. The nfs-server.service systemd unit also does some
other things.
(As a side note, you can see what NFS versions your NFS server is currently supporting by looking at /proc/fs/nfsd/versions. Sadly this can't be changed while there are NFS server threads running.)
If you restart the kernel NFS server either with 'systemctl restart
nfs-server
' or by hand by writing '0' and then some number to
/proc/fs/nfsd/threads, the kernel will completely drop knowledge
of all locks from NFS v3 clients. Unfortunately running 'sm-notify
' doesn't
seem to recover them; they're just gone.
Locks from NFS v4 clients suffer a somewhat less predictable and
certain fate. If the NFS v4 client is actively doing NFS operations
to the server, its locks will generally be preserved over a 'systemctl
restart nfs-server
'. If the client isn't actively doing NFS
operations and doesn't do any for a while, I'm not certain that its
locks will be preserved, and certainly they aren't immediately there
(they seem to only come back when the NFS v4 client re-attaches to
the server).
Looked at from the right angle, this makes sense. The kernel has to release locks from NFS clients when it stops being an NFS server, and a sensible signal that it's no longer an NFS server is when it's told to run zero NFS threads. However, it does seem to lead to an unfortunate result for at least NFS v3 clients.
2023-09-13
A user program doing intense IO can manifest as high system CPU time
Recently, our IMAP
server had unusually high CPU usage and was increasingly close to
saturating its CPU. When I investigated with 'top' it was easy to
see the culprit processes, but when I checked what they were doing
with the strace
command, they were all busy madly doing IO, in
fact processing recursive IMAP LIST
commands by walking around in the
filesystem. Processes that intensely do IO like this normally wind
up in "iowait", not in
active CPU usage (whether user or system CPU usage). Except here
these processes were, using huge amounts of system CPU time.
What was happening is that these IMAP processes trying to do recursive
IMAP LISTs of all available 'mail folders' had managed to escape
into '/sys
'. The processes were working away more or less endlessly
because Dovecot (the IMAP server
software we use) makes the entirely defensible but less common
decision to follow symbolic links when traversing directory trees, and Linux's /sys
has a
lot of them (and may have ones that form cycles, so a directory
traversal that follows symbolic links may never terminate). Since
/sys
is a virtual filesystem that is handled entirely inside the
Linux kernel, traversing it and reading directories from it does
no actual IO to actual disks. Instead, it's all handled in kernel
code, and all of the work to traverse around it, list directories,
and so on shows up as system time.
Operating on a virtual filesystem isn't the only way that a program
can turn a high IO rate into high system time. You can get the same
effect if you're repeatedly re-reading the same data that the kernel
has cached in memory. Since the kernel can satisfy your IO requests
without going to disk, all of the effort required turns into system
CPU time inside the kernel. This is probably easiest to have happen
with reading data from files, but you can also have programs that
are repeatedly scanning the same directories or calling stat()
(or lstat()
) on the same filesystem names. All of those can wind
up as entirely in-kernel activities because the modern Linux kernel
is very good at caching things.
(Most people's IMAP servers don't have the sort of historical configuration issues we have that create these exciting adventures.)