Wandering Thoughts archives

2023-09-26

Some reasons to combine systemd-resolved with your private DNS resolver

Probably like many people, we have some machines that are set up as local DNS resolvers. Originally we had one set for everyone, both our own servers and other people's machines on our internal networks, but after some recent issues we want to make DNS resolution on our own critical servers more reliable and are doing that partly by having a dedicated private DNS resolver for our servers. Right now all of our servers do DNS in the old fashioned way, with a nsswitch.conf that tells them to use DNS and an /etc/resolv.conf that points to our two (now three on some servers) DNS resolvers. One of the additional measures I've been considering is whether we want to consider using systemd-resolved on some servers.

Systemd-resolved has two features that make it potentially attractive for making server DNS more reliable. The obvious one is that it normally has a small cache of name resolutions (the Cache= configuration directive). Based on 'resolvectl statistics' on a few machines I have that are running systemd-resolved, this cache doesn't seem to get very big and doesn't get very high a hit rate, even on machines that are just sitting there doing nothing (and so are only talking to the same few hosts over and over again). I certainly don't think we can count on this cache to do very much if our normal DNS resolvers stop responding for some reason.

The second feature is much more interesting, and it's that systemd-resolved will rapidly switch to another DNS resolver if your initial one stops responding. In situations where you have multiple DNS servers (for a given network link or global setting, because systemd-resolved thinks in those terms), systemd-resolved maintains a 'current DNS server' and will send all traffic to it. If this server stops responding, resolved will switch over and then latch on whichever of your DNS servers is still working. This makes the failure of your 'primary' DNS server much less damaging than in a pure /etc/resolv.conf situation. In normal resolv.conf handling, every program has to fail over itself (and I think some runtime environments may always keep trying the first listed 'nameserver' and waiting for it to time out).

The generally slow switching of nameservers listed in your resolv.conf means that you really want the first DNS resolver to stay responsive (whatever it is). Systemd-resolved makes it much less dangerous to add another DNS resolver along side your regular ones, as long as you can trust it to not give wrong answers. If it stops working, those systems using it will switch over to other DNS resolvers fast enough that very little will notice.

(Unfortunately getting those systems to switch back may be annoying, but in a sense you don't care whether or not they're using your special private DNS resolver that's just for them or one of your public DNS resolvers. If your public DNS resolvers get flooded by other people's traffic and stop responding, systemd-resolved will switch the systems back to your private DNS resolver again.)

PS: Of course there are configuration issues with systemd-resolved that you may need to care about, but very little is flawless.

SystemdResolvedWithDNSResolvers written at 23:19:50;

2023-09-24

I wish Linux exposed a 'OOM kills due to cgroup limits' kernel statistic

Under ertain circumstances, Linux will trigger the Out-Of-Memory Killer and kill some process. For some time, there have been two general ways for this to happen, either a global OOM kill because the kernel thinks it's totally out of memory, or a per-cgroup based OOM kill where a cgroup has a memory limit. These days the latter is quite easy to set up through systemd memory limits, especially user memory limits.

The kernel exposes a vmstat statistic for total OOM kills from all causes, as 'oom_kill' in /proc/vmstat; this is probably being surfaced in your local metrics collection agent under some name. Unfortunately, as far as I know the kernel doesn't expose a simple statistic for how many of those OOM kills are global OOM kills instead of cgroup OOM kills. This difference is of quite some interest to people monitoring their systems, because a global OOM kill is probably important while a cgroup OOM kill may be entirely expected.

Each cgroup does have information about OOM kills in its hierarchy (or sometimes itself only, if you used the memory_localevents cgroups v2 mount option, per cgroups(7)). This information is in the 'memory.events' file, but as covered in the cgroups v2 documentation, this file is only present in non-root cgroups, which means that you can't find a system wide version of this information in one place. If you know on a specific system that only one top level cgroup can have OOM kills, you can perhaps monitor that, but otherwise you need something more sophisticated (and in theory you might miss transient top level cgroups, although in practice most are persistent).

The kernel definitely knows this information; the kernel log messages for global OOM kills are distinctly different from the kernel log messages for cgroup OOM kills. So the kernel could expose this information, for example as a new /proc/vmstat field or two; it just doesn't (currently, as of fall 2023).

(Someday we may add a Prometheus cgroups metrics exporter to our host agents in our Prometheus environment and so collect this information, but so far I haven't found a cgroup exporter that I like and that provides the information I want to know.)

OOMFromCgroupStatisticWish written at 23:25:25;

2023-09-20

Restarting nfs-server on a Linux NFS (v3) server isn't transparent

A while back I wrote an article on enabling NFS v4 on an Ubuntu 22.04 fileserver (instead of just NFS v3), where one of the final steps was to restart 'nfsd', the NFS server daemon (sort of), with 'systemctl restart nfs-server'. In that article I said that as far as I could tell this entire process was transparent to NFS v3 clients that were talking to the NFS server. Unfortunately I have to take that back. Restarting 'nfs-server' will cause the NFS server to discard locks obtained by NFS v3 clients, without telling the NFS v3 clients anything about this. This results in the NFS v3 clients thinking that they hold locks while the NFS server believes that everything is unlocked and so will allow another client to lock it.

(What happens with NFS v4 clients is more uncertain to me; they may more or less ride through things.)

On Linux, the NFS server is in the kernel and runs as kernel processes, generally visible in process lists as '[nfsd]'. You might wonder how these processes are started and stopped, and the answer is through a little user-level shim, rpc.nfsd. What this program actually does is write to some files in /proc/fs/nfsd that control the portlist, the NFS versions offered, and the number of kernel nfsd threads that are running. To restart (kernel) NFS service, the nfs-server.service unit first stops it with 'rpc.nfsd 0', telling the kernel to run '0' nfsd threads, and then starts it again by writing some appropriate number of threads into place, which starts NFS service. The nfs-server.service systemd unit also does some other things.

(As a side note, you can see what NFS versions your NFS server is currently supporting by looking at /proc/fs/nfsd/versions. Sadly this can't be changed while there are NFS server threads running.)

If you restart the kernel NFS server either with 'systemctl restart nfs-server' or by hand by writing '0' and then some number to /proc/fs/nfsd/threads, the kernel will completely drop knowledge of all locks from NFS v3 clients. Unfortunately running 'sm-notify' doesn't seem to recover them; they're just gone. Locks from NFS v4 clients suffer a somewhat less predictable and certain fate. If the NFS v4 client is actively doing NFS operations to the server, its locks will generally be preserved over a 'systemctl restart nfs-server'. If the client isn't actively doing NFS operations and doesn't do any for a while, I'm not certain that its locks will be preserved, and certainly they aren't immediately there (they seem to only come back when the NFS v4 client re-attaches to the server).

Looked at from the right angle, this makes sense. The kernel has to release locks from NFS clients when it stops being an NFS server, and a sensible signal that it's no longer an NFS server is when it's told to run zero NFS threads. However, it does seem to lead to an unfortunate result for at least NFS v3 clients.

NFSServerRestartLosesNFSv3Locks written at 23:12:58;

2023-09-13

A user program doing intense IO can manifest as high system CPU time

Recently, our IMAP server had unusually high CPU usage and was increasingly close to saturating its CPU. When I investigated with 'top' it was easy to see the culprit processes, but when I checked what they were doing with the strace command, they were all busy madly doing IO, in fact processing recursive IMAP LIST commands by walking around in the filesystem. Processes that intensely do IO like this normally wind up in "iowait", not in active CPU usage (whether user or system CPU usage). Except here these processes were, using huge amounts of system CPU time.

What was happening is that these IMAP processes trying to do recursive IMAP LISTs of all available 'mail folders' had managed to escape into '/sys'. The processes were working away more or less endlessly because Dovecot (the IMAP server software we use) makes the entirely defensible but less common decision to follow symbolic links when traversing directory trees, and Linux's /sys has a lot of them (and may have ones that form cycles, so a directory traversal that follows symbolic links may never terminate). Since /sys is a virtual filesystem that is handled entirely inside the Linux kernel, traversing it and reading directories from it does no actual IO to actual disks. Instead, it's all handled in kernel code, and all of the work to traverse around it, list directories, and so on shows up as system time.

Operating on a virtual filesystem isn't the only way that a program can turn a high IO rate into high system time. You can get the same effect if you're repeatedly re-reading the same data that the kernel has cached in memory. Since the kernel can satisfy your IO requests without going to disk, all of the effort required turns into system CPU time inside the kernel. This is probably easiest to have happen with reading data from files, but you can also have programs that are repeatedly scanning the same directories or calling stat() (or lstat()) on the same filesystem names. All of those can wind up as entirely in-kernel activities because the modern Linux kernel is very good at caching things.

(Most people's IMAP servers don't have the sort of historical configuration issues we have that create these exciting adventures.)

UserIOCanBeSystemTime written at 22:11:06;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.