The DBus daemon and out of memory conditions (and systemd)
We have a number of systems where for reasons beyond the scope of
this entry, we enable strict overcommit. In
this mode, when you reach the system's memory limits the Linux
kernel will deny memory allocations but usually not trigger the
OOM killer to terminate processes. It's up
to programs to deal with failed memory allocations as best they
can, which doesn't always go very well. In our current setup on the
most common machines we operate this way, we've set the
vm.admin_reserve_kbytes sysctl to reserve enough space for root
so that most or all of our system management scripts keep working
and we at least don't get deluged in email from cron about jobs
failing. This mostly works.
(The sysctl is documented in vm.txt.)
Recently several of these machines hit an interesting failure mode
that required rebooting them, even after the memory usage had
finished. The problem is DBus, or more specifically the DBus daemon.
The direct manifestation of the problem is that
an error message:
dbus-daemon: [system] dbus-daemon transaction failed (OOM), sending error to sender inactive
After this error message is logged, attempts to do certain sorts of systemd-related DBus operations hang until they time out (if the software doing them has a timeout). Logins over SSH take quite a while to give you a shell, for example, as they fail to create sessions:
pam_systemd(sshd:session): Failed to create session: Connection timed out
The most relevant problem for us on these machines is that attempts to query metrics from the Prometheus host agent start hanging, likely because we have it set to pull information from systemd and this is done over DBus. Eventually there are enough hung metric probes so that the host agent starts refusing our attempts immediately.
The DBus daemon is not easy to restart (systemd will normally refuse
to let you do it directly, for example), so I haven't found any
good way of clearing this state. So far my method of recovering a
system in this state is to reboot it, which I generally have to do
reboot -f' because a plain '
reboot' hangs (it's probably
trying to talk to systemd over DBus).
I believe that part of what creates this issue is that the DBus
daemon is not protected by
sysctl specifically reserves space for UID 0 processes, but
dbus-daemon doesn't run as UID 0; it runs as its own UID (often
messagebus), for good security related reasons. As far as I know,
there's no way to protect an arbitrary UID through
vm.admin_reserve_kbytes; it specifically applies only to processes
that hold a relatively powerful Linux security capability,
cap_sys_admin. And unified cgroups (cgroup v2) don't have
a true guaranteed memory reservation, just a best effort one (and
we're using cgroup v1 anyway, which doesn't have anything here).
We're probably making this DBus issue much more likely to happen by having the Prometheus host agent talk to systemd, since this generates DBus traffic every time our Prometheus setup pulls host metrics from the agent (currently, every 15 seconds). At the same time, the systemd information is useful to find services that are dead when they shouldn't be and other problems.
(It would be an improvement if the Prometheus host agent would handle this sort of DBus timeout during queries, but that would only mean we got host metrics back, not that DBus was healthy again.)
PS: For us, all of this is happening on Ubuntu 18.04 with their version of systemd 237 and dbus 1.12.2. However I suspect that this isn't Ubuntu specific. I also doubt that this is systemd specific; I rather suspect that any DBus service using the system bus is potentially affected, and it's just that the most commonly used ones are from systemd and its related services.
(In fact on our Ubuntu 18.04 servers there doesn't seem to be much on the system bus apart from systemd related things, so if there are DBus problems at all, it's going to be experienced with them.)