In Linux, hitting a strict overcommit limit doesn't trigger the OOM killer
By now, we're kind of used to our Linux machines running out of memory, because people or runaway processes periodically do it (to our primary login server, to our primary web server, or sometimes other service machines). It has a familiar litany of symptoms, starting with massive delays and failures in things like SSH logins and ending with Linux's OOM killer activating to terminate some heavy processes. If we're lucky the OOM killer will get the big process right away; if we're not, it will first pick off a few peripheral ones before getting the big one.
However, every so often recently we've been having some out of memory situations on some of our machines that didn't look like this. We knew the machines had run out of memory because log messages told us:
systemd-networkd: eno1: Failed to save LLDP data to /run/systemd/netif/lldp/2: No space left on device [...] systemd: user@NNN.service: Failed to fork: Cannot allocate memory [...] sshd: fatal: fork of unprivileged child failed sshd: error: fork: Cannot allocate memory
That's all pretty definitely telling us about a memory failure (note
/run is a tmpfs filesystem, and so 'out of space on device'
means 'out of memory'). What we didn't see was any indication that
the OOM killer had been triggered. There were no kernel messages
about it, for example, and the
oom_kill counter in
stubbornly reported '0'. We spent some time wondering where the
used memory was going so that we didn't really see it and, more
importantly, why the kernel didn't think it had to invoke the OOM
killer. Was the kernel failing to account for memory used in tmpfs
somewhere, for example?
(In the process of looking into this issue I did learn that memory
used by tmpfs shows up in
Shmem field. Tmpfs
also apparently gets added to
Cached, which is a bit misleading
since it can't be freed up, unlike a lot of what else gets counted
Then last night the penny dropped and I came to a sudden realization about what was happening. What was happening with these machines was that they were running into strict overcommit limits, and when your machine hits strict overcommit limits, the kernel OOM killer is not triggered (or at least, isn't necessarily triggered). Most of our machines don't use strict overcommit (and this is generally the right answer), but our Linux compute servers do have it turned on, and it was our compute servers that we were experiencing these unusual out of memory situations on. This entirely explains how we could be out of memory without the kernel panicing about it; we had simply run into the limit of how much memory we told the kernel to allow people to allocate.
(Since the OOM killer wasn't invoked, it seems likely that some of this allocated memory space wasn't in active use and may not even have been touched.)
In a way, things are working exactly as designed. We said to use strict overcommit on these machines and the kernel is dutifully carrying out what we told it to do. We're enforcing memory limits that insure these machines don't get paralyzed, and they mostly don't. In another way, how this is happening is a bit unfortunate. If the OOM killer activates, generally you lose a memory hog but other things aren't affected (in our environment the OOM killer seems pretty good at only picking such processes). But if a machine runs into the strict overcommit limit, lots of things can start failing because they suddenly can't allocate memory, can't fork, can't start new processes or daemons, and so on. Sometimes this leaves things in a failed or damaged state, because your average Unix program simply doesn't expect memory allocation or fork or the like to fail. In an environment where we're running various background tasks for system maintenance, this can be a bit unfortunate.
(Go programs panic, for instance. We got a lot of stack traces from the Prometheus host agent.)
One of the things that we should likely be doing to deal with this
is increasing the
vm.admin_reserve_kbytes sysctl (documented
This defaults to 8 MB, which is far too low on a modern machine.
Unfortunately it turns out to be hard to find a good value for it,
because it includes existing usage from current processes as well.
In experimentation, I found that a setting as high as 4 GB wasn't
enough to allow a login through ssh (5 GB was enough, at the time,
but I didn't try binary searching from there). This appears to be
partly due to memory surges from logging in, because an idle machine
has under a GB in
/proc/meminfo's Committed_AS field.
(I didn't know about
admin_reserve_kbytes until I started
researching this entry, so once again blogging about something
turns out to be handy.)