In Linux, hitting a strict overcommit limit doesn't trigger the OOM killer

November 1, 2018

By now, we're kind of used to our Linux machines running out of memory, because people or runaway processes periodically do it (to our primary login server, to our primary web server, or sometimes other service machines). It has a familiar litany of symptoms, starting with massive delays and failures in things like SSH logins and ending with Linux's OOM killer activating to terminate some heavy processes. If we're lucky the OOM killer will get the big process right away; if we're not, it will first pick off a few peripheral ones before getting the big one.

However, every so often recently we've been having some out of memory situations on some of our machines that didn't look like this. We knew the machines had run out of memory because log messages told us:

systemd-networkd[828]: eno1: Failed to save LLDP data to /run/systemd/netif/lldp/2: No space left on device
[...]
systemd[1]: user@NNN.service: Failed to fork: Cannot allocate memory
[...]
sshd[29449]: fatal: fork of unprivileged child failed
sshd[936]: error: fork: Cannot allocate memory

That's all pretty definitely telling us about a memory failure (note that /run is a tmpfs filesystem, and so 'out of space on device' means 'out of memory'). What we didn't see was any indication that the OOM killer had been triggered. There were no kernel messages about it, for example, and the oom_kill counter in /proc/vmstat stubbornly reported '0'. We spent some time wondering where the used memory was going so that we didn't really see it and, more importantly, why the kernel didn't think it had to invoke the OOM killer. Was the kernel failing to account for memory used in tmpfs somewhere, for example?

(In the process of looking into this issue I did learn that memory used by tmpfs shows up in /proc/meminfo's Shmem field. Tmpfs also apparently gets added to Cached, which is a bit misleading since it can't be freed up, unlike a lot of what else gets counted in Cached.)

Then last night the penny dropped and I came to a sudden realization about what was happening. What was happening with these machines was that they were running into strict overcommit limits, and when your machine hits strict overcommit limits, the kernel OOM killer is not triggered (or at least, isn't necessarily triggered). Most of our machines don't use strict overcommit (and this is generally the right answer), but our Linux compute servers do have it turned on, and it was our compute servers that we were experiencing these unusual out of memory situations on. This entirely explains how we could be out of memory without the kernel panicing about it; we had simply run into the limit of how much memory we told the kernel to allow people to allocate.

(Since the OOM killer wasn't invoked, it seems likely that some of this allocated memory space wasn't in active use and may not even have been touched.)

In a way, things are working exactly as designed. We said to use strict overcommit on these machines and the kernel is dutifully carrying out what we told it to do. We're enforcing memory limits that insure these machines don't get paralyzed, and they mostly don't. In another way, how this is happening is a bit unfortunate. If the OOM killer activates, generally you lose a memory hog but other things aren't affected (in our environment the OOM killer seems pretty good at only picking such processes). But if a machine runs into the strict overcommit limit, lots of things can start failing because they suddenly can't allocate memory, can't fork, can't start new processes or daemons, and so on. Sometimes this leaves things in a failed or damaged state, because your average Unix program simply doesn't expect memory allocation or fork or the like to fail. In an environment where we're running various background tasks for system maintenance, this can be a bit unfortunate.

(Go programs panic, for instance. We got a lot of stack traces from the Prometheus host agent.)

One of the things that we should likely be doing to deal with this is increasing the vm.admin_reserve_kbytes sysctl (documented here). This defaults to 8 MB, which is far too low on a modern machine. Unfortunately it turns out to be hard to find a good value for it, because it includes existing usage from current processes as well. In experimentation, I found that a setting as high as 4 GB wasn't enough to allow a login through ssh (5 GB was enough, at the time, but I didn't try binary searching from there). This appears to be partly due to memory surges from logging in, because an idle machine has under a GB in /proc/meminfo's Committed_AS field.

(I didn't know about admin_reserve_kbytes until I started researching this entry, so once again blogging about something turns out to be handy.)


Comments on this page:

By KC Marshall at 2018-11-02 12:27:39:

Rather than using Strict Overcommit would it be possible to allow overcommit, but limit the ratio for any process to be only say 85% of the RAM and swap. That way, giant, run-away processes tend to get killed before they slurp up all the memory in the system but normal processes can allocate memory as they always have.

By KC Marshall at 2018-11-02 13:45:52:

I'm confused - I'm trying to suggest the thing you are already doing ( setting overcommit_memory = 2 and then setting overcommit_ratio appropriately). Is it just too many processes needing memory or an issue of not enough memory allowed to each?

By cks at 2018-11-02 14:48:02:

Strict overcommit applies to the total memory usage across all processes, not separately to individual processes. Unless restricted in some other way, individual processes can use however much of the allowed global memory as they can grab, up to the commit limit (and on our compute servers, this is a feature for us; if the machine has a single user running a single compute process, we want it to get as much memory as possible). Conversely, when the global commit limit runs out, no one can get any more memory.

We can't rely on a per-process or per-user limit, because on compute servers we need to assume that a process that makes a giant memory allocation actually intends to use that allocation for computing (for example to load a giant dataset that it's going to grind over). If we allowed multiple such processes to each claim, say, 85% of the machine's RAM, they would collectively overload the available RAM and thrash the machine to death (well, they'd actually blow up spectacularly and get OOM killed, because we don't give machines that much swap; our current compute servers can have 96 GB and even 256 GB of RAM).

Written on 01 November 2018.
« Do I feel uncertain about CentOS's future now? Yes, a bit
Metadata that you can't commit into a VCS is a mistake (for file based websites) »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Nov 1 21:51:25 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.