A problem with strict memory overcommit in practice

February 5, 2019

We've used Linux's strict overcommit mode on our compute servers for years to limit memory usage to the (large) amount of RAM on the machines, on the grounds that compute jobs that allocate tens or hundreds of GB of RAM generally intend to use it. Recently we had a series of incidents where compute machines were run out of memory, and unfortunately these incidents illustrated a real problem with relying on strict overcommit handling. You see, in practice, strict overcommit kills random processes. Oh, not literally, but the effect is generally the same.

Quite a lot of programs need memory as they run, and when they can't get it there is often not very much they can do except exit. This is especially the case for shells and shell scripts; even if the shell can get enough memory to do its own internal work, the moment it tries to fork() and exec() some external program, it's going to fail, and there goes your shell script. All sorts of things can start failing, including things that shouldn't fail. Do you have a 'trap EXIT "rm $lockfile"' in your shell script? Well, rm isn't a builtin, so your lock file is probably not going away.

Strict overcommit is blind here; it denies more memory to any and all processes that want it after the limit is hit, no matter what they are, how big they are, or what their allocation pattern has been. And in practice the amount of available memory doesn't just go to zero and stay there; instead, some things will try to allocate memory, fail, and exit, releasing their current memory, which allows other programs to get more memory for a bit and maybe finish and exit, and then the whole cycle bounces around. This oscillation is what creates the randomness, where if you ask at the right time you get memory and get to survive but if you ask at the wrong time you fail and die.

In our environment, processes failing unpredictably turns out to be surprisingly disruptive to routine ongoing maintenance tasks, like password propagation and managing NFS mounts. I'd say that our scripts weren't designed for things exploding half-way through, but I'm not sure there's any way to design scripts to survive in a setting where any command and any internal Bourne shell operation might fail at any time.

(We are taking steps to mitigate some problems.)

Despite our recent experience with this, we're probably going to stick with strict overcommit on our compute servers; it has saved us in other situations, and there really isn't a better alternative. The OOM killer has its own problems and is probably the wrong answer for our compute servers, especially in its current form.

(There is an argument that can be made that the OOM killer's apparent current approach of 'kill the biggest process' is actually a fair one on a shared compute server, but I think it's at least questionable.)

PS: We saw this last fall, but at the time we didn't really fully appreciate the potential problems and how hard they may be to deal with.

Comments on this page:

By jaloren@gmail.com at 2019-02-06 07:22:47:

So if a script could die randomly at any time, what if you used a distributed scheduler system like hashicorp nomad or even a message bus? That would allow you to detect when things fail and have the job system automatically reschedule the failed task. While not the best fit, even doing something in Jenkins with a paramiko script might help.

Alternatively maybe you daemonize the script and run it under systemd?

Btw why can’t you leverage cgroups to have saner memory allocation to mitigate the problem?

By George Shuklin at 2019-02-10 16:17:39:

I tried to use strict mode but gave up. Some processes just allocate too much memory for no reason and it's hard to fix that. As sad as it sounds current oom killer way is more reasonable.

But here one advice how to live with strict mode: use swap. If few KB got into swap it's not a big deal. If real swapping happens, that should be treated as an issue. But bug swap gives a room for useless (unused) allocations to keep tabs. Nothing swapped, but overcommit is successful.

By cks at 2019-02-15 14:46:21:

Belatedly: jaloren, the problem is not re-running scripts; that generally happens reliably through cron. The problem is that an abrupt script death (or the abrupt death of any program) may leave the overall state of the system in a not-great state; for example, the script may have been part way through updating something.

To some extent you get this issue when systems reboot, but it's possible to arrange programs so that their temporary state is wiped away and redone from scratch when the system boots (for example, you make sure that it's all on a tmpfs filesystem such as /var/run). This is much harder when only some bits of the system may fall over unpredictably or not work.

Written on 05 February 2019.
« Hand-building an updated upstream kernel module for your (Fedora) kernel
Using a single git repo to compare things between two upstreams »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Feb 5 23:44:58 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.