Our experience with Linux's strict overcommit mode
As a follow-up to 64BitDrawback: after we had several machines crash due to being driven out of memory, dealing with the whole issue suddenly got a whole lot more urgent and we opted to try to solve it by turning on Linux's strict overcommit mode for swap allocation. At first we did this only on our compute servers, but after some of our login servers also OOM'd and crashed, we enabled it on them too.
(Strict overcommit has the great advantage that we don't need to pick a somewhat arbitrary number for a per-process size limit and thus that it is hard for people to be too displeased with.)
On the compute servers this has worked great and I consider it a big win. We have seen it choke off runaway jobs that would otherwise have killed the machine without perturbing anything, so it definitely does what we want it to, and no users have complained. (I'm not sure any of them have noticed, since the overcommit ratio we picked allows them to use all of the physical memory.)
Things are less clear on the login servers. Despite having lots of free memory and no swap usage, their committed address space grows slowly over time and after a while approaches the commit limit; at the worst, this could leave us with a system that can't start new processes despite having lots of capacity left.
I can only conclude that modern graphical applications actually do allocate a bunch of address space that they don't wind up using, for whatever reason. Over time, more people log in and run more programs, many of which are idle, many of which are not using all of their committed address space and never will, and the total committed space grows and grows. (Perhaps sometime it will reach a steady state.)
It's not at all clear what commit limit is appropriate in this situation, although we can probably defend a ratio that is very close to 100. If even that is too small, we probably might as well turn off strict overcommit (and look for another solution to groups of runaway programs; the login servers are 32-bit machines, so no single process can OOM them).