Why per-process (or per-user) memory resource limits are hard
A while back I wrote a little thing about process memory resource limits. Today I feel like elaborating on why this is a hard problem that people haven't solved yet.
Let's consider the simplest case possible: per-process RSS limits. In the abstract, this is simple; you keep track of how many physical pages a process has in its page tables, and when a process that is at its RSS limit wants to add a new page, it must first release its mapping for an existing one. Effectively what you create is a situation where an RSS-limited process pages against itself instead of against the overall system memory use.
(We don't try to make the process immediately free up the dropped page; while that's the ultimate goal, we assume that normal kernel mechanisms will take care of it. Certainly, if the system is under memory pressure in general, the pages that the process frees up will promptly be stolen by other memory users.)
The complication is virtual memory areas that are shared between processes. Each process has an independent set of page tables and thus an independent RSS limit, but pages in a shared VMA can only be released by the system if no process has them mapped. Now suppose you have two processes, one at its RSS limit and one not, that are both using the same pages in a shared VMA. What happens when the RSS-limited process should release a page in the shared VMA?
(If you skip shared VMAs entirely, you give processes a great way to avoid RSS limits.)
If you only release the page in the RSS-limited process, the page is going to stay in memory. You haven't reduced the effective memory footprint of the process or freed up any memory overall; instead, all you've done is drive up the soft page fault rate and wasted CPU. Instead of doing something productive, the RSS-limited process gets to spend its time evicting pages only to soft-fault them back in later.
(If your policy goal is to slow down the RSS-limited process, you are better off just explicitly putting it to sleep instead.)
If you force all processes using the shared VMA to release the page, you penalize the virtuous processes that are staying under their RSS limits, possibly drastically if the system is under enough memory pressure that the page gets stolen entirely and must now be loaded back in from disk. As it happens, a typical system has a lot of widely shared VMAs; consider things like shared libraries.
Thus, on sober second reflection I find it entirely unsurprising that the Linux kernel tracks the theoretical maximum RSS limit for processes but never does anything about it; there is no clear answer about what it actually should do in a reasonably common situation.
And RSS limits are the simple case, because they are a 'soft' limit; a process that hits them just slows down, it doesn't stop or otherwise get errors. Other limits, such as the one on per-process committed address space that I wanted back then, would be hard limits, which means that you could either be killing a process pointlessly or killing innocent processes. Neither are very appealing.
(I suspect that there's academic research on this whole area somewhere that might have clever answers to these problems.)