What has to happen with Unix virtual memory when you have no swap space

August 7, 2019

Recently, Artem S. Tashkinov wrote on the Linux kernel mailing list about a Linux problem under memory pressure (via, and threaded here). The specific reproduction instructions involved having low RAM, turning off swap space, and then putting the system under load, and when that happened (emphasis mine):

Once you hit a situation when opening a new tab requires more RAM than is currently available, the system will stall hard. You will barely be able to move the mouse pointer. Your disk LED will be flashing incessantly (I'm not entirely sure why). [...]

I'm afraid I have bad news for the people snickering at Linux here; if you're running without swap space, you can probably get any Unix to behave this way under memory pressure. If you can't on your particular Unix, I'd actually say that your Unix is probably not letting you get full use out of your RAM.

To simplify a bit, we can divide pages of user memory up into anonymous pages and file-backed pages. File-backed pages are what they sound like; they come from some specific file on the filesystem that they can be written out to (if they're dirty) or read back in from. Anonymous pages are not backed by a file, so the only place they can be written out to and read back in from is swap space. Anonymous pages mostly come from dynamic memory allocations and from modifying the program's global variables and data; file backed pages come mostly from mapping files into memory with mmap() and also, crucially, from the code and read-only data of the program.

(A file backed page can turn into an anonymous page under some circumstances.)

Under normal circumstances, when you have swap space and your system is under memory pressure a Unix kernel will balance evicting anonymous pages out to swap space and evicting file-backed pages back to their source file. However, when you have no swap space, the kernel cannot evict anonymous pages any more; they're stuck in RAM because there's nowhere else to put them. All the kernel can do to reclaim memory is to evict whatever file-backed pages there are, even if these pages are going to be needed again very soon and will just have to be read back in from the filesystem. If RAM keeps getting allocated for anonymous pages, there is less and less RAM left to hold whatever collection of file-backed pages your system needs to do anything useful and your system will spend more and more time thrashing around reading file-backed pages back in (with your disk LED blinking all of the time). Since one of the sources of file-backed pages is the executable code of all of your programs (and most of the shared libraries they use), it's quite possible to get into a situation where your programs can barely run without taking a page fault for another page of code.

(This frantic eviction of file-backed pages can happen even if you have anonymous pages that are being used only very infrequently and so would normally be immediately pushed out to swap space. With no swap space, anonymous pages are stuck in RAM no matter how infrequently they're touched; the only anonymous pages that can be discarded are ones that have never been written to and so are guaranteed to be all zero.)

In the old days, this usually was not very much of an issue because system RAM was generally large compared to the size of programs and thus the amount of file-backed pages that were likely to be in memory. That's no longer the case today; modern large programs such as Firefox and its shared libraries can have significant amounts of file-backed code and data pages (in addition to their often large use of dynamically allocated memory, ie anonymous pages).

In theory, this thrashing can happen in any Unix. To prevent it, your Unix has to decide to deliberately not allow you to allocate more anonymous pages after a certain point, even though it could evict file-backed pages to make room for them. Deciding when to cut your anonymous page allocations off is necessarily a heuristic, and so any Unix that tries to do it is sooner or later going to prevent you from using some of your RAM.

(This is different than the usual issue with overcommitting virtual memory address space because you're not asking for more memory than could theoretically be satisfied. The kernel has to guess how much file-backed memory programs will need in order to perform decently, and it has to do so at the time when you try to allocate anonymous memory since it can't take the memory back later.)


Comments on this page:

On Linux desktop, can we simply behave like windows or Mac and open a dialog to close some programs?

You don't explicitly state that these issues with the awful UNIX model are intractable, but you do seem to imply as much.

Is it so unreasonable to have the system ask an operator how to proceed or even perform tasks such as adding more memory to the machine? Systems from decades passed featured such things, rather than killing a random process.

I've never used BeOS, but it's my understanding BeOS was an operating system designed from the start to have a graphical interface and so had very specialized support for keeping the interface running smoothly and whatnot. This is in contrast to the UNIX model of pretending the user is using a typewriter and having a large program the system knows little about named X11 handle graphical matters.

To prevent it, your Unix has to decide to deliberately not allow you to allocate more anonymous pages after a certain point, even though it could evict file-backed pages to make room for them. Deciding when to cut your anonymous page allocations off is necessarily a heuristic, and so any Unix that tries to do it is sooner or later going to prevent you from using some of your RAM.

One could make the argument this is an example of how giving each process its own address space is a silly idea, as opposed to a system that manages everything within a single address space and so avoids wasting memory. In any case, that could be regarded as tangential, so I'll simply note that this operating system of many millions of lines and decades of work is apparently lacking in these heuristics that would help prevent this behavior.

For further details as to how an operating system with millions of lines of code and several decades' worth of work put into it could still fail at basic tasks, see The UNIX-HATERS Handbook.

For those curious about the possibility of improvements, the thread link may be informative.

You can also then click on the "flat" or "nested" link by the word "expand". You can then skim and search the whole thread for keywords. Try "pressure" and "psi", "Facebook"/"oomd", "Android"/"lkmd". Also "patch" (you want to look at the second one :).

The follow-ups to the post you are referencing disagree with this assessment somewhat.

The real "problem" is that the OOM killer is not invoked. The system is clearly out of memory otherwise it wouldn't be so completely unresponsive trying to deal with said memory pressure. Ideally the OOM killer would run and at least recover the system.

And there is existing pressure stall work to do exactly that, used on both android and in the container fleets of google and/or facebook (I forget which, maybe both?). And I think some partial work to do the detection side of that inside the kernel that was merged sometime recently-ish?

So it's worth reading into the situation a little more to understand that side of things.

As a side note you can also get into this same situation when you do have swap space for much the same reasons. You just need to use more "RAM" and fill up your swap. And again in some cases the OOM killer won't handle it correctly at current. I've seen this myself plenty of times.

By SomeWan at 2019-08-08 10:39:28:

Why can the kernel not emit a signal to applications that memory pressure is occurring? Many applications - particularly the large consumers on the desktop keep caches that are not strictly neccessary for the application to continue running.

These include languages with a garbage collector - why can it not just start a GC run? Firefox has a manual GC button in about:cache. Hitting that frees over 100MB on a freshly started process with only 3 tabs. Similarly GNOME desktop has a GC run that is quite long, and can be automatically triggered with no loss.

Dropping caches that are not important seems very sensible. You could even grade it (drop priority 3 cache, 2, then 1).

This really should be the first action that the kernel could take - "dear users, please tidy your stuff up, or I'll need to bring out the out-of-memory stick". Paging really should be part of the latter.

By linux user at 2019-08-10 17:30:28:

slightly off-topic but i find it hilarious that copying files to disk requires so much ram! my old laptop with 6gb ram and 2gb swap (yeah i know its too little) often runs out both physical memory and swap and becomes unresponsive. i've been considering replacing cp with nocache but it won't solve the problem with programs that write to disk. right now i'm using drop_cache to clear the memory.

what i don't understand is this is happening only after upgrading to debian 10 (kernel 4.19). in debian 9 (kernel 4.9) i could have two browsers with ~40 tabs open, watching movies in vlc, reading/writing large docs in libreoffice writer, but memory usage never exceeding 2gb.

By Luke Leighton at 2019-08-11 00:42:38:

hi chris, thanks for bringing this to peoples' attention.

i have an extremely expensive Aorus X3 v6 gaming laptop, bought at a time when 16GB of 2400mhz DDR4 RAM and 512mb NVMe SSDs just did not exist.

i was shocked to find that, just as with the 2012 macbook pro it replaced (8GB RAM, 256gb SSD), going into swap-space - even by the smallest amount - would result in the loadavg climbing instantly to over 120, leaving about 4 seconds in which to type "killall -9 firefox" before the only other alternative solution was a hard powercycle.

replacing a USD $2500 laptop is clearly not an option.

after months of research over a 12-15 month period, trying different options, the most successful one was simply, "swapoff -a". this because letting the OOM killer take out a few processes (then restarting them) is infinitely better than losing valuable desktop configuration context involving over 50 xterm windows, 200+ tabs and often over 200+ backgrounded vim editor sessions.

part of the problem might be down to hardware issues with both laptops (just bad luck, there). the SSDs both end up with bus "resets" and/or get reconfigured with power-management daemons. they're normally configured into "low power" mode, however a hardware power-supply fault on the Aorus, related most likely to PCIe power not being adequate on the early adopted Reference Design that went into this laptop, the entire PCIe bus gets a "reset" if the data transfer rate is too high.

i've since found a way to get the PCIe bus protocol reduced down to version 2.0 (or something like that), which means i no longer have 2500mbytes/sec NVMe transfer speeds, but i get reliability back.

the resets occur precisely because swap-space is too demanding, so just when the system needs the NVMe SSD the most, SMACK, gone.

again: no, replacing this laptop is not a viable option.

the macbook pro also had a similar issue, one that's well-known, where /var/log/syslog is filled up with SATA resets one every second until min_power was set. the problem there being that the power management daemon, on a wake-up or other event, would reset the damn thing. i gave up on that machine, as you can probably gather, because 8GB of RAM was just inadequate for the increasingly-demanding tasks i put machines to.

so, thank you for raising this issue! and yes, some people have a genuine need to run with "swapoff -a". yes i did try zram at one point. yes it made things worse - much worse.

By Milan Keršláger at 2019-08-13 02:55:16:

Why disc flashing? Because you have no buffers/cache even for reading!

By FooBar at 2019-08-13 08:18:31:

SomeWan at 2019-08-08 10:39:28: Why can the kernel not emit a signal to applications that memory pressure is occurring?

Although not exactly the same as what you desire, in a matter of speaking, the kernel does signal processes when memory allocation actually fails: it sends an 'out of memory' error (ENOMEM). But it's up to the applications to deal with such occurrences, which I assUme is not very practical in practice.

Written on 07 August 2019.
« Rewriting my iptables rules using ipsets
Non-uniform caches are harder to make work well »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Aug 7 22:26:29 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.