My uncertainty about swapping and swap sizing for SSDs and NVMe drives

March 21, 2021

The traditional reason to avoid configuring a lot of swap space on your servers and to avoid using swap in general was that lots of swap space made it much easier for your system to thrash itself into total overload. But that's wisdom (and painful experience) from back in the days of system-wide 'global' swapping and your swap being on spinning rust (ie, hard drives). A lot of paging evicted memory back in (whether from swap or from its original files) is random IO and spinning rust had hard limits on how many IOPs a second it could do, which often had to be shared between swapping and real IO. And with global swapping, any process could be victimized by having to page things back in, or have its regular IO delayed by swapping IO. In theory, things could be different today.

Modern SSDs and especially NVMe drives are much faster and support many more IOPs a second, especially for read IO (ie, paging things back in). Paging is still quite slow if compared to simply accessing RAM, but it's not anywhere near as terrible as it used to be on spinning rust; various sources suggest that you might see page-in latencies of 100 microseconds or less on good NVMe drives, and perhaps only a few milliseconds on SSDs. Since modern SSDs and especially NVMe drives can reach and sustain very high random IO rates, this paging activity is also far less disruptive to other IO that other programs are doing.

(The figures I've seen for random access to RAM on modern machines is on the order of 100 nanoseconds. If we assume the total delay on a NVMe page-in is on the order of 100 microseconds (including kernel overheads), that means a page-in costs you around 1,000 RAM accesses. This is far better than it used to be, although it's not fast. Every additional microsecond of delay costs another 10 RAM accesses.)

Increasingly, systems also support 'local' swapping in addition to system wide 'global' swapping, where different processes or groups of processes have different RAM limits and so one group can be pushed into swapping without affecting other groups. The affected group will still pay a real performance penalty for all of the paging it's doing, but other processes should be mostly unaffected. They shouldn't have their pages evicted from RAM any faster than they otherwise would be, so if they weren't paging before they shouldn't be paging afterward. And with SSDs and NVMe drives having high concurrent IO limits, the other processes shouldn't be particularly affected by the paging IO.

If you're using SSDs or NVMe drives with enough IO capacity (and low enough latency), even system-wide swap thrashing might not be as lethal as it used to be. If everything works well with 'local' swapping, a particular group of processes could be pushed into swap thrashing by their excessive memory usage without doing anything much to the rest of the system; of course they might not perform well and perhaps you'd rather have them terminated and restarted. If all of this works, perhaps these days systems should have a decent amount of swap, much more than the minimal swap space that we have tended to configure so far.

(All of this is more true on NVMe drives than SSDs, though, and all of our servers still use SSDs for their system drives.)

However, all of this is theoretical. I don't know if it actually works in practice, especially on SSDs (where even a one millisecond delay for a page-in is the same cost as 10,000 accesses to RAM, and that's probably fast for SSDs). System wide swap thrashing on SSDs seems like a particularly bad case, and our most likely case on most servers. Per-user RAM limits seem like a better case for using a lot of swap, but even then we may not be doing people any real favours and they might be better off having the offending process just terminated.

(All of this was sparked by a Twitter thread.)

Comments on this page:

It seems that even when using spinning rust, swap is useful:

One interesting observation:

3. Disabling swap does not prevent disk I/O from becoming a problem under memory contention, it simply shifts the disk I/O thrashing from anonymous pages to file pages. Not only may this be less efficient, as we have a smaller pool of pages to select from for reclaim, but it may also contribute to getting into this high contention state in the first place.

I had some good luck with using an Intel Optane 900p NVMe SSD to hold swap files. My workload needed a few hundred gigabytes of ram that I didn’t have (and that wouldn’t even fit in this motherboard, which is maxed out at 64GB), so swap was the cheaper option. Once I started using the Optane drive as swap the difference was amazing. My computer was almost completely unaffected even though it was swapping like mad. I had some occasional dropped frames when I was watching a movie (it took hours to run this program…), but you almost couldn’t tell that anything was wrong.

{{IMG: 1920 auto atop stats showing write bandwidth and large queue depth}}

There's a screenshot showing 142.5 MB/s of reads along with 1355.3MB/s of writes to the Optane drive; all of that is swap traffic. The average queue depth is 304.41 and those queued operations had to wait a whole 10.7µs each. Yes, microseconds. As the queue gets smaller the io waits less time, too:

{{IMG: 1920 auto atop stats showing mixed bandwidth and low queue depth}}

It’s my understanding that the Optane drive retain their desirable low access times even with mixed read and write traffic, while most SSDs (and NVMe drives) manage it only with pure reads or pure writes.

Written on 21 March 2021.
« My experience with x2go is that it's okay but not compelling
Portability has ongoing costs for code that's changing »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Mar 21 23:40:42 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.