My uncertainty about swapping and swap sizing for SSDs and NVMe drives
The traditional reason to avoid configuring a lot of swap space on your servers and to avoid using swap in general was that lots of swap space made it much easier for your system to thrash itself into total overload. But that's wisdom (and painful experience) from back in the days of system-wide 'global' swapping and your swap being on spinning rust (ie, hard drives). A lot of paging evicted memory back in (whether from swap or from its original files) is random IO and spinning rust had hard limits on how many IOPs a second it could do, which often had to be shared between swapping and real IO. And with global swapping, any process could be victimized by having to page things back in, or have its regular IO delayed by swapping IO. In theory, things could be different today.
Modern SSDs and especially NVMe drives are much faster and support many more IOPs a second, especially for read IO (ie, paging things back in). Paging is still quite slow if compared to simply accessing RAM, but it's not anywhere near as terrible as it used to be on spinning rust; various sources suggest that you might see page-in latencies of 100 microseconds or less on good NVMe drives, and perhaps only a few milliseconds on SSDs. Since modern SSDs and especially NVMe drives can reach and sustain very high random IO rates, this paging activity is also far less disruptive to other IO that other programs are doing.
(The figures I've seen for random access to RAM on modern machines is on the order of 100 nanoseconds. If we assume the total delay on a NVMe page-in is on the order of 100 microseconds (including kernel overheads), that means a page-in costs you around 1,000 RAM accesses. This is far better than it used to be, although it's not fast. Every additional microsecond of delay costs another 10 RAM accesses.)
Increasingly, systems also support 'local' swapping in addition to system wide 'global' swapping, where different processes or groups of processes have different RAM limits and so one group can be pushed into swapping without affecting other groups. The affected group will still pay a real performance penalty for all of the paging it's doing, but other processes should be mostly unaffected. They shouldn't have their pages evicted from RAM any faster than they otherwise would be, so if they weren't paging before they shouldn't be paging afterward. And with SSDs and NVMe drives having high concurrent IO limits, the other processes shouldn't be particularly affected by the paging IO.
If you're using SSDs or NVMe drives with enough IO capacity (and low enough latency), even system-wide swap thrashing might not be as lethal as it used to be. If everything works well with 'local' swapping, a particular group of processes could be pushed into swap thrashing by their excessive memory usage without doing anything much to the rest of the system; of course they might not perform well and perhaps you'd rather have them terminated and restarted. If all of this works, perhaps these days systems should have a decent amount of swap, much more than the minimal swap space that we have tended to configure so far.
(All of this is more true on NVMe drives than SSDs, though, and all of our servers still use SSDs for their system drives.)
However, all of this is theoretical. I don't know if it actually works in practice, especially on SSDs (where even a one millisecond delay for a page-in is the same cost as 10,000 accesses to RAM, and that's probably fast for SSDs). System wide swap thrashing on SSDs seems like a particularly bad case, and our most likely case on most servers. Per-user RAM limits seem like a better case for using a lot of swap, but even then we may not be doing people any real favours and they might be better off having the offending process just terminated.
(All of this was sparked by a Twitter thread.)
Comments on this page:Written on 21 March 2021.