Understanding Resident Set Size and the RSS problem on modern Unixes
On a modern Unix system with all sorts of memory sharing between processes, Resident Set Size is a hard thing to explain; I resorted to a very technical description in my entry on Linux memory stats. To actually understand RSS, let's back up and imagine a hypothetical old system that has no memory sharing between processes at all; each page of RAM is either free or in use by exactly one process.
(We'll ignore the RAM the operating system itself uses. In old Unixes, this was an acceptable simplification; memory was statically divided between memory used by the OS and memory used by user programs.)
In this system, processes acquire new pages of RAM by trying to access them and then either having them allocated or having them paged (back) in from disk. Meanwhile, the kernel is running around trying to free up memory, generally using some approximation of finding the least recently used page of RAM. How aggressively the operating system tries to reclaim pages depends on how much free memory it has; the less free memory, the faster the OS tries to grab pages back. In this environment, the resident set size of a process is how many pages of RAM it has. If the system is not thrashing, ie if there's enough memory to go around, a process's RSS is how much RAM it actually needs in order to work at its current pace.
(All of this is standard material from an operating system course.)
The problem of RSS on modern Unix systems is how to adopt this model to an environment where processes share significant amounts of memory with each other. In the face of a lot of sharing, what does it mean for a process to have a resident set size and how do you find the right pages to free up?
There are at least two approaches the kernel can take to reclaiming pages, which we can call the 'physical' and 'process' approaches. In the physical approach the kernel continues to scan over physical RAM to identify candidate pages to be freed up; when it finds one, it takes it away from all of the processes using it at once (this is the 'global' removal of my earlier entry). In the process approach the kernel scans each process more or less independently, finding candidate pages and removing them only from the process (a 'local' removal); only once a candidate page has been removed from all processes using it is it actually freed up.
(Scanning each 'process' is a simplification. Really the kernel scans each separate set of page tables; there are situations where multiple processes share a single set of page tables.)
The problem with the process approach is that the kernel can spend a great deal of time removing pages from processes when the pages will never actually be reclaimed for real. Imagine two processes with a shared memory area; one process uses it actively and one process only uses it slowly. The kernel can spend all the time it wants removing pages of the shared area from the less active process without ever actually getting any RAM back, because the active process is keeping all of those pages in RAM anyways.
So, why doesn't everyone use the physical approach? My understanding is that the problem with the physical approach is that it is often not necessarily a good fit for how the hardware manages virtual memory activity information. Per my earlier entry, every process mapping a shared page of RAM can have a different page table entry for it. To find out if the page of RAM has been accessed recently you may have to find and look at all of those PTEs (with locking), and do so for every page of physical RAM you look at.
My impression is that most current Unixes normally use per-process scanning, perhaps falling back on physical scanning if memory pressure gets sufficiently severe.
(I suspect and hope that virtual memory management in the face of shared pages have been studied academically, just as the older and simpler model of virtual memory has been, but I'm out of contact with OS research.)