2024-07-18
The Linux Out-Of-Memory killer process list can be misleading
Recently, we had a puzzling incident where the OOM killer was triggered for a cgroup, listed some processes, and then reported that it couldn't kill anything:
acctg_prof invoked oom-killer: gfp_mask=0x1100cca (GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000 [...] memory: usage 16777224kB, limit 16777216kB, failcnt 414319 swap: usage 1040kB, limit 16777216kB, failcnt 0 Memory cgroup stats for /system.slice/slurmstepd.scope/job_31944 [...] Tasks state (memory values in pages): [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 252732] 0 252732 1443 0 53248 0 0 sleep [ 252728] 0 252728 37095 1915 90112 54 -1000 slurmstepd [ 252740] NNNN 252740 7108532 17219 39829504 5 0 python3 [ 252735] 0 252735 53827 1886 94208 151 -1000 slurmstepd Out of memory and no killable processes...
We scratched our heads a lot, especially as something seemed to be killing systemd-journald at the same time and the messages being logged suggested that it had been OOM-killed instead (although I'm no longer so sure). Why was the kernel saying that there were no killable processes when there was a giant Python process right there?
What was actually going on is that the OOM task state list leaves out a critical piece of information, namely whether or not the process in question had already been killed. A surprising number of minutes before this set of OOM messages, the kernel had done another round of a cgroup OOM kill for this cgroup and:
oom_reaper: reaped process 252740 (python3), now anon-rss:0kB, file-rss:68876kB, shmem-rss:0kB
So the real problem was that this Python process was doing something that had it stuck sitting there, using memory, even after it was OOM killed. The Python process was indeed not killable, for the reason that it had already been killed.
The whole series of events is probably sufficiently rare that it's not worth cluttering the tasks state listing with some form of 'task status' that would show if a particular process was already theoretically dead, just not cleaned up. Perhaps it could be done with some clever handling of the adjusted OOM score, for example marking such processes with a blank value or a '-'. This would make the field not parse as a number, but then kernel log messages aren't an API and can change as the kernel developers like.
(This happened on one of the GPU nodes of our SLURM cluster, so our suspicion is that some CUDA operation (or a GPU operation in general) was in progress and until it finished, the process could not be cleaned and collected. But there were other anomalies at the time so something even odder could be going on.)