Wandering Thoughts archives

2024-07-18

The Linux Out-Of-Memory killer process list can be misleading

Recently, we had a puzzling incident where the OOM killer was triggered for a cgroup, listed some processes, and then reported that it couldn't kill anything:

acctg_prof invoked oom-killer: gfp_mask=0x1100cca (GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
[...]
memory: usage 16777224kB, limit 16777216kB, failcnt 414319
swap: usage 1040kB, limit 16777216kB, failcnt 0
Memory cgroup stats for /system.slice/slurmstepd.scope/job_31944
[...]
Tasks state (memory values in pages):
[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 252732]     0 252732     1443        0    53248        0             0 sleep
[ 252728]     0 252728    37095     1915    90112       54         -1000 slurmstepd
[ 252740]  NNNN 252740  7108532    17219 39829504        5             0 python3
[ 252735]     0 252735    53827     1886    94208      151         -1000 slurmstepd
Out of memory and no killable processes...

We scratched our heads a lot, especially as something seemed to be killing systemd-journald at the same time and the messages being logged suggested that it had been OOM-killed instead (although I'm no longer so sure). Why was the kernel saying that there were no killable processes when there was a giant Python process right there?

What was actually going on is that the OOM task state list leaves out a critical piece of information, namely whether or not the process in question had already been killed. A surprising number of minutes before this set of OOM messages, the kernel had done another round of a cgroup OOM kill for this cgroup and:

oom_reaper: reaped process 252740 (python3), now anon-rss:0kB, file-rss:68876kB, shmem-rss:0kB

So the real problem was that this Python process was doing something that had it stuck sitting there, using memory, even after it was OOM killed. The Python process was indeed not killable, for the reason that it had already been killed.

The whole series of events is probably sufficiently rare that it's not worth cluttering the tasks state listing with some form of 'task status' that would show if a particular process was already theoretically dead, just not cleaned up. Perhaps it could be done with some clever handling of the adjusted OOM score, for example marking such processes with a blank value or a '-'. This would make the field not parse as a number, but then kernel log messages aren't an API and can change as the kernel developers like.

(This happened on one of the GPU nodes of our SLURM cluster, so our suspicion is that some CUDA operation (or a GPU operation in general) was in progress and until it finished, the process could not be cleaned and collected. But there were other anomalies at the time so something even odder could be going on.)

linux/OOMKillProcessListMisleading written at 23:30:12;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.