2021-01-01
An interesting and puzzling bit of Linux server utilization
Today, due to looking at dashboards in our Prometheus system supplemented with some
use of 'perf top
', I discovered that a person here had managed
an interesting performance achievement on one of our SLURM CPU
nodes. What initially attracted
my attention was that the node had sustained 100% CPU usage, but
it's the details that make it interesting.
This node has a 1 GB Ethernet interface, and it was fully saturating that interface with NFS reads from one of our fileservers. However, the 100% CPU usage was not from user CPU; instead it was about 97% system time, 2% iowait, 0.2% in softirq, and a tiny fraction in user time (under a tenth of a percent, and I've rounded these). This was already relatively interesting, since it suggests strongly that almost nothing was being done with all of that NFS IO. As kind of expected, there were a ton of interrupts and context switches (although far more than can be explained by plain network IO). The node had a lot of processes running and blocked, running up the load average, so it wasn't just a single process doing all of this.
The 'perf top
' output was interesting in a suggestive way; a sample
looked like this:
66.44% [kernel] [k] native_queued_spin_lock_slowpath 12.07% [kernel] [k] _raw_spin_lock 4.28% [kernel] [k] list_lru_count_one 3.36% [kernel] [k] isolate_lru_pages.isra.59 2.64% [kernel] [k] putback_inactive_pages 2.02% [kernel] [k] __isolate_lru_page 1.24% [kernel] [k] super_cache_count 1.09% [kernel] [k] shrink_page_list [...]
For some reason this workload (whatever it was) was lighting the
kernel on fire with spinlock contention. Since there were multiple
processes (and 'perf top --sort comm,dso
' said that the kernel
activity was distributed across a number of them), the obvious
theory is that they were contenting over locks related to one or
perhaps a few virtual memory areas shared between them. Alternately,
perhaps they were all trying to read from the same NFS file and
that was causing massive amounts of kernel contention.
(Looking at perf-top(1) suggests
that I should have tried 'perf top -g
', and perhaps worked out
how to get a general record of a few minutes to analyze it in detail.
Brendan Gregg's perf
page
has some examples I should study.)
Now that I've looked at it in the kernel source, isolate_lru_pages
has a suggestive comment to the effect that:
For pagecache intensive workloads, this function is the hottest spot in the kernel (apart from copy_*_user functions).
I didn't see any of the copy-to-user calls in the 'perf top
'
results, but I'm not sure if they're involved on mmap()
'd files
or if NFS is otherwise somewhat special.
One of the things this illustrates once again is that system level
metrics (even detailed ones) don't necessarily tell you underlying
causes. They may give you high scale whats ('this system is spending
almost all of its time in the kernel'), but without the all important
why. It also shows me that I should learn more about Linux performance
troubleshooting. Exploring 'perf top
' on our various systems has
been interesting in general, but I've only scratched the surface
of what I can do with perf
and I'm probably going to need to know
more sooner or later.
(This load has quietly gone away now so I can't dig into it further.)