Wandering Thoughts archives

2021-01-01

An interesting and puzzling bit of Linux server utilization

Today, due to looking at dashboards in our Prometheus system supplemented with some use of 'perf top', I discovered that a person here had managed an interesting performance achievement on one of our SLURM CPU nodes. What initially attracted my attention was that the node had sustained 100% CPU usage, but it's the details that make it interesting.

This node has a 1 GB Ethernet interface, and it was fully saturating that interface with NFS reads from one of our fileservers. However, the 100% CPU usage was not from user CPU; instead it was about 97% system time, 2% iowait, 0.2% in softirq, and a tiny fraction in user time (under a tenth of a percent, and I've rounded these). This was already relatively interesting, since it suggests strongly that almost nothing was being done with all of that NFS IO. As kind of expected, there were a ton of interrupts and context switches (although far more than can be explained by plain network IO). The node had a lot of processes running and blocked, running up the load average, so it wasn't just a single process doing all of this.

The 'perf top' output was interesting in a suggestive way; a sample looked like this:

 66.44%  [kernel]    [k] native_queued_spin_lock_slowpath
 12.07%  [kernel]    [k] _raw_spin_lock
  4.28%  [kernel]    [k] list_lru_count_one
  3.36%  [kernel]    [k] isolate_lru_pages.isra.59
  2.64%  [kernel]    [k] putback_inactive_pages
  2.02%  [kernel]    [k] __isolate_lru_page
  1.24%  [kernel]    [k] super_cache_count
  1.09%  [kernel]    [k] shrink_page_list
 [...]

For some reason this workload (whatever it was) was lighting the kernel on fire with spinlock contention. Since there were multiple processes (and 'perf top --sort comm,dso' said that the kernel activity was distributed across a number of them), the obvious theory is that they were contenting over locks related to one or perhaps a few virtual memory areas shared between them. Alternately, perhaps they were all trying to read from the same NFS file and that was causing massive amounts of kernel contention.

(Looking at perf-top(1) suggests that I should have tried 'perf top -g', and perhaps worked out how to get a general record of a few minutes to analyze it in detail. Brendan Gregg's perf page has some examples I should study.)

Now that I've looked at it in the kernel source, isolate_lru_pages has a suggestive comment to the effect that:

For pagecache intensive workloads, this function is the hottest spot in the kernel (apart from copy_*_user functions).

I didn't see any of the copy-to-user calls in the 'perf top' results, but I'm not sure if they're involved on mmap()'d files or if NFS is otherwise somewhat special.

One of the things this illustrates once again is that system level metrics (even detailed ones) don't necessarily tell you underlying causes. They may give you high scale whats ('this system is spending almost all of its time in the kernel'), but without the all important why. It also shows me that I should learn more about Linux performance troubleshooting. Exploring 'perf top' on our various systems has been interesting in general, but I've only scratched the surface of what I can do with perf and I'm probably going to need to know more sooner or later.

(This load has quietly gone away now so I can't dig into it further.)

linux/HighKernelTimePuzzle written at 23:13:24; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.