When using Illumos's lockstat, check the cumulative numbers too

October 3, 2014

Suppose, not entirely hypothetically that you have an Illumos (or OmniOS or etc) system that is experiencing something that looks an awful lot like kernel contention; for example, periodic 'mpstat 1' output where one CPU is spending 100% of its time in kernel code. Perhaps following Brendan Gregg's Solaris USE method, you stumble over lockstat and decide to give it a try. This is a fine thing, as it's a very nice tool and can give you lots of fascinating output.

However, speaking from recent experience, I urge you to at some point run lockstat with the -P option and check that output too. I believe that lockstat normally sorts its output by count, highest first; -P changes this to sort by total time (ie the count times its displayed average time). The very important thing that this does is it very prominently surfaces relatively rare but really long things. In my case, I spent a bunch of time and effort looking at quite frequent and kind of alarming looking adaptive mutex spins, but when I looked at 'lockstat -P' I discovered a lock acquisition that only had 30 instances over 60 seconds but that had an average spin time (not block time) of 55 milliseconds.

(Similarly, when I looked at the adaptive mutex block times I discovered the same lock acquisition, this time blocked 37 times in 60 seconds with an average block time of 1.6 seconds.)

In theory you can spot these things when scanning through the full lockstat output even without -P, but in practice humans don't work that way; we scan the top of the list and then as everything starts to dwindle away into sameness our eyes glaze over. You're going to miss things, so let lockstat do the work for you to surface them.

(If you specifically suspect long things you can use -d to only report on them, but picking a useful -d value probably requires some guesswork and looking at basic lockstat output.)

By the way, there turn out to be a bunch of interesting tricks you can do with lockstat. I recommend reading all the way through the EXAMPLES section and especially paying attention to the discussion of why various flags get used in various situations. Unlike the usual manpage examples, it only gets more interesting as it goes along.

(And if you need really custom tooling you can use the lockstat DTrace provider in your own DTrace scripts. I wound up doing that today as part of getting information on one of our problems.)

Written on 03 October 2014.
« The problem with making bug reports about CentOS bugs
Why people are almost never going to be reporting bugs upstream »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 3 02:59:08 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.