2014-10-03
Why people are almost never going to be reporting bugs upstream
In comments on my entry about CentOS bug reporting, opk wrote:
If it is essentially an upstream bug and not packaging I tend to think it's far better to wade into the upstream swamps as you call it. I once packaged something for Debian and mainly gave it up because of the volume of bug reports that were basically for upstream but I had to verify, reproduce, and forward them.
Then Pete left a comment that nicely summarizes the problems with opk's suggestion:
[...] But of course you have to commit to running versions with debugging and then of course there's "the latest" even for the 7. Due to the critical nature of my NM use, I had difficulties experimenting with it.
The reality is that upstream bug reports aren't going to work for almost everyone unless the project has a very generous upstream. The problem is simple: almost all Linux distributions both use old versions of packages and patch them. If your version is patched or not current or both, almost every open source project is going to immediately ask 'can you reproduce this with an unmodified current version?'
I won't go so far as to say that this request is a non-starter, because in theory it can be done. For some projects it is good enough to download the current version (perhaps the current development version) and compile it yourself to be installed in an alternate location (or just run from where it was compiled). Other projects can be rebuilt into real distribution packages and then installed on your system without blowing up the world. And of course if this bug is absolutely critical to you, maybe you're willing to blow up the world just to be able to submit a bug report.
What all of this is is too much work, especially for the payoff most people are likely to get. The reality is that you're unlikely to benefit much from reporting any bug, and you're especially unlikely to benefit from upstream bug fixes unless you're willing to permanently run the upstream version (because if you're not, your distribution has to pick up and possibly backport the upstream bug fix if one is made).
(Let's skip the question of how many bug reporters even have the expertise to go through the steps necessary to try out the upstream version.)
Because reporting bugs upstream is so much work, in practice almost no one is going to do it no matter what you ask (or at least they aren't going to file useful ones). The direct corollary is that a policy of 'report bugs upstream' is in practice a policy of 'don't file bug reports'.
The one semi-exception to all of this is when your distribution package is an unmodified upstream version that the upstream (still) supports. At that point it makes sense to put a note in your bug tracker to explain this and say that upstream will take reports without problems. You're still asking bug reporters to do more work (now they have to go deal with the upstream bug reporting system too), but at least it's a pretty small amount of work.
When using Illumos's lockstat
, check the cumulative numbers too
Suppose, not entirely hypothetically that you
have an Illumos (or OmniOS or etc) system that is experiencing
something that looks an awful lot like kernel contention; for
example, periodic 'mpstat 1
' output where one CPU is spending
100% of its time in kernel code. Perhaps following Brendan Gregg's
Solaris USE method, you stumble
over lockstat
and decide to give it a try. This is a fine thing,
as it's a very nice tool and can give you lots of fascinating output.
However, speaking from recent experience, I urge you to at some
point run lockstat
with the -P
option and check that output
too. I believe that lockstat normally sorts its output by count,
highest first; -P
changes this to sort by total time (ie the count
times its displayed average time). The very important thing that
this does is it very prominently surfaces relatively rare but really
long things. In my case, I spent a bunch of time and effort looking
at quite frequent and kind of alarming looking adaptive mutex spins,
but when I looked at 'lockstat -P
' I discovered a lock acquisition
that only had 30 instances over 60 seconds but that had an average
spin time (not block time) of 55 milliseconds.
(Similarly, when I looked at the adaptive mutex block times I discovered the same lock acquisition, this time blocked 37 times in 60 seconds with an average block time of 1.6 seconds.)
In theory you can spot these things when scanning through the full
lockstat
output even without -P
, but in practice humans don't
work that way; we scan the top of the list and then as everything
starts to dwindle away into sameness our eyes glaze over. You're
going to miss things, so let lockstat
do the work for you to
surface them.
(If you specifically suspect long things you can use -d
to only
report on them, but picking a useful -d
value probably requires
some guesswork and looking at basic lockstat
output.)
By the way, there turn out to be a bunch of interesting tricks you
can do with lockstat
. I recommend reading all the way through the
EXAMPLES
section and especially paying attention to the discussion
of why various flags get used in various situations. Unlike the usual
manpage examples, it only gets more interesting as it goes along.
(And if you need really custom tooling you can use the lockstat DTrace provider in your own DTrace scripts. I wound up doing that today as part of getting information on one of our problems.)