Wandering Thoughts archives

2014-10-03

Why people are almost never going to be reporting bugs upstream

In comments on my entry about CentOS bug reporting, opk wrote:

If it is essentially an upstream bug and not packaging I tend to think it's far better to wade into the upstream swamps as you call it. I once packaged something for Debian and mainly gave it up because of the volume of bug reports that were basically for upstream but I had to verify, reproduce, and forward them.

Then Pete left a comment that nicely summarizes the problems with opk's suggestion:

[...] But of course you have to commit to running versions with debugging and then of course there's "the latest" even for the 7. Due to the critical nature of my NM use, I had difficulties experimenting with it.

The reality is that upstream bug reports aren't going to work for almost everyone unless the project has a very generous upstream. The problem is simple: almost all Linux distributions both use old versions of packages and patch them. If your version is patched or not current or both, almost every open source project is going to immediately ask 'can you reproduce this with an unmodified current version?'

I won't go so far as to say that this request is a non-starter, because in theory it can be done. For some projects it is good enough to download the current version (perhaps the current development version) and compile it yourself to be installed in an alternate location (or just run from where it was compiled). Other projects can be rebuilt into real distribution packages and then installed on your system without blowing up the world. And of course if this bug is absolutely critical to you, maybe you're willing to blow up the world just to be able to submit a bug report.

What all of this is is too much work, especially for the payoff most people are likely to get. The reality is that you're unlikely to benefit much from reporting any bug, and you're especially unlikely to benefit from upstream bug fixes unless you're willing to permanently run the upstream version (because if you're not, your distribution has to pick up and possibly backport the upstream bug fix if one is made).

(Let's skip the question of how many bug reporters even have the expertise to go through the steps necessary to try out the upstream version.)

Because reporting bugs upstream is so much work, in practice almost no one is going to do it no matter what you ask (or at least they aren't going to file useful ones). The direct corollary is that a policy of 'report bugs upstream' is in practice a policy of 'don't file bug reports'.

The one semi-exception to all of this is when your distribution package is an unmodified upstream version that the upstream (still) supports. At that point it makes sense to put a note in your bug tracker to explain this and say that upstream will take reports without problems. You're still asking bug reporters to do more work (now they have to go deal with the upstream bug reporting system too), but at least it's a pretty small amount of work.

linux/NoUpstreamBugReports written at 23:56:29; Add Comment

When using Illumos's lockstat, check the cumulative numbers too

Suppose, not entirely hypothetically that you have an Illumos (or OmniOS or etc) system that is experiencing something that looks an awful lot like kernel contention; for example, periodic 'mpstat 1' output where one CPU is spending 100% of its time in kernel code. Perhaps following Brendan Gregg's Solaris USE method, you stumble over lockstat and decide to give it a try. This is a fine thing, as it's a very nice tool and can give you lots of fascinating output.

However, speaking from recent experience, I urge you to at some point run lockstat with the -P option and check that output too. I believe that lockstat normally sorts its output by count, highest first; -P changes this to sort by total time (ie the count times its displayed average time). The very important thing that this does is it very prominently surfaces relatively rare but really long things. In my case, I spent a bunch of time and effort looking at quite frequent and kind of alarming looking adaptive mutex spins, but when I looked at 'lockstat -P' I discovered a lock acquisition that only had 30 instances over 60 seconds but that had an average spin time (not block time) of 55 milliseconds.

(Similarly, when I looked at the adaptive mutex block times I discovered the same lock acquisition, this time blocked 37 times in 60 seconds with an average block time of 1.6 seconds.)

In theory you can spot these things when scanning through the full lockstat output even without -P, but in practice humans don't work that way; we scan the top of the list and then as everything starts to dwindle away into sameness our eyes glaze over. You're going to miss things, so let lockstat do the work for you to surface them.

(If you specifically suspect long things you can use -d to only report on them, but picking a useful -d value probably requires some guesswork and looking at basic lockstat output.)

By the way, there turn out to be a bunch of interesting tricks you can do with lockstat. I recommend reading all the way through the EXAMPLES section and especially paying attention to the discussion of why various flags get used in various situations. Unlike the usual manpage examples, it only gets more interesting as it goes along.

(And if you need really custom tooling you can use the lockstat DTrace provider in your own DTrace scripts. I wound up doing that today as part of getting information on one of our problems.)

solaris/LockstatCheckCumulatives written at 02:59:08; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.