Debuggers and two sorts of bugs

December 5, 2011

In the middle of reading Go Isn't C I ran across some remarks about how it was bad that programmers ignore debuggers in favour of print statements. This immediately sparked my standard reaction, which is that debuggers are focused on telling you what happens next but generally I want to know how on earth my code got into its current state. Then I started thinking some more and realized that I was being too strong because this isn't really accurate. In fact I use debuggers periodically, but only on certain sorts of bugs.

Let us say that there are two sorts of bugs. For lack of better names, I will call them direct bugs and indirect bugs. A direct bug's cause can be determined immediately by looking at the call stack, the local variables, the code, and so on at the time when it happens. You can say 'oh, the caller forgot that this function couldn't be called with a NULL', or see that you forgot to handle a case and something fell through to code that should never have been reached in this situation. A decent debugger works very well on direct bugs, or even features like automatic call stack backtraces on uncaught exceptions (as you get in languages like Python).

Indirect bugs are data structure corruption bugs (or sometimes flow of control bugs), where you are now in a 'this can't happen' situation (whether caught by an assert() or not). Finding the immediate problem in the code or diagnosing the source of corrupt results is only a starting point; the real challenge is discovering where and how things went off the rails so as to get you to where you are now. Indirect bugs are the bugs where you need to look back into the past to answer 'how did I get here?' questions.

(For practical examples, my recent Liferea issue was a direct bug; if I had read the code carefully, the first stack backtrace would have shown me the problem. My SIGCHLD signal handler race in Python was an indirect bug; I always knew what the direct problem was, but I had no idea how the program got into that state until I did some careful analysis.)

My unconscious bias until now has been that direct bugs are uninteresting because they are easy to solve from basic inspection, so I only really thought about what I wanted to deal with indirect bugs. But the main reason that direct bugs are easy to deal with is that I already have tools like stack backtraces and inspection of local variables so it's easy to see what's wrong with the program's current state.

Sidebar: a hazard of dealing with indirect bugs

It's often popular to 'fix' an indirect bug that crashes the program or generates obviously bad results by making the code accept the impossible state; for example, by adding a NULL check to the low-level routine that's crashing with a NULL pointer exception. This is generally the wrong idea (you're treating the symptoms instead of the disease), but it's tempting as a quick fix and it's an easy approach to fall into if you don't understand the code well enough to know that you're dealing with an indirect bug, not a direct bug.

This is one of the issues that always makes me wary about fixing 'obvious' crash bugs, especially if I want to send the fix upstream. Before I add a NULL pointer check or the like I need to be sure that it's the real bug, and I need to understand what the code should do instead of crashing (which is not always obvious).

Written on 05 December 2011.
« The current state of GPT, EFI, and Linux
What I know about boot time ZFS pool activation (part I) »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Dec 5 00:38:56 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.