Wandering Thoughts archives

2011-08-31

An interesting debugging experience (another tale from long ago)

This is programming war story.

Once, a long time ago, I worked on an Amiga program. The program had a bug. Well, it had many bugs, but it had one particularly frustrating one where the program had erratic crashes that we couldn't reproduce or track down. One day we were starting to give an internal demonstration of the program and bang, it crashed. We tried again, and it crashed again. I immediately said 'don't touch anything' and we sat down with this specific machine and build of the program to finally get to the bottom of the problem. (So much for the demo, but we'd gotten really frustrated with the bug and it was a small company run by a programmer, so the boss understood.)

At least back in those days, the Amiga had no virtual memory or other memory protection (CPUs with those capabilities were far too expensive for personal computers). It was also a preemptive multi-tasking system, with multiple processes (which were more like threads, since they all ran in the same address space with no protection from each other). Now, of course, each process needed its own stack, and without memory protection they were fixed size and not expandable. There was a system setting for how big a stack each new process should get.

You can see where this is going. All of the programmers had Amigas with relatively large amounts of memory and did complex things with them and we'd vaguely heard that some programs needed bigger stack sizes to be reliable, so we'd increased our stack sizes well above the default. Meanwhile, the default stack size had been designed so that an Amiga with the minimum amount of memory could still work and run enough processes and so was rather small.

(This was a long time ago and RAM sizes were much smaller back then. My memory is that the default stack size was 4K and we had generally turned it up to 16K.)

And, of course, one of the programmers had written a function with a relatively large amount of local variables (my memory is that he'd put a decent sized array on the stack), enough that the program easily exceeded the default stack size when this function was called. Run on our machines with their enlarged stack sizes, this function worked fine (or at least usually worked fine). Run on an Amiga with the default setup, it would blow past the end of the stack and overwrite whatever was in memory there. Sometimes you got away with this; sometimes you overwrote something important and crashed.

The demo machine crashed all the time because it had the default small stack size, what we were doing in the demo wound up calling the function with too many local variables, and what else was (or wasn't) on the demo machine meant that the stack overwrite reliably hit something important.

(In order to crash, you had to be doing something that called the function, on a machine with a small stack, in a situation where you overwrote something important. No wonder it was hard to reproduce, especially in an environment where most people had lots of memory and reflexively raised the stack size.)

One of the lessons I've taken from this experience is the obvious one; always test your programs on a stock configuration machine. These days I expect that it's standard testing practice, but we were younger and stupider back then (and machines were more precious and expensive).

programming/AmigaStackSizeBug written at 22:53:18; Add Comment

Who is the audience for a trouble ticket update?

One of things that commentators brought up in response to my entry on why we don't use a trouble ticketing system is that trouble tickets have multiple uses; for example, they can be used later to look up what you did to solve a problem, and the user can use them to see how their issue is progressing. I expect that this is a common thing to say as a virtue of a ticketing system. However, I don't think that this is as easy as it sounds.

First, let's ask an awkward question: when you write trouble ticket updates, who are you writing the update for? Because these praiseworthy goals are generally in conflict; unless you have very unusual users, you cannot write an update that is simultaneously keeping your fellow sysadmins in the loop, documenting the solution to an ongoing problem, and giving the user useful information. Each of these goals calls for a different sort of writing with different contents.

(If you can resolve the problem in a single update you can at least collapse the first two cases together, but not all problems are amenable to this. And once you have a multi-step diagnosis and fix, well, as I've written earlier lab notebooks are not changelogs and so a series of progress reports are not the same as real documentation of a solution. You can reconstruct the latter from the former, but it is a reconstruction and it takes work; what you really want is to do the reconstruction once and then write it all down neatly.)

Particularly, I think that you need to decide right up front if trouble ticket updates are for you and your fellow sysadmins or if they are for users. If they are for sysadmins they contain deep-dive technical details that may well be opaque to someone who doesn't know your system environment. If the updates are for anyone else, you need to write them so that they can be understood by outsiders (with as much or as little actual technical details as you think your users can stand); this is likely to not include important details that fellow sysadmins need.

(A closely related issue is something that I wrote about back in PrivateTicketing, which is that there are times when you need to have discussions that users should definitely not see. These discussions obviously can't go in a public ticketing system.)

You can resolve all of these issues with a sufficiently complex trouble ticketing system, one where you actually have several different audiences for ticket updates (commentators on PrivateTicketing pointed out ticketing systems that support this in various ways). My personal feeling is that trying to wedge all of these different jobs into a single system is going to create something that's rather ungainly, but seeing as we've never tried to use a trouble ticketing system I have to admit that I have no hard evidence for this.

Sidebar: an example of writing for users versus sysadmins

Consider the user-focused writeup of the incident I described here. As you can imagine, my internal writeup included a great many more details than this (eg, I didn't name the failed backend in the user writeup) and omitted some things (as implicitly known by my fellow sysadmins). It also used more technical terminology, because using technical terminology is generally faster and more precise than more general writing.

sysadmin/TicketingAudience written at 01:25:38; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.