An interesting debugging experience (another tale from long ago)

August 31, 2011

This is programming war story.

Once, a long time ago, I worked on an Amiga program. The program had a bug. Well, it had many bugs, but it had one particularly frustrating one where the program had erratic crashes that we couldn't reproduce or track down. One day we were starting to give an internal demonstration of the program and bang, it crashed. We tried again, and it crashed again. I immediately said 'don't touch anything' and we sat down with this specific machine and build of the program to finally get to the bottom of the problem. (So much for the demo, but we'd gotten really frustrated with the bug and it was a small company run by a programmer, so the boss understood.)

At least back in those days, the Amiga had no virtual memory or other memory protection (CPUs with those capabilities were far too expensive for personal computers). It was also a preemptive multi-tasking system, with multiple processes (which were more like threads, since they all ran in the same address space with no protection from each other). Now, of course, each process needed its own stack, and without memory protection they were fixed size and not expandable. There was a system setting for how big a stack each new process should get.

You can see where this is going. All of the programmers had Amigas with relatively large amounts of memory and did complex things with them and we'd vaguely heard that some programs needed bigger stack sizes to be reliable, so we'd increased our stack sizes well above the default. Meanwhile, the default stack size had been designed so that an Amiga with the minimum amount of memory could still work and run enough processes and so was rather small.

(This was a long time ago and RAM sizes were much smaller back then. My memory is that the default stack size was 4K and we had generally turned it up to 16K.)

And, of course, one of the programmers had written a function with a relatively large amount of local variables (my memory is that he'd put a decent sized array on the stack), enough that the program easily exceeded the default stack size when this function was called. Run on our machines with their enlarged stack sizes, this function worked fine (or at least usually worked fine). Run on an Amiga with the default setup, it would blow past the end of the stack and overwrite whatever was in memory there. Sometimes you got away with this; sometimes you overwrote something important and crashed.

The demo machine crashed all the time because it had the default small stack size, what we were doing in the demo wound up calling the function with too many local variables, and what else was (or wasn't) on the demo machine meant that the stack overwrite reliably hit something important.

(In order to crash, you had to be doing something that called the function, on a machine with a small stack, in a situation where you overwrote something important. No wonder it was hard to reproduce, especially in an environment where most people had lots of memory and reflexively raised the stack size.)

One of the lessons I've taken from this experience is the obvious one; always test your programs on a stock configuration machine. These days I expect that it's standard testing practice, but we were younger and stupider back then (and machines were more precious and expensive).

Written on 31 August 2011.
« Who is the audience for a trouble ticket update?
Things I will do differently in the next building power shutdown »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Aug 31 22:53:18 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.