2005-09-24
It's a multi-protocol world after all
I just fixed a wee bug in DWiki's Atom syndication feeds. The bug was that https:// URLs (such as references to Red Hat's Bugzilla) got mangled in Atom feeds, and only in Atom feeds, to be prefixed with the web site's URL.
DWiki normally generates shortened URLs that have full paths but omit the 'http://website/' bit (for various reasons). But when it generates Atom feed entries, DWiki needs to generate only absolute, fully qualified URLs (the Atom spec calls for this, among other reasons). This means that it needs to be able to recognize which URLs were already fully qualified URLs (because they refer to external websites) and which ones aren't. To tell if a URL was already fully specified DWiki was just looking for 'http://' at the start of the URL it had. So DWiki thought https:// URLs weren't fully qualified and 'helpfully' qualified them in the Atom feed entries.
When I wrote that code, I had forgotten that it's a multi-protocol world (technically, a multi-scheme world). And in a multi-scheme world, checking for just one scheme is almost certainly a bug. In this case, I should have been checking to see if the URL had any scheme at all (which takes somewhat more code; DWiki now goes to this effort).
As a result, I have a new mantra: if my code is looking for
http:// and I'm not about to connect to a web server, I probably
have a bug. (The magnitude of the bug may vary, but as a minimum all
my code should look for https:// too.)
(The wonder of having a blog and talking about my own code bugs is that I can display my stupid programming moments in public. Perhaps it'll goad me into writing higher quality code from the start.)
2005-09-23
Be cautious with numbers in awk
I like awk, and often use it for quick little log aggregation things
(often on the command line, if what I am interested in is a one-off).
But awk has a small problem: it likes printing large numbers in
exponential notation.
The minor problem with this is that I find exponential notation for numbers harder to read than straight decimal notation. '3.18254e+10' is just harder to understand casually than 31825440599.
The major problem with this is that when I do log aggregation, I often
feed the result to 'sort -nr' or the like, so I can see the result
in a clearly sorted order (and perhaps pick out the top N). Numbers in
exponential notation are not sorted 'correctly' by sort, as sort
requires things to be in decimal notation.
Worse, when you are looking for the top N of something this issue can the precise entries you're most interested in to drop out. The highest entries are the ones most likely to have numbers large enough that awk starts putting them in exponential notation, which will make them sort very low indeed.
This isn't just a theoretical concern. When writing yesterday's entry, this exact issue almost caused me to miss four of the actual top six URLs by data transfered. (Fortunately I wound up noticing the missing entries when I was looking at detailed log output, and then worked out why it was happening.)
The workaround is relatively simple: awk's '%d' printf format will
print even large numbers in decimal notation. So instead of 'END
{print sum}' or the like, use 'END {printf "%d\n", sum}'.
(Unfortunately I find awk's printf annoying for some reason, so I
don't normally use it unless I have to. I guess I have to a lot more
often now.)
This isn't the end of the story, because this points to another
caution for dealing with numbers in awk, namely: awk uses floating
point math, not integer math, even for numbers that are entirely
decimal. This is most likely to bite you if you are subtracting large
numbers from each other; for example, computing differences between
Unix timestamps. (This actually bit me once, in an assignment, and I
wound up being sufficiently annoyed to use a baroque workaround
involving breaking out of awk to get bc to do that particular
subtraction just so I could submit something that had the numbers
absolutely correct.)
2005-09-14
Why I really dislike the Singleton design pattern
To quote Glyph Lefkowitz (from here):
Aah, the singleton. Global variables for the new millennium.
I'm not against global variables, but I am against misleading people about them. The Singleton pattern lies about them; it looks like you have a normal object, but you really have a global variable without knowing it.
(And if everyone knows you have a faux global variable, you are just tarting up a global variable in 'object-oriented' clothes. This is like respecting the letter of the law ('globals bad, objects good') without understanding its spirit.)
If your language doesn't have global variables, you have my sympathies and condolences. In such circumstances, Singletons and other things become necessary hacks.
Note that I consider 'Singleton' not the same as a class where you only ever instantiate one object from in the course of your program. With Singleton, your program appears to instantiate multiple objects from the same source, but they are all the same thing.
(I've seen some people use 'Singleton' to describe what I would call a cached lookup. This is where you have descriptors of some sort and turn them into objects, but repeatedly calling the factory function with the same descriptor always gives you the same object back instead of making a new one every time.)
(Sort of continued from PointlessClasses.)
2005-09-05
Why print-based debugging is so popular
A recent ONLamp article on the Python debugger included the line:
The only time we reach for a debugger is when something goes wrong or breaks and our standard debugging techniques such as
This sparked an insight about why print-based debugging is so popular.
(It is; a surprising number of people swear by it, and get good results.
The quote got my attention because it made me go 'yeah, the author
understands how I work'.)
In debugging the most important thing is to look backwards in time, because you are trying to answer the question 'how did I get into this pinch?'
Print-based debugging builds up information about the history of your program as it arrives at the fatal point. All the information you print or log creates a trace record that you can then walk backwards, determining the relevant bit of your program's past state.
By contrast, debuggers have historically spent a lot of effort on moving forward in time (single-step, continue to next breakpoint, and so on) and relatively little effort on sophisticated ways of examining program state.
If you want to trace your program's evolving state in a debugger, usually what you wind up doing is inserting print commands in a language that's worse and more awkward than the one your program is written in. Is it any wonder so many people skip the middleman and just put the print statements directly in their program?
In hindsight I don't think it's any coincidence that
UPS, my favorite C debugger, lets me
dynamically insert printf()s into C programs and is pretty
good at showing program state at a glance.