2005-09-23
Be cautious with numbers in awk
I like awk, and often use it for quick little log aggregation things
(often on the command line, if what I am interested in is a one-off).
But awk has a small problem: it likes printing large numbers in
exponential notation.
The minor problem with this is that I find exponential notation for numbers harder to read than straight decimal notation. '3.18254e+10' is just harder to understand casually than 31825440599.
The major problem with this is that when I do log aggregation, I often
feed the result to 'sort -nr' or the like, so I can see the result
in a clearly sorted order (and perhaps pick out the top N). Numbers in
exponential notation are not sorted 'correctly' by sort, as sort
requires things to be in decimal notation.
Worse, when you are looking for the top N of something this issue can the precise entries you're most interested in to drop out. The highest entries are the ones most likely to have numbers large enough that awk starts putting them in exponential notation, which will make them sort very low indeed.
This isn't just a theoretical concern. When writing yesterday's entry, this exact issue almost caused me to miss four of the actual top six URLs by data transfered. (Fortunately I wound up noticing the missing entries when I was looking at detailed log output, and then worked out why it was happening.)
The workaround is relatively simple: awk's '%d' printf format will
print even large numbers in decimal notation. So instead of 'END
{print sum}' or the like, use 'END {printf "%d\n", sum}'.
(Unfortunately I find awk's printf annoying for some reason, so I
don't normally use it unless I have to. I guess I have to a lot more
often now.)
This isn't the end of the story, because this points to another
caution for dealing with numbers in awk, namely: awk uses floating
point math, not integer math, even for numbers that are entirely
decimal. This is most likely to bite you if you are subtracting large
numbers from each other; for example, computing differences between
Unix timestamps. (This actually bit me once, in an assignment, and I
wound up being sufficiently annoyed to use a baroque workaround
involving breaking out of awk to get bc to do that particular
subtraction just so I could submit something that had the numbers
absolutely correct.)