Be cautious with numbers in awk

September 23, 2005

I like awk, and often use it for quick little log aggregation things (often on the command line, if what I am interested in is a one-off). But awk has a small problem: it likes printing large numbers in exponential notation.

The minor problem with this is that I find exponential notation for numbers harder to read than straight decimal notation. '3.18254e+10' is just harder to understand casually than 31825440599.

The major problem with this is that when I do log aggregation, I often feed the result to 'sort -nr' or the like, so I can see the result in a clearly sorted order (and perhaps pick out the top N). Numbers in exponential notation are not sorted 'correctly' by sort, as sort requires things to be in decimal notation.

Worse, when you are looking for the top N of something this issue can the precise entries you're most interested in to drop out. The highest entries are the ones most likely to have numbers large enough that awk starts putting them in exponential notation, which will make them sort very low indeed.

This isn't just a theoretical concern. When writing yesterday's entry, this exact issue almost caused me to miss four of the actual top six URLs by data transfered. (Fortunately I wound up noticing the missing entries when I was looking at detailed log output, and then worked out why it was happening.)

The workaround is relatively simple: awk's '%d' printf format will print even large numbers in decimal notation. So instead of 'END {print sum}' or the like, use 'END {printf "%d\n", sum}'. (Unfortunately I find awk's printf annoying for some reason, so I don't normally use it unless I have to. I guess I have to a lot more often now.)

This isn't the end of the story, because this points to another caution for dealing with numbers in awk, namely: awk uses floating point math, not integer math, even for numbers that are entirely decimal. This is most likely to bite you if you are subtracting large numbers from each other; for example, computing differences between Unix timestamps. (This actually bit me once, in an assignment, and I wound up being sufficiently annoyed to use a baroque workaround involving breaking out of awk to get bc to do that particular subtraction just so I could submit something that had the numbers absolutely correct.)

Written on 23 September 2005.
« The (probable) importance of accurate Content-Types
It's a multi-protocol world after all »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Sep 23 17:55:55 2005
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.