2005-09-23
Be cautious with numbers in awk
I like awk
, and often use it for quick little log aggregation things
(often on the command line, if what I am interested in is a one-off).
But awk has a small problem: it likes printing large numbers in
exponential notation.
The minor problem with this is that I find exponential notation for numbers harder to read than straight decimal notation. '3.18254e+10' is just harder to understand casually than 31825440599.
The major problem with this is that when I do log aggregation, I often
feed the result to 'sort -nr
' or the like, so I can see the result
in a clearly sorted order (and perhaps pick out the top N). Numbers in
exponential notation are not sorted 'correctly' by sort
, as sort
requires things to be in decimal notation.
Worse, when you are looking for the top N of something this issue can the precise entries you're most interested in to drop out. The highest entries are the ones most likely to have numbers large enough that awk starts putting them in exponential notation, which will make them sort very low indeed.
This isn't just a theoretical concern. When writing yesterday's entry, this exact issue almost caused me to miss four of the actual top six URLs by data transfered. (Fortunately I wound up noticing the missing entries when I was looking at detailed log output, and then worked out why it was happening.)
The workaround is relatively simple: awk's '%d
' printf format will
print even large numbers in decimal notation. So instead of 'END
{print sum}
' or the like, use 'END {printf "%d\n", sum}
'.
(Unfortunately I find awk's printf
annoying for some reason, so I
don't normally use it unless I have to. I guess I have to a lot more
often now.)
This isn't the end of the story, because this points to another
caution for dealing with numbers in awk, namely: awk uses floating
point math, not integer math, even for numbers that are entirely
decimal. This is most likely to bite you if you are subtracting large
numbers from each other; for example, computing differences between
Unix timestamps. (This actually bit me once, in an assignment, and I
wound up being sufficiently annoyed to use a baroque workaround
involving breaking out of awk to get bc
to do that particular
subtraction just so I could submit something that had the numbers
absolutely correct.)
The (probable) importance of accurate Content-Types
As a result of the MSN search spider going crazy, I am actually paying some attention to our web server logs for a change. This led to me looking up which URLs are responsible for the largest amounts of bandwidth used.
To my surprise, the six largest bandwidth sources were some CD-ROM images in ISO format that we happen to have lying around, the oldest one dating back to 2002. In the last week alone , there were eight requests totaling 3.5 gigabytes of transfers. Who could be that interested in some relatively ratty old ISO images?
Search engines, it turned out. All of the off-campus requests for the ISO images over the past 28 days came from MSNbot, Googlebot, and Yahoo! Slurp. I already knew about the crazy MSN spider, but Googlebot is well behaved; what possible reason could it have for fetching the same 600 megabyte image (last changed May 27th) three times between September 17th and September 21st?
I had previously noticed (while researching CrazyMSNCrawler) that our
web server was serving these ISO images with the Content-Type of
text/plain
. (I didn't think much about it at the time, except to
become less annoyed at MSNbot repeatedly looking at them.)
Suddenly the penny dropped: Googlebot probably thought the URL was a huge text file, not an ISO image. Worse, the web server was claiming that the 'text file' was in UTF-8, despite it certainly having non UTF-8 byte sequences.
If my theory is right, no wonder search engines repeatedly fetched the URLs. Each time they were hoping that this time around the text file would have valid UTF-8 that they could use. (Certainly I'd like search engines to re-check web pages that have invalidly encoded content, in the hopes that it gets fixed sooner or later.)
Our web server is now serving files that end in .iso
as Content-Type
application/octet-stream
. Time will tell if my theory is right and
the search engines lay off on the ISO images. (Even if it does no good
with search engines, unwary people who click on the links now won't
have their browser trying to show them the 600 megabyte 'text' of the
ISO file, which is a good thing.)
The obvious moral: every so often, take a look at your web server logs. You never know what interesting things you'll find.
(Maybe you'll discover that you're hosting a bulletin board system that averages a couple of hits a second that you hadn't previously noticed. Don't laugh; it happened to us. (It was a legitimate bulletin board system; we just hadn't realized it was quite that active.))
Sidebar: the bonus round of CPU usage
In an effort to speed up transfers to clients by reducing the
amount of data transfered to them, I recently configured the web
server to compress outgoing pages on the fly for various
Content-Types if the client advertised it was cool with this.
(Using the Apache mod_deflate
module.)
Of course, one of those Content-Types was text/plain
.
So not only were we doing huge pointless transfers, we were probably burning extra CPU to compress them on the fly. For ISO images where most of the content was likely already compressed, where further compression passes are at best pointless and at worst result in the content expanding.