Wandering Thoughts archives

2005-09-23

Be cautious with numbers in awk

I like awk, and often use it for quick little log aggregation things (often on the command line, if what I am interested in is a one-off). But awk has a small problem: it likes printing large numbers in exponential notation.

The minor problem with this is that I find exponential notation for numbers harder to read than straight decimal notation. '3.18254e+10' is just harder to understand casually than 31825440599.

The major problem with this is that when I do log aggregation, I often feed the result to 'sort -nr' or the like, so I can see the result in a clearly sorted order (and perhaps pick out the top N). Numbers in exponential notation are not sorted 'correctly' by sort, as sort requires things to be in decimal notation.

Worse, when you are looking for the top N of something this issue can the precise entries you're most interested in to drop out. The highest entries are the ones most likely to have numbers large enough that awk starts putting them in exponential notation, which will make them sort very low indeed.

This isn't just a theoretical concern. When writing yesterday's entry, this exact issue almost caused me to miss four of the actual top six URLs by data transfered. (Fortunately I wound up noticing the missing entries when I was looking at detailed log output, and then worked out why it was happening.)

The workaround is relatively simple: awk's '%d' printf format will print even large numbers in decimal notation. So instead of 'END {print sum}' or the like, use 'END {printf "%d\n", sum}'. (Unfortunately I find awk's printf annoying for some reason, so I don't normally use it unless I have to. I guess I have to a lot more often now.)

This isn't the end of the story, because this points to another caution for dealing with numbers in awk, namely: awk uses floating point math, not integer math, even for numbers that are entirely decimal. This is most likely to bite you if you are subtracting large numbers from each other; for example, computing differences between Unix timestamps. (This actually bit me once, in an assignment, and I wound up being sufficiently annoyed to use a baroque workaround involving breaking out of awk to get bc to do that particular subtraction just so I could submit something that had the numbers absolutely correct.)

programming/AnAwkCaution written at 17:55:55;

The (probable) importance of accurate Content-Types

As a result of the MSN search spider going crazy, I am actually paying some attention to our web server logs for a change. This led to me looking up which URLs are responsible for the largest amounts of bandwidth used.

To my surprise, the six largest bandwidth sources were some CD-ROM images in ISO format that we happen to have lying around, the oldest one dating back to 2002. In the last week alone , there were eight requests totaling 3.5 gigabytes of transfers. Who could be that interested in some relatively ratty old ISO images?

Search engines, it turned out. All of the off-campus requests for the ISO images over the past 28 days came from MSNbot, Googlebot, and Yahoo! Slurp. I already knew about the crazy MSN spider, but Googlebot is well behaved; what possible reason could it have for fetching the same 600 megabyte image (last changed May 27th) three times between September 17th and September 21st?

I had previously noticed (while researching CrazyMSNCrawler) that our web server was serving these ISO images with the Content-Type of text/plain. (I didn't think much about it at the time, except to become less annoyed at MSNbot repeatedly looking at them.)

Suddenly the penny dropped: Googlebot probably thought the URL was a huge text file, not an ISO image. Worse, the web server was claiming that the 'text file' was in UTF-8, despite it certainly having non UTF-8 byte sequences.

If my theory is right, no wonder search engines repeatedly fetched the URLs. Each time they were hoping that this time around the text file would have valid UTF-8 that they could use. (Certainly I'd like search engines to re-check web pages that have invalidly encoded content, in the hopes that it gets fixed sooner or later.)

Our web server is now serving files that end in .iso as Content-Type application/octet-stream. Time will tell if my theory is right and the search engines lay off on the ISO images. (Even if it does no good with search engines, unwary people who click on the links now won't have their browser trying to show them the 600 megabyte 'text' of the ISO file, which is a good thing.)

The obvious moral: every so often, take a look at your web server logs. You never know what interesting things you'll find.

(Maybe you'll discover that you're hosting a bulletin board system that averages a couple of hits a second that you hadn't previously noticed. Don't laugh; it happened to us. (It was a legitimate bulletin board system; we just hadn't realized it was quite that active.))

Sidebar: the bonus round of CPU usage

In an effort to speed up transfers to clients by reducing the amount of data transfered to them, I recently configured the web server to compress outgoing pages on the fly for various Content-Types if the client advertised it was cool with this. (Using the Apache mod_deflate module.)

Of course, one of those Content-Types was text/plain.

So not only were we doing huge pointless transfers, we were probably burning extra CPU to compress them on the fly. For ISO images where most of the content was likely already compressed, where further compression passes are at best pointless and at worst result in the content expanding.

web/AccurateContentTypeImportance written at 02:27:46;


Page tools: See As Normal.
Search:
Login: Password:

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.