Wandering Thoughts archives

2005-08-17

Parallelizing DNS queries with split

So there I was the other day, with 35,000 IP addresses to look up in the SBL to see if they were there. Looking up 35,000 IP addresses one after the other takes a long time. Too long a time.

The obvious approach was to write a SBL lookup program that internally worked in parallel, perhaps using threads. I was using Python and it has decent thread support, but when I started going down this route it rapidly started looking like too much work.

So instead I decided to use brute force and Unix. I had all of the IP addresses I wanted to look up in a big file, one IP address per line, so:

$ mkdir /tmp/sbl
$ split -l 800 /tmp/ipaddrs /tmp/sbl/in-sbl.
$ for i in /tmp/sbl/in-sbl.*; do \
  o=`echo $i | sed 's/in-/out-/'`; \
  sbllookup <$i >$o & \
  done; wait
$ cat /tmp/sbl/out-sbl.* >/tmp/sbl-out

What this does is that it takes /tmp/ipaddrs, the file of all of the IP addresses, and splits it up into a whole bunch of smaller chunks. Once I had it in chunks, I could parallelize my DNS lookups by starting the (serial) SBL lookup program on each separate chunk in the background, letting 44-odd of them run at once. Each wrote its output to a separate file, and once the wait had waited for them all to finish I could glue /tmp/sbl/out-sbl.* back into a single output file.

Parallelized, it took about five or ten minutes the first time around, and then only a minute or so for the second pass. (I did a second pass because the replies from some DNS queries might have been late trickling in the first time; the second time around they were all in our local DNS cache.)

sysadmin/ParallelDNSQueriesWithSplit written at 23:53:48; Add Comment

Remember to think about the scale of things

One of the famous computer programming quotes is 'premature optimization is the root of all evil' (C.A.R. Hoare quoted by Donald Knuth; attribution dammit (tm)).

A related issue is 'think about the scale of what you're planning'. A recent LiveJournal story provides a lovely example of this. To quote from it:

An increasing number of companies (large and small) are really insistent that we ping them with all blog updates, for reasons I won't rant about.

LiveJournal gets 3 or more public posts a second. That's a third of a second per post that has to include all DNS lookups, connection setup, sending the HTTP or SOAP or XML or whatever the ping format is, and connection teardown. (Apparently none of the companies gave LiveJournal a streaming interface, where LJ could open a connection once and then feed in results.)

The LiveJournal people gave a couple of these companies what they asked for. None of them could keep up.

The companies probably don't have bad or buggy software. I'm sure it works fine for the current set of blogs that pings them, and even has room for future growth. They just didn't think about the scale of what they were asking for from LiveJournal, and it probably didn't even occur to them to think about it.

Of course that's part of the problem of scale: it rarely occurs to people to think about it. Especially people almost never think about radical scale changes, whether up or down. This can lead to perfectly good solutions that don't fit the scale of the problem, or (as in this case) perfectly good solutions that don't quite handle the scale of a new problem.

When I start thinking about a system, I've found it useful to think about the scale of things as well as the problem itself. Sometimes this means I have to do more work; not infrequently it means I can do less. Thereby avoiding premature optimization and evil, and bringing me back to the quote up at the top.

Sidenote: the full optimization quote

A Google search wound up here, which cites the full quote as:

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."

An MSDN web page gives the source as:

Quoted in Donald E. Knuth, Literate Programming (Stanford, California: Center for the Study of Language and Information, 1992), 276.

tech/ThinkAboutScale written at 00:56:50; Add Comment

Annoying RSS Feed Tricks

The RSS feed tricks that are really annoying me right now are all the different ways people have invented to serve partial entry content. Almost all of them are bad, plus the basic idea is bad too.

Serving partial entries implies that the blog authors don't expect their readers to be interested in most of their words (otherwise, why make them go through extra effort to read them). The only good reasons for this that I can think of offhand are that very long entries or entries on a huge variety of topics. (Given the blogs I read, I can discount vulgar commercial motives.)

(My feed reader makes it very easy to skip the rest of an entry if I decide it's not interesting. If yours doesn't, find better software.)

The best excuse for this and the best version of it I've seen is the BBC news site. They at least have the excuse that they cover everything from soccer scores to earthquakes in Japan. They also go to the actual effort of publishing single sentence summaries of the news story (plus the headline).

Everyone else has both far less excuse and devotes far less effort to it. The result is, unsurprisingly, far less usable and far more annoying. Bad ways include:

  • serving an article abstract for a feed that's only about one thing. If I am interested enough to subscribe to the feed, I am interested enough to read more than your abstracts.
  • truncating the entry after the first sentence or paragraph, which may not serve all that well as a summary and/or teaser.
  • just truncating the entry after a certain number of words. You get bonus points for not explicitly noting the truncation, or marking it with something that can be at the end of your short posts too, like '...'.

The third method produces the worst results and is naturally the most common technique (perhaps because the other two take effort, instead of trivial code). I suppose I should be thankful that I've yet to see anyone truncating entries after so many characters, gleefully slicing words in half with their sharp ginsu code.

If your blog truncates entries in your syndication feed, for the love of the gods please take a look at how the feed looks in a feed reader. Then ask yourself if the result is either appealing or useful.

(I do not object to cutting off what are essentially footnotes from RSS entries; I sometimes do it too.)

Updated, August 25th: Another stupid entry truncation trick is just to have the title/link, with no entry text at all; bonus points are awarded for unhelpful titles. I had mercifully forgotten about this one until the feed in question had a new posting to one of the Planet feed aggregators that I read.

tech/AnnoyingRSSFeedTricks written at 00:22:50; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.