Parallelizing DNS queries with split

August 17, 2005

So there I was the other day, with 35,000 IP addresses to look up in the SBL to see if they were there. Looking up 35,000 IP addresses one after the other takes a long time. Too long a time.

The obvious approach was to write a SBL lookup program that internally worked in parallel, perhaps using threads. I was using Python and it has decent thread support, but when I started going down this route it rapidly started looking like too much work.

So instead I decided to use brute force and Unix. I had all of the IP addresses I wanted to look up in a big file, one IP address per line, so:

$ mkdir /tmp/sbl
$ split -l 800 /tmp/ipaddrs /tmp/sbl/in-sbl.
$ for i in /tmp/sbl/in-sbl.*; do \
  o=`echo $i | sed 's/in-/out-/'`; \
  sbllookup <$i >$o & \
  done; wait
$ cat /tmp/sbl/out-sbl.* >/tmp/sbl-out

What this does is that it takes /tmp/ipaddrs, the file of all of the IP addresses, and splits it up into a whole bunch of smaller chunks. Once I had it in chunks, I could parallelize my DNS lookups by starting the (serial) SBL lookup program on each separate chunk in the background, letting 44-odd of them run at once. Each wrote its output to a separate file, and once the wait had waited for them all to finish I could glue /tmp/sbl/out-sbl.* back into a single output file.

Parallelized, it took about five or ten minutes the first time around, and then only a minute or so for the second pass. (I did a second pass because the replies from some DNS queries might have been late trickling in the first time; the second time around they were all in our local DNS cache.)

Written on 17 August 2005.
« Remember to think about the scale of things
The pains of modern disk storage »

Page tools: View Source.
Search:
Login: Password:

Last modified: Wed Aug 17 23:53:48 2005
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.