2005-09-15
Efficiently distributing huge files to lots of workstations
CQUEST's workstations have a piece of software that needs a read only 5+ gigabyte set of data files. Naturally we put the dataset on the local workstation disks, which works great. (These days it's hard to buy machines with less than 40 gigabyte disks, which leaves about 30 gigabytes free even with a generous Linux installation. We don't put user data on workstations for all the obvious reasons.)
However, every so often the dataset changes and we need to propagate all 5+ gigabytes of the new version to 80+ workstations, spread out across three or four subnets (and different physical labs). At first we simply copied them from the NFS server to each workstation, but we only have a 100 megabit network; at the best of times this took over half a day, during which the NFS server was totally unusable (due to its network interface being totally saturated).
The ideal solution is file distribution via reliable multicast, except for two problems:
- we would have to get multicast traffic across the central backbone to each lab. (Or have a more complicated scheme where we primed one node in each lab and did per-lab multicast.)
- there is no ready to use Linux software for reliable multicast (that I'm aware of).
(The second issue was the real killer.)
As it happens there is an existing piece software for efficient and reliable large scale distribution of files over ordinary TCP networks: BitTorrent (you may have heard of it). BitTorrent distributes files by assembling an ad-hoc mesh (a 'swarm') of peers that swap pieces of the files around from peers that have it to peers that don't. Among other things, this results in quite effective usage of the available bandwidth.
The official version of BitTorrent is written in Python and runs on Unix, so it already did almost everything CQUEST needed. All I had to do was adopt the simplest client program into a completely hands-off automated version, and the only tricky bit about this was deciding when the client should quit.
You don't want the client to just quit once it has downloaded all of the files, because other peers could still use its outgoing bandwidth to speed up getting their copies. Nor do you want the client to quit after sending out a certain amount of data; early finishers in the swarm would quit too early, and unpopular peers might never send out that much data.
Solution: quit when there are no more peers needing more data. Each machine in the swarm keeps a count of how many 'seeds' (people with complete copies) and 'peers' (people still needing data) it sees; when the number of peers drops to zero, clearly it isn't needed any more and it can quit. And this clearly converges: when everyone is done transferring the file, everyone will see zero peers. If a machine hangs, eventually its peers will eject it for being non-responsive and drop to zero peers.
Because BitTorrent creates an ad-hoc mesh each time performance varies from run to run. Our best timed run to date saw BitTorrent distribute 5875.3 megabytes from our fileserver to 80 workstations in 43 minutes 52 seconds wall clock time. The fileserver moved 16243.9 MB itself; the 80 workstations got the remaining 443 gigabytes from each other.
Or, to put it another way: we averaged 178.5 megabytes/second of aggregate bandwidth, out of four physical 100 mbit switched networks. Peak bandwidth will have been significantly higher.
Results: BitTorrent works smokingly well to distribute our dataset updates. The average propagation time is 50 minutes or so, and all of the machines remain entirely usable for the duration. CQUEST can now casually push updates out, instead of having to extensively plan and sequence things.
(We'd be happy to share copies of the software et al with any interested parties. Get in touch; it's all free software.)