The increasingly surprising limits to the speed of our Amanda backups
When I started dealing with backups the slowest part of the process was generally writing things out to tape, which is why Amanda was much happier when you gave it a 'holding disk' that it could stage all of the backups to before it had to write them out to tape. Once you had that in place, the speed limit was generally some mix between the network bandwidth to the Amanda server and the speed of how fast the machines being backed up could grind through their filesystems to create the backups. When networks moved to 1G, you (and we) usually wound up being limited by the speed of reading through the filesystems to be backed up.
(If you were backing up a lot of separate machines, you might initially be limited by the Amanda server's 1G of incoming bandwidth, but once most machines started finishing their backups you usually wound up with one or two remaining machines that had larger, slower filesystems. This slow tail wound up determining your total backup times. This was certainly our pattern, especially because only our fileservers have much disk space to back up. The same has typically been true of backing up multiple filesystems in parallel from the same machine; sooner or later we wind up stuck with a few big, slow filesystems, usually ones we're doing full dumps of.)
Then we moved our Amanda servers to 10G-T networking and, from my perspective, things started to get weird. When you have 1G networking, it is generally slower than even a single holding disk; unless something's broken, modern HDs will generally do at least 100 Mbytes/sec of streaming writes, which is enough to keep up with a full speed 1G network. However this is only just over 1G data rates, which means that a single HD is vastly outpaced by a 10G network. As long as we had a number of machines backing up at once, the Amanda holding disk was suddenly the limiting factor. However, for a lot of the run time of backups we're only backing up our fileservers, because they're where all the data is, and for that we're currently still limited by how fast the fileservers can do disk IO.
(The fileservers only have 1G network connections for reasons. However, usually it's disk IO that's the limiting factor, likely because scanning through filesystems is seek-limited. Also, I'm ignoring a special case where compression performance is our limit.)
All of this is going to change in our next generation of fileservers, which will have both 10G-T networking and SSDs. Assuming that the software doesn't have its own IO rate limits (which is not always a safe assumption), both the aggregate SSDs and all the networking from the fileservers to Amanda will be capable of anywhere from several hundred Mbytes/sec up to as much 10G bandwidth as Linux can deliver. At this point the limit on how fast we can do backups will be down to the disk speeds on the Amanda backup servers themselves. These will probably be significantly slower than the rest of the system, since even striping two HDs together would only get us up to around 300 Mbytes/sec at most.
(It's not really feasible to use a SSD for the Amanda holding disk, because it would cost too much to get the capacities we need. We currently dump over a TB a day per Amanda server, and things can only be moved off the holding disk at the now-paltry HD speed of 100 to 150 Mbytes/sec.)
This whole shift feels more than a bit weird to me; it's upended my perception of what I expect to be slow and what I think of as 'sufficiently fast that I can ignore it'. The progress of hardware over time has made it so the one part that I thought of as fast (and that was designed to be fast) is now probably going to be the slowest.
(This sort of upset in my world view of performance happens every so often, for example with IO transfer times. Sometimes it even sticks. It sort of did this time, since I was thinking about this back in 2014. As it turned out, back then our new fileservers did not stick at 10G, so we got to sleep on this issue until now.)