The increasingly surprising limits to the speed of our Amanda backups

April 20, 2018

When I started dealing with backups the slowest part of the process was generally writing things out to tape, which is why Amanda was much happier when you gave it a 'holding disk' that it could stage all of the backups to before it had to write them out to tape. Once you had that in place, the speed limit was generally some mix between the network bandwidth to the Amanda server and the speed of how fast the machines being backed up could grind through their filesystems to create the backups. When networks moved to 1G, you (and we) usually wound up being limited by the speed of reading through the filesystems to be backed up.

(If you were backing up a lot of separate machines, you might initially be limited by the Amanda server's 1G of incoming bandwidth, but once most machines started finishing their backups you usually wound up with one or two remaining machines that had larger, slower filesystems. This slow tail wound up determining your total backup times. This was certainly our pattern, especially because only our fileservers have much disk space to back up. The same has typically been true of backing up multiple filesystems in parallel from the same machine; sooner or later we wind up stuck with a few big, slow filesystems, usually ones we're doing full dumps of.)

Then we moved our Amanda servers to 10G-T networking and, from my perspective, things started to get weird. When you have 1G networking, it is generally slower than even a single holding disk; unless something's broken, modern HDs will generally do at least 100 Mbytes/sec of streaming writes, which is enough to keep up with a full speed 1G network. However this is only just over 1G data rates, which means that a single HD is vastly outpaced by a 10G network. As long as we had a number of machines backing up at once, the Amanda holding disk was suddenly the limiting factor. However, for a lot of the run time of backups we're only backing up our fileservers, because they're where all the data is, and for that we're currently still limited by how fast the fileservers can do disk IO.

(The fileservers only have 1G network connections for reasons. However, usually it's disk IO that's the limiting factor, likely because scanning through filesystems is seek-limited. Also, I'm ignoring a special case where compression performance is our limit.)

All of this is going to change in our next generation of fileservers, which will have both 10G-T networking and SSDs. Assuming that the software doesn't have its own IO rate limits (which is not always a safe assumption), both the aggregate SSDs and all the networking from the fileservers to Amanda will be capable of anywhere from several hundred Mbytes/sec up to as much 10G bandwidth as Linux can deliver. At this point the limit on how fast we can do backups will be down to the disk speeds on the Amanda backup servers themselves. These will probably be significantly slower than the rest of the system, since even striping two HDs together would only get us up to around 300 Mbytes/sec at most.

(It's not really feasible to use a SSD for the Amanda holding disk, because it would cost too much to get the capacities we need. We currently dump over a TB a day per Amanda server, and things can only be moved off the holding disk at the now-paltry HD speed of 100 to 150 Mbytes/sec.)

This whole shift feels more than a bit weird to me; it's upended my perception of what I expect to be slow and what I think of as 'sufficiently fast that I can ignore it'. The progress of hardware over time has made it so the one part that I thought of as fast (and that was designed to be fast) is now probably going to be the slowest.

(This sort of upset in my world view of performance happens every so often, for example with IO transfer times. Sometimes it even sticks. It sort of did this time, since I was thinking about this back in 2014. As it turned out, back then our new fileservers did not stick at 10G, so we got to sleep on this issue until now.)

Comments on this page:

By Twirrim at 2018-04-21 11:46:10:

In some places where it has been critical to ramp up speed of backups, I've heard of SSDs being used as a buffer or scratch disk on the backup server.

Effectively: Servers-> Backup-server (SSD) -> Backup-server (large HDD array). And sometimes of course then -> Backup-server (Tape). I'm extremely rusty on Amanda and can't remember just how complicated that could make things.

Of course that's adding complexity, an extra potential point of failure in the whole process. It's an option I'd only jump to if I really needed it.

From at 2018-04-21 13:46:23:

You used "holding disk" in the singular: is a single HDD actually used? Wouldn't this be a good use case for stripe/RAID of disks? Is it a matter of hardware (space) limitations?

If pure-SSD is not practical, what about hybrid-SSD? Some HDDs (in a RAID configuration?), plus some SSDs for write caching. ZFS has done this almost a decade, and Linux now has bcache and dm-cache.

By cks at 2018-04-21 16:23:54:

In our old Amanda backup server hardware, we didn't have the space to have more than one holding disk. Our new backup server hardware has more drive bays and so we've currently got a two-HD software RAID stripe. Unfortunately that still only gets us up to the 300 Mbytes/sec range, which is well below what even a single SSD-based fileserver will likely be able to deliver over 10G.

Even a single SSD isn't necessarily fast enough to fully handle incoming backups. I think you can reasonably expect in the 400-500 Mbytes/sec range under many circumstances, and that's only half of the 10G speed. Our fileservers are feeding us from multiple SSDs at once, so I'm certainly hoping that they can push close to full 10G.

(How big we need and want our Amanda holding disk(s) to be is somewhat complicated and so beyond the scope of this comment, but we want them to be relatively large.)

Written on 20 April 2018.
« Spam from Yahoo Groups has quietly disappeared
Thinking about why ZFS only does IO in recordsize blocks, even random IO »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Apr 20 23:28:38 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.