== The cause of our slow Amanda backups and our workaround A while back I wrote about [[the challenges in diagnosing slow (Amanda) backups SlowBackupsChallenge]]. It's time for a followup entry on that, because we found what I can call 'the problem' and along with it a workaround. To start with, I need to talk about how we had configured our Amanda clients. In order to back up [[our fileservers ../solaris/ZFSFileserverSetupII]] in a sensible amount of time, we run multiple backups on each them at once. We don't really try to do anything sophisticated to balance the load across multiple disks both because this is hard in our environment (especially given limited Amanda features) and because we've never seen much evidence that reducing overlaps was useful in speeding things up; instead we just have Amanda run three backups at once on each fileserver ('_maxdumps 3_' in Amanda configuration). For historical reasons we were also using Amanda's '_auth bsd_' style of authentication and communication. As I kind of mentioned in passing in [[my entry on Amanda data flows AmandaBackupDataFlows]], '_auth bsd_' communication causes all concurrent backup activity to flow through a single master _amandad_ process. It turned out that this was our bottleneck. When we had a single _amandad_ process handling sending all backups back to the Amanda server and it was running more than one filesystem backup at a time, things slowed down drastically and we experienced our problem. When an _amandad_ process was only handling a single backup, things went fine. We tested and demonstrated this in two ways. The first was we dropped one fileserver down to one dump at a time and then it ran fine. The more convincing test was [[to use _SIGSTOP_ and _SIGCONT_ ../unix/SIGSTOPUsesAndCautions]] to pause and then resume backups on the fly on a server running multiple backups at once. This demonstrated that network bandwidth usage jumped drastically when we paused two out of the three backups and tanked almost immediately when we allowed more than one to run at once. It was very dramatic. Further work with [[a DTrace script ../solaris/DTraceFDIOVolScript]] provided convincing evidence that it was the _amandad_ process itself that was the locus of the problem and it wasn't that, eg, _tar_ reads slowed down drastically if more than one _tar_ was running at once. Our workaround was to switch to Amanda's '_auth bsdtcp_' style of communication. Although I initially misunderstood what it does, it turns out that this causes each concurrent backup to use a separate _amandad_ process and this made everything work fine for us; performance is now up to the level where [[we're saturating the backup server disks instead of the network ../tech/HDsVs10GEthernet]]. Well, mostly. It turns out that [[our first-generation ZFS fileservers ../solaris/ZFSFileserverSetup]] probably also have the slow backup problem. Unfortunately they're running a much older Amanda version and I'm not sure we'll try to switch them to '_auth bsdtcp_' since they're on the way out anyways. I call this a workaround instead of a solution because in theory a single central _amandad_ process handling all backup streams shouldn't be a problem. It clearly is in our environment for some reason, so it sort of would be better to understand why and if it can be fixed. (As it happens I have a theory for why this is happening, but it's long enough and technical enough that it needs [[another entry ../programming/IOMultiplexingDoneWrong]]. The short version is that I think the _amandad_ code is doing something wrong with its socket handling.)