The challenges of diagnosing slow backups
This is not the techblog entry I thought I was going to write. That entry would have been a quietly triumphal one about finding and squelching an interesting and obscure issue that was causing our backups to be unnaturally slow. Unfortunately, while the particular issue is squelched our backups of at least our new fileservers are still (probably) unnaturally slow. Certainly they seem to be slower than we want. So this entry is about the challenge of trying to figure out why your backups are slow (and what you can do about it).
The first problem is that unless the cause is quite simple, you're going to wind up needing to make detailed observations of your systems while the backups are running. In fact you'll probably have to do this repeatedly. By itself this is not a problem as such. What makes it into one is that most backups are run out of hours, often well out of hours. If you need to watch the backups and the backups start happening at 11pm, you're going to be staying up. This has various knock-on consequences, including that human beings are generally not at their sharpest at 11pm.
(General purpose low level metrics collection can give you some immediate data but there are plenty of interesting backup slowdowns that cannot be caught with them, including our recent one. And backups run during the day (whether test runs or real ones) are generally going to behave somewhat differently from nighttime backups, at least if your overall load and activity are higher in the day.)
Beyond that issue, a running backup system is generally a quite complex beast with many moving parts. There are probably multiple individual backups in progress in multiple hosts, data streaming back to backup servers, and any number of processes in communication with each other about all of this. As we've seen, the actual effects of a problem in one piece can manifest far away from that piece. In addition pieces may be interfering with each other; for example, perhaps running enough backups at once on a single machine causes them to contend for a resource (even an inobvious one, since it's pretty easy to spot saturated disks, networks, CPU, et al).
Complex systems create complex failure modes, which means that there are a lot of potential inobvious things that might be going on. That's a lot of things to winnow through for potential issues that pass the smell test, don't contradict any evidence of system behavior that you already have, and ideally that can be tested in some simple way.
(And the really pernicious problems don't have obvious causes, because if they did they would be easy to at least identify.)
What writing this tells me is that this is not unique to backup systems and that I should probably sit down to diagram out the overall backup system and its resources, then apply Brendan Gregg's USE Method to all of the pieces involved in backups in a systematic way. That would at least give me a good collection of data that I could use to rule things out.
(It's nice to form hypotheses and then test them and if you get lucky you can speed everything up nicely. But there are endless possible hypotheses and thus endless possible tests, so at some point you need to do the effort to create mass winnowings.)