Reasoning backwards (a story about what can happen to SATA disks)
This is a story, presented as a narrative.
Oh dear, one backup server hasn't finished yet despite running all night and all (work) day; in fact, Amanda is still doing three initial full backups of three filesystems on the single fileserver this backup server handles, and hasn't even started anything else. First thought: is there anything obviously wrong on the fileserver or the iSCSI backends? The answer is no; load is low, IO stats are low, the disks do not seem to be at all close to saturation in things IO operations a second, network usage is low on the fileserver and the backends, and there is nothing reported in syslog for any of the machines.
It's time to reason backwards and use brute force.
trusson one of the
tarprocesses that Amanda is running on the fileserver to see if anything it's doing is slow. Nothing obvious jumps out (and in particular it seems to be reading data from the disks fine), but every so often one of the
write()s it does stalls for two seconds or so.
Well, why? I know that
taris writing through a pipe to a local
amandadprocess (which then forwards the data over the network to Amanda on the backup server). Time to move a step backwards.
- Truss the
amandadprocess to see what's slowing it down; the answer is one of its
write()to the network stalls every so often, just like
write()does. I know the fileserver's network isn't saturated, but that's only one side of the traffic flow.
- check the network usage on the backup server. Nope, there's no peculiar source of traffic volume that would be saturating it. For a bit I was wondering if the network link had come up at 100 Mbits instead of a gigabit, but there are periodic usage spikes of over 10 Mbytes/sec so that can't be it. So it has to be something local on the backup server.
Amanda on the backup server writes all incoming backups to a holding disk before writing them to 'tape' (our tapes are disks, but our version of Amanda is still strongly living in the tape world).
- check the IO stats on the holding disk and they are through the roof and well into crazy land. The big problem is that the disk is only writing data at around 5.2 Mbytes/sec (and 13 IO operations a second), despite always having over a hundred writes queued. As expected, utilization is at 100% and write service times are immense (over 8 seconds).
Ding. We have a winner (well, a loser).
As it happens we've seen SATA disks fail this way before; they get (very) slow but don't report any errors or SMART failures. Replacing the holding disk with a new one made the backup server happy with life again.
What I found most interesting about this problem was how indirect the symptoms were (in a sense). 'Really slow backups' is usually something caused by fileserver problems, not something quiet on the backup server itself. I had to reason backwards through a number of layers to arrive at the real culprit (and it helped a lot that we'd already had experience with clearly slow SATA disks).