Reasoning backwards (a story about what can happen to SATA disks)

August 5, 2012

This is a story, presented as a narrative.

Oh dear, one backup server hasn't finished yet despite running all night and all (work) day; in fact, Amanda is still doing three initial full backups of three filesystems on the single fileserver this backup server handles, and hasn't even started anything else. First thought: is there anything obviously wrong on the fileserver or the iSCSI backends? The answer is no; load is low, IO stats are low, the disks do not seem to be at all close to saturation in things IO operations a second, network usage is low on the fileserver and the backends, and there is nothing reported in syslog for any of the machines.

It's time to reason backwards and use brute force.

  • Use truss on one of the tar processes that Amanda is running on the fileserver to see if anything it's doing is slow. Nothing obvious jumps out (and in particular it seems to be reading data from the disks fine), but every so often one of the write()s it does stalls for two seconds or so.

    Well, why? I know that tar is writing through a pipe to a local amandad process (which then forwards the data over the network to Amanda on the backup server). Time to move a step backwards.

  • Truss the amandad process to see what's slowing it down; the answer is one of its write() to the network stalls every so often, just like tar's write() does. I know the fileserver's network isn't saturated, but that's only one side of the traffic flow.

  • check the network usage on the backup server. Nope, there's no peculiar source of traffic volume that would be saturating it. For a bit I was wondering if the network link had come up at 100 Mbits instead of a gigabit, but there are periodic usage spikes of over 10 Mbytes/sec so that can't be it. So it has to be something local on the backup server.

Amanda on the backup server writes all incoming backups to a holding disk before writing them to 'tape' (our tapes are disks, but our version of Amanda is still strongly living in the tape world).

  • check the IO stats on the holding disk and they are through the roof and well into crazy land. The big problem is that the disk is only writing data at around 5.2 Mbytes/sec (and 13 IO operations a second), despite always having over a hundred writes queued. As expected, utilization is at 100% and write service times are immense (over 8 seconds).

Ding. We have a winner (well, a loser).

As it happens we've seen SATA disks fail this way before; they get (very) slow but don't report any errors or SMART failures. Replacing the holding disk with a new one made the backup server happy with life again.

What I found most interesting about this problem was how indirect the symptoms were (in a sense). 'Really slow backups' is usually something caused by fileserver problems, not something quiet on the backup server itself. I had to reason backwards through a number of layers to arrive at the real culprit (and it helped a lot that we'd already had experience with clearly slow SATA disks).

Written on 05 August 2012.
« Oracle, ZFS, and Linux (and Solaris)
The (possibly apocryphal) story of System V IPC's origins »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Aug 5 00:02:44 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.