sysadmin/ReasoningBackwards written at 00:02:44; Add Comment
Reasoning backwards (a story about what can happen to SATA disks)
This is a story, presented as a narrative.
Oh dear, one backup server hasn't finished yet despite running all night and all (work) day; in fact, Amanda is still doing three initial full backups of three filesystems on the single fileserver this backup server handles, and hasn't even started anything else. First thought: is there anything obviously wrong on the fileserver or the iSCSI backends? The answer is no; load is low, IO stats are low, the disks do not seem to be at all close to saturation in things IO operations a second, network usage is low on the fileserver and the backends, and there is nothing reported in syslog for any of the machines.
It's time to reason backwards and use brute force.
Amanda on the backup server writes all incoming backups to a holding disk before writing them to 'tape' (our tapes are disks, but our version of Amanda is still strongly living in the tape world).
Ding. We have a winner (well, a loser).
As it happens we've seen SATA disks fail this way before; they get (very) slow but don't report any errors or SMART failures. Replacing the holding disk with a new one made the backup server happy with life again.
What I found most interesting about this problem was how indirect the symptoms were (in a sense). 'Really slow backups' is usually something caused by fileserver problems, not something quiet on the backup server itself. I had to reason backwards through a number of layers to arrive at the real culprit (and it helped a lot that we'd already had experience with clearly slow SATA disks).
* * *
Atom feeds are available; see the bottom of most pages.