A surprising effect of RAID-1 resynchronization
Today I got to run into an interesting performance impact of having a RAID-1 mirror resync running on a big partition of a live system.
An important system was having performance problems today, so we were poking around it. When we watched the disk statistics, we noticed that only the first disk was seeing read traffic; the second disk was loafing along with just occasional bursts of writes. Looking more closely we noticed that a RAID-1 resync of a big partition was in progress; because the system was loaded, the resync's IO bandwidth had been choked and it hadn't gotten very far, only 5% or so in a 100G partition.
Then the light dawned. Normally, reads are distributed over both sides of a RAID-1 mirror. However, at the moment only 5% of the second disk was valid; a read for something in the remaining 95% could only be be done by the first disk. No wonder the first disk was running hot and the second disk was seeing virtually no reads.
Like everybody, I already knew about the direct IO impact of a RAID-1 resync. But the choking effect of not being able to read from both disks for most of the filesystem hadn't previously occurred to me.
Sidebar: what's a RAID-1 resync?
A RAID-1 resync is what happens when the two disks in a RAID-1 mirror cease to be identical copies of each other, usually due to some calamity (power loss, system crash, disk failure). When this happens, one of the mirrors is identified as the most up to date and its data gets dumped to the other disk to bring them back into sync.
The obvious effect of a RAID-1 resync is that it adds extra IO to the system: reads on the first disk, writes on the second disk. However, any decent RAID system has various things to limit this IO so that it happens more or less when the disks are idle and doesn't steal IO bandwidth from real work.