My view on software RAID and the RAID write hole

April 12, 2013

The old issue of Software RAID versus hardware RAID came up recently on Twitter, which got Chris Cowley to write Stop the hate on Software RAID, which prompted a small lobste.rs discussion in which people pointed to the RAID 5 write hole as a reason to prefer hardware RAID over software RAID. I've written several entries about how I favour software RAID but I've never talked about the write hole.

(For now let's ignore some other issues with RAID 5 or pretend that we're talking about RAID 6 instead, which also has this write hole issue.)

I'll start by being honest even if it's painful: hardware RAID has an advantage here. Yes, you can (and should) put your software RAID system on a UPS (or two) and so on, but there are simply more parts that can fail abruptly when you're dealing with a full server than when you're dealing with an on-card battery. This doesn't mean either that hardware RAID is risk free (hardware RAID cards fail too) or that software RAID is particularly risky (abrupt crashes of this sort are extreme outliers in most environments), but it does mean that hardware RAID is less risky in this specific respect.

This is where we get into tradeoffs. Hardware RAID has both drawbacks and risks of its own (relative to software RAID). When building any real system you have to assess the relative importance and real world chances of these risks (and how successfully you feel that you can mitigate them), because real systems are always almost always a balance between (potential) problems. My personal view is that in general, abrupt system halts are a vanishingly rare in properly designed systems. This makes the RAID write hole essentially a non-issue for software RAID.

(Of course there are all sorts of cautions here. For example, if you're operating enough systems the vanishingly rare can start happening more often than you want.)

Thus my overall feeling is (and remains) that most people and most systems are better off with software RAID than with hardware RAID. In practice I think you are much more likely to get bitten by various issues with hardware RAID than you are to blow things up by hitting the software RAID write hole with a system crash or power loss event.

(By the way, if you're seriously worried about the RAID write hole you'll want to carefully verify that your disks actually write data when they tell you that they have. This is probably much less of a risk if you buy expensive 'enterprise' SAS drives, of course.)


Comments on this page:

From 203.97.214.3 at 2013-04-13 20:18:27:

For me the biggest risk with hardware RAID in small to medium sized shops is the worry that if the controller fails, I won't be able to replace it easily with something that understands the on-disk format of the RAID drives, or that there will be a delay in doing so. Software RAID will, on the other hand, work the same way anywhere I can run the operating system.

Rodger

From 86.148.19.117 at 2013-04-23 04:24:00:

The problem with R5 write hole is that when it happens you can end up with a silent data corruption.

Additionally, ZFS RAID-Z[123] do not have write hole issue and are safe to use.

By cks at 2013-04-24 16:16:59:

My view is that there are plenty of ways to get silent data corruption with modern drives; all of those ways are exactly why ZFS and other modern filesystems have added checksums. The RAID-5 write hole is in practice probably less likely to affect you than the other options, assuming that your power and UPSes and so on are reliable (which they are here).

Written on 12 April 2013.
« Something I'd like to be easier in Solaris's IPS
Classic linked lists versus Python's list (array) type »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Apr 12 00:24:19 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.