Solid state disks in mirrors and other RAID setups, and wear lifetimes

October 4, 2020

Writing down my plans to move to all solid state disks on my home machine, where I don't have great backups, has made me start thinking about various potential issues that this shift might create. One of them is specific to how I'm going to be using my drives (and how I'm already using SSDs), which is in mirrored pairs and more generally in a RAID environment.

The theory of using mirrored drives is that it creates redundancy and gives you insurance against single disk drive failures. When you mirror hard drives, one of the things you are tacitly counting on is that most hard drive failures seem to be random mechanical or physical media failures (ie, the drive suffers a motor failure or too many bad spots start cropping up on the platters). Because these are random failures, the odds are very good that they won't happen on both drives at the same time.

Solid state drives are definitely subject to random failures from things like (probable) manufacturing defects. We've had some SSDs die very early in their lifetimes, and there are a reasonable number of reports that SSDs are subject to infant mortality (people might find A Study of SSD Reliability in Large Scale Enterprise Storage Deployments [PDF] to be interesting on this topic, among others). However, solid state drives also have a definite maximum lifetime based on total writes. Drives in a mirrored setup (or more generally in any RAID one) are likely to see almost exactly the same amount of writes over time, which means that they will reach their wear lifetimes at almost the same time.

If your solid state drives reach their wear lifetimes at all in your RAID array (and you put them into the array at the same time, which is quite common), it seems very likely that they will reach that lifetime at about the same time. If you have good monitoring and reporting on wear (and if the drives report wear honestly), this means you'll start wanting to replace them at about the same time. If they don't report wear honestly and just die someday, the odds of nearly simultaneous failures are perhaps uncomfortably high.

There are two reasons this may not be a real worry in practice. The first is that it seems unusual (and hard) in practice to reach even the official nominal wear lifetimes of SSDs, much less the real ones (which historically seem to have been much higher than the datasheet numbers when people have tested to destruction). The second is that A Study of SSD Reliability in Large Scale Enterprise Storage Deployments specifically says that you should worry more about infant mortality getting multiple drives at once, since their data says (enterprise) solid state storage has a significantly extended infant mortality period.

(You can also deal with wear concerns by throwing one or some of your RAID drives into a test setup to get written to a lot before you spin up the real RAID array, so that they should reach any wear lifetime a TB or three ahead of your other drives. This might or might not affect infant mortality in any useful way.)


Comments on this page:

As an example of mirrored drives having highly correlated problems, see HPE's SSD firmware 'uptime bug' from last year:

Written on 04 October 2020.
« A thought about the lifetimes of hard disks and solid state disks
Link: Old-School Disk Partitions »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Oct 4 01:12:28 2020
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.