SSDs may make ZFS raidz viable for general use

August 16, 2013

The classic problem and surprise with ZFS's version of RAID-5+ (raidz1, raidz2, and so on) is that you get much less read IO from your pool than most people expect. Rather than N disks worth of read IOPs you get one disk's worth for small random reads (more or less). To date this has mostly made raidz unsuitable for general use; you need to be doing relatively little random read IO or have rather low performance requirements to avoid being disappointed.

(Sequential read IO is less affected. Although I haven't tested or measured it, I believe that ZFS raidz will saturate your available disk bandwidth for predictable read patterns.)

Or rather this has made raidz unsuitable because hard drives have such low IOPs rates (generally assumed to be around 100 a second) so having only one disk's worth is terrible. But SSDs have drastically higher IOPs for reads; one SSD's worth of reads a second is still generally an impressively high number. While a raidz pool of SSDs will not have as high an IOPs rate as a bunch of mirrored SSDs, you'll get a lot more storage for your money. And a single SSD's worth of IOPs may well be enough to saturate other parts of your system (or at least more than satisfy their performance needs).

(There are other tradeoffs, of course. A raidzN will protect you from any arbitrary N disks dying, unlike mirrors, but can't protect you from a whole controller falling over the way a distributed set of mirrors can.)

This didn't even occur to me until today because I've been conditioned to shy away from raidz; I 'knew' that it performed terribly for random reads and hadn't thought through the special implications of changing raidz from HDs to SSDs. I don't think this will change our general plans (we value immunity from a single iSCSI backend failing) but it's certainly something I'm going to keep in mind in case.


Comments on this page:

By trs80 at 2013-08-17 00:37:42:

How much storage are you planning on needing? You can get ~1TB SSDs for ~$US1000, although this is of course 10x the cost of 2.5" 1TB drives, and even more for 3.5" drives. I'm sure you'll post a blog entry on this, but what are you thoughts on 2.5" vs 3.5" and enterprise vs consumer drives? Production wise the industry is almost done moving away from 3.5" performance drives, with that size only left for bulk storage.

By cks at 2013-08-19 18:27:17:

I've written up my answers as an entry, DiskDriveViews2013. The short version: we can't afford to use SSDs for all of our storage and while I like 2.5" drives in theory there currently aren't any that are big enough for us.

From 86.162.240.137 at 2013-08-20 05:49:02:

Depending on your specific workload the hybrid allocator for RAID-Z in Solaris 11 might help - it essentially uses mirroring (N-way) for metadata on RAID-Z volumes - so if you have lots of small files it should make reading metadata faster and with less impact for standard reads. This is available only in Solaris 11 ZFS AFAIK.

By vlad at 2013-10-15 15:01:55:

Hi Chris,

Vlad is here (they guy who was seating across of your office for few years ;) ).

Here is the IOZONE results for ZFS RAIDZ-1 on FreeBSD 9.2 machine. ZFS is installed on 3x4TB Seagate NAS drives. This server has 2GB RAM only. An I choose 16GB file for the tests.

IOZONE is using various tests. And Random reads is one of them.

       Run began: Sat Oct 12 07:13:34 2013
       Using minimum file size of 16777216 kilobytes.
       Using maximum file size of 16777216 kilobytes.
       Auto Mode
       Command line used: iozone -n16g -g 16g -a
       Output is in Kbytes/sec
       Time Resolution = 0.000001 seconds.
       Processor cache size set to 1024 Kbytes.
       Processor cache line size set to 32 bytes.
       File stride size set to 17 * record size.
                                                           random  random    bkwd   record   stride                                   
             KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
       16777216      64  116136   52756   114127   212492    6260    5151   11266  3519361     5256     9692    58950  139560   172149
       16777216     128  128367  110293   127927   154604   10936  111799   13329  5233815    11903   119359   114005  188507   199066
       16777216     256  120470  118875   173513   192828   16756  108468   21217  5555224    19255   109513   121924  151058   242490
       16777216     512  119806  109784   199077   240193   30150  110300   36420  5687205    34656   118471   120858  222436   233393
       16777216    1024  112002  117841   266887   274866   55937  113958   61115  5763623    61207   119008   110966  192922   193810
       16777216    2048  118477  113491   272860   283175   91151  104305   91075  5180215    94369   112214   121025  239028   243185
       16777216    4096  127778  109895   188126   210995  106470  124505  110848  1213090   117973   120814   118854  194869   199480
       16777216    8192  111168  116231   199739   214738  153041  114268  136429   832451   156018   112373   108627  167275   177815
       16777216   16384  113805  110436   237025   246339  195275  120950  189143   912348   200979   119754   116305  195823   198498
Written on 16 August 2013.
« A peculiar use of ZFS L2ARC that we're planning
My views on various bits of disk drive technology today »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Aug 16 22:15:45 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.