Wandering Thoughts archives

2013-11-04

How writes work on ZFS raidzN pools, with implications for 4k disks

There is an important difference between how ZFS handles raidzN pools and traditional RAID-5 and RAID-6 systems, a difference that can have serious ramifications in some environments. While I've mentioned this before I've never made it explicit and clear, so it's time to fix that.

In a traditional RAID-5/6/etc system, all stripes are full width, ie they span all disks (in fact they're statically laid out and you can predict which disks are data disks and which are parity disks for any particular stripe). If you write or rewrite only part of a stripe, the RAID system must do some variant of a read-modify-write cycle, updating at least one data disk and N parity disks. In ZFS stripes are variable size and hence span a variable number of disks (up to the full number of disks for data plus parity). Layout is variable and how big a stripe is depends on how much data you're writing (up to the dataset's recordsize). To determine how many disks a given data block write needs, you basically divide the size of the data by the fundamental sector size of the vdev (ie its ashift), possibly or likely wrapping around once the write gets big. There is no in-place updates of existing stripes.

(This leads to the usual ZFS sizing suggestion for how many disks should be in a raidzN vdev. Basically you want a full block to be evenly divided over all of the data disks, so with the usual 128kb recordsize you might want 8 disks for the data plus N disks for the parity. This creates even disk usage for full sized writes.)

In the days of disks with 512 byte physical sectors it didn't take much data being written to use all of the vdev's disks; even a 4kb write could be sliced up into eight 512-byte chunks and thus use eight data disks (plus N more for parity). You might still have some unevenness, but probably not much. In the days of 4k sector disks, things can now be significantly different. In particular if you make a 4kb write it takes one 4kb sector on one disk for the data and then N more 4kb sectors on other disks for the parity. If you have a raidz2 vdev and write only 4kb blocks (probably as random writes) you will write twice as many blocks for parity as for data, for a write amplification ratio for your data of 3 to 1 (you've written 4kb at the user level, the disks write 12kb). Even a raidz1 vdev has a 2x write amplification for 4k random writes.

(What may make this worse is that I believe that a lot of ZFS metadata is likely to be relatively small. On a raidzN vdev using 4k disks, much of it may not use all disks and thus suffer some degree of write amplification.)

The short way to put this is in ZFS the parity overhead varies depending on your write blocksize. And on 4k sector disks it may well be higher than you expect.

There are some consequences of this for 4k sector drives. First, the larger your raidzN vdevs are (in terms of disks) the larger the writes you need in order to use them all and reduce the actual overhead of parity. Second, if you want to minimize parity overhead it's important to evenly divide data between all disks. If you roll over, using two 4k sectors for data on even one disk, ZFS needs two 4k sectors for parity on each parity disk. Since in real life your writes are probably going to be of various different sizes (and then there's metadata), 4k sector disks and ashift=12 will likely have higher parity overheads than 512b sector disks. And in general from what you expect for RAID-5/6/etc.

I don't know if this makes ZFS raidzN less viable these days. Given the read performance issues, it probably always was for slow(er) bulk data storage outside of special situations.

ZFSRaidzHowWritesWork written at 23:32:54; Add Comment

2013-11-02

Revising our peculiar ZFS L2ARC trick

Here is a very smart question that my coworkers asked me today: if we have an L2ARC that's big enough to cache basically the entire important bit of one pool, is there much of a point to having that pool's regular data storage on SSDs? After all, basically all of the reads should be satisfied out of the L2ARC so the read IO speed of the actual pool storage doesn't really matter.

(Writes can be accelerated with a ZIL SLOG if necessary.)

Our current answer is that there isn't any real point to using SSDs instead of HDs on such a pool, especially in our architecture (where we have plenty of drive bay space for L2ARC SSDs). In current ZFS the L2ARC is lost on reboots (or pool exports and imports) and has to be rebuilt over time as you read from the regular pool vdevs, but for us these are very rare events anyways; most of our current fileservers have uptimes of well over a year. You do need enough RAM to hold the L2ARC index metadata in memory but I think our contemplated fileserver setup will have that.

(The one uncertainty over memory is to what degree other memory pressure (including from the regular ZFS ARC) will push L2ARC metadata out of memory and thus effectively drop things from the L2ARC.)

Since I just looked this up in the Illumos kernel sources, L2ARC header information is considered ARC metadata and ARC metadata is by default limited to one quarter of the ARC (although the ARC can be most of your memory). If you need to change this, you want the tunable arc_meta_limit. To watch how close to the limit you're running, you want to monitor arc_meta_used in the ARC kernel stats. The current size of (in-memory) L2ARC metadata is visible in the l2_hdr_size kstat.

(What exactly l2_hdr_size counts depends on the Illumos version. In older versions of Illumos I believe that it counts all L2ARC header data even if the data is currently in the ARC too. In modern Illumos versions it's purely for the headers of data that's only in the L2ARC, which is often the more interesting thing to know.)

ZFSLocalL2ARCTrickII written at 00:43:47; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.