The effects of losing a ZFS ZIL SLOG device, as I understand them

January 11, 2015

Back when we planned out our new fileservers, our plan for any ZIL SLOG devices we'd maybe eventually put on hot pools was to use mirrored SLOG SSDs, just as we use mirrored disks for the main data storage. At the time when I put together these plans, my general impression was that losing your SLOG was fatal for your pool; of course that meant we had to mirror them to avoid a single device failure destroying a pool. Since then I've learned more about the effects of ZIL SLOG failure and I am starting to reconsider this and related design decisions.

As far as I know and have gathered (but have not yet actually tested with our OmniOS version), ZIL SLOG usage goes like this. First, the ZIL is never read from in normal operation; the only time the ZIL is consulted is if the system crashes abruptly and ZFS has to recover IO that was acknowledged (eg that was fsync()'d) but not yet committed to regular storage as part of a transaction group. This means that if a pool or system was shut down in an orderly way and then the SLOG is not there on reboot, reimport, or whatever, you've lost nothing since all of the acknowledged, fsync()'d in-flight IO was committed in a transaction group before the system shut down.

If the system crashed and then the pool SLOG turns out to have IO problems when you reboot, the regular pool metadata (and data) is still fully intact and anything that made it into a committed transaction group is on disk in the main pool. However you have lost whatever was logged in the ZIL (well, the SLOG ZIL) since the last acknowledged transaction; effectively you've rolled back the pool to the last transaction, which will generally be a rollback of a few seconds. In some circumstances this may be hard to tell apart from the system crashing before applications even had a chance to call fsync() to insure the data was on disk. In other situations, such as NFS fileservers, the server may have already told clients that the data was safe and they'll be quite put out to have it silently go missing.

Because the main pool metadata and data is intact, ZFS allows you to import pools that have lost their SLOG, even if they were shut down uncleanly and data has been lost (I assume that this may take explicit sysadmin action). Thus loss of an SLOG doesn't mean loss of a pool. Further, as far as I know if the SLOG dies while the system is running you still don't lose data (or the pool); the system will notice the SLOG loss and just stop writing the ZIL to it. All data recorded in the SLOG will be in the main pool once the next TXG commits.

So the situation where you will lose some data is if you have both a system crash (or power loss) and then a SLOG failure when the pool comes back up (or the SLOG fails and then the system crashes before the next TXG commit). Ordinary SLOG failure while the system is running is okay, as is 'orderly' SLOG loss if the pool goes down normally and then comes back without the SLOG. If you assume that system crashes and SLOG device failures are uncorrelated events, you would have to be very unlucky to have both happen at once. In short, you need a simultaneous loss situation in order to lose data.

This brings me to power loss protection for SSDs. Losing power will obviously 'crash' the system before it can commit the next TXG and get acknowledged data safely into the main pool, while many SSDs will lose some amount of recent writes if they lose power abruptly. Thus you can have a simultaneous loss situation if your SLOG SSDs don't have supercaps or some other form of power loss protection that lets them flush data from their onboard caches. It's worth noting that mirroring your SLOG SSDs doesn't help with this; power loss will again create a simultaneous loss situation in both sides of the mirror.

(In theory ZFS issues cache flush commands to the SSDs as part of writing the ZIL out and the SSDs should then commit this data to flash. In practice I've read that a bunch of SSDs just ignore the SATA cache flush commands in the name of turning in really impressive benchmark results.)

PS: This is what I've gathered from reading ZFS mailing lists and so on, and so some of it may be wrong; I welcome corrections or additional information. I'm definitely going to do my own testing to confirm things on our specific version of OmniOS (and in our specific hardware environment and so on) before I fully trust any of this, and I wouldn't be surprised to find corner cases. If nothing else, I need to find out what's involved in bringing up a pool with a missing or failed SLOG.

Comments on this page:

By James (trs80) at 2015-01-11 09:13:14:

"ZFS allows you to import pools that have lost their SLOG, even if they were shut down uncleanly and data has been lost" This is only true for poll versions 19 and above, before then the loss of a slog did cause pool loss, which is probably the source of your impression.

By Alex at 2015-01-15 05:16:39:

Another aspect to consider is performance: if you depend on the ZIL to speed up synchronous writes, loosing your sole, non-mirrored slog device, will degrade performance until you have it replaced, which might not be acceptable. (Typically, for a vmware NFS backend, where all writes are synchronous)

By Chip Schweiss at 2015-06-22 12:20:44:

I've found the hard way that loosing a single log device on a running system does at least cause a permanent mark on the pool. There is no data loss, but the zfs folder that had transactions in flight at the time of log device failure gets flagged as having permanent errors. While running off of my DR pool a ZeusRAM failed. Now until I can resync the entire zfs folder the pool has this scar.

Scrub show no data errors. Even a full file compare was run with the primary pool. There is a short thread about this on the Illumos mailing list.

root@mir-dr-zfs01:/root# zpool status -v drpool
  pool: drpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
  scan: scrub repaired 0 in 73h55m with 0 errors on Wed Jan 21 09:51:29  2015

        NAME                       STATE     READ WRITE CKSUM
        drpool                     ONLINE       0     0     0
          raidz2-0                 ONLINE       0     0     0
            c1t5000C5006251FFEBd0  ONLINE       0     0     0
            c1t5000C5006252A867d0  ONLINE       0     0     0
.... lots of disks not listed ....
          c2t3d0s0                 ONLINE       0     0     0
          c1t5000C50062532063d0    AVAIL
          c1t5000C50062533883d0    AVAIL
          c1t5000C50062533967d0    AVAIL
          c1t5000C50062520793d0    AVAIL
          c1t5000C50062521153d0    AVAIL   

errors: Permanent errors have been detected in the following files:

Written on 11 January 2015.
« Autoplaying anything is a terrible decision, doubly so for video
I've now seen comment spam attempts from Tor exit nodes »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jan 11 02:59:03 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.