The effects of losing a ZFS ZIL SLOG device, as I understand them

January 11, 2015

Back when we planned out our new fileservers, our plan for any ZIL SLOG devices we'd maybe eventually put on hot pools was to use mirrored SLOG SSDs, just as we use mirrored disks for the main data storage. At the time when I put together these plans, my general impression was that losing your SLOG was fatal for your pool; of course that meant we had to mirror them to avoid a single device failure destroying a pool. Since then I've learned more about the effects of ZIL SLOG failure and I am starting to reconsider this and related design decisions.

As far as I know and have gathered (but have not yet actually tested with our OmniOS version), ZIL SLOG usage goes like this. First, the ZIL is never read from in normal operation; the only time the ZIL is consulted is if the system crashes abruptly and ZFS has to recover IO that was acknowledged (eg that was fsync()'d) but not yet committed to regular storage as part of a transaction group. This means that if a pool or system was shut down in an orderly way and then the SLOG is not there on reboot, reimport, or whatever, you've lost nothing since all of the acknowledged, fsync()'d in-flight IO was committed in a transaction group before the system shut down.

If the system crashed and then the pool SLOG turns out to have IO problems when you reboot, the regular pool metadata (and data) is still fully intact and anything that made it into a committed transaction group is on disk in the main pool. However you have lost whatever was logged in the ZIL (well, the SLOG ZIL) since the last acknowledged transaction; effectively you've rolled back the pool to the last transaction, which will generally be a rollback of a few seconds. In some circumstances this may be hard to tell apart from the system crashing before applications even had a chance to call fsync() to insure the data was on disk. In other situations, such as NFS fileservers, the server may have already told clients that the data was safe and they'll be quite put out to have it silently go missing.

Because the main pool metadata and data is intact, ZFS allows you to import pools that have lost their SLOG, even if they were shut down uncleanly and data has been lost (I assume that this may take explicit sysadmin action). Thus loss of an SLOG doesn't mean loss of a pool. Further, as far as I know if the SLOG dies while the system is running you still don't lose data (or the pool); the system will notice the SLOG loss and just stop writing the ZIL to it. All data recorded in the SLOG will be in the main pool once the next TXG commits.

So the situation where you will lose some data is if you have both a system crash (or power loss) and then a SLOG failure when the pool comes back up (or the SLOG fails and then the system crashes before the next TXG commit). Ordinary SLOG failure while the system is running is okay, as is 'orderly' SLOG loss if the pool goes down normally and then comes back without the SLOG. If you assume that system crashes and SLOG device failures are uncorrelated events, you would have to be very unlucky to have both happen at once. In short, you need a simultaneous loss situation in order to lose data.

This brings me to power loss protection for SSDs. Losing power will obviously 'crash' the system before it can commit the next TXG and get acknowledged data safely into the main pool, while many SSDs will lose some amount of recent writes if they lose power abruptly. Thus you can have a simultaneous loss situation if your SLOG SSDs don't have supercaps or some other form of power loss protection that lets them flush data from their onboard caches. It's worth noting that mirroring your SLOG SSDs doesn't help with this; power loss will again create a simultaneous loss situation in both sides of the mirror.

(In theory ZFS issues cache flush commands to the SSDs as part of writing the ZIL out and the SSDs should then commit this data to flash. In practice I've read that a bunch of SSDs just ignore the SATA cache flush commands in the name of turning in really impressive benchmark results.)

PS: This is what I've gathered from reading ZFS mailing lists and so on, and so some of it may be wrong; I welcome corrections or additional information. I'm definitely going to do my own testing to confirm things on our specific version of OmniOS (and in our specific hardware environment and so on) before I fully trust any of this, and I wouldn't be surprised to find corner cases. If nothing else, I need to find out what's involved in bringing up a pool with a missing or failed SLOG.

Written on 11 January 2015.
« Autoplaying anything is a terrible decision, doubly so for video
I've now seen comment spam attempts from Tor exit nodes »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jan 11 02:59:03 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.