Wandering Thoughts archives

2015-01-15

General ZFS pool shrinking will likely be coming to Illumos

Here is some great news. It started with this tweet from Alex Reece (which I saw via @bdha):

Finally got around to posting the device removal writeup for my first open source talk on #openzfs device removal! <link>

'Device removal' sounded vaguely interesting but I wasn't entirely sure why it called for a talk, since ZFS can already remove devices. Still, I'll read ZFS related things when I see them go by on Twitter, so I did. And my eyes popped right open.

This is really about being able to remove vdevs from a pool. In its current state I think the code requires all vdevs to be bare disks, which is not too useful for real configurations, but now that the big initial work has been done I suspect that there will be a big rush of people to improve it to cover more cases once it goes upstream to mainline Illumos (or before). Even being able to remove bare disks from pools with mirrored vdevs would be a big help for the 'I accidentally added a disk as a new vdev instead of as a mirror' situation that comes up periodically.

(This mistake is the difference between 'zpool add POOL DEV1 DEV2' and 'zpool add POOL mirror DEV1 DEV2'. You spotted the one word added to the second command, right?)

While this is not quite the same thing as an in-place reshape of your pool, a fully general version of this would let you move a pool from, say, mirroring to raidz provided that you had enough scratch disks for the transition (either because you are the kind of place that has them around or because you're moving to new disks anyways and you're just arranging them differently).

(While you can do this kind of 'reshaping' today by making a completely new pool and using zfs send and zfs receive, there are some advantages to being able to do it transparently and without interruptions while people are actively using the pool).

This feature has been a wishlist item for ZFS for so long that I'd long since given up on ever seeing it. To have even a preliminary version of it materialize out of the blue like this is simply amazing (and I'm a little bit surprised that this is the first I heard of it; I would have expected an explosion of excitement as the news started going around).

(Note that there may be an important fundamental limitation about this that I'm missing in my initial enthusiasm and reading. But still, it's the best news about this I've heard for, well, years.)

ZFSPoolShrinkingIsComing written at 00:25:11; Add Comment

2015-01-13

Our tradeoffs on ZFS ZIL SLOG devices for pools

As I mentioned in my entry on the effects of losing a SLOG device, our initial plan (or really idea) for SLOGs in our new fileservers was to use a mirrored pair for each pool that we gave a SLOG to, split between iSCSI backends as usual. This is clearly the most resilient choice for a SLOG setup, assuming that you have SSDs with supercaps; it would take a really unusual series of events to lose any committed data in the pool.

On ZFS mailing lists that I've read, there are plenty of people who think that using mirrored SSDs for your SLOG is overkill for the likely extremely unlikely event of a simultaneous server and SLOG failure. This would obviously save us one SLOG device (or chunk) per pool, which has its obvious attractions.

If we're willing to drop to one SLOG device per pool and live with the resulting small chance of data loss, a more extreme possibility is to put the SLOG device on the fileserver itself instead of on an iSCSI backend. The potential big win here would be moving from iSCSI to purely local IO, which presumably has lower latency and thus would enable to fileserver to respond to synchronous NFS operations faster. The drawback is that we couldn't fail over pools to another fileserver without either abandoning the SLOG (with potential data loss) or physically moving the SLOG device to the other fileserver. While we've almost never failed over pools, especially remotely, I'm not sure we want to abandon the possibility quite so definitely.

(And before we went down this road we'd definitely want to measure the IO latencies of SLOG writes to a local SSD versus SLOG writes to an iSCSI SSD. It may well be that there's almost no difference, at which point giving up the failover advantages would be relatively crazy.)

Since we aren't yet at the point of trying SLOGs on any pools or even measuring our volume of ZIL writes, all of this is idle planning for now. But I like to think ahead and to some extent it affects things like how many bays we fill in the iSCSI backends (we're currently reserving two bays on each backend for future SLOG SSDs).

PS: Even if we have a low volume of ZIL writes in general, we may find that we hit the ZIL hard during certain sorts of operations (perhaps eg unpacking tarfiles or doing VCS operations) and it's worth adding SLOGs just so we don't perform terribly when people do them. Of course this is going to be quite affected by the price of appropriate SSDs.

ZFSOurSLOGTradeoffs written at 00:38:50; Add Comment

2015-01-11

The effects of losing a ZFS ZIL SLOG device, as I understand them

Back when we planned out our new fileservers, our plan for any ZIL SLOG devices we'd maybe eventually put on hot pools was to use mirrored SLOG SSDs, just as we use mirrored disks for the main data storage. At the time when I put together these plans, my general impression was that losing your SLOG was fatal for your pool; of course that meant we had to mirror them to avoid a single device failure destroying a pool. Since then I've learned more about the effects of ZIL SLOG failure and I am starting to reconsider this and related design decisions.

As far as I know and have gathered (but have not yet actually tested with our OmniOS version), ZIL SLOG usage goes like this. First, the ZIL is never read from in normal operation; the only time the ZIL is consulted is if the system crashes abruptly and ZFS has to recover IO that was acknowledged (eg that was fsync()'d) but not yet committed to regular storage as part of a transaction group. This means that if a pool or system was shut down in an orderly way and then the SLOG is not there on reboot, reimport, or whatever, you've lost nothing since all of the acknowledged, fsync()'d in-flight IO was committed in a transaction group before the system shut down.

If the system crashed and then the pool SLOG turns out to have IO problems when you reboot, the regular pool metadata (and data) is still fully intact and anything that made it into a committed transaction group is on disk in the main pool. However you have lost whatever was logged in the ZIL (well, the SLOG ZIL) since the last acknowledged transaction; effectively you've rolled back the pool to the last transaction, which will generally be a rollback of a few seconds. In some circumstances this may be hard to tell apart from the system crashing before applications even had a chance to call fsync() to insure the data was on disk. In other situations, such as NFS fileservers, the server may have already told clients that the data was safe and they'll be quite put out to have it silently go missing.

Because the main pool metadata and data is intact, ZFS allows you to import pools that have lost their SLOG, even if they were shut down uncleanly and data has been lost (I assume that this may take explicit sysadmin action). Thus loss of an SLOG doesn't mean loss of a pool. Further, as far as I know if the SLOG dies while the system is running you still don't lose data (or the pool); the system will notice the SLOG loss and just stop writing the ZIL to it. All data recorded in the SLOG will be in the main pool once the next TXG commits.

So the situation where you will lose some data is if you have both a system crash (or power loss) and then a SLOG failure when the pool comes back up (or the SLOG fails and then the system crashes before the next TXG commit). Ordinary SLOG failure while the system is running is okay, as is 'orderly' SLOG loss if the pool goes down normally and then comes back without the SLOG. If you assume that system crashes and SLOG device failures are uncorrelated events, you would have to be very unlucky to have both happen at once. In short, you need a simultaneous loss situation in order to lose data.

This brings me to power loss protection for SSDs. Losing power will obviously 'crash' the system before it can commit the next TXG and get acknowledged data safely into the main pool, while many SSDs will lose some amount of recent writes if they lose power abruptly. Thus you can have a simultaneous loss situation if your SLOG SSDs don't have supercaps or some other form of power loss protection that lets them flush data from their onboard caches. It's worth noting that mirroring your SLOG SSDs doesn't help with this; power loss will again create a simultaneous loss situation in both sides of the mirror.

(In theory ZFS issues cache flush commands to the SSDs as part of writing the ZIL out and the SSDs should then commit this data to flash. In practice I've read that a bunch of SSDs just ignore the SATA cache flush commands in the name of turning in really impressive benchmark results.)

PS: This is what I've gathered from reading ZFS mailing lists and so on, and so some of it may be wrong; I welcome corrections or additional information. I'm definitely going to do my own testing to confirm things on our specific version of OmniOS (and in our specific hardware environment and so on) before I fully trust any of this, and I wouldn't be surprised to find corner cases. If nothing else, I need to find out what's involved in bringing up a pool with a missing or failed SLOG.

ZFSSLOGLossEffects written at 02:59:03; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.