Wandering Thoughts archives

2016-09-22

Why we've wound up without ZFS ZILs or L2ARCs on our pools

Back when we designed the current generation of our ZFS fileservers, we expected to wind up putting in at least some ZILs (mirrored and in the backends) and L2ARCs (in the OmniOS fileservers). This has not wound up happening, as all of our plans for this have basically fallen through. There are a number of reasons for this (independent of my thoughts on why an L2ARC probably isn't a good fit for us, which sort of came later).

In one sense, the biggest reason is good news: we haven't felt the need to work on adding them because fileserver performance doesn't obviously suck. The fileservers work well and there's no clear bottleneck to their performance. But this is also kind of bad news. Performance could probably be better with a ZIL or L2ARC, at least for some pools, but there's no simple, easy, and especially always there way of seeing how much improvement there might be. You can gather ZIL usage information with DTrace scripts, but you have to actually go out and do it (and you have to figure out what metrics are important). As far as I know there are no kstats that track things like ZIL commits, volume written to the ZIL, and so on; without using DTrace, you really don't have any idea how active your ZIL is for a pool.

The other big reason is that there are a lot of practical questions about what happens when things go wrong and ZFS doesn't currently document clear answers to them. Before we could add either a ZIL or a L2ARC to a production pool, we'd need to test all of these things, and there's a daunting list of failure scenarios to test (L2ARC goes away during operation, L2ARC not present on reboot, and so on and so forth). Building a test environment and grinding through all of these is a lot of work to undertake when we don't even have a clear need established. And of course we'd also have to test a ZIL (or a L2ARC) in normal usage, just to make sure it didn't have any adverse consequences and actually did deliver the benefits we expected.

(We'd also inevitably need to change and update our management tools and our monitoring systems, including our spares system.)

At a practical level, we've actually dealt with the most important, clearest, and easiest cases of 'we need high performance here' by building out a couple of all-SSD pools. These hold /var/mail and some other core system filesystems, and much of the disk space for the departmental administrative staff (who are the heartbeat of the department and do everything over Samba from managed Windows machines; historically they can really create load).

So after all the dust has settled, it's simply been easier to keep on going without either ZILs or L2ARCs. We don't obviously need them, they may not do us any real good if we actually deployed them in our environment, and they require work to investigate and to deploy. At this point it seems likely that we'll remain without them for the remaining lifetime of this fileserver generation (which I hope starts running out in 2018, but we'll see). It's also my deep hope that the next generation of fileservers will be built around all-SSD storage, which will render many of these issues moot.

solaris/ZFSWhyNoZILOrL2ARCForUse written at 00:04:37; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.