Wandering Thoughts archives


SSDs may make ZFS raidz viable for general use

The classic problem and surprise with ZFS's version of RAID-5+ (raidz1, raidz2, and so on) is that you get much less read IO from your pool than most people expect. Rather than N disks worth of read IOPs you get one disk's worth for small random reads (more or less). To date this has mostly made raidz unsuitable for general use; you need to be doing relatively little random read IO or have rather low performance requirements to avoid being disappointed.

(Sequential read IO is less affected. Although I haven't tested or measured it, I believe that ZFS raidz will saturate your available disk bandwidth for predictable read patterns.)

Or rather this has made raidz unsuitable because hard drives have such low IOPs rates (generally assumed to be around 100 a second) so having only one disk's worth is terrible. But SSDs have drastically higher IOPs for reads; one SSD's worth of reads a second is still generally an impressively high number. While a raidz pool of SSDs will not have as high an IOPs rate as a bunch of mirrored SSDs, you'll get a lot more storage for your money. And a single SSD's worth of IOPs may well be enough to saturate other parts of your system (or at least more than satisfy their performance needs).

(There are other tradeoffs, of course. A raidzN will protect you from any arbitrary N disks dying, unlike mirrors, but can't protect you from a whole controller falling over the way a distributed set of mirrors can.)

This didn't even occur to me until today because I've been conditioned to shy away from raidz; I 'knew' that it performed terribly for random reads and hadn't thought through the special implications of changing raidz from HDs to SSDs. I don't think this will change our general plans (we value immunity from a single iSCSI backend failing) but it's certainly something I'm going to keep in mind in case.

solaris/ZFSViableRaidzWithSSDs written at 22:15:45; Add Comment

A peculiar use of ZFS L2ARC that we're planning

In our SAN-based fileserver infrastructure we have a relatively small but very important and very busy pool. We need to be able to fail over this pool to another physical fileserver, so its data storage has to live on our iSCSI backends. But even with it on SSDs on the backends, going over the network with iSCSI adds latency and probably reduces bandwidth somewhat. We're not willing to move the pool to local storage on a fileserver; it's much more important that the pool stay up than that it be blindingly fast (especially since it's basically fast enough now). Oh, and it's generally much more important that reads be fast than writes.

But there is a way around this, assuming that you're willing to live with failover taking manual work (which we are): a large local L2ARC plus the regular SAN data storage. This particular pool is small enough that we basically get all of its data into an affordable L2ARC SSD (and certainly all of the active data). A local L2ARC gives us the local (read) IO for speed and effectively reduces the actual backend data storage to a persistence mechanism.

What makes this work is that a pool will import and run without its L2ARC device(s). Because L2ARC is only a cache, ZFS is willing to bring up a pool with missing L2ARC devices. If we have to fail over the pool to another fileserver it will come up without L2ARC and be slower, but at least it will come up.

(A local L2ARC plus SAN data storage works for any pool and is what we're planning in general when we renew our fileserver infrastructure (hopefully soon). But it may have limited effectiveness for large pools, based on usage patterns and so on. What makes this particular pool special is that it's small enough that the L2ARC can basically store all of it. And the L2ARC doesn't need to be mirrored or anything expensive.)

PS: given that this pool is already on SSDs, I don't think that there's any point to a separate log device. Since a SLOG is essential to the pool, it would have to live in the SAN and be mirrored; we couldn't get away with a local SLOG plus the data in the SAN.

solaris/ZFSLocalL2ARCTrick written at 11:52:51; Add Comment

Funding and the size of hardware you want to buy

We need a new core router (among other things) and we'd also like to move towards 10GB Ethernet in our machine room. In general there are two broad approaches in this situation: you can buy a (small) router and then various separate switches or you can buy a single big core router+switch unit with many ports. There are various issues involving either choice and in the past we've gone back and forth between them in our network design. During discussions about this today I had an obvious in retrospect realization about our specific situation.

Our equipment funding is quite erratic and as result we buy stuff mostly in bursts. In this environment the problem with a single big unit is that significant updates to it are likely to be quite expensive and you probably simply won't have that money all in one lump at once. Or to put it another way, you may very well not be able to do a slow rolling upgrade with a single big unit. With separate small pieces of hardware you can do piece by piece replacements as you get much smaller chunks of money over time; first you update a switch here, then a switch there, then you replace the (modest) router, and so on.

(Big units like this are often modular but that modularity has limits. After a certain amount of time it's very likely that the vendor is going to stop developing new modules for your chassis; if you want the new whatever you have to upgrade chassis and probably a number of other things as well.)

This should really not have surprised me because it's exactly one of the drawbacks of having a single big server to do a lot of things instead of spreading the same things out over a bunch of smaller servers. Sooner or later you're going to have to replace the big server and that's going to be a lot of money at once. Upgrading the smaller servers may cost just as much (or more), but you can spread that cost out much more.

(In a sense this is a sad thing if the big server or the big core network box offer economies of scale or other benefits. But note that the overall organization may be getting important benefits from being able to spend the same amount of money in a steady but moderate stream instead of very bursty large chunks.)

Sidebar: the spares issue

Another issue with a crucial core router is spares and redundancy. With modular units you can stock a sufficient amount of spare modules instead of fully duplicating the unit, but chassises can break too and a spare one is probably not cheap. With a big unit (even a modular one) you're effectively paying for more spares and redundancy than you actually need.

sysadmin/FundingAndHardwareSize written at 01:04:05; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.