A retrospective on our overall fileserver architecture et al

June 27, 2014

My look back at how our fileserver environment has done over the past six years has focused on the Solaris fileservers (both the stuff that worked nicely and the stuff that didn't quite work out). Today I'm widening my focus to how well the whole idea and some of our overall decisions have worked out.

I think that the safest thing to say is about our overall architecture is that it's worked well for us; taking a proper retrospective look back at the constraints and so on involved in the design would require an entire entry by itself. Basing our SAN on iSCSI is still clearly the best option for low cost and wide availability of both frontends and backends. Having a SAN itself and decoupling the front end fileservers from their backend storage has been a significant advantage in practice; we've been able to do things like shuffle around entire backend machines in live production and bring up an entire second hot spare backend over Christmas vacation just in case. Effectively, having a SAN (or in our case two of them) means avoiding a single point of (long-term) failure for any particular fileserver. We like this overall architecture enough that we're replicating it in the second generation environment that we're building out now.

(To be fair, using a SAN did create a single point of performance problems.)

I don't think we've ever actually needed to have two iSCSI SAN networks instead of one, but it was fairly cheap insurance and I have no qualms about it. I believe we've taken advantage of it a few times (eg to deliberately swap one iSCSI switch around with things live, although I believe we did that during a formal downtime just in case).

Mirroring all storage between two separate iSCSI backends (at least) has been a great thing. Mirroring has given us better IO performance and has enabled us to ride out the failure of entire backend units without any significant user-visible issues. It's also made expanding people's ZFS pools much simpler, partly because it means we can do it in relatively small units. A lot of our pools are actually too small to be really viable RAID-5+ pools.

I'm convinced that using decent consumer-grade SATA disks has been a serious win. Even our 'bad' 1TB Seagate drives have given us decent service out to the end of their warranty lifetimes and beyond it, and the price savings over more costly disks is what made our entire design feasible in the first place. With unlimited budget, sure, fill up the racks with 15K RPM SAS enterprise drives (and now SSDs), but within our constraints I don't think we could have done better and our disks have undeniably worked and delivered decent performance.

Using inexpensive external disk enclosures hardware on the backends worked but has caused us moderate problems over the long run, because the disk enclosures just aren't as solid as the server hardware. They are basically PC cases, PC power supplies, and a bunch of PC fans and so on, plus some disk enclosures and wiring. We've had a number of power supply failures, by now a number of the fans have died (and can't really be replaced) with the accompanying increase in disk temperature, and so on. Having only single power supplies leaves the disk enclosures vulnerable to various power feed problems in practice. We're quite looking forward to moving to a better class of hardware in the next generation, with dual power supplies, easily replaced fans, and simply better engineering and construction.

(This means that to some extent the easy failover between backends created by using a SAN has only been necessary because our backends keep falling over due to this inexpensive hardware. We've never migrated backend storage around for other reasons.)

Using ESATA as the way to connect up all of the disks worked but, again, not without moderate problems. The largest of these is that disk resets (due to errors or just pulling a disk to replace it) are whole channel events, stalling and interrupting IO for up to four other backend disks at once. I will be much happier in the new generation where we're avoiding that. I don't think total ESATA channel bandwidth limits have been an issue on our current hardware, but that's only because an iSCSI backend only has 200 Mbytes/sec of network bandwidth. On modern hardware with dual 10G Ethernet and SATA disks that can do 150+ MBytes/sec of real disk IO this would probably be an issue.

(We are lucky that our single SSD based pool is not very big and is on SSDs for latency reasons instead of bandwidth ones.)

Our general design forced us into what I'll call 'multi-tenant' use of physical disks, where several ZFS pools can all wind up using the same physical disk. This has clearly had an impact on users, where high IO on one pool has leaked through to affect other people in other pools. At the same time we've also seen some degree of problems simply from shared fileservers and/or shared backends, even when physical disk usage doesn't overlap (and those are inevitable with our costs). I'm not sure we can really avoid multi-tenanting our disks but it is a drawback of our environment and I'll admit it.

Although I said this in my Solaris fileserver retrospective, it's worth repeating that ZFS has been a great thing for us (both for its flexible space management and for ZFS scrubs). We could have done something similar to our fileserver environment without ZFS but it wouldn't have been half as trustworthy (in my biased opinion) or half as easy to manage and deal with. I also remain convinced that we made the right choice for iSCSI backends and iSCSI target software, partly because our iSCSI target software both works and has been quite easy to manage (the latter is not something I can say about the other Linux iSCSI targets I've looked at).

As I mentioned in my entry on the Solaris stuff that didn't quite work out, effectively losing failover has been quietly painful in a low-level way. It's the one significant downside I can see in our current set of design choices; I think that ZFS is worth it, but it does ache. If we'd had it over the past six years, we probably would have made significant uses of the ability to quickly move a virtual fileserver from one physical server to another. Saying that we don't really miss it now is true only in a simple way; because we don't have it we undoubtedly haven't even been aware of situations where we'd have used it.

Having management processors and KVMs over IP for all of the fileservers and the backends has worked out well and has turned into something that I think is quite important. Our fileserver environment is crucial infrastructure; being able to look closely at its state remotely is a good and periodically important thing. We lucked into this on both our original generation hardware and on our new generation hardware (we didn't make it an explicit requirement), but as far as I'm concerned it's going to be a hard requirement for the next generation.

(Assuming that I remember this in four years or so, that is.)

PS: If you're interested in how some other aspect of our fileserver environment has worked out, please feel free to ask in comments. I'm probably missing covering interesting bits simply because I'm a fish in water when it comes to this stuff (for obvious reasons).

Written on 27 June 2014.
« A retrospective on our Solaris ZFS-based NFS fileservers (part 2)
The tradeoffs for us in a SAN versus disk servers »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Jun 27 23:55:01 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.