A retrospective on our overall fileserver architecture et al
My look back at how our fileserver environment has done over the past six years has focused on the Solaris fileservers (both the stuff that worked nicely and the stuff that didn't quite work out). Today I'm widening my focus to how well the whole idea and some of our overall decisions have worked out.
I think that the safest thing to say is about our overall architecture is that it's worked well for us; taking a proper retrospective look back at the constraints and so on involved in the design would require an entire entry by itself. Basing our SAN on iSCSI is still clearly the best option for low cost and wide availability of both frontends and backends. Having a SAN itself and decoupling the front end fileservers from their backend storage has been a significant advantage in practice; we've been able to do things like shuffle around entire backend machines in live production and bring up an entire second hot spare backend over Christmas vacation just in case. Effectively, having a SAN (or in our case two of them) means avoiding a single point of (long-term) failure for any particular fileserver. We like this overall architecture enough that we're replicating it in the second generation environment that we're building out now.
(To be fair, using a SAN did create a single point of performance problems.)
I don't think we've ever actually needed to have two iSCSI SAN networks instead of one, but it was fairly cheap insurance and I have no qualms about it. I believe we've taken advantage of it a few times (eg to deliberately swap one iSCSI switch around with things live, although I believe we did that during a formal downtime just in case).
Mirroring all storage between two separate iSCSI backends (at least) has been a great thing. Mirroring has given us better IO performance and has enabled us to ride out the failure of entire backend units without any significant user-visible issues. It's also made expanding people's ZFS pools much simpler, partly because it means we can do it in relatively small units. A lot of our pools are actually too small to be really viable RAID-5+ pools.
I'm convinced that using decent consumer-grade SATA disks has been a serious win. Even our 'bad' 1TB Seagate drives have given us decent service out to the end of their warranty lifetimes and beyond it, and the price savings over more costly disks is what made our entire design feasible in the first place. With unlimited budget, sure, fill up the racks with 15K RPM SAS enterprise drives (and now SSDs), but within our constraints I don't think we could have done better and our disks have undeniably worked and delivered decent performance.
Using inexpensive external disk enclosures hardware on the backends worked but has caused us moderate problems over the long run, because the disk enclosures just aren't as solid as the server hardware. They are basically PC cases, PC power supplies, and a bunch of PC fans and so on, plus some disk enclosures and wiring. We've had a number of power supply failures, by now a number of the fans have died (and can't really be replaced) with the accompanying increase in disk temperature, and so on. Having only single power supplies leaves the disk enclosures vulnerable to various power feed problems in practice. We're quite looking forward to moving to a better class of hardware in the next generation, with dual power supplies, easily replaced fans, and simply better engineering and construction.
(This means that to some extent the easy failover between backends created by using a SAN has only been necessary because our backends keep falling over due to this inexpensive hardware. We've never migrated backend storage around for other reasons.)
Using ESATA as the way to connect up all of the disks worked but, again, not without moderate problems. The largest of these is that disk resets (due to errors or just pulling a disk to replace it) are whole channel events, stalling and interrupting IO for up to four other backend disks at once. I will be much happier in the new generation where we're avoiding that. I don't think total ESATA channel bandwidth limits have been an issue on our current hardware, but that's only because an iSCSI backend only has 200 Mbytes/sec of network bandwidth. On modern hardware with dual 10G Ethernet and SATA disks that can do 150+ MBytes/sec of real disk IO this would probably be an issue.
(We are lucky that our single SSD based pool is not very big and is on SSDs for latency reasons instead of bandwidth ones.)
Our general design forced us into what I'll call 'multi-tenant' use of physical disks, where several ZFS pools can all wind up using the same physical disk. This has clearly had an impact on users, where high IO on one pool has leaked through to affect other people in other pools. At the same time we've also seen some degree of problems simply from shared fileservers and/or shared backends, even when physical disk usage doesn't overlap (and those are inevitable with our costs). I'm not sure we can really avoid multi-tenanting our disks but it is a drawback of our environment and I'll admit it.
Although I said this in my Solaris fileserver retrospective, it's worth repeating that ZFS has been a great thing for us (both for its flexible space management and for ZFS scrubs). We could have done something similar to our fileserver environment without ZFS but it wouldn't have been half as trustworthy (in my biased opinion) or half as easy to manage and deal with. I also remain convinced that we made the right choice for iSCSI backends and iSCSI target software, partly because our iSCSI target software both works and has been quite easy to manage (the latter is not something I can say about the other Linux iSCSI targets I've looked at).
As I mentioned in my entry on the Solaris stuff that didn't quite work out, effectively losing failover has been quietly painful in a low-level way. It's the one significant downside I can see in our current set of design choices; I think that ZFS is worth it, but it does ache. If we'd had it over the past six years, we probably would have made significant uses of the ability to quickly move a virtual fileserver from one physical server to another. Saying that we don't really miss it now is true only in a simple way; because we don't have it we undoubtedly haven't even been aware of situations where we'd have used it.
Having management processors and KVMs over IP for all of the fileservers and the backends has worked out well and has turned into something that I think is quite important. Our fileserver environment is crucial infrastructure; being able to look closely at its state remotely is a good and periodically important thing. We lucked into this on both our original generation hardware and on our new generation hardware (we didn't make it an explicit requirement), but as far as I'm concerned it's going to be a hard requirement for the next generation.
(Assuming that I remember this in four years or so, that is.)
PS: If you're interested in how some other aspect of our fileserver environment has worked out, please feel free to ask in comments. I'm probably missing covering interesting bits simply because I'm a fish in water when it comes to this stuff (for obvious reasons).
A retrospective on our Solaris ZFS-based NFS fileservers (part 2)
In yesterday's entry I talked about the parts of our Solaris ZFS fileserver environment that worked nicely over the six years we've run them. Today is for the other side, the things about Solaris that didn't go so well. You may have noticed that yesterday I was careful to talk specifically about the basics of ZFS working well. That is because pretty much all of the extra frills we tried failed or outright blew up in our faces.
The largest single thing that didn't work out anywhere near as we
planned and wanted is failover. There are contributing factors
beyond ZFS (see this for a
full overview) but what basically killed even careful manual failover
is the problem of very slow
The saving grace of the situation is that we've only really needed
failover a relatively small number of times because the fileservers
have been generally quite reliable. The downside of losing failover
is that the other name for failover is 'easy and rapid migration
of NFS service' and there have been any number of situations where
we could have used that. For example, we recently rebooted all of
the fileservers because they'd been up over 650 days and we had
some signs they might have latent problems. With fast, good 'failover'
we could have done this effectively live without much user-visible
impact (shift all NFS fileservice away from a particular machine,
reboot it, shift its NFS fileservice back, repeat). Without that
failover? A formal downtime.
The largest advertised ZFS feature that just didn't work was ZFS's
support for spare devices. We wound up feeling
that this was completely useless and built our own spares system (part 2, part 3). We also had problems with, for example,
zpool status hanging in problem situations
or just not being honest with us about the truth of the situation.
It turned out to be a significant issue in practice that ZFS has no API, ie no way for outside systems to reliably extract state information from it (a situation that continues to this day). Because we needed this information we were forced to develop ad-hoc and non-portable tools to extract by force from Solaris and this in turn caused further problems. One significant reason we never upgraded past Solaris 10 update 8, despite the existence of fixes we were interested in, was that upgrading would have required updating and re-validating all of these tools.
(These tools are also a large part of why we wouldn't take Solaris 11 even if Oracle offered it to us for free. We need these tools and these tools require source code access so we can reverse engineer this information.)
Overall our Solaris experiences has left me feeling that we were quite far from the (ZFS) usage cases that the Solaris developers expected. A lot of things didn't seem prepared to cope with, for example, how many 'disks' we have. Nothing actually broke significantly (at least once we stopped applying Solaris patches) but the entire environment felt fragile, like a too-tall building swaying as the wind builds up. We also became increasingly dubious about the quality of implementation of the changes that Sun (and then Oracle) was making to Solaris, adding another reason to stop applying patches and to never upgrade past Solaris 10U8.
(Allow me to translate that: Solaris OS developers routinely wrote and released patches and changes with terrible code that broke things for us and didn't work as officially documented. The Sun and Oracle reaction to this was a giant silent shrug.)
While we got away with our 'no patches, no updates, no changes' policy I'm aware that we were lucky; we simply never hit any of the known S10U8 bugs. I didn't (and don't) like running systems that I feel I can't update because things are sure to break and we definitely wound up doing that with our Solaris machines. I count that as something that did not go well.
In general, over time I've become increasingly uncomfortable about our default 'no updates on black box appliance style machines' policy, which we've followed on both the Solaris fileservers and the iSCSI backends. I kind of count it as an implicit failure in our current fileserver environment. For the next generation of fileservers and backends I'd really like to figure out a way to apply as many updates as possible in a safe way (I have some ideas but I'll save them for another entry).
None of these things that didn't work so well have been fatal or even painful in day to day usage. Some of them, such as the ZFS spares situation, have forced us to do things that improved the overall environment; having our own spares system has turned out to be a big win because it can be more intelligent and more aggressive than any general ZFS solution could be.