2015-02-20
All of our Solaris 10 machines are now out of production
Technically we did the final migration of filesystems from our old Solaris 10 fileservers to our new OmniOS ones on Wednesday evening, but in our view a migration is never really final until we've run the new thing in production for a while. Everything survived today without a hiccup, so we can now definitively say that all of our old Solaris 10 fileservers are out of production. We haven't actively and fully decommissioned them yet, in that almost everything is actually still running and many of the old ZFS pools are still there, but we've lagged a bit on that before.
These Solaris machines and their iSCSI backends had a very good run. It looks like our first production filesystems were brought up in September of 2008 on what was probably Solaris 10 update 5 (we soon moved to update 6 and then eventually update 8, where we stopped). Running until now means that they lasted almost six and a half years in production without real problems.
(That was six and a half years on the same server hardware; they all stayed on our original SunFire X2200 servers (and X2100s for the iSCSI backends). Some of the iSCSI backends probably still have the original 750 GB SATA disks we started with, and they pretty much all have their original system disks.)
I've already written an overall retrospective on our environment, but I want to say (again) that Solaris 10 was a totally solid operating system over its lifetime here and gave us good service. It may not have been the most modern thing and I have my issues with bits of ZFS and Solaris administration and so on, but it kept running and running and running. We could (and did) leave the machines up for very long periods of time and never touch them and nothing fell over, even under NFS load. I expect good things from OmniOS, but it does have big shoes to fill; Solaris has set a high bar for trouble-free operation.
(OmniOS will probably never be as 'stable' as Solaris in that we'll likely feel a need to update it more often and not totally freeze it the way we wound up doing with Solaris 10 update 8. It helps that OmniOS makes this less painful than Solaris updates and so on were, although we'll have to see.)
In that Solaris 10 is in some ways one of the last artifacts of a Sun that no longer exists, I feel a bit sad to see it go. It was time and past time for us to update things and OmniOS is a perfectly good successor (our 10G problems notwithstanding), but it doesn't come from the exciting Sun of the mid to late 00s even if many of the same people are still involved with it.
(Oracle Solaris is a different beast entirely, one that's gone all sharp-elbowed business suit because that's what Oracle is.)
2015-02-11
ZFS can apparently start NFS fileservice before boot finishes
Here's something that I was surprised to discover the other day: ZFS can start serving things over NFS before the system is fully up. Unfortunately this can have a bad effect because it's possible for this NFS traffic to cause further ZFS traffic in some circumstances.
Since this sounds unbelievable, let me report what I saw first. As
our problem NFS fileserver rebooted, it
stalled reporting 'Reading ZFS config:'. At the same time, our
iSCSI backends reported a high ongoing write volume to one pool's
set of disks and snoop on the fileserver could see active NFS
traffic. ptree reported that what was running at the time was the
'zfs mount -a' that is part of the /system/filesystem/local target.
(I recovered the fileserver from this problem by the simple method
of disconnecting its network interface. This caused nlockmgr to
fail to start, but at least the system was up. ZFS commands like
'zfs list' stalled during this stage; I didn't think to do a df
to capture the actual mounts.)
Although I can't prove it from the source code, I have to assume
that 'zfs mount -a' is enabling NFS access to filesystems as it
mounts them. An alternate explanation is that /etc/dfs/sharetab
had listings for all of the filesystems (ZFS adds them as part of
sharing them over NFS) and this activated NFS service for filesystems
as they appeared. The net effect is about the same.
This is obviously a real issue if you want your system to be fully up and running okay before any NFS fileservice starts. Since apparently some sorts of NFS traffic under some circumstances can stall further ZFS activity, well, this is something you may care about; we certainly do now.
In theory the SMF dependencies say that /network/nfs/server depends on /system/filesystem/local, as well as nlockmgr (which didn't start). In practice, well, how the system actually behaves is the ultimate proof and all I can do is report what I saw. Yes, this is frustrating. That ZFS and SMF together hide so much in black magic is a serious problem that has made me frustrated before. Among other things it means that when something goes odd or wrong you need to be a deep expert to understand what's going on.
2015-02-10
Our ZFS fileservers have a serious problem when pools hit quota limits
Sometimes not everything goes well with our ZFS fileservers. Today was one of those times and as a result this is an entry where I don't have any solutions, just questions. The short summary is that we've now had a fileserver get very unresponsive and in fact outright lock up when a ZFS pool that's experiencing active write IO runs into a pool quota limit.
Importantly, the pool has not actually run out of actual disk space;
it has only run into the quota limit, which is about 235 GB below
the space limit as 'zfs list' reports it
(or would, if there was no pool quota). Given things we've seen
before with full pools I would not have been
surprised to experience these problems if the pool had run itself
out of actual disk space. However it didn't; it only ran into an
entirely artificial quota limit. And things exploded anyways.
(Specifically, the pool had a quota setting, since refquota
on a pool where all the data is in filesystems isn't good for
much.)
Unfortunately we haven't gotten a crash dump. By the time there was
serious problem indications the system had locked up, and anyways
our past attempts to get crash dumps in the same situation have
been ineffective (the system would start to dump but then appear
to hang).
To the extent that we can tell anything, the few console messages that
get logged sort of vaguely suggest kernel memory issues. Or perhaps
I am simply reading too much into messages like 'arl_dlpi_pending
unsolicited ack for DL_UNITDATA_REQ on e1000g1'. Since the problem
is erratic and usually materializes with little or no warning, I don't
think we've captured eg mpstat output during the run-up to a lockup to
see things like if CPU usage is going through the roof.
I don't think that this happens all the time, as we've had this specific pool go to similar levels of being almost full before and the system hasn't locked up. The specific NFS IO pattern likely has something to do with it, as we've failed to reproduce system lockups in a test setup even with genuinely full pools, but of course we have no real idea what the IO pattern is. Given our multi-tenancy we can't even be confident that IO to the pool itself is the only contributor; we may need a pattern of IO to other pools as well to trigger problems.
(I also suspect that NFS and iSCSI are probably all involved in the problem. Partly this is because I would have expected a mere pool quota issue with ZFS alone to have been encountered before now, or even with ZFS plus NFS since a fair number of people run ZFS based NFS fileservers. I suspect we're one of the few places using ZFS with iSCSI as the backend and then doing NFS on top of it.)
One thing that writing this entry has convinced me is that I should pre-write a bunch of questions and things to look at in a file so I have them on hand the next time things start going south and I don't have to rely on my fallible memory to come up with what troubleshooting we want to try. Of course these events are sufficiently infrequent that I may forget where I put the file by the time the next one happens.