2015-02-20
All of our Solaris 10 machines are now out of production
Technically we did the final migration of filesystems from our old Solaris 10 fileservers to our new OmniOS ones on Wednesday evening, but in our view a migration is never really final until we've run the new thing in production for a while. Everything survived today without a hiccup, so we can now definitively say that all of our old Solaris 10 fileservers are out of production. We haven't actively and fully decommissioned them yet, in that almost everything is actually still running and many of the old ZFS pools are still there, but we've lagged a bit on that before.
These Solaris machines and their iSCSI backends had a very good run. It looks like our first production filesystems were brought up in September of 2008 on what was probably Solaris 10 update 5 (we soon moved to update 6 and then eventually update 8, where we stopped). Running until now means that they lasted almost six and a half years in production without real problems.
(That was six and a half years on the same server hardware; they all stayed on our original SunFire X2200 servers (and X2100s for the iSCSI backends). Some of the iSCSI backends probably still have the original 750 GB SATA disks we started with, and they pretty much all have their original system disks.)
I've already written an overall retrospective on our environment, but I want to say (again) that Solaris 10 was a totally solid operating system over its lifetime here and gave us good service. It may not have been the most modern thing and I have my issues with bits of ZFS and Solaris administration and so on, but it kept running and running and running. We could (and did) leave the machines up for very long periods of time and never touch them and nothing fell over, even under NFS load. I expect good things from OmniOS, but it does have big shoes to fill; Solaris has set a high bar for trouble-free operation.
(OmniOS will probably never be as 'stable' as Solaris in that we'll likely feel a need to update it more often and not totally freeze it the way we wound up doing with Solaris 10 update 8. It helps that OmniOS makes this less painful than Solaris updates and so on were, although we'll have to see.)
In that Solaris 10 is in some ways one of the last artifacts of a Sun that no longer exists, I feel a bit sad to see it go. It was time and past time for us to update things and OmniOS is a perfectly good successor (our 10G problems notwithstanding), but it doesn't come from the exciting Sun of the mid to late 00s even if many of the same people are still involved with it.
(Oracle Solaris is a different beast entirely, one that's gone all sharp-elbowed business suit because that's what Oracle is.)
2015-02-11
ZFS can apparently start NFS fileservice before boot finishes
Here's something that I was surprised to discover the other day: ZFS can start serving things over NFS before the system is fully up. Unfortunately this can have a bad effect because it's possible for this NFS traffic to cause further ZFS traffic in some circumstances.
Since this sounds unbelievable, let me report what I saw first. As
our problem NFS fileserver rebooted, it
stalled reporting 'Reading ZFS config:'. At the same time, our
iSCSI backends reported a high ongoing write volume to one pool's
set of disks and snoop on the fileserver could see active NFS
traffic. ptree reported that what was running at the time was the
'zfs mount -a' that is part of the /system/filesystem/local target.
(I recovered the fileserver from this problem by the simple method
of disconnecting its network interface. This caused nlockmgr to
fail to start, but at least the system was up. ZFS commands like
'zfs list' stalled during this stage; I didn't think to do a df
to capture the actual mounts.)
Although I can't prove it from the source code, I have to assume
that 'zfs mount -a' is enabling NFS access to filesystems as it
mounts them. An alternate explanation is that /etc/dfs/sharetab
had listings for all of the filesystems (ZFS adds them as part of
sharing them over NFS) and this activated NFS service for filesystems
as they appeared. The net effect is about the same.
This is obviously a real issue if you want your system to be fully up and running okay before any NFS fileservice starts. Since apparently some sorts of NFS traffic under some circumstances can stall further ZFS activity, well, this is something you may care about; we certainly do now.
In theory the SMF dependencies say that /network/nfs/server depends on /system/filesystem/local, as well as nlockmgr (which didn't start). In practice, well, how the system actually behaves is the ultimate proof and all I can do is report what I saw. Yes, this is frustrating. That ZFS and SMF together hide so much in black magic is a serious problem that has made me frustrated before. Among other things it means that when something goes odd or wrong you need to be a deep expert to understand what's going on.
2015-02-10
Our ZFS fileservers have a serious problem when pools hit quota limits
Sometimes not everything goes well with our ZFS fileservers. Today was one of those times and as a result this is an entry where I don't have any solutions, just questions. The short summary is that we've now had a fileserver get very unresponsive and in fact outright lock up when a ZFS pool that's experiencing active write IO runs into a pool quota limit.
Importantly, the pool has not actually run out of actual disk space;
it has only run into the quota limit, which is about 235 GB below
the space limit as 'zfs list' reports it
(or would, if there was no pool quota). Given things we've seen
before with full pools I would not have been
surprised to experience these problems if the pool had run itself
out of actual disk space. However it didn't; it only ran into an
entirely artificial quota limit. And things exploded anyways.
(Specifically, the pool had a quota setting, since refquota
on a pool where all the data is in filesystems isn't good for
much.)
Unfortunately we haven't gotten a crash dump. By the time there was
serious problem indications the system had locked up, and anyways
our past attempts to get crash dumps in the same situation have
been ineffective (the system would start to dump but then appear
to hang).
To the extent that we can tell anything, the few console messages that
get logged sort of vaguely suggest kernel memory issues. Or perhaps
I am simply reading too much into messages like 'arl_dlpi_pending
unsolicited ack for DL_UNITDATA_REQ on e1000g1'. Since the problem
is erratic and usually materializes with little or no warning, I don't
think we've captured eg mpstat output during the run-up to a lockup to
see things like if CPU usage is going through the roof.
I don't think that this happens all the time, as we've had this specific pool go to similar levels of being almost full before and the system hasn't locked up. The specific NFS IO pattern likely has something to do with it, as we've failed to reproduce system lockups in a test setup even with genuinely full pools, but of course we have no real idea what the IO pattern is. Given our multi-tenancy we can't even be confident that IO to the pool itself is the only contributor; we may need a pattern of IO to other pools as well to trigger problems.
(I also suspect that NFS and iSCSI are probably all involved in the problem. Partly this is because I would have expected a mere pool quota issue with ZFS alone to have been encountered before now, or even with ZFS plus NFS since a fair number of people run ZFS based NFS fileservers. I suspect we're one of the few places using ZFS with iSCSI as the backend and then doing NFS on top of it.)
One thing that writing this entry has convinced me is that I should pre-write a bunch of questions and things to look at in a file so I have them on hand the next time things start going south and I don't have to rely on my fallible memory to come up with what troubleshooting we want to try. Of course these events are sufficiently infrequent that I may forget where I put the file by the time the next one happens.
2015-01-15
General ZFS pool shrinking will likely be coming to Illumos
Here is some great news. It started with this tweet from Alex Reece (which I saw via @bdha):
Finally got around to posting the device removal writeup for my first open source talk on #openzfs device removal! <link>
'Device removal' sounded vaguely interesting but I wasn't entirely sure why it called for a talk, since ZFS can already remove devices. Still, I'll read ZFS related things when I see them go by on Twitter, so I did. And my eyes popped right open.
This is really about being able to remove vdevs from a pool. In its current state I think the code requires all vdevs to be bare disks, which is not too useful for real configurations, but now that the big initial work has been done I suspect that there will be a big rush of people to improve it to cover more cases once it goes upstream to mainline Illumos (or before). Even being able to remove bare disks from pools with mirrored vdevs would be a big help for the 'I accidentally added a disk as a new vdev instead of as a mirror' situation that comes up periodically.
(This mistake is the difference between 'zpool add POOL DEV1 DEV2'
and 'zpool add POOL mirror DEV1 DEV2'. You spotted the one word
added to the second command, right?)
While this is not quite the same thing as an in-place reshape of your pool, a fully general version of this would let you move a pool from, say, mirroring to raidz provided that you had enough scratch disks for the transition (either because you are the kind of place that has them around or because you're moving to new disks anyways and you're just arranging them differently).
(While you can do this kind of 'reshaping' today by making a
completely new pool and using zfs send and zfs receive, there
are some advantages to being able to do it transparently and without
interruptions while people are actively using the pool).
This feature has been a wishlist item for ZFS for so long that I'd long since given up on ever seeing it. To have even a preliminary version of it materialize out of the blue like this is simply amazing (and I'm a little bit surprised that this is the first I heard of it; I would have expected an explosion of excitement as the news started going around).
(Note that there may be an important fundamental limitation about this that I'm missing in my initial enthusiasm and reading. But still, it's the best news about this I've heard for, well, years.)
2015-01-13
Our tradeoffs on ZFS ZIL SLOG devices for pools
As I mentioned in my entry on the effects of losing a SLOG device, our initial plan (or really idea) for SLOGs in our new fileservers was to use a mirrored pair for each pool that we gave a SLOG to, split between iSCSI backends as usual. This is clearly the most resilient choice for a SLOG setup, assuming that you have SSDs with supercaps; it would take a really unusual series of events to lose any committed data in the pool.
On ZFS mailing lists that I've read, there are plenty of people who think that using mirrored SSDs for your SLOG is overkill for the likely extremely unlikely event of a simultaneous server and SLOG failure. This would obviously save us one SLOG device (or chunk) per pool, which has its obvious attractions.
If we're willing to drop to one SLOG device per pool and live with the resulting small chance of data loss, a more extreme possibility is to put the SLOG device on the fileserver itself instead of on an iSCSI backend. The potential big win here would be moving from iSCSI to purely local IO, which presumably has lower latency and thus would enable to fileserver to respond to synchronous NFS operations faster. The drawback is that we couldn't fail over pools to another fileserver without either abandoning the SLOG (with potential data loss) or physically moving the SLOG device to the other fileserver. While we've almost never failed over pools, especially remotely, I'm not sure we want to abandon the possibility quite so definitely.
(And before we went down this road we'd definitely want to measure the IO latencies of SLOG writes to a local SSD versus SLOG writes to an iSCSI SSD. It may well be that there's almost no difference, at which point giving up the failover advantages would be relatively crazy.)
Since we aren't yet at the point of trying SLOGs on any pools or even measuring our volume of ZIL writes, all of this is idle planning for now. But I like to think ahead and to some extent it affects things like how many bays we fill in the iSCSI backends (we're currently reserving two bays on each backend for future SLOG SSDs).
PS: Even if we have a low volume of ZIL writes in general, we may find that we hit the ZIL hard during certain sorts of operations (perhaps eg unpacking tarfiles or doing VCS operations) and it's worth adding SLOGs just so we don't perform terribly when people do them. Of course this is going to be quite affected by the price of appropriate SSDs.
2015-01-11
The effects of losing a ZFS ZIL SLOG device, as I understand them
Back when we planned out our new fileservers, our plan for any ZIL SLOG devices we'd maybe eventually put on hot pools was to use mirrored SLOG SSDs, just as we use mirrored disks for the main data storage. At the time when I put together these plans, my general impression was that losing your SLOG was fatal for your pool; of course that meant we had to mirror them to avoid a single device failure destroying a pool. Since then I've learned more about the effects of ZIL SLOG failure and I am starting to reconsider this and related design decisions.
As far as I know and have gathered (but have not yet actually tested
with our OmniOS version), ZIL SLOG usage goes like this. First,
the ZIL is never read from in normal operation;
the only time the ZIL is consulted is if the system crashes abruptly
and ZFS has to recover IO that was acknowledged (eg that was
fsync()'d) but not yet committed to regular storage as part of a
transaction group. This means that if a pool or system was shut
down in an orderly way and then the SLOG is not there on reboot,
reimport, or whatever, you've lost nothing since all of the
acknowledged, fsync()'d in-flight IO was committed in a transaction
group before the system shut down.
If the system crashed and then the pool SLOG turns out to have IO
problems when you reboot, the regular pool metadata (and data) is
still fully intact and anything that made it into a committed
transaction group is on disk in the main pool. However you have
lost whatever was logged in the ZIL (well, the SLOG ZIL) since the
last acknowledged transaction; effectively you've rolled back the
pool to the last transaction, which will generally be a rollback
of a few seconds. In some circumstances this may be hard to tell
apart from the system crashing before applications even had a chance
to call fsync() to insure the data was on disk. In other situations,
such as NFS fileservers, the server may have already told clients
that the data was safe and they'll be quite put out to have it
silently go missing.
Because the main pool metadata and data is intact, ZFS allows you to import pools that have lost their SLOG, even if they were shut down uncleanly and data has been lost (I assume that this may take explicit sysadmin action). Thus loss of an SLOG doesn't mean loss of a pool. Further, as far as I know if the SLOG dies while the system is running you still don't lose data (or the pool); the system will notice the SLOG loss and just stop writing the ZIL to it. All data recorded in the SLOG will be in the main pool once the next TXG commits.
So the situation where you will lose some data is if you have both a system crash (or power loss) and then a SLOG failure when the pool comes back up (or the SLOG fails and then the system crashes before the next TXG commit). Ordinary SLOG failure while the system is running is okay, as is 'orderly' SLOG loss if the pool goes down normally and then comes back without the SLOG. If you assume that system crashes and SLOG device failures are uncorrelated events, you would have to be very unlucky to have both happen at once. In short, you need a simultaneous loss situation in order to lose data.
This brings me to power loss protection for SSDs. Losing power will obviously 'crash' the system before it can commit the next TXG and get acknowledged data safely into the main pool, while many SSDs will lose some amount of recent writes if they lose power abruptly. Thus you can have a simultaneous loss situation if your SLOG SSDs don't have supercaps or some other form of power loss protection that lets them flush data from their onboard caches. It's worth noting that mirroring your SLOG SSDs doesn't help with this; power loss will again create a simultaneous loss situation in both sides of the mirror.
(In theory ZFS issues cache flush commands to the SSDs as part of writing the ZIL out and the SSDs should then commit this data to flash. In practice I've read that a bunch of SSDs just ignore the SATA cache flush commands in the name of turning in really impressive benchmark results.)
PS: This is what I've gathered from reading ZFS mailing lists and so on, and so some of it may be wrong; I welcome corrections or additional information. I'm definitely going to do my own testing to confirm things on our specific version of OmniOS (and in our specific hardware environment and so on) before I fully trust any of this, and I wouldn't be surprised to find corner cases. If nothing else, I need to find out what's involved in bringing up a pool with a missing or failed SLOG.
2014-12-22
The future of OmniOS here if we can't get 10G-T working on it
When I wrote about our long road to getting 10G in production on OmniOS after our problems with it, I mentioned in an aside that the pessimistic version of when we might get our new fileserver environment back to 10G was 'never' and that that would have depressing consequences. Today I've decided to talk about them.
From the start, one of my concerns with Illumos has been hardware support. A failure to get our OmniOS fileservers back to 10G-T would almost certainly be a failure of hardware support, where either the ixgbe driver didn't get updated or the update didn't work well enough. It would also be specifically a failure to support 10G. Both have significant impacts on the future.
We can, I think, survive this generation of fileservers without 10G, although it will hurt (partly because it makes 10G much less useful in other parts of our infrastructure and partly because we spent a bunch of money on 10G hardware). I don't think we can survive the next generation without 10G; in four years 10G-T will likely be much more pervasive and I'm certainly hoping that big SSDs will be cheap enough that they'll become our primary storage. SSDs over 1G networking is, well, not really all that attractive; once you have SSD data rates, you really want better than 1G.
That basically means the next generation of fileservers could not be OmniOS (maybe unless we do something really crazy); we would have to move to something we felt would give us 10G and the good hardware support we hadn't gotten from Illumos. The possibility of going to a non-Illumos system in four years obviously drains off some amount of interest in investing lots of time in OmniOS now, because there would be relatively little long term payoff from that time. The more we think OmniOS is not going to be used in the next generation, the more we'd switch to running OmniOS purely in maintenance mode.
To some extent all of this kicks into play even if we can move OmniOS back to 10G but just not very fast. If it takes a year or two for OmniOS to get an ixgbe update, sure, it's nice to be running 10G-T for the remainder of the production lifetime of these fileservers, but it's not a good omen for the next generation because we'd certainly like more timely hardware support than that.
(And on bad omens for general hardware support, well, our version of OmnioOS doesn't even seem to support the 1G Broadcom ports on our Dell R210 servers.)
Sidebar: I'm not sure if Illumos needs more development in general
I was going to say that lagging hardware support could also be a bad omen for the pace of Illumos development in general, but I'm actually not sure if Illumos actually needs general development (from our perspective). Right now I'm assuming that the great 'ZFS block pointer rewrite' feature will never happen, and I'm honestly not sure if there's much other improvements we'd really care very much about. DTrace, the NFS server, and the iSCSI initiator do seem to work fine, I no longer expect ZFS to get any sort of API, and I don't think ZFS is missing any features that we care about very much (and we haven't particularly tripped over any bugs).
(ZFS is also the most likely thing to get further development attention and bugfixes, because it's currently one of the few big killer features of Illumos for many people.)
2014-12-19
Our likely long road to working 10G-T on OmniOS
I wrote earlier about our problems with Intel 10G-T on our OmniOS fileservers and how we've had to fall back to 1G networking. Obviously we'd like to change that and go back to 10G-T. The obvious option was another sort of 10G-T chipset besides Intel's. Unfortunately, as far as we can see Intel's chipsets are the best supported option and eg Broadcom seems even less likely to work well (or at all, and we later had problems with even a Broadcom 1G chipset under OmniOS). So we've scratched that idea; at this point it's Intel or bust.
We really want to reproduce our issues outside of production. While we've set up a test environment and put load on it, we've so far been unable to make it fall over in any clearly networking related way (OmniOS did lock up once under extreme load, but that might not be related at all). We're going to have to keep trying in the new year; I don't know what we'll do if we can't reproduce things.
(We also aren't currently trying to reproduce the dual port card issue. We may switch to this at some point.)
As I said in the earlier entry, we no longer feel that we can trust the current OmniOS ixgbe driver in production. That means going back to production needs an updated driver. At the moment I don't think anyone in the Illumos community is actively working on this (which I can't blame them for), although I believe there's some interest in doing a driver update at some point.
It's possible that we could find some money to sponsor work on updating the ixgbe driver to the current upstream Intel version, and so get it done that way (assuming that this sort of work can be sponsored for what we can afford, which may be dubious). Unfortunately our constrained budget situation means that I can't argue very persuasively for exploring this until we have some confidence that the current upstream Intel driver would fix our issues. This is hard to get without at least some sort of reproduction of the problem.
(What this says to me is that I should start trying to match up driver versions and read driver changelogs. My guess is that the current Linux driver is basically what we'd get if the OmniOS driver was resynchronized, so I can also look at it for changes in the areas that I already know are problems, such as the 20msec stall while fondling the X540-AT2 ports.)
While I don't want to call it 'ideal', I would settle for a way to reproduce the dual card issue with simply artificial TCP network traffic. We could then change the server from OmniOS to an up to date Linux to see if the current Linux driver avoids the problem under the same load, then use this as evidence that commissioning an OmniOS driver update would get us something worthwhile.
None of this seems likely to be very fast. At this point, getting 10G-T back in six months seems extremely optimistic.
(The pessimistic view of when we might get our new fileserver environment back to 10G-T is obviously 'never'. That has its own long-term consequences that I don't want to think about right now.)
Sidebar: the crazy option
The crazy option is to try to learn enough about building and working on OmniOS so that I can build new ixgbe driver versions myself and so attempt either spot code modifications or my own hack testing on a larger scale driver resynchronization. While there is a part of me that finds this idea both nifty and attractive, my realistic side argues strongly that it would take far too much of my time for too little reward. Becoming a vaguely competent Illumos kernel coder doesn't seem like it's exactly going to be a small job, among other issues.
(But if there is an easy way to build new OmniOS kernel components, it'd be useful to learn at least that much. I've looked into this a bit but not very much.)
2014-11-16
We've started to decommission our Solaris 10 fileservers
Our migration from our old fileservers to
our new fileservers has been a slow process
that's hit some rough spots. Still, we've
hit a distinct point that I want to mark: this past week we reached
the point where we did 'zpool destroy' on one old fileserver's
old pools and powered down its backend disks.
While we can migrate rapidly when we need to, our decommissionings usually go slowly. Unless we need the rack space or the hardware for something, we mostly leave old servers and so on running until they get annoying for some reason. This makes the whole thing a gradual process, instead of the big bang that I expect some people have. In our case we actually started losing bits of the old fileserver environment almost a year ago, when we started removing bits of our old test environment. More recently we used our hot spare fileserver as a hardware donor during a crisis; we're unlikely to replace it, although we theoretically could.
(In fact we decommissioned this particular fileserver mostly because it was getting annoying. One of the disks in one of its backends failed, causing us to get plaintive email about the situation, so we decided that we wanted to power off all of the remaining disks to preserve them as potential future spares. Destroying the pools instead of just exporting them insures that the disks won't be seen as still in use by a ZFS pool if we reuse them later.)
On the one hand, this is a trivial step and the fileserver had
already had its last filesystems migrated a week or so before this
(and we've destroyed some migrated filesystems for various reasons,
which is far more irreversible than 'zpool destroy'). On the other
hand, it simply feels significant. In a way, it makes the whole
thing real; we've decisively torn down what used to be a production
fileserver. I can now really believe in a relatively near future
where we have no Solaris 10 machines any more.
(It won't be an immediate future; given recent issues all remaining filesystem migrations to the new fileservers are probably on hold until January. In general we try not to do too much in late November and early December due to the university's long holiday around Christmas and the end of the year.)
2014-11-15
Our current problems with 10G Intel networking on OmniOS
In my writeup on our new OmniOS fileservers I mentioned that we had built them out with 10G-T networking for their iSCSI networking (using onboard Intel X540-AT2 based ports) and their public NFS interface (using one port of a dual-port Intel 82599EB TN card). Since then, well, things have not gone so well and in fact we're in the process of moving all production fileservers to 1G networking until we can understand what's going on and we can fix it.
The initial problems involved more or less total server lockups on our most heavily used fileserver. Due to some warning messages on the console and previous weird issues with onboard ports, we added a second dual-port card and moved the iSCSI networks to them. We also had iSCSI networking issues on two other servers, one of which was also switched to use a second dual-port card for iSCSI networking.
(At this point the tally is two fileservers using the onboard ports for 10G iSCSI and two fileservers using second dual-port cards for it.)
The good news is that the fileservers mostly stopped locking up at this point. The bad news is that both actively used dual-port cards wound up getting themselves into a state where the ixgbe driver couldn't talk properly to the second port and this had very bad effects, most specifically an extremely long lock hold time with spinlocks. At first we saw this only with the first card that had been replaced, on our most-used fileserver, so it was possible for me to believe that this was just a hardware fault (after all, the second port was working fine on the less used fileserver). Today we had exactly the same issue appear on the other fileserver, so it seems extremely likely that there is some combination of a driver bug and a hardware bug involved, one that is probably more and more likely to manifest as you pass more traffic through the ports.
(On top of that problem, we also found a consistent once a second 20msec lock hold time and stall in the ixgbe driver when dealing with those onboard X540-AT2 ports. Interested parties are directed to this email to the illumos-developer mailing list for full details about both issues. Note that it was written when I still thought the big stall might be due to faulty hardware on a single card.)
My understanding is that the Illumos (and thus OmniOS) ixgbe driver is derived from an upstream general Intel driver through a process that must be done by hand and apparently has not happened for several years. At this point enough bad stuff has shown up that I don't think we can trust the current OmniOS version of the driver and we probably don't want to try Intel 10G-T again until it's updated. Unfortunately I don't have any idea if or when that will happen.
(It also seems unlikely that we'll find any simple or quick reproduction for any of these problems in a test environment. My suspicion is that the dual-port issue is due to some sort of narrow race window involving hardware access, so it may depend not just on total traffic volume but on the sort of traffic you send.)