2010-08-29
How many uberblock copies ZFS writes on transaction group commits
The ZFS uberblock is the root of the tree of ZFS objects in a pool (technically it is more a pointer to the root, but it's how ZFS finds everything). Because it is the root, it has to be updated on every ZFS update. Also, because it is so important ZFS keeps quite a lot of copies of it (and has a clever 'copy on write' scheme for it, so that it is not quite updated in place).
Given the impact of writes on mirrors, it becomes interesting to ask how many copies of the uberblock ZFS actually writes out. First off, the raw numbers: there are four copies of the uberblock on every physical disk in the pool, stored as part of the four ZFS label areas (two at the start of the disk and two at the end).
At the top level, there are two cases for uberblock updates. If the vdev or pool configuration is considered 'dirty', the uberblock is synced on all vdevs. If the pool configuration is unchanged, the uberblock is synced to at most three vdevs, chosen by picking a random start point in the list of top level vdevs. If your pool only has three (or less) top level vdevs, the two cases are equivalent.
(My powers of reading the OpenSolaris source code are not quite up to determining exactly what makes the pool configuration dirty. Certainly adding or removing devices does it, and I think that devices changing their state does as well. There are likely to be other things as well.)
When ZFS syncs the uberblock to a top level vdev, it writes copies of the new uberblock to all physical devices in the vdev (well, all currently live devices); to all mirrors in a mirrored vdev, and to all disks in a raidzN vdev. Syncing the uberblock to a physical device involves four separate writes. But wait, we're not done. To actually activate the new uberblock, the ZFS labels themselves must be updated, which is done separately from writing the uberblocks. This takes another four writes for each physical disk that is having its uberblock(s) updated.
So the simple answer is that if your pool has three or less top level vdevs, you will do eight writes per physical device every time a transaction group commits in order to write the uberblocks out. Fortunately transaction groups don't commit very often.
Sidebar: the uberblock update sequence
The best documentation for the logic of the uberblock update sequence
is in vdev_config_sync() in uts/common/fs/zfs/vdev_label.c.
The short version is that it goes:
- update the first two labels. They now do not match any uberblock and are thus invalid, but the last two labels are still good.
- write out all of the new uberblocks. If this works, the first two labels are valid; if it doesn't work, the last two labels are still pointing to valid older uberblocks.
- update the last two labels.
(ZFS labels include what transaction group (aka 'txg') that they are valid for. They do not explicitly point to an uberblock. Uberblocks contain a transaction group number and a pointer to that txg's metaobject set (MOS), which is the real root of the ZFS pool.)
2010-08-23
Why I hate the Solaris 10 version of /bin/sh
Every so often I discover some exceptional piece of braindamage in Solaris. Here is one of them, presented in handy illustrated form:
$ cat cdtest #!/bin/sh cd /not/there || echo failed echo got here $ sh cdtest ./cdtest: /not/there: does not exist
(No other output is produced.)
Oh yes, Solaris. Killing my script when a cd fails is just what
I want you to do, especially when this behavior is undocumented and
cannot be turned off. I especially like it when you do this despite
me making every attempt to handle the error.
It goes without saying that this behavior is in no way POSIX standard and in no way matches either the historical behavior of the Bourne shell or how the Bourne shell behaves on other systems.
I think that all of our Solaris shell scripts are about to change from
'#!/bin/sh' to '#!/usr/bin/bash'. I'm not a huge fan of Bash, but
at least it doesn't contain helpful landmines that blow up production
scripts.
2010-08-16
What OpenSolaris's death probably means for us
For those of you who haven't heard, OpenSolaris is dead; see the leaked internal memo for details (via here, here, here, or even Slashdot). In a way, I can't say that I'm too surprised, but I am rather unhappy.
We don't run OpenSolaris in production so in one sense we aren't directly affected. But there are at least two areas where this is likely to hurt us, possibly quite badly. The big one is what is happening with OpenSolaris source code. To quote the memo, with the emphasis mine:
We will distribute updates to approved CDDL or other open source-licensed code following full releases of our enterprise Solaris operating system. [...] . We will no longer distribute source code for the entirety of the Solaris operating system in real-time while it is developed, on a nightly basis.
At this point it is not clear whether 'Solaris X update Y' will be considered a full release or not. If it is not, well, the last full release of Solaris was Solaris 10 at the start of 2005, which didn't even have ZFS, and the next one is theoretically due sometime in 2011. This would make OpenSolaris source code completely useless for trying to investigate Solaris's behavior, which we have had to do periodically, especially when there are problems. Given our bad support experiences, being unable to take a good shot at diagnosing things ourselves is a serious problem.
(This is especially serious because libzfs is an undocumented interface
and official commands like zpool are unsuitable for a number of the
things we do here. We have a number of internal tools and systems that
could not have been built without access to relatively current ZFS
code and which are in fact rapidly going to get less useful as the ZFS
interfaces change but we can't find out about the changes because the
source code is no longer accessible.)
The other area is the loss of OpenSolaris binary builds. I care about
this because these builds used to be the best and in fact only real way
of getting access to ZFS debugging and repair tools; the ability to do
any number of things to problematic pools appeared first in builds,
often well in advance of its availability in official Solaris patches.
We once came quite close to temporarily importing production pools into
an OpenSolaris scratch environment in order to fix them up, and there
are plausible future cases where we'd need to do this. We've now lost
access to this emergency fix mechanism; instead, our only recourse is
the tender mercies of Sun Oracle support.
2010-08-04
Good Solaris news from Oracle for once
For once, there's two pieces of good Solaris news from Oracle. The first is that Oracle has announced an agreement with Dell and HP that makes Solaris officially supported on hardware from the latter; you will be able to run it legally and you will be able to get support, including patch access. The one cautious provisio is that we don't yet know if this covers all of HP's and Dell's servers or only some of them.
(The Oracle press release about this is here.)
For me, the important thing about this deal is not so much that we can run Solaris on (some) third-party hardware but that we once again have a source of affordable small servers that we can legally run Solaris on. Oracle's apparent termination of all of Sun's inexpensive 1U and 2U server line left a big hole at the bottom of the Solaris hardware range, a hole that has now (probably) been plugged.
(I'll miss the Sun Fire ILOMs, but I can see why Oracle got out of the market; the small server segment is cut throat competitive and I don't think anyone is making very much money there.)
Second, a commentator on my recent entry pointed out that Oracle now appears to be letting people using white box hardware buy Solaris support, as covered on this Oracle web page. The listed terms are very expansive and general; any system listed on the Solaris Hardware Compatibility List qualifies, and inspection of the list shows lots of servers from people other than Dell and HP.
(Of course, the actual mileage you get from your Oracle sales person may vary. But people have apparently been able to buy Solaris support on various third party hardware, and Oracle does appear to say that existing support contracts will still be honored if a system is removed from the HCL.)
The Dell and HP deal got wide coverage, but the general change in Oracle's Solaris support policy seems to have been far less publicized (or perhaps it was big news in places I don't read). This does make me a bit nervous that Oracle is going to change it again, given their past behavior, but for now I'm cautiously optimistic. At this rate, Solaris may be back in our long term future.
(Before this news, it seemed likely that we would simply be priced out of the Solaris market because we couldn't run Solaris on anything except very expensive servers since Oracle wasn't making inexpensive ones any more.)
2010-08-01
The consequence of Oracle's Solaris decisions
The title of this is a little bit misleading; ultimately, the really important thing about Oracle's Solaris decisions is not what those decisions are, it is about how they were implemented. I don't like the decisions, but the real problem is their abrupt, no-warning implementation.
Let me be completely clear here: by making these decisions, Oracle screwed people. By implementing them with no advance warning, Oracle screwed people harder.
The consequence of this is that Oracle has lost my trust. I can't really trust companies that blatantly and abruptly screw people, because someday they may screw us in a similarly charming way. (We are mostly but not entirely unaffected by all of the Oracle changes, at least so far. The need to add that provisio is a demonstration of the problem.)
If you can't trust Oracle any more, Solaris stops being a stable platform to build things on top of. Who knows when the next change to patch entitlement will come along, for example, or what it will be? Even if your existing systems are covered by some grandfather clause, being unable to deploy new systems is a crippling limitation in many environments.
(You may still choose to build things on top of Solaris, but now you are clearly taking a risk and I happen to think that it's a not insignificant one. It may still be worth it, or you may be stuck in a situation where you have no real choice, but either way people are not likely to enjoy the experience very much.)
Sidebar: how Oracle's decisions screw people
If you were running Solaris on non-Sun hardware, the new need for a support contract in order to stay secure plus your inability to get a support contract on non-Sun hardware has left you very screwed.
If you were running Solaris on Sun hardware, the surprise need to budget a potentially significant chunk of money as an ongoing yearly cost may have left you screwed. This is especially likely to be the case at universities, where it is much easier to get one-time money (to buy the hardware) than to get ongoing money (to buy the service) and researchers do not necessarily have any extra unallocated money sitting around that they can spend freely.
(If this was not a surprise, you could bundle some years of support into the one time purchase price of the hardware (this is common with one time grant funding money). But surprise ongoing money? That's quite hard to come up with, even with grant funding.)