2014-02-09
Why I'm not looking at installing OmniOS via Kayak
As far as I can see, OmniOS's Kayak network install system is the recommended way to both install a bunch of OmniOS machines and to customize the resulting installs (for example to change the default sizes of various things). However, even setting aside my usual issues with automatic installers (which can probably be worked around) I've found myself uninterested in trying to use Kayak. The core problem for me is that Kayak only seems to really be supported on an OmniOS host.
The blunt truth is that we're not going to use OmniOS for very much here. It's going to run our fileservers, but while those are very important machines there are only a handful of them. I don't want to have to set up and manage an additional OmniOS machine (and a bunch of one-off infrastructure on that machine) simply to install a handful of fileservers with custom parameters and some additional conveniences. The cognitive overhead is simply not worth it. Things would be different if I could confidently host a Kayak system on an Ubuntu machine, as we have plenty of Ubuntu machines and lots of systems in place for running them.
I'm aware that there's some documentation for hosting Kayak on a Linux system. Unfortunately 'here's what someone tried once and got working' is not anywhere near as strong as 'we officially try to make this work and here is information on the general things Kayak needs and how it all works'. One of them means that people will take bug reports and the other one implies that if things break I'm basically on my own. I'm not putting crucial fileserver infrastructure into a 'you're on your own' setup; it would be irresponsible.
(Well, it would be irresponsible to do it when we don't have a relatively strong need to do so. We don't here, as manual OmniOS installs are basically as good as a Kayak install and are considerably less complex overall.)
2014-02-07
A surprise with OmniOS disk sizing: the rpool/dump ZVOL
Until recently, all of my work installing and reinstalling OmniOS has been in a virtual machine, which had a modest configuration. Recently I started doing test installs on our real fileserver hardware, which resulted in a disk sizing surprise in the default OmniOS install.
Our fileserver hardware has 64 GB of RAM and an 80 GB SSD system
disk (well, two of them but they're mirrored). On this hardware the
OmniOS installer set up a dedicated 32 GB chunk of space for kernel
dumps (the rpool/dump ZFS volume). On the one hand, this is very
much not what I wanted; I definitely don't want half of a quite
limited amount of disk space to disappear to something which I
expect to never use. On the other hand you can argue that this is
at least half rational as an OmniOS default, since OmniOS itself
needs almost no disk space. If you probably need 32 GB of kernel
dump space for safety and the disk space is there, you might as
well use it for something.
(I call this only half rational because I'm not sure there's enough
space left in /var to write out a kernel dump that actually needs
all of that 32 GB of dump ZVOL, since both OmniOS itself and the
swap ZVOL use of some of the remaining space.)
The normal installer image doesn't give you a convenient way to customize this, so my workaround is to simply delete the dump ZVOL after installation:
dumpadm -d none zfs destroy rpool/dump
If our OmniOS systems ever start crashing in a situation where a
crash dump might be useful, we can revisit this and perhaps recreate
a rpool/dump ZVOL of some suitable size.
(The OmniOS installer also defaulted to a 4 GB swap ZVOL. On the one hand this strikes me as excessive; on the other hand it's only 4 GB and swapping to and from a SSD is going to be a bunch faster than a traditional hard disk swap so having to go to swap might be quite as terrible as it used to be. We're leaving the swap ZVOL alone for now.)
2014-01-30
OmniOS (and by extension Illumos) is pretty much Solaris
I've been working on bringing our customized Solaris fileserver environment up on OmniOS, which is going to power the new generation of our fileservers. On our current Solaris fileservers we do things that range from relatively standard through fairly crazy, so one of my concerns with OmniOS was how easy these things would be to port over to it (especially the fairly crazy things).
I've been pleasantly surprised by the answer, which is that for the most part OmniOS is Solaris. In fact it is so much Solaris that for a while I was using our Solaris custom NSS shared library instead of bothering to recompile it on OmniOS (I did eventually because I wanted to be sure that we could still build it and have it work right on OmniOS; it seemed safer). This is what one would hope for, of course, but given my Linux experience and that there are changes from Solaris to Illumos (and OmniOS) it's still pleasant to see it happen.
The major of the changes I've needed to make have been for two
reasons. First, the paths to various additional bits of software
have changed between Solaris and OmniOS. Partly this is because
OpenCSW doesn't do Illumos builds and we switched to pkgsrc, partly it's because
some things moved into OmniOS's core (eg rsync, which I applaud),
and partly it's because things just moved period (eg a bunch of
things on OmniOS hang out in /usr/gnu/bin).
Second, parts of our spares system fish around in system-dependent areas of ZFS in order to read out detailed information on pool state. There's still no stable public interface for this information and the format of stuff changed between Solaris 10 update 8 and the current version of OmniOS. With full OS source available this wasn't particularly difficult to deal with, and anyways we knew that this was coming (it was a big reason we never moved beyond S10U8).
(A minor change was that a bunch of our management tools used to assume that the only ZFS pools on the system were our pools. This is no longer true on OmniOS, where you also have the root ZFS pool. We want some of our tools to pay attention to the rpool, for example to scrub it regularly and tell us if it has problems, but we want others to ignore it, eg the spares system. We may be fiddling with this for a bit.)
PS: The whole OmniOS build is not completely tested and deployed into a test production fileserver, so I may yet find another significant source of changes. And our DTrace scripts will probably need some changes and updates because they play around in kernel data structures, some of which have changed between S10U8 and OmniOS. For that matter, DTrace itself has probably evolved since S10U8 and it may be worth revising some scripts to take advantage of that.
2014-01-14
The problem with OmniOS's 'build your own' view for Perl and Python
A commentator on my entry on Python 2's lifetime mentioned that OmniOS has adopted a principle called KYSTY. To simplify, the core of this principle is that you are not really supposed to use the OmniOS versions of things for your applications. To quote them:
We purposefully ship only what we need to build and run the OS. In many cases, these are not the most recent versions. Except in the most basic circumstance, you should not use these things in your app stack. At best, their versions are stagnant; at worst, they may go away entirely as the OS components they exist for are rewritten or removed.
There are two problems with this when it comes to things like Perl and Python. The first is a variant of the problem with compiling your own version of Python 3; requiring people to build or get their own versions of Perl and Python makes writing small utilities in those more expensive. In practice many people aren't going to do that and will instead use the system versions, with various consequences.
The larger problem is what this does to script portability in
heterogenous environments. Like it or not, there is a de facto Unix
standard about where to find at least Perl and Python and that is at
well known names in /usr/bin. Making those names not work right in
favour of, say, /opt/csw/bin/perl means that scripts that work on
OmniOS won't work on other machines and vice versa; each side's scripts
will have a #! line that points to something that is either wrong or
not there at all. This is not doing people any favours (and yes, some
people still run heterogenous environments with shared scripts).
(The #!/usr/bin/env hack doesn't help because you can't count on
optional add-on directories being in $PATH in all cases and all
contexts. In fact I tend to think that it is a great recipe for real
problems but that's another rant.)
The OmniOS approach is arguably sensible for large applications with their own deployment infrastructure and workable if OmniOS is by itself in a homogenous environment. It is not appealing if OmniOS is in a heterogenous environment and what you're writing is glue scripts and utilities and so on.
If OmniOS wants to take KYSTY seriously it needs to not claim the
names /usr/bin/perl, /usr/bin/python, and so on. OmniOS Perl and
Python scripts should use a private OmniOS name for the OmniOS versions
of those and leave the official public names to sysadmins to set up
however they want (or not set up at all, so that you know that all of
your scripts have to be using your specific version of whatever).
2013-12-26
How ZFS scrubs routinely save us
A while back I wrote about how ZFS resilvering saved us and mentioned in passing that there are a number of ways that ZFS routinely saves us in the small. Today I want to talk about one of them, namely ZFS scrubs.
Put simply, ZFS scrubs are integrity checks of your ZFS pools. When you scrub a pool it checks and verifies all copies of all pool data to make sure that they're fully intact. When it finds a checksum inconsistency it will repair it; if things are really bad and it's not possible to repair it, it'll tell you what got damaged so you can restore it from backups. If a scrub discovers a read error it generally won't try to rewrite the data but it will at least tell you about it.
We regularly scrub our pools through automation. This periodically turns up transient checksum errors, which it also fixes. So this is the first little save; ZFS has detected and fixed potential data problems for us and it does it on an essentially ongoing basis. As a pragmatic thing the scrubs also check for read errors (although they can't fix them) and so give us early warning on disks we probably want to replace. They also give us a way to check if read errors are transient or permanent; we simply schedule a scrub and see if the scrub gets errors.
(A surprisingly large amount of the time the scrub does not, either because the error was genuinely transient or because whatever object was using the bad sector has been deleted since then.)
As a corollary, forcing an immediate scrub lets us find out if there are any latent problems (which can have many potential causes). It's routine for us to force scrubs after significant outage events, such as an iSCSI backend losing power, to make sure that no data got lost in the chaos.
Of course it would be better if we didn't have checksum errors happen in the first place. But given that we have something going wrong, I'd much rather know about it and have it get fixed than not. ZFS does this for us without fuss or hassle, and that routinely saves us in the small.
(Much of this can be done by any RAID system with routine RAID array scans; Linux software RAID can do this, for example, and is often configured to do it. What is different about ZFS is that ZFS can tell which copy of inconsistent data is correct and which isn't. Other RAID systems have to just guess.)
(I've talked about the overall purposes of ZFS scrubs in an aside here.)
2013-12-18
Thinking about what we'll need for reproducible OmniOS installs
We have a thing here where we like all of our fileservers to be exactly identical to each other (apart from configuration information), even if they happen to be installed or reinstalled some time apart. In the old days of basically static systems this was easy to achieve; if you installed everything from the media for OS release X.Y, they all wound up the same. Today, in a world of (online) patches and package updates and security fixes and so on, this is not necessarily all that simple. This is especially so for free software whose distribution mirrors may not have either the disk space or the interest in keeping every package version ever released around for perpetuity (especially if those package versions are perhaps known to have security bugs).
(This generously assumes that the package management system can be used to install a very specific version of a set of packages even when there are newer versions available. Not all package systems are happy about that.)
OmniOS is installed from versioned install media and then updated and supplemented through package updates with IPS. In practice the OmniOS base package repository is not enough and we'll likely supplement it with pkgsrc to get various important things. IPS supports mirroring selected packages to your own local repository (although I don't know if this can easily be done on a non-Solaris platform). I haven't looked at pkgsrc but it can probably do something.
There is also two somewhat crazier options that I've thought of: building our own install media with current updates and packages rolled in, and ZFS level copies. Building our own install media would cut out a bunch of the update process although not all of it, since pkgsrc probably couldn't be bundled into the install media. As a result I probably won't look much at this. ZFS level copies are much more potentially promising.
OmniOS installs to a ZFS root pool, in which you have a core filesystem.
OmniOS already supports multiple iterations of this core filesystem
through ZFS snapshots, clones, and so on; it uses this to support having
multiple boot environments for rollback purposes. So what would happen
if we installed the base OmniOS and then instead of customizing it,
simply copied over the current root filesystem from a working system
with zfs send and zfs recv, modified or reset things like IP address
information, and then made this copy the current boot environment? This
could blow up spectacularly or it could be a great way to make an exact
duplicate of an existing system.
(One issue is actually finding all of the places where an OmniOS system
stashes things like IP addresses and other host-specific bits. In the
new world of state-manipulating commands like dladm and ipadm, they
are probably not stored in any simple text file. Another issue is making
this process play nice with the boot environment maintenance commands.)
Sadly the whole ZFS idea is probably working too hard at something we can solve in simpler, more brute force ways (at the cost of some more disk space). I'm not sure if it's a good use of my time to investigate it much.
(One excuse is that I should understand boot environment administration and so on much better than I currently do, and that all by itself would tell me a lot.)
2013-11-04
How writes work on ZFS raidzN pools, with implications for 4k disks
There is an important difference between how ZFS handles raidzN pools and traditional RAID-5 and RAID-6 systems, a difference that can have serious ramifications in some environments. While I've mentioned this before I've never made it explicit and clear, so it's time to fix that.
In a traditional RAID-5/6/etc system, all stripes are full width, ie
they span all disks (in fact they're statically laid out and you can
predict which disks are data disks and which are parity disks for any
particular stripe). If you write or rewrite only part of a stripe, the
RAID system must do some variant of a read-modify-write cycle, updating
at least one data disk and N parity disks.
In ZFS stripes are variable size and hence span a variable number of
disks (up to the full number of disks for data plus parity). Layout is
variable and how big a stripe is depends on how much data you're writing
(up to the dataset's recordsize). To determine how many disks a given
data block write needs, you basically divide the size of the data by
the fundamental sector size of the vdev (ie its ashift), possibly or
likely wrapping around once the write gets big. There is no in-place
updates of existing stripes.
(This leads to the usual ZFS sizing suggestion for how many disks should
be in a raidzN vdev. Basically you want a full block to be evenly
divided over all of the data disks, so with the usual 128kb recordsize
you might want 8 disks for the data plus N disks for the parity. This
creates even disk usage for full sized writes.)
In the days of disks with 512 byte physical sectors it didn't take much data being written to use all of the vdev's disks; even a 4kb write could be sliced up into eight 512-byte chunks and thus use eight data disks (plus N more for parity). You might still have some unevenness, but probably not much. In the days of 4k sector disks, things can now be significantly different. In particular if you make a 4kb write it takes one 4kb sector on one disk for the data and then N more 4kb sectors on other disks for the parity. If you have a raidz2 vdev and write only 4kb blocks (probably as random writes) you will write twice as many blocks for parity as for data, for a write amplification ratio for your data of 3 to 1 (you've written 4kb at the user level, the disks write 12kb). Even a raidz1 vdev has a 2x write amplification for 4k random writes.
(What may make this worse is that I believe that a lot of ZFS metadata is likely to be relatively small. On a raidzN vdev using 4k disks, much of it may not use all disks and thus suffer some degree of write amplification.)
The short way to put this is in ZFS the parity overhead varies depending on your write blocksize. And on 4k sector disks it may well be higher than you expect.
There are some consequences of this for 4k sector drives. First, the
larger your raidzN vdevs are (in terms of disks) the larger the writes
you need in order to use them all and reduce the actual overhead of
parity. Second, if you want to minimize parity overhead it's important
to evenly divide data between all disks. If you roll over, using two 4k
sectors for data on even one disk, ZFS needs two 4k sectors for parity
on each parity disk. Since in real life your writes are probably going
to be of various different sizes (and then there's metadata), 4k sector
disks and ashift=12 will likely have higher parity overheads than 512b
sector disks. And in general from what you expect for RAID-5/6/etc.
I don't know if this makes ZFS raidzN less viable these days. Given the read performance issues, it probably always was for slow(er) bulk data storage outside of special situations.
2013-11-02
Revising our peculiar ZFS L2ARC trick
Here is a very smart question that my coworkers asked me today: if we have an L2ARC that's big enough to cache basically the entire important bit of one pool, is there much of a point to having that pool's regular data storage on SSDs? After all, basically all of the reads should be satisfied out of the L2ARC so the read IO speed of the actual pool storage doesn't really matter.
(Writes can be accelerated with a ZIL SLOG if necessary.)
Our current answer is that there isn't any real point to using SSDs instead of HDs on such a pool, especially in our architecture (where we have plenty of drive bay space for L2ARC SSDs). In current ZFS the L2ARC is lost on reboots (or pool exports and imports) and has to be rebuilt over time as you read from the regular pool vdevs, but for us these are very rare events anyways; most of our current fileservers have uptimes of well over a year. You do need enough RAM to hold the L2ARC index metadata in memory but I think our contemplated fileserver setup will have that.
(The one uncertainty over memory is to what degree other memory pressure (including from the regular ZFS ARC) will push L2ARC metadata out of memory and thus effectively drop things from the L2ARC.)
Since I just looked this up in the Illumos kernel sources, L2ARC
header information is considered ARC metadata and ARC metadata is
by default limited to one quarter of the ARC (although the ARC
can be most of your memory). If you need to change this, you want
the tunable arc_meta_limit. To watch how close to the limit
you're running, you want to monitor arc_meta_used in the ARC
kernel stats. The current size of (in-memory) L2ARC metadata is
visible in the l2_hdr_size kstat.
(What exactly l2_hdr_size counts depends on the Illumos version.
In older versions of Illumos I believe that it counts all L2ARC
header data even if the data is currently in the ARC too. In modern
Illumos versions it's purely for the headers of data that's only
in the L2ARC, which is often the more interesting thing to know.)
2013-10-18
ZFS uberblock rollback and the top level metadata change rate
ZFS keeps lots of copies of a pool's uberblock; on a standard pool on disks with 512 byte sectors, you will have at least 127 old uberblocks. In an emergency ZFS will let you roll back to a previous uberblock. So clearly you have a lot of possibilities for rollback, right? Actually, no. You have far less than you might think. The root problem is a misconception about the rate of change in pool and filesystem metadata.
In a conventional filesystem implementation, top level metadata changes infrequently or rarely for most filesystems; generally things like the contents of the filesystem's root directory are basically static. Even if you know that your filesystem is copy-on-write (as ZFS is) you might expect that since the root directory changes rarely it won't be copied very often. This feeds the idea that most of those 127 uberblocks will be pointing to things that haven't been freed and reused yet, in fact perhaps often the same thing.
This is incorrect. Instead, top level ZFS metadata is the most frequently changing thing in your ZFS pool and as a result old top level metadata gets freed all the time (although it may not get reused immediately, depending on pool free space, allocation patterns, and so on). What causes this metadata churn is block pointers combined with the copy on write nature of ZFS. Every piece of metadata that refers to something else (including all directories and filesystem roots) do so by block address. Because ZFS never updates anything in place changing one thing (say a data block in a file) changes its block address, which forces a change in the file's metadata to point to the new block address and which in turn changes the block address of the file's metadata, which needs a change in the metadata of the directory the file is in, which forces a change in the parent directory, and so on up the tree. The corollary of this is that any change in a ZFS pool changes the top level metadata.
The result is that every new uberblock written has a new set of top level metadata written with it, the meta-object set (MOS). And the moment a new uberblock is written the previous uberblock's MOS becomes free and its blocks become candidates to be reused (although not right away). When any of the MOS blocks do get reused, the associated uberblock becomes useless. How fast this happens depends on many things, but don't count on it not happening. ZFS snapshots of filesystems below the pool's root definitely don't preserve any particular MOS, although they do preserve a part of the old metadata that MOS(es) point to. I'm not sure that any snapshot operation (even on the pool root) will preserve a MOS itself, although some might.
(It would be an interesting experiment to export a non-test ZFS pool and then attempt to see how many of its uberblocks still had valid MOSes. My suspicion is that on an active pool, a lot would not. For bonus points you could try to determine how intact the metadata below the MOS was too and roughly how much of the resulting pool you'd lose if you imported it with that uberblock.)
PS: I've alluded to this metadata churn before in previous entries but I've never spelled it out explicitly (partly because I assumed it was obvious, which is probably a bad idea).
2013-10-13
Revisiting some bits of ZFS's ZIL with separate log devices
Back in this entry I described how the ZFS ZIL may or may not put large writes into the ZIL itself depending on various factors (how big they are, whether you're using a separate ZIL device, and so on). It turns out that I missed one potentially important factor, in fact one that affects more than large writes.
If you're using a separate log device, ZFS will normally put all write
data into the ZIL (on the presumption that flushing data to the SLOG
is faster than flushing it to the regular pool) and will then put the
ZIL on your separate log device (unless you've turned this off with the
logbias property). However this only applies if the log is not 'too
big'.
What's 'too big'? That's the tunable zil_slog_limit, expressed
in bytes, but how it gets used is a little bit obscure. First, let's
backtrack to the overall ZIL structure. Each on disk
ZIL is made up from some number of ZIL commits; these commits clean out
over time as transaction groups push things into stable storage on the
pool. This gives us two sizes: the size of the current ZIL commit that's
being prepared and the total size of the (active) on disk ZIL at the
moment.
What zil_slog_limit does is turn off use of the SLOG for large ZIL
commits or large total ZIL log sizes. If the current ZIL commit is
over zil_slog_limit or the current total ZIL log size is over twice
zil_slog_limit, the ZIL commit is not written to your SLOG device
but instead is written into the main pool. The default value of this
tunable appears to be only one megabyte, which really startles me.
But wait, things get more fun. In ZFSWritesAndZIL I described how large writes are put directly into the ZIL if you have a separate log device, on the presumption that your SLOG is much faster than your actual disks. That decision is independent from the decision of whether your ZIL commit will be written to the SLOG or to your real disks (really, the code only checks 'does this have a SLOG?'). It appears to be quite possible to have a SLOG, have relatively large writes be put into a ZIL commit, and then have this ZIL commit written (relatively slowly) to your real disks instead of to your SLOG. You probably don't want this.
In a world where SLOG SSDs were tiny and precious, this may have made
some sense. In a world where 60 GB SSDs are common as grass it's my
opinion that this no longer really does in most environments. Most ZFS
environments with SLOG SSDs will never come close to filling the SSD
with active ZIL log entries because almost no one writes and fsync()s
that much data that fast (you can and should measure this for yourself,
of course, but this is the typical result). Raising zil_slog_limit
substantially seems like a good idea to me (we'll probably tune it up to
at least a gigabyte).
(See here for a nice overview of what gets written where and when and also some discussions about what may be faster under various circumstances.)