2016-03-27
The limits of open source with Illumos and OmniOS
I go back and forth on how optimistic I feel about OmniOS and Illumos as a whole. During the up moods, I remember how our fileservers are problem free these days; during the down moods, I remember our outstanding problems. This is an entry written from a down mood perspective.
At this point we have several outstanding problems with OmniOS and Illumos as a whole, such as our ixgbe 10G Ethernet issues and the kernel holding memory. These issues have been officially known for some time, but they remain and as far as I can tell there's been no visible movement towards fixing them. At the same time we have seen other problems be dealt with quite rapidly.
What I read into this is that we have hit the limits of Illumos's
open source development. The things that I've seen dealt with
promptly are either small, already solved somewhere, or a priority
of some paying customer of an Illumos-related company. Our open
issues are big and gnarly and (apparently) not being pushed along
by anyone who can afford to pay for support; revising bits of the
kernel memory system or doing a major update of the ixgbe driver
are both not small projects, after all.
In a bigger open source project such as Linux, there is both more manpower available and more people running into relatively obscure problems such as these. As an example, Linux is popular enough that it's extremely unlikely that a major 10G Ethernet driver would be left to rot in an effectively unusable condition for common hardware. But Illumos simply does not have that kind of manpower and usage; what gets developed and fixed for Illumos is clearly much more narrow. The people working on Illumos are great and they have been super-helpful to us where they could, but the limits of where they can be helpful do not extend to doing major unpaid work. And this means that what we can expect from Illumos and OmniOS is limited.
How limited? In my down mood right now, I say that in practice we can expect to get something very close to no support. If something doesn't work, we get to keep all the pieces and (as with our 10G situation) we cannot expect a fix over the lifetime of our fileservers.
(This is the theoretical situation with Linux and FreeBSD until we, say, pay Red Hat for good RHEL support, but not the practical one.)
This makes me think that as nice as OmniOS is on our current fileservers, I won't really be able to recommend it as the OS for our next generation of fileservers in a few years. This is beyond the concrete issues I wrote about in the future of OmniOS here without 10G (or when I initially worried about driver support); it's a general issue of how much confidence I can have about being able to get problems fixed.
(I'm sure that if we had the money for support or consulting work we'd get great support from OmniTI and so on, and we'd probably have fixes for our problems. But we don't have that money and are unlikely to ever do so, so we must rely on the charity of the crowd. And the Illumos crowd is thin.)
PS: Some people might say 'just test the 2018 version of OmniOS a lot before you make the final decision'. Unfortunately, our experiences with 10G ixgbe and other issues make it clear that we simply can't do that well enough. We will experience problems in production that we couldn't find before then.
2016-03-14
An additional small detail of how writes work on ZFS raidzN pools
Back in How writes work on ZFS raidzN pools I wrote up how ZFS doesn't always do what's usually called 'full stripe writes', unlike normal RAID-5/6/etc systems. This matters because if you write data in small chunks you can use up more space than you expect, especially on 4k physical sector size disks (apparently zvols with a 4K or 8K record size are especially terrible for this; see eg this ZFS on Linux issue report).
Recently, I was reading Matthew Ahrens' ZFS RAIDZ stripe width, or: How I Learned to Stop Worrying and Love RAIDZ and learned another small but potentially important detail about how ZFS does raidzN writes. It turns out that ZFS requires all allocations to be multiples of N+1 blocks, so it rounds everything up to the nearest N+1 block boundary. This is regardless of how many disks you have in the raidzN vdev; if you have a raidz2 pool, for example, it can allocate 9 blocks or 12 blocks for a single write but never 10 or 11 blocks.
(Note that this is the allocation size including the raidzN parity blocks, not the user level data alone.)
At first this might seem kind of crazy, but as Matthew Ahrens explains, it more or less makes sense. The minimum write size in a raidzN pool is one data block plus N parity blocks, ie N+1 blocks in total. By rounding allocations up to this boundary, ZFS makes life somewhat easier on itself; any chunk of free space is always guaranteed to fit at least one data block, no matter how space is allocated. No matter how things are allocated and freed, ZFS will never be left with 'runt' free space that is too small to be used.
(This is free space as ZFS sees it, ie free space in a space map, which is what ZFS scans when it wants to allocate space. There will be some amount of irregular space that is 'free' in the sense that it is not used because it's rounded-up blocks, but ZFS doesn't have to keep track of that as free space. Instead ZFS just ignores it entirely, or more exactly marks it as used space.)
As with partial stripe writes, this does interact with 4k sector drives to potentially use more space, especially for higher raidzN settings. However, how much extra space gets used is going to be very dependent on what size your writes are.
(The good news is that minimum-sized objects won't experience any extra space usage as a result of this, since they're already one data block plus N parity blocks.)