Wandering Thoughts

2017-08-16

The three different names ZFS stores for each vdev disk (on Illumos)

I sort of mentioned yesterday that ZFS keeps information on several different ways of identifying disks in pools. To be specific, it keeps three different names or ways of identifying each disk. You can see this with 'zdb -C' on a pool, so here's a representative sample:

# zdb -C rpool
MOS Configuration:
[...]
  children[0]:
    type: 'disk'
    id: 0
    guid: 15557853432972548123
    path: '/dev/dsk/c3t0d0s0'
    devid: 'id1,sd@SATA_____INTEL_SSDSC2BB08__BTWL4114016X080KGN/a'
    phys_path: '/pci@0,0/pci15d9,714@1f,2/disk@0,0:a'
    [...]

The guid is ZFS's internal identifier for the disk, and is stored on the disk itself as part of the disk label. Since you have to find the disk to read it, it's not something that ZFS uses to find disks, although it is part of verifying that ZFS has found the right one. The three actual names for the disk are reported here as path, devid aka 'device id', and phys_path aka 'physical path'.

The path is straightforward; it's the filesystem path to the device, which here is a conventional OmniOS (Illumos, Solaris) cNtNdNsN name typical of a plain, non-multipathed disk. As this is a directly attached SATA disk, the phys_path shows us the PCI information about the controller for the disk in the form of a PCI device name. If we pulled this disk and replaced it with a new one, both of those would stay the same, since with a directly attached disk they're based on physical topology and neither has changed. However, the devid is clearly based on the disks's identity information; it has the vendor name, the 'product id', and the serial number (as returned by the disk itself in response to SATA inquiry commands). This will be the same more or less regardless of where the disk is connected to the system, so ZFS (and anything else) can find the disk wherever it is.

(I believe that the 'id1,sd@' portion of the devid is simply giving us a namespace for the rest of it. See 'prtconf -v' for another representation of all of this information and much more.)

Multipathed disks (such as the iSCSI disks on our fileservers) look somewhat different. For them, the filesystem device name (and thus path) looks like c5t<long identifier>d0s0, the physical path is /scsivhci/disk@g<long identifier>, and the devid_ is not particularly useful in finding the specific physical disk because our iSCSI targets generate synthetic disk 'serial numbers' based on their slot position (and the target's hostname, which at least lets me see which target a particular OmniOS-level multipathed disk is supposed to be coming from). As it happens, I already know that OmniOS multipathing identifies disks only by their device ids, so all three names are functionally the same thing, just expressed in different forms.

If you remove a disk entirely, all three of these names go away for both plain directly attached disks and multipath disks. If you replace a plain disk with a new or different one, the filesystem path and physical path will normally still work but the devid of the old disk is gone; ZFS can open the disk but will report that it has a missing or corrupt label. If you replace a multipathed disk with a new one and the true disk serial number is visible to OmniOS, all of the old names go away since they're all (partly) based on the disk's serial number, and ZFS will report the disk as missing entirely (often simply reporting it by GUID).

Sidebar: Which disk name ZFS uses when bringing up a pool

Which name or form of device identification ZFS uses is a bit complicated. To simplify a complicated situation (see vdev_disk_open in vdev_disk.c) as best I can, the normal sequence is that ZFS starts out by trying the filesystem path but verifying the devid. If this fails, it tries the devid, the physical path, and finally the filesystem path again (but without verifying the devid this time).

Since ZFS verifies the disk label's GUID and other details after opening the disk, there is no risk that finding a random disk this way (for example by the physical path) will confuse ZFS. It'll just cause ZFS to report things like 'missing or corrupt disk label' instead of 'missing device'.

ZFSDiskNames written at 23:47:46; Add Comment

Things I do and don't know about how ZFS brings pools up during boot

If you import a ZFS pool explicitly, through 'zpool import', the user-mode side of the process normally searches through all of the available disks in order to find the component devices of the pool. Because it does this explicit search, it will find pool devices even if they've been shuffled around in a way that causes them to be renamed, or even (I think) drastically transformed, for example by being dd'd to a new disk. This is pretty much what you'd expect, since ZFS can't really read what the pool thinks its configuration is until it assembles the pool. When it imports such a pool, I believe that ZFS rewrites the information kept about where to find each device so that it's correct for the current state of your system.

This is not what happens when the system boots. To the best of my knowledge, for non-root pools the ZFS kernel module directly reads /etc/zfs/zpool.cache during module initialization and converts it into a series of in-memory pool configurations for pools, which are all in an unactivated state. At some point, magic things attempt to activate some or all of these pools, which causes the kernel to attempt to open all of the devices listed as part of the pool configuration and verify that they are indeed part of the pool. The process of opening devices only uses the names and other identification of the devices that's in the pool configuration; however, one identification is a 'devid', which for many devices is basically the model and serial number of the disk. So I believe that under at least some circumstances the kernel will still be able to find disks that have been shuffled around, because it will basically seek out that model plus serial number wherever it's (now) connected to the system.

(See vdev_disk_open in vdev_disk.c for the gory details, but you also need to understand Illumos devids. The various device information available for disks in a pool can be seen with 'zdb -C <pool>'.)

To the best of my knowledge, this in-kernel activation makes no attempt to hunt around on other disks to complete the pool's configuration the way that 'zpool import' will. In theory, assuming that finding disks by their devid works, this shouldn't matter most or basically all of the time; if that disk is there at all, it should be reporting its model and serial number and I think the kernel will find it. But I don't know for sure. I also don't know how the kernel acts if some disks take a while to show up, for example iSCSI disks.

(I suspect that the kernel only makes one attempt at pool activation and doesn't retry things if more devices show up later. But this entire area is pretty opaque to me.)

These days you also have your root filesystems on a ZFS pool, the root pool. There are definitely some special code paths that seem to be invoked during boot for a ZFS root pool, but I don't have enough knowledge of the Illumos boot time environment to understand how they work and how they're different from the process of loading and starting non-root pools. I used to hear that root pools were more fragile if devices moved around and you might have to boot from alternate media in order to explicitly 'zpool import' and 'zpool export' the root pool in order to reset its device names, but that may be only folklore and superstition at this point.

ZFSPoolBootUnknowns written at 00:36:46; Add Comment

2017-08-07

There will be no LTS release of the OmniOS Community Edition

At the end of my entry on how I was cautiously optimistic about OmniOS CE, I said:

[...] For a start, it's not clear to me if OmniOS CE r151022 will receive long-term security updates or if users will be expected to move to r151024 when it's released (and I suppose I should ask).

Well, I asked, and the answer is a pretty unambiguous 'no'. The OmniOS CE core team has no interest in maintaining an LTS release; any such extended support would have to come from someone else doing the work. The current OmniOS CE support plans are:

What we intend, is to support the current and previous release with an emphasis on the current release going forward from r151022.

OmniOS CE releases are planned to come out roughly every 26 weeks, ie every six months, so supporting the current and previous release means that you get a nominal year of security updates and so on (in practice less than a year).

I can't blame the OmniOS CE core team for this (and I'm not anything that I'd describe as 'disappointed'; getting not just a OmniOS CE but a OmniOS CE LTS was always a long shot). People work on what interest them, and the CE core team just doesn't use LTS releases or plan to. They're doing enough as it is to keep OmniOS alive. And for most people, upgrading from release to release is probably not a big deal.

In the short term, this means that we are not going to bother to try to upgrade from OmniOS r151014 to either the current or the next version of OmniOS CE, because the payoff of relatively temporary security support doesn't seem worth the effort. We've already been treating our fileservers as sealed appliances, so this is not something we consider a big change.

(The long term is beyond the scope of this entry.)

OmniOSCENoLTSVersion written at 01:09:13; Add Comment

2017-07-24

Trying to understand the ZFS l2arc_noprefetch tunable

If you read the Illumos ZFS source code or perhaps some online guides to ZFS performance tuning, you may run across mention of a tunable called l2arc_noprefetch. There are various explanations of what this tunable means; for example, the current Illumos source code for arc.c says:

boolean_t l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */

As you can see, this defaults to being turned on in Illumos (and in ZFS on Linux). You can find various tuning guides online that suggest turning it to off for better L2ARC performance, and when I had an L2ARC in my Linux system I ran this way for a while. One tuning guide I found describes it this way:

This tunable determines whether streaming data is cached or not. The default is not to cache streaming data. [...]

This makes things sound like you should absolutely turn this on, but not so fast. The ZFS on Linux manpage on these things describes it this way:

Do not write buffers to L2ARC if they were prefetched but not used by applications.

That sounds a lot more reasonable, and especially it sounds reasonable to have it turned on by default. ZFS prefetching can still be overly aggressive, and (I believe) it still doesn't slow itself down if the prefetched data is never actually read. If you are having prefetch misses, under normal circumstances you probably don't want those misses taking up L2ARC space; you'd rather have L2ARC space go to things that you actually did read.

As far as I can decode the current Illumos code, this description also seems to match the actual behavior. If a ARC header is flagged as a prefetch, it is marked as not eligible for the L2ARC; however, if a normal read is found in the ARC and the read is eligible for L2ARC, the found ARC header is then marked as eligible (in arc_read()). So if you prefetch then get a read hit, the ARC buffer is initially ineligible but becomes eligible.

If I'm reading the code correctly, l2arc_noprefetch also has a second, somewhat subtle effect on reads. If the L2ARC contains the block but ZFS is performing a prefetch, then the prefetch will not read the block from the L2ARC but will instead fall through to doing a real read. I'm not sure why this is done; it may be that it simplifies other parts of the code, or it may be a deliberate attempt to preserve L2ARC bandwidth for uses that are considered more likely to be productive. If you set l2arc_noprefetch to off, prefetches will read from the L2ARC and count as L2ARC hits, even if they are not actually used for a real read.

Note that this second subtle effect makes it hard to evaluate the true effects of turning off l2arc_noprefetch. I think you can't go from L2ARC hits alone because L2ARC hits could be inflated by prefetching, putting the prefetched data in L2ARC, throwing it away before it's used for a real read, re-prefetching the same data and getting it from L2ARC, then throwing it away again, still unused.

ZFSL2ARCNoprefetchTunable written at 02:06:57; Add Comment

2017-07-20

I'm cautiously optimistic about the new OmniOS Community Edition

You may recall that back in April, OmniTI suspended active development of OmniOS, leaving its future in some doubt and causing me to wonder what we'd do about our current generation of fileservers. There was a certain amount of back and forth on the OmniOS mailing list, but in general nothing concrete happened about, say, updates to the current OmniOS release, and people started to get nervous. Then just over a week ago, the OmniOS Community Edition was announced, complete with OmniOSCE.org. Since then, they've already released one weekly update (r151022i) with various fixes.

All of this leaves me cautiously optimistic for our moderate term needs, where we basically need a replacement for OmniOS r151014 (the previous LTS release) that gets security updates. I'm optimistic for the obvious reason, which is that things are really happening here; I'm cautious because maintaining a distribution of anything is a bunch of work over time and it's easy to burn out people doing it. I'm hopeful that the initial people behind OmniOS CE will be able to get more people to help and spread the load out, making it more viable over the months to come.

(I won't be one of the people helping, for previously discussed reasons.)

We're probably not in a rush to try to update from r151014 to the OmniOS CE version of r151022. Building out a new version of OmniOS and testing it takes a bunch of effort, the process of deployment is disruptive, and there's probably no point in doing too much of that work until the moderate term situation with OmniOS CE is clearer. For a start, it's not clear to me if OmniOS CE r151022 will receive long-term security updates or if users will be expected to move to r151024 when it's released (and I suppose I should ask).

For our longer term needs, ie the next generation of fileservers, a lot of things are up in the air. If we move to smaller fileservers we will probably move to directly attached disks, which means we now care about SAS driver support, and in general there's been the big question of good Illumos support for 10G-T Ethernet hardware (which I believe is still not there today for Intel 10G-T cards, or at least I haven't really seen any big update to the ixgbe driver). What will happen with OmniOS CE over the longer term is merely one of the issues in play; it may turn out to be important, or it may turn out to be irrelevant because our decision is forced by other things.

OmniOSCECautiousOptimism written at 23:47:29; Add Comment

2017-06-14

The difference between ZFS scrubs and resilvers

At one level, asking what the difference is between scrubs and resilvers sounds silly; resilvering is replacing disks, while scrubbing is checking disks. But if you look at the ZFS code in the kernel things become much less clear, because both scrubs and resilvers use a common set of code to do all their work and it's actually not at all easy to tell what happens differently between them. Since I have actually dug through this section of ZFS code just the other day, I want to write down what the differences are while I remember them.

Both scrubs and resilvers traverse all of the metadata in the pool (in a non-sequential order), and both wind up reading all data. However, scrubs do this more thoroughly for mirrored vdevs; scrubs read all sides of a mirror, while resilvers only read one copy of the data (well, one intact copy). On raidz vdevs there is no difference here, as both scrubs and resilvers read both the data blocks and the parity blocks. This implies that a scrub on mirrored vdevs does more IO and (importantly) more checking than a resilver does. After a resilver of mirrored vdevs, you know that you have at least one intact copy of every piece of the pool, while after an error-free scrub of mirrored vdevs, you know that all ZFS metadata and data on all disks is fully intact.

For resilvers but not scrubs (at least normally), ZFS will sort of attempt to write everything back to the disks again, as I covered in the sidebar of last entry. As far as I can tell from the code, ZFS always skips even trying to write things back to disks that are either known to have good data (for example they were read from) or that are believed to be good because their DTL says that the data is clean on them (I believe that this case only really applies for mirrored vdevs). For scrubs, the only time ZFS submits writes to the disks is if it detects actual errors.

(Although I don't entirely understand how you get into this situation, it appears that a scrub of a pool with a 'replacing' disk behaves a lot like a resilver as far as that disk is concerned. As you would expect and want, the scrub doesn't try to read from the new disk and, like resilvers, it tries to write everything back to the disk.)

Since we only have mirrored vdevs on our fileservers, what really matters to us here is the difference in what gets read between scrubs and resilvers. On the one hand, resilvers put less read load on the disks, which is good for reducing their impact. On the other hand, resilvering isn't quite as thorough a check of the pool's total health as a scrub is.

PS: I'm not sure if either a scrub or a resilver reads the ZIL. Based on some code in zil.c, I suspect that it's checked only during scrubs, which would make sense. Alternately, zil.c is reusing ZIO_FLAG_SCRUB for non-scrub IO for some of its side effects, but that would be a bit weird.

ZFSResilversVsScrubs written at 00:44:26; Add Comment

2017-06-12

Resilvering multiple disks at once in a ZFS pool adds no real extra overhead

Suppose, not entirely hypothetically, that you have a multi-disk, multi-vdev ZFS pool (in our case using mirrored vdevs) and you need to migrate this pool from one set of disks to another set of disks. If you care about doing relatively little damage to pool performance for the duration of the resilvering, are you better off replacing one disk at a time (with repeated resilvers of the pool), or doing a whole bunch of disk replacements at once in one big resilver?

As far as we can tell from both the Illumos ZFS source code and our experience, the answer is that replacing multiple disks at once in a single pool is basically free (apart from the extra IO implied by writing to multiple new disks at once). In particular, a ZFS resilver on mirrored vdevs appears to always read all of the metadata and data in the pool, regardless of how many replacement drives there are and where they are. This means that replacing (or resilvering) multiple drives at once doesn't add any extra read IO; you do the same amount of reads whether you're replacing one drive or ten.

(This is unlike conventional RAID10, where replacing multiple drives at once will probably add additional read IO, which will affect array performance.)

For metadata, this is not particularly surprising. Since metadata is stored all across the pool, metadata located in one vdev can easily point to things located on another one. Given this, you have to read all metadata in the pool in order to even find out what is on a disk that's being resilvered. In theory I think that ZFS could optimize handling leaf data to skip stuff that's known to be entirely on unaffected vdevs; in practice, I can't find any sign in the code that it attempts this optimization for resilvers, and there are rational reasons to skip it anyway.

(As things are now, after a successful resilver you know that the pool has no permanent damage anywhere. If ZFS optimized resilvers by skipping reading and thus checking data on theoretically unaffected vdevs, you wouldn't have this assurance; you'd have to resilver and then scrub to know for sure. Resilvers are somewhat different from scrubs, but they're close.)

This doesn't mean that replacing multiple disks at once won't have any impact on your overall system, because your system may have overall IO capacity limits that are affected by adding more writes. For example, our ZFS fileservers each have a total write bandwidth of 200 Mbytes/sec across all their ZFS pool disks (since we have two 1G iSCSI networks). At least in theory we could saturate this total limit with resilver write traffic alone, and certainly enough resilver write traffic might delay normal user write traffic (especially at times of high write volume). Of course this is what ZFS scrub and resilver tunables are about, so maybe you want to keep an eye on that.

(This also ignores any potential multi-tenancy issues, which definitely affect us at least some of the time.)

ZFS does optimize resilvering disks that were only temporarily offline, using what ZFS calls a dirty time log. The DTL can be used to optimize walking the ZFS metadata tree in much the same way as how ZFS bookmarks work.

Sidebar: How ZFS resilvers (seem to) repair data, especially in RAIDZ

If you look at the resilvering related code in vdev_mirror.c and especially vdev_raidz.c, how things actually get repaired seems pretty mysterious. Especially in the RAIDZ case, ZFS appears to just go off and issue a bunch of ZFS writes to everything without paying any attention to new disks versus old, existing disks. It turns out that the important magic seems to be in zio_vdev_io_start in zio.c, where ZFS quietly discards resilver write IOs to disks unless the target disk is known to require it. The best detailed explanation is in the code's comments:

/*
 * If this is a repair I/O, and there's no self-healing involved --
 * that is, we're just resilvering what we expect to resilver --
 * then don't do the I/O unless zio's txg is actually in vd's DTL.
 * This prevents spurious resilvering with nested replication.
 * For example, given a mirror of mirrors, (A+B)+(C+D), if only
 * A is out of date, we'll read from C+D, then use the data to
 * resilver A+B -- but we don't actually want to resilver B, just A.
 * The top-level mirror has no way to know this, so instead we just
 * discard unnecessary repairs as we work our way down the vdev tree.
 * The same logic applies to any form of nested replication:
 * ditto + mirror, RAID-Z + replacing, etc.  This covers them all.
 */

It appears that preparing and queueing all of these IOs that will then be discarded does involve some amount of CPU, memory allocation for internal data structures, and so on. Presumably this is not a performance issue in practice, especially if you assume that resilvers are uncommon in the first place.

(And certainly the code is better structured this way, especially in the RAIDZ case.)

ZFSMultidiskResilversFree written at 22:29:35; Add Comment

2017-05-17

Unfortunately I don't feel personally passionate about OmniOS

In light of recent events, OmniOS badly needs people who really care about it, the kind of people who are willing and eager to contribute their own time and efforts to the generally thankless work of keeping a distribution running; integrating upstream changes, backporting security and bug updates, maintaining various boring pieces of infrastructure, and so on. There is an idealistic portion of me that would like me to be such a person, and thus that wants me to step forward to volunteer for things here (on the OmniOS mailing list, some people have recently gotten the ball rolling to line up volunteers for various jobs).

Unfortunately, I have to be realistic here (and it would be irresponsible not to be). Perhaps some past version of me is someone who might have had that sort of passion for OmniOS at the right time in his life, but present-day me does not (for various reasons, including that I have other things going on in my life). Without this passion, if I made the mistake of volunteering to do OmniOS-related things in an excess of foolishness I would probably burn out almost immediately; it would be all slogging work and no particular feeling of reward. And I wouldn't even be getting paid for it.

Can I contribute as part of my job? Unfortunately, this is only a maybe, because of two unfortunate realities here. Reality number one is that while we really like our OmniOS fileservers, OmniOS and even ZFS are not absolutely essential to us; although it would be sort of painful, we could get by without OmniOS or even ZFS. Thus there is only so much money, staff time, or other resource that it's sensible for work to 'spend' on OmniOS. Reality number two is that it's only sensible for work to devote resources to supporting OmniOS if we actually continue to use OmniOS over the long term, and that is definitely up in the air right now. Volunteering long-term ongoing resources for OmniOS now (whether that's some staff time or some servers) is premature. Unfortunately there's somewhat of a chicken and egg problem; we might be more inclined to volunteer resources if it was clear that OmniOS was going to be viable going forward, but volunteering resources now is part of what OmniOS needs in order to be viable.

(Given our 10G problem, it's not clear that even a fully viable OmniOS would be our OS of choice for our next generation of fileservers.)

(This echoes what I said in my sidebar on the original entry at the time, but I've been thinking over this recently in light of volunteers starting to step forward on the OmniOS mailing list. To be honest, it makes me feel a bit bad to be someone who isn't stepping forward here, when there's a clear need for people. So I have to remind myself that stepping forward without passion to sustain me would be a terrible mistake.)

OmniOSNoPersonalPassion written at 02:41:25; Add Comment

2017-05-07

ZFS's zfs receive has no error recovery and what that implies

Generally, 'zfs send' and 'zfs receive' are a great thing about ZFS. They give you easy, efficient, and reliable replication of ZFS filesystems, which is good for both backups and transfers (and testing). The combination is among my favorite ways of shuffling filesystems around, right up there with dump and restore and ahead of rsync. But there is an important caveat about zfs receive that deserves to be more widely known, and that is that zfs receive makes no attempt to recover from a partially damaged stream of data.

Formats and programs like tar, restore, and so on generally have provisions for attempting to skip over damaged sections of the stream of data they're supposed to restore. Sometimes this is explicitly designed into the stream format; other times the programs just use heuristics to try to re-synchronize a damaged stream at the next recognizable object. All of these have the goal of letting you recover at least something from a stream that's been stored and damaged in that storage, whether it's a tarball that's experienced some problems on disk or a dump image written to tape and now the tape has a glitch.

zfs receive does not do any of this. If your stream is damaged in any way, zfs receive aborts completely. In order to recover anything at all from the stream, it must be perfect and completely undamaged. This is generally okay if you're directly feeding a 'zfs send' to 'zfs receive'; an error means something bad is going on, and you can retry the send immediately. This is not good if you are storing the 'zfs send' output in a file before using 'zfs receive'; any damage or problem to the file, and it's completely unrecoverable and the file is useless. If the 'zfs send' file is your only copy of the data, the data is completely lost.

There are some situations where this lack of resilience for saved send streams is merely annoying. If you're writing the 'zfs send' output to a file on media as a way of transporting it around and the file gets damaged, you can always redo the whole process (just as you could redo a 'zfs send | zfs receive' pipeline). But in other situations this is actively dangerous, for example if the file is the only form of backup you have or the only copy of the data. Given this issue I now strongly discourage people from storing 'zfs send' output unless they're sure they know what they're doing and they can recover from restore failures.

(See eg this discussion or this one.)

I doubt that ZFS will ever change this behavior, partly because it would probably require a change in the format of the send/receive data stream. My impression is that ZFS people have never considered it a good idea to store 'zfs send' streams, even if they never said this very strongly in things like documentation.

(You also wouldn't want continuing to be the default behavior for 'zfs receive', at least in normal use. If you have a corrupt stream and you can retry it, you definitely want to.)

ZFSReceiveNoErrorRecovery written at 19:36:26; Add Comment

2017-04-24

What we need from an Illumos-based distribution

It started on Twitter:

@thatcks: Current status: depressing myself by looking at the status of Illumos distributions. There's nothing else like OmniOS out there.

@ptribble: So what are the values of OmniOS that are important to you and you feel aren't present in other illumos distributions?

This is an excellent question from Peter Tribble of Tribblix and it deserves a more elaborate answer than I could really give it on Twitter. So here's my take on what we need from an Illumos distribution.

  • A traditional Unix server environment that will do NFS fileservice with ZFS, because we already have a lot of tooling and expertise in running such an Illumos-based environment and our current environment has a very particular setup due to local needs. If we have to completely change how we design and operate NFS fileservice (for example to move to a more 'storage appliance' environment), the advantages of continuing with Illumos mostly go away. If we're going to have to drastically redesign our environment no matter what, maybe the simpler redesign is 'move to ZFS on FreeBSD doing NFS service as a traditional Unix server'.

  • A distribution that is actively 'developed', by which I mean that it incorporates upstream changes from Illumos and other sources on a regular basis and the installer is updated as necessary and so on. Illumos is not a static thing; there's important evolution in ZFS fixes and improvements, hardware drivers, and various other components.

  • Periodic stable releases that are supported with security updates and important bug fixes for a few years but that do not get wholesale rolling updates from upstream Illumos. We need this because testing, qualifying, and updating to a whole new release (with a wide assortment of changes) is a very large amount of work and risk. We can't possibly do it on a regular basis, such as every six months; even once a year is too much. Two years of support is probably our practical minimum.

    We could probably live with 'security updates only', although it'd be nice to be able to get high priority bug fixes as well (here I'm thinking of things that will crash your kernel or cause ZFS corruption, which are very close to 'must fix somehow' issues for people running ZFS fileservers).

    We don't need the ability to upgrade from stable release to stable release. For various reasons we're probably always going to upgrade by installing from scratch on new system disks and then swapping things around in downtimes.

  • An installer that lets us install systems from USB sticks or DVD images and doesn't require complicated backend infrastructure. We're not interested in big complex environments where we have to PXE-boot initial installs of servers, possibly with various additional systems to auto-configure them and tell them what to do.

In a world with pkgsrc and similar sources of pre-built and generally maintained packages, I don't think we need the distribution itself to have very many packages (I'm using the very specific 'we' here, meaning my group and our fileservers). Sure, it would be convenient for us if the distribution shipped with Postfix and Python and a few other important things, but it's not essential the way the core of Illumos is. While it would be ideal if the distribution owned everything important the way that Debian, FreeBSD, and so on do, it doesn't seem like Illumos distributions are going to have that kind of person-hours available, even for a relatively small package collection.

With that said I don't need for all packages to come from pkgsrc or whatever; I'm perfectly happy to have a mix of maintained packages from several sources, including the distribution in either their main source or an 'additional packages we use' repository. Since there's probably not going to be a plain-server NFS fileserver focused Illumos distribution, I'd expect any traditional Unix style distribution to have additional interests that lead to them packaging and maintaining some extra packages, whether that's for web service or X desktops or whatever.

(I also don't care about the package format that the distribution uses. If sticking with IPS is the easy way, that's fine. Neither IPS nor pkgsrc are among my favorite package management systems, but I can live with them.)

Out of all of our needs, I expect the 'stable releases' part to be the biggest problem. Stable releases are a hassle for distribution maintainers (or really for maintainers of anything); you have to maintain multiple versions and you may have to backport a steadily increasing amount of things over time. The amount of pain involved in them is why we're probably willing to live with only security updates for a relatively minimal set of packages and not demand backported bugfixes.

(In practice we don't expect to hit new bugs once our fileservers have gone into production and been stable, although it does happen every once in a while.)

Although 10G Ethernet support is very important to us in general, I'm not putting it on this list because I consider it a general Illumos issue, not something that's going to be specific to any particular distribution. If Illumos as a whole has viable 10G Ethernet for us, any reasonably up to date distribution should pick it up, and we don't want to use a distribution that's not picking those kind of things up.

Sidebar: My current short views on other Illumos distributions

Peter Tribble also asked what was missing in existing Illumos distributions. Based on an inspection of the Illumos wiki's options, I can split the available distributions into three sets:

  • OpenIndiana and Tribblix are developed and more or less traditional Unix systems, but don't appear to have stable releases that are maintained for a couple of years; instead there are periodic new releases with all changes included.

  • SmartOS, Nexenta, and napp-it are Illumos based but as far as I can tell aren't in the form of a traditional Unix system. (I'm not sure if napp-it is actively updated, but the other two are.)

  • the remaining distributions don't seem to be actively developed and may not have maintained stable releases either (I didn't look deeply).

Hopefully you can see why OmniOS hit a sweet spot for us; it is (or was) actively maintained, it has 'long term stable' releases that are supported for a few years, and you get a traditional Unix OS environment and a straightforward installation system.

IllumosDistributionNeeds written at 22:18:16; Add Comment

(Previous 10 or go back to April 2017 at 2017/04/21)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.