Wandering Thoughts archives

2016-02-26

Our problem with iSCSI connections at boot on OmniOS

You might perhaps wonder why I recently needed to run a script when our OmniOS machines booted. As it happens, we sometimes have a little problem with our iSCSI networking when we reboot a system, and we would like to know about it right away. First, the high speed summary of iSCSI on our ZFS fileservers is that fileservers connect to their iSCSI backends over two separate and thus redundant networks. At a mechanical level this is done by statically configuring each iSCSI target disk twice, one over each network, joining them together with standard OmniOS multipathing (set to round-robin), and then telling the OmniOS iSCSI initiator that it should make two connections to each target with 'iscsiadm modify initiator-node -c 2' (here's a longer writeup).

What we want and expect is that those two connections to each target should be made over different networks. And most of the time this works. However, some of the time a system will boot up with all of its connections to some or even all of the targets going over only a single network. Usually there will still be two connections but both will be over the same network, which costs us both redundancy and bandwidth.

(It's possible that OmniOS would make a new connection over the other network if the first one died, but this isn't something we exactly want to bet on.)

Because nothing actually breaks when the system is like this (at least when both iSCSI networks are working), it's possible for fileservers to quietly stay in this state for some time. Once we got disturbed enough by this fact, we wrote a script on the backends that checks for this, but only once a day. We decided that we'd like to know faster than that for the most common case, where this unbalanced iSCSI usage happens at boot time and can be detected right after boot. That led to needing a boot time service to run the script and wound up with me deep in SMF for the first and hopefully last time.

By the way, this is not directly OmniOS's fault; it's something that's been happening in Solaris for some time. My assumption is that this problem has at least something to do with the tangled way that Solaris has always brought up iSCSI disks at boot time, such that the OmniOS iSCSI initiator is attempting to bring up the two connections we told it to make at a time when only one network is available.

(Perhaps I should file this as an OmniOS and/or Illumos bug, but somehow I doubt it would get much attention.)

Sidebar: How we fix this

In an ideal world, you could fix this simply by telling OmniOS to switch to having only one connection per target, then go back to two connections per target; OmniOS would notice that it had two networks available and that it would be smart to make that second connection over the other network. Sometimes this even works. Often it doesn't, though.

When it fails to work, what has worked for us is to remove entirely the static target configuration for the network that is not being used, drop to one connection per target, re-add all of those removed static target configurations, and go back to two connections per target. Fortunately we have scripts that generate most of the necessary commands.

OmniOSISCSIBootProblem written at 01:39:38; Add Comment

2016-02-05

Some notes on SMF manifests (on OmniOS) and what goes in them

Recently, I needed to create a SMF manifest to run a script at boot. In most init systems, this is simple. SMF is not most init systems. SMF requires services (including scripts run at boot) to be defined in XML manifests. Being XML, they are verbose and picky, but fortunately there are some good general guidelines on what goes in them; the one I started from is Ben Rockwood's An SMF Manifest Cheatsheet. But there are a number of things it didn't say explicitly (or at all) that I had to find out the hard way, so here's some notes.

First, on OmniOS you'll find most existing SMF manifests under /lib/svc/manifest, especially /lib/svc/manifest/system. If you get confused or puzzled on how to do something, itt's very worth raiding these files for examples.

What both Ben Rockwood's writeup and the documentation neglects to mention is that there is a fixed order of elements in the SMF manifest. The manifest is not just an (XML) bag of properties; the elements need to come in a relatively specific order. You can get all sorts of puzzling and annoying errors from 'svccfg validate' if you don't know this.

(The error messages probably make total sense to people who understand XML DTD validation. I am not such a person.)

For just running a script, everyone seems to set things so there is only a single instance of your SMF service and it's auto-created:

<create_default_instance enabled='false' />
<single_instance/>

(This comes right after the opening <service> tag.)

There is probably an art to picking your SMF dependencies. I went for overkill; in order to get my script run right at the end of boot, I specified /system/filesystem/local, /milestone/multi-user, /milestone/network, and for local reasons /network/iscsi/initiator. 'svcs' defaults to listing services in start order, so you can use that to fish around for likely dependencies. Or you can look at what similar system SMF services use.

(It turns out that you can put multiple FMRIs in a single <dependency> tag, so my SMF manifest is more verbose than it needs to be. They need to have the same grouping, restart_on, and type, but this is probably not uncommon.)

Although you might think otherwise, even a single-shot script needs to have a 'stop' <exec_method> defined, even if it does nothing. The one that services seem to use is:

<exec_method type='method'
             name='stop'
             exec=':true'
             timeout_seconds='3' />

The timeout varies but I suspect it's not important. Omitting this will cause your SMF manifest to fail validation.

If you just want to run a script from your SMF service, you need what is called a 'transient' service. How you specify that your service is a transient one is rather obscure, because it is not something you set in the overall service description or in the 'start' exec_method (where you might expect it to live). Instead it's done this way:

<property_group name='startd' type='framework'>
    <propval name='duration' type='astring' value='transient' />
</property_group>

This is directions for svc.startd, which is responsible for starting and restarting SMF services. You can thus find some documentation for it in the svc.startd manpage, if you already understand enough about SMF XML manifests to know how to write properties.

(Since it is an add-on property, not a fundamental SMF XML attribute, it is not to be found anywhere in the SMF DTD. Isn't it nice that the SMF documentation points you to the SMF DTD for these things? No, not particularly.)

Some documentation will suggest giving your SMF service a name in the /site/ overall namespace. I suggest using an organizational name of some sort instead, because that way you know that a particular service came from you and was not dropped in from who knows where (and it's likely to stand out more in eg 'svcs' output). Other people creating SMF packages are already doing this; for instance, pkgsrc uses /pkgsrc/ names.

(This is the kind of entry that I write because I don't want to have to re-research this. SMF was annoying enough the first time around.)

Sidebar: A quick command cheatsheet

SMF manifests are validated with 'svccfg validate <file>.xml'. Expect to use this often.

Once ready to be used, manifests must be imported into SMF, which is done with 'svccfg import <file>.xml'. If you specified that your service should default to disabled when installed (as I did here), you then need to enable it with the usual 'svcadm enable /you/whatever'.

In theory you can re-import manifests to pick up changes. In practice I have no idea what sort of things are picked up; for example, if you delete a <dependency> block, does it go away in the imported version when reimported? I'd have to experiment (or know more about SMF than I currently do).

Your imported SMF manifest can be completely removed with 'svccfg delete /you/whatever'. Normally you'll want to have disabled the service beforehand. The svccfg manpage makes me slightly nervous about this in some circumstances that are probably not going to apply to many people.

(Svccfg has an export operation, but it just dumps out information, it doesn't remove things.)

SMFServiceManifestNotes written at 01:15:00; Add Comment

2016-01-20

Illumos's ZFS prefetching has recently become less superintelligent than it used to be

Several years ago (in 2012) I wrote How ZFS file prefetching seems to work, which discussed how ZFS prefetching worked at the time. As you may have guessed from the title of this entry, things have recently changed, at least in Illumos and other things built on the open source ZFS code (which includes the very latest ZFS on Linux). The basic change is Illumos 5987 - zfs prefetch code needs work, which landed in mainstream Illumos in early September of 2015, appears to have made it into FreeBSD trunk shortly afterwards, and which made it into ZFS on Linux only in late December.

The old code detected up to 8 streams (by default) of forward and reverse reads that were either straight sequential or strided (eg 'read every fourth block'). The new code still has 8 streams, but each stream now only matches sequential forward reads. This makes ZFS prefetching much easier to avoid and makes the code much easier to follow. I suspect that it won't have much effect on real workloads, although you never know; maybe there's real code that does strided forward reads or the like.

(There is also a tunable change; zfetch_max_distance replaces zfetch_block_cap as the limit on the amount of data that will be prefetched for a single stream. It's in bytes and defaults to 8 MBytes.)

Unfortunately the largest single drawback of ZFS prefetching still remains: prefetching (still) doesn't notice if the data it read in gets discarded from the ARC before it could be used. Just as before, as long as you're reading sequentially from the file, it will keep prefetching more and more data. Nor do streams time out if the file hasn't been touched at all in a while; each ZFS dnode may have up to eight of them hanging around basically forever, waiting patiently to match against the next read and restart prefetching (perhaps very large prefetching, as the amount of data to be prefetched never shrinks as far as I can see).

(That streams are per dnode instead of per open file handle does help explain why ZFS wants up to eight of them, since the dnode is shared across everyone who has the file open. If multiple people have the same file open and are reading from it sequentially (perhaps in different spots), it's good if they all get prefetched.)

ZFSHowPrefetchingII written at 01:12:22; Add Comment

2016-01-11

The drawback of setting an explicit mount point for ZFS filesystems

ZFS has three ways of getting filesystems mounted and deciding where they go in the filesystem hierarchy. As covered in the zfs manpage, you have a choice of automatically putting the filesystem below the pool (so that tank/example is mounted as /tank/example), setting an explicit mount point with mountpoint=/some/where, or marking the filesystem as 'legacy' so that you mount it yourself through whatever means you want (usually /etc/vfstab, the legacy approach to filesystem mounts). With either of the first two options, ZFS will automatically mount and unmount filesystems as you import and export pools or do various other things (and will also automatically share them over NFS if set to do so); with the third, you're on your own to manage things.

The first approach is ZFS's default scheme and what many people follow. However, for what is in large part historical reasons we haven't used it; instead we've explicitly specified our mount points with mountpoint=/some/where on our fileservers. When I set up ZFS on Linux on my office workstation I also set the mount points explicitly, because I was migrating existing filesystems into ZFS and I didn't feel like trying to change their mount points (or add another layer of bind mounts).

For both our fileservers and my workstation, this has turned out to sometimes be awkward. The largest problem comes if you're in the process of moving a filesystem from one pool to another on the same server using zfs send and zfs recv. If mountpoint was unset, both versions of the filesystem could coexist, with one as /oldpool/fsys and the other as /newpool/fsys. But with mountpoint set, they both want to be mounted on the same spot and only one can win. This means we have to be careful to use 'zfs recv -u' and even then we have to worry a bit about reboots.

(You can set 'canmount=off' or clear the 'mountpoint' property on the new-pool version of the filesystem for the time when the filesystem is only part-moved, but then you have a divergence between your received snapshot and the current state of the filesystem and you'll have to force further incremental receives with 'zfs recv -F. This is less than ideal, although such a divergence can happen anyways for other reasons.)

On the other hand, there are definite advantages to not having the mount point change and for having mount points be independent of the pool the filesystem is in. There's no particular reason that either users or your backup system need to care which pool a particular filesystem is in (such as whether it's in a HD-based pool or a SSD-based one, or a mirrored pool instead of a slower but more space efficient RAIDZ one); in this world, the filesystem name is basically an abstract identifier, instead of the 'physical location' that normal ZFS provides.

(ZFS does not quite do 'physical location' as such, but the pool plus the position within the pool's filesystem hierarchy may determine a lot about stuff like what storage the data is on and what quotas are enforced. I call this the physical location for lack of a better phrase, because users usually don't care about these details or at least how they're implemented.)

On the third hand, arguably the right way to provide an 'abstract identifier' version of filesystems (if you need it) is to build another layer on top of ZFS. On Solaris, you'd probably do this through the automounter with some tool to automatically generate the mappings between logical filesystem identifiers and their current physical locations.

PS: some versions of 'zfs receive' allow you to set properties on the received filesystem; unfortunately, neither OmniOS nor ZFS on Linux currently support that. I also suspect that doing this creates the same divergence between received snapshot and received filesystem that setting the properties by hand does, and you're back to forcing incremental receives with 'zfs recv -F' (and re-setting the properties and so on).

(It's sort of a pity that canmount is not inherited, because otherwise you could receive filesystems into a special 'newpool/nomount' hierarchy that blocked mounts and then active them later by using 'zfs rename' to move them out to their final place. But alas, no.)

ZFSMountpointConundrum written at 23:48:55; Add Comment

2016-01-05

Illumos's problem with its VCS commit messages

Quite a number of years ago I wrote an entry on the problem with the OpenSolaris source repository, where I called out Sun for terrible commit practices. At the time I thought that the public OS source repository had to be just a series of code snapshots turned into an external repository, but someone from Sun showed up in the comments to assure me that no, the terrible commit practices really were how they worked. I am glad to say that Illumos has fixed this problem in the Illumos master repository.

Well, mostly. Illumos does not routinely bundle multiple unrelated changes together into one commit the way that Sun used to, and (unlike Sun) their bug reports and so on are clearly visible. But they still have one problem with their commits. To show you what it is, here is a typical commit message:

6434 sa_find_sizes() may compute wrong SA header size
Reviewed-by: Ned Bass <...>
Reviewed-by: Brian Behlendorf <...>
[...]
Approved by: Robert Mustacchi <...>

That is the entire commit message. To know anything more, you must know how to look up the Illumos issue associated with this. Unless you do this, or are sufficiently knowledgeable about Illumos internals, it is probably not obvious that this is a ZFS bug; if you were scanning the commit logs to look for potentially important things for a ZFS fileserver environment, for example, this commit might not jump out at you as something you'd like.

Minimal commit messages like this are not what you'd call best practices. Pretty much everyone else has settled on a style where you at least describe a bit about the issue and the changes you're making. This lets people follow along just from the commit logs alone and provides a point in time snapshot of things; external bug reports may get updated or edited later, for example.

Beyond just the ability of people to follow the commit logs, this means that the Illumos commit history is not complete by itself. Since all the real content is in the Illumos issue tracker, the commit logs are crucially dependent on it. Lose the issue tracker (or just lose access to it) and you will be left to reconstruct scraps of understanding.

And, as far as I know, the Illumos issue tracker is not a distributed, replicated resource. There is one of it, and you cannot clone its data the way you can clone the Illumos repo itself.

(I'm sure it's backed up and there's multiple people involved. But there's still centralization here, and we've had things happen to centralized open source resources before. If nothing else, life on the Internet has taught me that almost everything shuts down sooner or later.)

At one point I thought it would be nice to at least include the URL of the Illumos issue in the commit message. I'm not sure of that any more, although I'm sure it'd help some people. It feels like a half-hearted bandaid, though. On the other hand, ZFS on Linux does put in URL references when porting Illumos changes into ZoL (eg) and I do like it, although it's a somewhat different situation.

(I don't expect this part of Illumos development culture to change. I'm sure the people doing Illumos development have heard all of these arguments before, and since they're doing what they're doing they're clearly happy with doing it their way.)

IllumosCommitMessages written at 01:18:37; Add Comment

2015-12-28

The limits of what ZFS scrubs check

In the ZFS community, there is a widespread view that ZFS scrubs are the equivalent of fsck for ordinary filesystems and so check for and find at least as many error conditions as fsck does. Unfortunately this view of ZFS scrubs is subtly misleading and can lead you to expect them to do things that they simply don't.

The simple version of what a ZFS scrub does is that it verifies the checksum for every copy of every (active) block in the ZFS pool. It also explicitly verifies parity blocks for RAIDZ vdevs (which a normal error-free read does not). In the process of doing this verification, the scrub must walk the entire object tree of the pool from the top downwards, which has the side effect of more or less verifying this hierarchy; certainly if there's something like a directory entry that points to an invalid thing, you will get a checksum error somewhere in the process.

However, this is all that a ZFS scrub verifies. In particular, it does not check the consistency and validity of metadata that isn't necessary to walk the ZFS object tree. This includes things like much of the inode data that is returned by stat() calls, and also internal structural information that is not necessary to walk the tree. Such information is simply tacitly assumed to be correct if its checksum verifies.

What this means at a broad level is that while a ZFS scrub guards against on disk corruption of data that was correct when it was written, it does not protect against internal corruption of data. If RAM errors or ZFS bugs cause corrupt data to be written, a ZFS scrub will not detect it even though it may be obvious in, for example, a ls -l. This is not just a theoretical issue, and has been encountered on multiple platforms.

(I also believe that ZFS scrubs don't try to do full consistency checks on ZFS's tracking of free disk blocks. I'm not sure if they even try to check that all in-use blocks are actually marked that way.)

This means that a ZFS scrub does somewhat different checks than a traditional fsck. Traditional fsck can't verify block integrity except indirectly, unlike scrubs, but fsck does a lot of explicit consistency checks of things like inode modes to make sure they're sane and it does verify that the filesystem's idea of free space is correct.

It would be possible to make ZFS scrubs do additional checks, and this may happen at some point. But it is not the state of affairs today, so today you can have a ZFS pool with corruption that never the less passes ZFS scrubs with no errors. In extreme cases, you may wind up with a pool that panics the system. You can do a certain amount of verification yourself, for example by writing a program that walks the entire filesystem to verify that there are no inodes with crazy modes. And if you make your backups with a conventional system that works through the filesystem (instead of with ZFS snapshot replication), your backups will do a certain amount of verification themselves just by walking the filesystem and trying to read all of the files (sooner or later).

ZFSScrubLimits written at 02:44:19; Add Comment

2015-12-05

The details behind zpool list's new fragmentation percentage

In this entry I explained that zpool list's new FRAG field is a measure of how fragmented the free space in the pool is, but I ignored all of the actual details. Today it's time to fix that, and to throw in the general background on top of it. So first we need to start by talking about free (disk) space.

All filesystems need to keep track of free disk space somehow. ZFS does so using a number of metaslabs, each of which has a space map; simplifying a bunch, spacemaps keep track of segments of contiguous free space in the metaslab (up to 'the whole metaslab'). A couple of years ago, a new ZFS feature called spacemap_histogram was added as part of a spacemap/metaslab rework. Spacemap histograms maintain a powers-of-two histogram of how big the segments of free space in metaslabs are. The motivation for this is, well, let me just quote from the summary of the rework:

The current [pre-histogram] disk format only stores the total amount of free space [in a metaslab], which means that heavily fragmented metaslabs can look appealing, causing us to read them off disk, even though they don't have enough contiguous free space to satisfy large allocations, leading us to continually load the same fragmented space maps over and over again.

(Note that when this talks about 'heavily fragmented metaslabs' it means heavily fragmented free space.)

To simplify slightly, each spacemap histogram bucket is assigned a fragmentation percentage, ranging from '0' for the 16 MB and larger buckets down to '100' for the 512 byte bucket, and then well, once again I'll just quote directly from the source:

This table defines a segment size based fragmentation metric that will allow each metaslab to derive its own fragmentation value. This is done by calculating the space in each bucket of the spacemap histogram and multiplying that by the fragmentation metric in this table. Doing this for all buckets and dividing it by the total amount of free space in this metaslab (i.e. the total free space in all buckets) gives us the fragmentation metric. This means that a high fragmentation metric equates to most of the free space being comprised of small segments. Conversely, if the metric is low, then most of the free space is in large segments. A 10% change in fragmentation equates to approximately double the number of segments.

My first entry summarized the current values in the table, or you can read the actual zfs_frag_table table in the source code. There is one important bit that is not in the table at all, which is that a metaslab with no free space left is considered 0% fragmented.

A pool's fragmentation value is derived in a two step process, because metaslabs are actually grouped together in 'metaslab groups' (I believe each vdev gets one). All metaslabs in a metaslab group are the same size, so the fragmentation for a metaslab group is just the average fragmentation over all metaslabs with valid spacemap histograms. The overall pool fragmentation is then derived from the metaslab group fragmentations, weighted by how much total space each metaslab group contributes (not how much free space).

A sufficiently recent pool will have spacemap histograms for all metaslabs. A pool that was created before this feature was added but then upgraded may not have spacemap histograms created for all of its metaslabs yet (I believe that a spacemap histogram is only added if the metaslab spacemap winds up getting written out with changes). If too many metaslabs in any single metaslab group lack spacemap histograms, the pool is considered to not have an overall fragmentation percentage (zpool list will report this as a FRAG value of '-', even though the spacemap_histogram feature is active).

(Currently 'too many metaslabs' is 'half or more of the metaslabs in a metaslab group', but this may change.)

You can inspect raw metaslab spacemap histograms through zdb, using 'zdb -mm <POOL>'. Note that the on-disk histogram has more buckets than the fragmentation percentage table does (it has 32 entries versus zfs_frag_table's 17). The bucket numbers printed represent raw powers of two, eg a bucket number of 10 is 2^10 bytes or 1 KB; this implies that you'll never see a bucket number smaller than the vdev's ashift. Zdb also reports the calculated fragmentation percentage for each metaslab (as 'fragmentation NN').

(It looks like mdb can also dump this information when it is reporting on appropriate vdevs, via '::vdev -m'. I have not investigated this, just noticed it in the source.)

The metaslab fragmentation number is used for more than just reporting a metric in zpool list. There are a number of bits of ZFS block allocation that pay attention to it when deciding what metaslab to allocate new space from. There are also some ZFS global variables related to this, but since I haven't dug into this area at all I'm not going to say anything about them.

(In the Illumos source, all of this is in uts/common/fs/zfs/metaslab.c; you want to search for all of the things that talk about fragmentation. Note that there's multiple levels of functions involved in this.)

ZFSZpoolFragmentationDetails written at 01:38:04; Add Comment

2015-12-02

What zpool list's new FRAG fragmentation percentage means

Recent versions of 'zpool list' on Illumos (and elsewhere) have added a new field of information called 'FRAG', reported as a percentage, which the zpool manpage will tell you is 'the amount of fragmentation in the pool'. To put it politely, this is very under-documented (and in a misleading way). Based on an expedition into the current Illumos kernel code, as far as I can tell:

zpool list's FRAG value is an abstract measure of how fragmented the free space in the pool is.

A pool with a low FRAG percent has most of its remaining free space in large contiguous segments, while a pool with a high FRAG percentage has most of its free space broken up into small pieces. The FRAG percentage tells you nothing about how fragmented (or not fragmented) your data is, and thus how many seeks it will take to read it back. Instead it is part of how hard ZFS will have to work to find space for large chunks of new data (and how fragmented they may be forced to be when they get written out).

(How hard ZFS has to work to find space is also influenced by how much total free space is left in your pool. There's likely to be some correlation between low free space and higher FRAG numbers, but I wouldn't assume that they're inextricably yoked together.)

FRAG also doesn't tell you how evenly the free space is distributed across your disk(s). As far as I know, adding a new vdev or expanding an existing one will generally result in the new space being seen as essentially unfragmented; this can drop your overall FRAG percent even if your old disk space had very fragmented free space. In practice this probably doesn't matter, since ZFS will generally prefer to write things to that new (and unfragmented) space.

(Such a drop in FRAG is 'fair' in the sense that the chances that ZFS will be able to find a large chunk of free space have gone way up.)

How the percentages relate to the average segment size of free space goes roughly like this. Based on the current Illumos kernel code, if all free space was in segments of the given size, the reported fragmentation would be:

  • 512 B and 1 KB segments are 100% fragmented
  • 2 KB segments are 98% fragmented; 4 KB segments are 95% fragmented.
  • 8 KB to 1 MB segments start out at 90% fragmented and drop 10% for every power of two (eg 16 KB is 80% fragmented and 1 MB is 20%). 128 KB segments are 50% fragmented.
  • 2 MB, 4 MB, and 8MB segments are 15%, 10%, and 5% fragmented respectively
  • 16 MB and larger segments are 0% fragmented.

Of course the free space is probably not all in segments of one size. ZFS does the obvious thing and weights each segment size bucket by the amount of free space that falls into that range. This makes FRAG essentially an average, which means it has the usual hazards of averages.

Note that these fragmentation percents are relatively arbitrary, as comments in the Illumos kernel code admit; they are designed to produce what the ZFS developers feel is a useful result, not by following any strict mathematical formula. They may also change in the future. As far as relative values go, according to comments in the source code, 'a 10% change in fragmentation equates to approximately double the number of segments'.

(The source code explicitly calls the fragmentation percentage a 'metric' as opposed to a direct measurement.)

I believe that one interesting consequence of the current OmniOS code is that a pool on 4K sector disks (a pool with ashift=12) can never be reported as more than 95% fragmented, because 4K is the minimum allocation size and thus the minimum free segment size. I would not be surprised if in the future ZFS modifies the fragmentation percents reported for such pools so that 4K segments become '100% fragmented'.

(Technically it would be a per-vdev thing, but in practice I think that very few people mix vdevs with different ashifts and block sizes.)

I was initially planning on writing up the technical details too, but this entry is already long enough as it is so I'm deferring them to another entry.

ZFSZpoolFragmentationMeaning written at 22:46:21; Add Comment

2015-11-14

ZFS pool import needs much better error messages

One of the frustrating things about dealing with sufficiently damaged ZFS pools is that 'zpool import' and friends do not generate very detailed error messages. There are a lot of things that can go wrong with a ZFS pool that will make it not importable, but 'zpool import' has clear explanations for only some of them. For many others all you get is a generic error in 'zpool import' status reporting of, say:

The pool cannot be imported due to damaged devices or data.

(Here I'm talking about the results of just running 'zpool import' to see available pools and their states and configuration, not trying to actually import a pool. Here zpool has lots of room to write explicit and detailed messages about what seems to be wrong with your pool's configuration.)

This isn't just an issue of annoying and frustrating people with opaque, generic error messages. Given that the error messages are generic, it's quite easy for people to focus only on the obvious problems that zpool import reports, even if those problems may not be the reason the pool can't be imported. As it happens I have a great example of this in action, in this SuperUser question. When you read this question, can you figure out what's wrong? Both the SuperUser ZFS community and the ZFS on Linux mailing list couldn't.

(I believe that everything you need to figure out what's going on is actually in the information in the question and the code behind 'zpool import' actually knows what the problem is. This assumes that my diagnosis is correct, of course.)

Perhaps zpool import should not be fully verbose by default, as there's a certain amount of information that may only make sense to people who know a fair bit about how ZFS works. But it certainly should be possible to get this information with, eg, a verbose switch instead of having to reverse engineer it from zdb output. If nothing else, this means that you can get a verbose report and show it to ZFS exports in the hope that they can tell you what's wrong.

On a purely pragmatic level I think that zpool import should be really verbose and detailed when a pool can't be imported. 'My pool won't import' is one of the most stressful experiences you can have with ZFS; to get unclear, generic errors at this point is extremely frustrating and does not help one's mood in the least. This is exactly the time when large amounts of detail are really, really appreciated, even if they're telling you exactly how far up the creek you are.

(This means that I would very much like a 'zpool import -v <pool>' option that describes exactly what the import is doing or trying to do and then covers all of the problems that it detected with the pool configuration, all the things the kernel said to it, and so on. A report of 'I am asking the kernel to import a pool made up of the following devices in the following vdev structure' is not too verbose.)

PS: while this example is from ZFS on Linux and FreeBSD, I've looked at the current Illumos code for zpool and libzfs, and as far as I can see it would have exactly the same problem here.

(Part of the issue is that zpool import and libzfs have what you could call less than ideal reporting if a pool is marked as active on some other system and also has configuration problems. But even if it reported multiple errors I think that the real problem here would remain obscure; the current 'zpool import' code appears to deliberately suppress printing out parts of the information necessary.)

ZFSImportBetterErrors written at 00:35:51; Add Comment

2015-11-13

We killed off our SunSolve email contact address yesterday

Back in the days when Sun was Sun, Sun's patch access and support system was imaginatively called Sunsolve. If you had a support contract with Sun (which often was only about the ability to get patches and file bug reports), you had a SunSolve account. We had one, of course (we have been using Solaris for longer than it's been Solaris). In the very beginning we made a classical mistake and had it in the name and email of a specific sysadmin (who then moved on), but in the early days of our Solaris 10 fileservers we switched this to a generic email address, cleverly named sunsolve.

Yesterday, we removed that address.

Our Solaris machines have all been out of commission for a while now, but we left the address in place mostly because of inertia. What pushed me to remove it is the usual reason; we just couldn't get Oracle to stop mailing things to it. I don't think Oracle spammed it (unlike some people), but they did keep sending us information about patch clusters and quarterly updates and this and that, all of which is irrelevant to us these days.

(I managed to get Oracle to mostly knock it off, but the other day they decided that they had an update that was so urgent that they just had to mail it to us. Never mind that we don't have any of the software at issue, that Oracle had our email address was good enough for them.)

At one level this is an unimportant little bit of cleanup that we should have done long ago. With our Solaris machines gone and our grandfathered support contract let run down, the email address had no point; it was just another lingering bit of clutter, and we should get rid of that kind of thing while we remember what it is and why we can remove it.

(If you wait long enough on this sort of thing, you can easily forget whether or not there's some special, inobvious reason that you're keeping these old oddities around. So it's best to strike while everything is fresh in your mind.)

At another level, the sunsolve email address was one of the last lingering traces of what was (after all) a very long association with Sun and Solaris. Just as with other things, letting it go is yet another line drawn under all of that history, even if SunSolve itself stopped existing years ago.

(Oracle decommissioned SunSolve and folded the functionality into their own support system not long after they bought Sun. The conversion was not entirely pleasant for support customers.)

PS: Since I just looked, it warms my heart a little bit that PCA is still trucking along. Oracle may have killed some very useful customer-done things but at least they left PCA alone. If we still had to deal with the mess that is Solaris patches, we'd be very thankful for that.

SunSolveEnding written at 00:27:47; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.