Wandering Thoughts

2017-03-27

We're probably going to upgrade our OmniOS servers by reinstalling them

We're currently running OmniOS r151014 on our fileservers, which is the current long term support release (although we're behind on updates, because we avoid them for stability reasons). However, per the OmniOS release cycle, there's a new LTS release coming this summer and about six to nine months later, our current r151014 version will stop being supported at all. Despite what I wrote not quite a year ago about how we might not upgrade at all, we seem to be broadly in support of the idea of upgrading when the next LTS release is out in order to retain at least the option of applying updates for security issues and so on.

This raises the question of how we do it, because there are two possible options; we could reinstall (what we did the last time around), or upgrade the existing systems through the normal process with a new boot environment. Having thought about it, I think that I'm likely to argue for upgrading via full reinstalls (on new system disks). There's two reasons for this, one specific to this particular version change and one more general one.

The specific issue is that OmniOS is in the process of transitioning to a new bootloader; they're moving from an old version of Grub to a version of the BSD bootloader (which OmniOS calls the 'BSD Loader'). While it's apparently going to be possible to stick with Grub or switch bootloaders over the transition, the current OmniOS Bloody directions make this sound pretty intricate. Installing a new OmniOS from scratch on new disks seems to be the cleanest and best way to get the new bootloader for the new OmniOS while preserving Grub for the old OmniOS (on the old disks).

The more broader issue is that reinstalling from scratch on new disks every time is more certain for rollbacks (since we can keep the old disks) and means that any hypothetical future systems we install wind up the same as the current ones without making us go through extra work. If we did in-place upgrades, to get identical new installs we would actually have to install r151014 then immediately upgrade it to the new LTS. If we just installed the new LTS, there are various sorts of subtle differences and incompatibilities that could sneak in.

(This is of course not specific to OmniOS. It's very hard to make sure that upgraded systems are exactly the same as newly installed systems, especially if you've upgraded the systems over significant version boundaries.)

I like the idea of upgrading between OmniOS versions using boot environments in theory (partly because it's neat if it works), it would probably be faster and less of a hassle, and I may yet change my mind here. But I suspect that we're going to do it the tedious way just because it's easier on us in the long run.

OmniOSUpgradesViaReinstalls written at 01:45:33; Add Comment

2017-03-03

Some notes on ZFS per-user quotas and their interactions with NFS

In addition to quotas on filesystems themselves (refquota) and quotas on entire trees (plain quota), ZFS also supports per-filesystem quotas on how much space users (or groups) can use. We haven't previously used these for various reasons, but today we had a situation with an inaccessible runaway user process eating up all the free space in one pool on our fileservers and we decided to (try to) stop it by sticking a quota on the user. The result was reasonably educational and led to some additional educational experimentation, so now it's time for notes.

User quotas for a user on a filesystem are created by setting the userquota@<user> property of the filesystem to some appropriate value. Unlike overall filesystem and tree quotas, you can set a user quota that is below the user's current space usage. To see the user's current space usage, you look at userused@<user> (which will have its disk space number rounded unless you use 'zfs get -p userused@<user> ...'). To clear the user's quota limit after you don't need it any more, set it to none instead of a size.

(The current Illumos zfs manpage has an annoying mistake, where its section on the userquota@<user> property talks about finding out space by looking at the 'userspace@<user>' property, which is the wrong property name. I suppose I should file a bug report.)

Since user quotas are per-filesystem only (as mentioned), you need to know which filesystem or filesystems your errant user is using space on in your pool in order to block a runaway space consumer. In our case we already have some tools for this and had localized the space growth to a single filesystem; otherwise, you may want to write a script in advance so you can freeze someone's space usage at its current level on a collection of filesystems.

(The mechanics are pretty simple; you set the userquota@<user> value to the value of the userspace@<user> property, if it exists. I'd use the precise value unless you're sure no user will ever use enough space on a filesystem to make the rounding errors significant.)

Then we have the issue of how firmly and how fast quotas are enforced. The zfs manpage warns you explicitly:

Enforcement of user quotas may be delayed by several seconds. This delay means that a user might exceed their quota before the system notices that they are over quota and begins to refuse additional writes with the EDQUOT error message.

This is especially the case over NFS (at least NFS v3), where NFS clients may not start flushing writes to the NFS server for some time. In my testing, I saw the NFS client's kernel happily accept a couple of GB of writes before it started forcing them out to the fileserver.

The behavior of an OmniOS NFS server here is somewhat variable. On the one hand, we saw space usage for our quota'd user keep increasing over the quota for a certain amount of time after we applied the quota (unfortunately I was too busy to time it or carefully track it). On the other hand, in testing, if I started to write to an existing but empty file (on the NFS client) once I was over quota, the NFS server refused all writes and didn't put any data in the file. My conclusion is that at least for NFS servers, the user may be able to go over your quota limit by a few hundred megabytes under the right circumstances. However, once ZFS knows that you're over the quota limit a lot of things shut down immediately; you can't make new files, for example (and NFS clients helpfully get an immediate error about this).

(I took a quick look at the kernel code but I couldn't spot where ZFS updates the space usage information in order to see what sort of lag there is in the process.)

I haven't tested what happens to fileserver performance if a NFS client keeps trying to write data after it has hit the quota limit and has started getting EDQUOTA errors. You'd think that the fileserver should be unaffected, but we've seen issues when pools hit overall quota size limits.

(It's not clear if this came up today when the user hit the quota limit and whatever process(es) they were running started to get those EDQUOTA errors.)

ZFSUserQuotaNotes written at 01:01:22; Add Comment

2017-02-24

How ZFS bookmarks can work their magic with reasonable efficiency

My description of ZFS bookmarks covered what they're good for, but it didn't talk about what they are at a mechanical level. It's all very well to say 'bookmarks mark the point in time when [a] snapshot was created', but how does that actually work, and how does it allow you to use them for incremental ZFS send streams?

The succinct version is that a bookmark is basically a transaction group (txg) number. In ZFS, everything is created as part of a transaction group and gets tagged with the TXG of when it was created. Since things in ZFS are also immutable once written, we know that an object created in a given TXG can't have anything under it that was created in a more recent TXG (although it may well point to things created in older transaction groups). If you have an old directory with an old file and you change a block in the old file, the immutability of ZFS means that you need to write a new version of the data block, a new version of the file metadata that points to the new data block, a new version of the directory metadata that points to the new file metadata, and so on all the way up the tree, and all of those new versions will get a new birth TXG.

This means that given a TXG, it's reasonably efficient to walk down an entire ZFS filesystem (or snapshot) to find everything that was changed since that TXG. When you hit an object with a birth TXG before (or at) your target TXG, you know that you don't have to visit the object's children because they can't have been changed more recently than the object itself. If you bundle up all of the changed objects that you find in a suitable order, you have an incremental send stream. Many of the changed objects you're sending will contain references to older unchanged objects that you're not sending, but if your target has your starting TXG, you know it has all of those unchanged objects already.

To put it succinctly, I'll quote a code comment from libzfs_core.c (via):

If "from" is a bookmark, the indirect blocks in the destination snapshot are traversed, looking for blocks with a birth time since the creation TXG of the snapshot this bookmark was created from. This will result in significantly more I/O and be less efficient than a send space estimation on an equivalent snapshot.

(This is a comment about getting a space estimate for incremental sends, not about doing the send itself, but it's a good summary and it describes the actual process of generating the send as far as I can see.)

Yesterday I said that ZFS bookmarks could in theory be used for an imprecise version of 'zfs diff'. What makes this necessarily imprecise is that while scanning forward from a TXG this way can tell you all of the new objects and it can tell you what is the same, it can't explicitly tell you what has disappeared. Suppose we delete a file. This will necessarily create a new version of the directory the file was in and this new version will have a recent TXG, so we'll find the new version of the directory in our tree scan. But without the original version of the directory to compare against we can't tell what changed, just that something did.

(Similarly, we can't entirely tell the difference between 'a new file was added to this directory' and 'an existing file had all its contents changed or rewritten'. Both will create new file metadata that will have a new TXG. We can tell the case of a file being partially updated, because then some of the file's data blocks will have old TXGs.)

Bookmarks specifically don't preserve the original versions of things; that's why they take no space. Snapshots do preserve the original versions, but they take up space to do that. We can't get something for nothing here.

(More useful sources on the details of bookmarks are this reddit ZFS entry and a slide deck by Matthew Ahrens. Illumos issue 4369 is the original ZFS bookmarks issue.)

Sidebar: Space estimates versus actually creating the incremental send

Creating the actual incremental send stream works exactly the same for sends based on snapshots and sends based on bookmarks. If you look at dmu_send in dmu_send.c, you can see that in the case of a snapshot it basically creates a synthetic bookmark from snapshot's creation information; with a real bookmark, it retrieves the data through dsl_bookmark_lookup. In both cases, the important piece of data is zmb_creation_txg, the TXG to start from.

This means that contrary to what I said yesterday, using bookmarks as the origin for an incremental send stream is just as fast as using snapshots.

What is different is if you ask for something that requires estimating the size of the incremental sends. Space estimates for snapshots are pretty efficient because they can be made using information about space usage in each snapshot. For details, see the comment before dsl_dataset_space_written in dsl_dataset.c. Estimating the space of a bookmark based incremental send requires basically doing the same walk over the ZFS object tree that will be done to generate the send data.

(The walk over the tree will be somewhat faster than the actual send, because in the actual send you have to read the data blocks too; in the tree walk, you only need to read metadata.)

So, you might wonder how you ask for something that requires a space estimate. If you're sending from a snapshot, you use 'zfs send -v ...'. If you're sending from a bookmark or a resume token, well, apparently you just don't; sending from a bookmark doesn't accept -v and -v on resume tokens means something different from what it does on snapshots. So this performance difference is kind of a shaggy dog story right now, since it seems that you can never actually use the slow path of space estimates on bookmarks.

ZFSBookmarksMechanism written at 00:26:44; Add Comment

2017-02-22

ZFS bookmarks and what they're good for

Regular old fashioned ZFS has filesystems and snapshots. Recent versions of ZFS add a third object, called bookmarks. Bookmarks are described like this in the zfs manpage (for the 'zfs bookmark' command):

Creates a bookmark of the given snapshot. Bookmarks mark the point in time when the snapshot was created, and can be used as the incremental source for a zfs send command.

ZFS on Linux has an additional explanation here:

A bookmark is like a snapshot, a read-only copy of a file system or volume. Bookmarks can be created extremely quickly, compared to snapshots, and they consume no additional space within the pool. Bookmarks can also have arbitrary names, much like snapshots.

Unlike snapshots, bookmarks can not be accessed through the filesystem in any way. From a storage standpoint a bookmark just provides a way to reference when a snapshot was created as a distinct object. [...]

The first question is why you would want bookmarks at all. Right now bookmarks have one use, which is saving space on the source of a stream of incremental backups. Suppose that you want to use zfs send and zfs receive to periodically update a backup. At one level, this is no problem:

zfs snapshot pool/fs@current
zfs send -Ri previous pool/fs@current | ...

The problem with this is that you have to keep the previous snapshot around on the source filesystem, pool/fs. If space is tight and there is enough data changing on pool/fs, this can be annoying; it means, for example, that if people delete some files to free up space for other people, they actually haven't done so because the space is being held down by that snapshot.

The purpose of bookmarks is to allow you to do these incremental sends without consuming extra space on the source filesystem. Instead of having to keep the previous snapshot around, you instead make a bookmark based on it, delete the snapshot, and then do the incremental zfs send using the bookmark:

zfs snapshot pool/fs@current
zfs send -i #previous pool/fs@current | ...

This is apparently not quite as fast as using a snapshot, but if you're using bookmarks here it's because the space saving is worth it, possibly in combination with not having to worry about unpredictable fluctuations in how much space a snapshot is holding down as the amount of churn in the filesystem varies.

(We have a few filesystems that get frequent snapshots for fast recovery of user-deleted files, and we live in a certain amount of concern that someday, someone will dump a bunch of data on the filesystem, wait just long enough for a scheduled snapshot to happen, and then either move the data elsewhere or delete it. Sorting that one out to actually get the space back would require deleting at least some snapshots.)

Using bookmarks does require you to keep the previous snapshot on the destination (aka backup) filesystem, although the manpage only tells you this by implication. I believe that this implies that while you're receiving a new incremental, you may need extra space over and above what the current snapshot requires for space, since you won't be able to delete previous and recover its space until the incremental receive finishes. The relevant bit from the manpage is:

If an incremental stream is received, then the destination file system must already exist, and its most recent snapshot must match the incremental stream's source. [...]

This means that the destination filesystem must have a snapshot. This snapshot will and must match a bookmark made from it, since otherwise incremental send streams from bookmarks wouldn't work.

(In theory bookmarks could also be used to generate an imprecise 'zfs diff' without having to keep the origin snapshot around. In practice I doubt anyone is going to implement this, and why it's necessarily imprecise requires an explanation of why and how bookmarks work.)

ZFSBookmarksWhatFor written at 23:58:39; Add Comment

2017-01-13

The ZFS pool history log that's used by 'zpool history' has a size limit

I have an awkward confession. Until Aneurin Price mentioned it in his comment on my entry on 'zpool history -i', I had no idea that the internal, per-pool history log that zpool history uses has a size limit. I thought that perhaps the size and volume of events was small enough that ZFS just kept everything, which is silly in retrospect. This unfortunately means that the long-term 'strategic' use of zpool history that I talked about in my first entry has potentially significant limits, because you can only go back so far in history. How far depends on a number of factors, including how many snapshots and so on you take.

(If you're just inspecting the output of 'zpool history', it's easy to overlook that it's gotten truncated, because it always starts with the pool's creation. This is because the ZFS code that maintains the log goes out of its way to make sure that the initial pool creation record is kept forever.)

The ZFS code that creates and maintains the log is in spa_history.c. As far as the log's size goes, let me quote the comment in spa_history_create_obj:

/*
 * Figure out maximum size of history log.  We set it at
 * 0.1% of pool size, with a max of 1G and min of 128KB.
 */

Now, there is a complication, which is that the pool history log is only sized and set up once, at initial pool creation. So that size is not 0.1% of the current pool size, it is 0.1% of the initial pool size, whatever that was. If your pool has been expanded since its creation and started out smaller than 1000 GB, its history log is smaller (possibly much smaller) than it would be if you recreated the pool at 1000 GB or more now. Unfortunately, based on the code, I don't think ZFS can easily resize the history log after creation (and it certainly doesn't attempt to now).

The ZFS code does maintain some information about how many records have been lost and how many total bytes have been written to the log, but these don't seem to be exposed in any way to user-level code; they're simply there in the on-disk and in-memory data structures. You'd have to dig them out of the depths of the kernel with DTrace or the like, or you can use zdb to read them off disk.

(It turns out that our most actively snapshotted pool, which probably has the most records in its log, only has an 11% full history log at the moment.)

Sidebar: Using zdb to see history log information

This is brief notes, in the style of using zdb to see the ZFS delete queue. First we need to find out the object ID of the SPA history information, which is always going to be in the pool's root dataset (as far as I know):

# zdb -dddd rpool 1
Dataset mos [META], [...]

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         1    1    16K    16K  24.0K    32K  100.00  object directory
[...]
               history = 32 
[...]

The history log is stored in a ZFS object; here that is object number 32. Since it was object 32 in three pools that I checked, it may almost always be that.

# zdb -dddd rpool 32
Dataset [...]
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
        32    1    16K   128K  36.0K   128K  100.00  SPA history
                                         40   bonus  SPA history offsets
        dnode flags: USED_BYTES 
        dnode maxblkid: 0
                pool_create_len = 536
                phys_max_off = 79993765
                bof = 536
                eof = 77080
                records_lost = 0

The bof and eof values are logical byte positions in the ring buffer, and so at least eof will be larger than phys_max_off if you've started losing records. For more details, see the comments in spa_history.c.

ZFSZpoolHistorySizeLimit written at 01:28:05; Add Comment

2017-01-11

ZFS's potentially very useful 'zpool history -i' option

I recently wrote a little thing praising zpool history. At the time I wrote that, I hadn't really read the manpage carefully enough to have noticed an important additional feature, which is zpool history's -i argument (and -l as well, sometimes). To quote the manpage, -i 'displays internally logged ZFS events in addition to user initiated events'. What this means in plain language is that 'zpool history -i' shows you a lot of what happened to your pool no matter how it was done. This may sound irrelevant and abstract, so let me give you a concrete example.

Did you know that you can create and delete snapshots in a filesystem by using mkdir and rmdir in the <filesystem>/.zfs/snapshot directory? If you have sufficient privileges (root is normally required), this works both locally and over NFS to a ZFS fileserver. Snapshots created and deleted this way don't show up in plain 'zpool history' because of course they weren't created with a 'zfs' command, but they do show up in 'zpool history -i'.

When you're looking at the output at this level, you will typically see three log events for a typical command:

<time> [txg:12227245] snapshot fs0-core-01/cs/mail@2017_01_10 (4350)
<time> ioctl snapshot
  input:
    snaps:
      fs0-core-01/cs/mail@2017_01_10
    props:

<time> zfs snapshot fs0-core-01/cs/mail@2017_01_10

The [txg:NNN] first line is the low-level internal log and is apparently the only log entry that's guaranteed to be there, I assume because it's written as part of the transaction; the remaining records can be lost if the machine fails at the right time or the program crashes, and they're written after the TXG record (as we see here). The ioctl entry tells us that this was a snapshot operation initiated from user level through a ZFS ioctl. And the final line tells us that this snapshot creation was done by the zfs command.

(Much of this is from Matthew Ahrens of Delphix in the ZFS developers mailing list, and his message is (indirectly) how I found out about the -i flag.)

If this was a snapshot creation or deletion that had been done through mkdir and rmdir, there would only be the [txg:NNN] log entries (because obviously they use neither user-level ioctls nor the zfs command).

There seem to be any number of interesting internally logged ZFS events, but at this point I haven't really gone looking into this in any depth. I encourage people to look at this themselves for their own pools.

ZFSZpoolHistoryIOption written at 01:47:29; Add Comment

2017-01-02

ZFS may panic your system if you have an exceptionally slow IO

Today, one of our ZFS fileservers paniced. The panic itself is quite straightforward:

genunix: I/O to pool 'fs0-core-01' appears to be hung.
genunix: ffffff007a0c5a20 zfs:vdev_deadman+10b ()
[...]
genunix: ffffff007a0c5af0 zfs:spa_deadman+ad ()
[...]

The spa_deadman function is to be found in spa_misc.c and vdev_deadman is in vdev.c. The latter has the important comment:

/*
 * Look at the head of all the pending queues,
 * if any I/O has been outstanding for longer than
 * the spa_deadman_synctime we panic the system.
 */

The spa_deadman_synctime value comes from zfs_deadman_synctime_ms, in spa_misc.c:

/*
 * [...]
 * Secondly, the value determines if an I/O is considered "hung".
 * Any I/O that has not completed in zfs_deadman_synctime_ms is
 * considered "hung" resulting in a system panic.
 */
uint64_t zfs_deadman_synctime_ms = 1000000ULL;

That's 1000 seconds, or 16 minutes and 40 seconds.

By 'completed' I believe that ZFS includes 'has resulted in an error', including a timeout error from eg the SCSI system. Normally you would expect IO systems to time out IO requests well before 16 minutes, but apparently something in our multipathed iSCSI setup did not do this and so ZFS pushed the big red button of a panic.

(This is a somewhat dangerous assumption under some circumstances. If you have a ZFS pool built from files from an NFS mounted filesystem, for example, NFS will wait endlessly for server IO to complete. And while this is extreme, there are vaguely plausible situations where file-backed ZFS pools make some sense.)

Note that this behavior is completely unrelated to the ZFS pool failmode setting. It can happen before ZFS reports any pool errors, and it can happen when the only problem is a single IO to a single underlying disk (and the pool has retained full redundancy throughout and so on). All it needs is one hung IO to one device used by one pool and your entire system can panic (and then sit there while it slowly writes out a crash dump, if you have those configured).

However, I've decided that I'm not particularly upset by this. The fileserver was in some trouble before the panic (I assume due to IO problems to iSCSI backend disk(s)), and rebooting seems to have fixed things for now. At least some of the times, panicing and retrying from scratch is a better strategy than banging your head against the wall over and over; this time seems to be one of them.

(I might feel differently if we had important user level processes running on these machines, like database servers or the like.)

In the short term we're unlikely to change this deadman timeout or disable it. I'm more interested in trying to find out what our iSCSI IO timeouts actually are and see if we can lower them so that the kernel will spit out timeout errors well before that much time goes by (say a couple of minutes at the outside). Unfortunately there are a lot of levels and moving parts involved here, so things are likely to be complex (and compounding on each other).

Sidebar: The various levels I think we have in action here

From the top downwards: OmniOS scsi_vhci multipathing, OmniOS generic SCSI, OmniOS iSCSI initiator, our Linux iSCSI target, the generic Linux block and SCSI layers, the Linux mpt2sas driver, and then the physical SSDs involved. Probably some of these levels do some amount of retrying of timed out requests before they pass problems back to higher levels, which of course compounds this sort of issue (and complicates tuning it).

ZFSReallySlowIOPanic written at 01:42:47; Add Comment

2016-12-21

An important little detail of our ZFS spares setup

I've written before about our ZFS spares handling system (2, 3) that we use for our fileservers. In all of that time, I've casually hand-waved a bit of terminology by calling our spares 'disks'. While they are disks from the perspective of the fileservers, our spares are not separate physical disks on the iSCSI backends (well, not usually, and I'll get to that).

We partition the 2TB physical HDs on the iSCSI backends into a number of standard sized chunks (four, in our case). It is these chunks that are exported to the fileservers, the fileservers see as disks, and thus that form the pool of unused 'disks' that become potential spares. Our spares system knows about the mapping to physical disks and thus normally avoids things like using a spare 'disk' (we call them chunks) that comes from the same HD as a pool is already using.

Where this matters is when we come around to the issue of testing your spare disks. When we started allocating chunks to pools on our new fileservers, we made a deliberate decision not to reserve one or more physical disks purely for spare chunks. Instead we smeared the collection of pools across all of the physical disks, which meant that once a fileserver had at least 14 chunks allocated to pools, all physical disks in the fileserver's backends were receiving IO. We had and have spare chunks, but we don't have any spare physical disks; all disks on all backends are actually active, making the issue of testing them relatively moot.

(Our smallest fileserver has 16 chunks allocated at the moment, which is just a bit over the 14 chunk threshold to get all disks busy.)

Recently we decided that one fileserver had so much space allocated on it that it was running alarmingly low on spare chunks. To deal with this, we added a fifteenth disk to each of its backends and this time, specifically reserved the chunks from these disks as spares. We'll never grow pools onto these disks; they now actually are spare disks, not just spare chunks on active disks. Which means that now we get to think about testing them (as I alluded to in this entry).

(Smearing our pools across all available physical disks and not reserving disks as pure spares is a policy choice, not a technical requirement. By now I can't remember exactly why we decided to do it this way; possibly we just thought it was easier and we might as well. Since it's possible to shuffle around the chunks that a pool uses, we can always change our minds on this later.)

Sidebar: the exception to this picture is our all-SSD pools

We have one fileserver and pair of backends that only has SSDs. The SSDs that we buy aren't big enough (today) to slice up into chunks, so each SSD is only one 'disk' on the fileserver. This means that the spare SSDs in the backends are genuinely unused. We haven't been worrying about this so far, but probably we should.

OurSparesSystemIV written at 01:47:43; Add Comment

2016-12-20

In praise of zpool history

When I started out with ZFS, many years ago, I mostly ignored the existence of 'zpool history' and the information it can give you. Sure, it seemed like a neat side feature and I didn't exactly object to it, but I didn't think I'd ever use it for anything much. As it turns out, I was kind of wrong about that. I still don't use zpool history very often and it is not an essential part of what we do with ZFS, but it's turned out to be quite handy to have its information, especially over the long term. So what's it good for?

At the short term 'tactical' level, 'zpool history' will tell you for sure what you, someone else, or some automated system just did to your pool. Do you need to reconstruct a relatively exact sequence of commands so you can see how those two disks wound up as spares in some pool? You can. Did something go funny and now you're not totally sure what commands you issued to the pool? 'zpool history' will tell you.

At the long term 'strategic' level, 'zpool history' will let you track things you've done to your pool over time. For instance, it will tell you when a pool was created, when you added and removed an L2ARC device or a ZIL device, or when you grew the pool's vdevs. You can also look back to track problems and work you had to do; 'zpool history' will tell you every time you cleared errors on a pool device (or the entire pool), every spare activation, and every disk replacement. Sure, ideally you would keep track of this information outside of ZFS as well, but the world is not necessarily an ideal place. If nothing else, 'zpool history' is a backup to your other record keeping and thus serves as a second source of truth.

These things that 'zpool history' can do for you don't come up right away. You may never need to look back at the immediate past to see just what happened to a pool, and long term pool history only becomes interesting once your pools actually have been around for a fair while. But we now have pools that are several years old and I'm happy that I can look back at pretty much every pool-level thing we've done to them over their lifespan.

Of course 'zpool history' is not perfect because there's plenty of information it doesn't capture. It'll tell you when you added a L2ARC device but by itself it won't tell you how big that device was, and more to the point it'll tell you when you cleared error counts on a pool device but it won't tell you what the error counts were. And it doesn't record fault events and other things like that, just ZFS commands. But all by itself it's definitely useful and I'm glad that ZFS has it.

(I'm not going to complain about 'zpool history' recording all ZFS commands, although it does get a little painful if you have a pool that you snapshot a lot. You can always filter the output, and it's better to have the information than not have it.)

ZFSZpoolHistoryPraise written at 00:35:05; Add Comment

2016-11-23

We may have seen a ZFS checksum error be an early signal for later disk failure

I recently said some things about our experience with ZFS checksums on Twitter, and it turns out I have to take one bit of it back a bit. And in that lies an interesting story about what may be a coincidence and may not be.

A couple of weeks ago, we had our first disk failure in our new fileserver environment; everything went about as smoothly as we expected and our automatic spares system fixed things up in the short term. Specifically, what failed was one of the SSDs in our all-SSD fileserver, and it went off the cliff abruptly, going from all being fine to reporting some problems to having so many issues that ZFS faulted it within a few hours. And that SSD hadn't reported any previous problems, with no one-off read errors or the like.

Well, sort of. Which is where the interesting part comes in. Today, when I was checking our records for another reason, I discovered that a single ZFS checksum error had been reported against that disk back at the end of August. There were no IO errors reported on either the fileserver or the iSCSI backend, and the checksum error didn't repeat on a scrub, so I wrote it off as a weird one-off glitch.

(And I do mean 'one checksum error', as in ZFS's checksum error count was '1'. And ZFS didn't report that any bytes of data had been fixed.)

This could be a complete coincidence. Or it could be that this SSD checksum error was actually an early warning signal that something was going wrong deep in the SSD. I have no answers, just a data point.

(We've now had another disk failure, this time a HD, and it didn't have any checksum errors in advance of the failure. Also, I have to admit that although I would like this to be an early warning signal because it would be quite handy, I suspect it's more likely to be pure happenstance. The checksum error being an early warning signal makes a really attractive story, which is one reason I reflexively distrust it.)

PS: We don't have SMART data from the SSD, either at the time of the checksum error or at the time of its failure. Next time around I'll be recording SMART data from any disk that has checksum errors reported against it, just in case something can be gleamed from it.

ZFSChecksumErrorMaybeSignal written at 00:29:49; Add Comment

(Previous 10 or go back to November 2016 at 2016/11/21)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.