Wandering Thoughts

2019-07-26

Some things on the GUID checksum in ZFS pool uberblocks

When I talked about how 'zpool import' generates its view of a pool's configuration, I mentioned that an additional kernel check of the pool configuration is that ZFS uberblocks have a simple 'checksum' of all of the GUIDs of the vdev tree. When the kernel is considering a pool configuration, it rejects it if the sum of the GUIDs in the vdev tree doesn't match the GUID sum from the uberblock.

(The documentation of the disk format claims that it's only the checksum of the leaf vdevs, but as far as I can see from the code it's all vdevs.)

I was all set to write about how this interacts with the vdev configurations that are in ZFS labels, but as it turns out this is no longer applicable. In versions of ZFS that have better ZFS pool recovery, the vdev tree that's used is the one that's read from the pool's Meta Object Set (MOS), not the pool configuration that was passed in from user level by 'zpool import'. Any mismatch between the uberblock GUID sum and the vdev tree GUID sum likely indicates a serious consistency problem somewhere.

(For the user level vdev tree, the difference between having a vdev's configuration and having all of its disks available is potentially important. As we saw yesterday, the ZFS label of every device that's part of a vdev has a complete copy of that vdev's configuration, including all of the GUIDs of its elements. Given a single intact ZFS label for a vdev, you can construct a configuration with all of the GUIDs filled in and thus pass the uberblock GUID sum validation, even if you don't have enough disks to actually use the vdev.)

The ZFS uberblock update sequence guarantees that the ZFS disk labels and their embedded vdev configurations should always be up to date with the current uberblock's GUID sum. Now that I know about the embedded uberblock GUID sum, it's pretty clear why the uberblock must be synced on all vdevs when the vdev or pool configuration is considered 'dirty'. The moment that the GUID sum of the current vdev tree changes, you'd better update everything to match it.

(The GUID sum changes if any rearrangement of the vdev tree happens. This includes replacing one disk with another, since each disk has a unique GUID sum. In case you're curious, the ZFS disk label always has the full tree for a top level vdev, including the special 'replacing' and 'spare' sub-vdevs that show up during these operations.)

PS: My guess from a not very extensive look through the kernel code is that it's very hard to tell from user level if you have a genuine uberblock GUID sum mismatch or another problem that returns the same extended error code to user level. The good news is that I think the only other case that returns VDEV_AUX_BAD_GUID_SUM is if you have missing log device(s).

ZFSUberblockGUIDSumNotes written at 22:51:41; Add Comment

How 'zpool import' generates its view of a pool's configuration

Full bore ZFS pool import happens in two stages, where 'zpool import' puts together a vdev configuration for the pool, passes it to the kernel, and then the kernel reads the real pool configuration from ZFS objects in the pool's Meta Object Set. How 'zpool import' does this is outlined at a high level by a comment in zutil_import.c; to summarize the comment, the configuration is created by assembling and merging together information from the ZFS label of each device. There is an important limitation to this process, which is that the ZFS label only contains information on the vdev configuration, not on the overall pool configuration.

To show you what I mean, here's relevant portions of a ZFS label (as dumped by 'zdb -l') for a device from one of our pools:

   txg: 5059313
   pool_guid: 756813639445667425
   top_guid: 4603657949260704837
   guid: 13307730581331167197
   vdev_children: 5
   vdev_tree:
       type: 'mirror'
       id: 3
       guid: 4603657949260704837
       is_log: 0
       children[0]:
           type: 'disk'
           id: 0
           guid: 7328257775812323847
           path: '/dev/disk/by-path/pci-0000:19:00.0-sas-phy3-lun-0-part6'
       children[1]:
           type: 'disk'
           id: 1
           guid: 13307730581331167197
           path: '/dev/disk/by-path/pci-0000:00:17.0-ata-4-part6'

(For much more details that are somewhat out of date, see the ZFS On-Disk Specifications [pdf].)

Based on this label, 'zpool import' knows what the GUID of this vdev is, which disk of the vdev it's dealing with and where the other disk or disks in it are supposed to be found, the pool's GUID, how many vdevs the pool has in total (it has 5) and which specific vdev this is (it's the fourth of five; vdev numbering starts from 0). But it doesn't know anything about the other vdevs, except that they exist (or should exist).

When zpool assembles the pool configuration, it will use the best information it has for each vdev, where the 'best' is taken to be the vdev label with the highest txg (transaction group number). The label with the highest txg for the entire pool is used to determine how many vdevs the pool is supposed to have. Note that there's no check that the best label for a particular vdev has a txg that is anywhere near the pool's (assumed) current txg. This means that if all of the modern devices for a particular vdev disappear and a very old device for it reappears, it's possible for zpool to assemble a (user-level) configuration that claims that the old device is that vdev (or the only component available for that vdev, which might be enough if the vdev is a mirror).

If zpool can't find any labels for a particular vdev, all it can do in the configuration is fill in an artificial 'there is a vdev missing' marker; it doesn't even know whether it was a raidz or a mirrored vdev, or how much data is on it. When 'zpool import' prints the resulting configuration, it doesn't explicitly show these missing vdevs; if I'm reading the code right, your only clue as to where they are is that the pool configuration will abruptly skip from, eg, 'mirror-0' to 'mirror-2' without reporting 'mirror-1'.

There's an additional requirement for a working pool configuration, although it's only checked by the kernel, not zpool. The pool uberblocks have a ub_guid_sum field, which must match the sum of all GUIDs in the vdev tree. If the GUID sum doesn't match, you'll get one of those frustrating 'a device is missing somewhere' errors on pool import. An entirely missing vdev naturally forces this to happen, since all of its GUIDs are unknown and obviously not contributing what they should be to this sum. I don't know how this interacts with better ZFS pool recovery.

ZFSZpoolImportAssembly written at 01:18:58; Add Comment

2019-07-24

ZFS pool imports happen in two stages of pool configuration processing

The mechanics of how ZFS pools are imported is one of the more obscure areas of ZFS, which is a potential problem given that things can go very wrong (often with quite unhelpful errors). One special thing about ZFS pool importing is that it effectively happens in two stages, first with user-level processing and then again in the kernel, and these two stages use two potentially different pool configurations. My primary source for this is the discussion from Illumos issue #9075:

[...] One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. [...]

Here's my translation of that. In order to tell the kernel to load a pool, 'zpool import' has to come up with a vdev configuration for the pool and then provide it to the kernel. However, this is not the real pool configuration; the real pool configuration is stored in the pool itself (in regular ZFS objects that are part of the MOS), where the kernel reads it again as the kernel imports the pool.

Although not mentioned explicitly, the pool configuration that 'zpool import' comes up with and passes to the kernel is not read from the canonical pool configuration, because reading those ZFS objects from the MOS requires a relatively full implementation of ZFS, which 'zpool import' does not have (the kernel obviously does). One source of the pool configuration for 'zpool import' is the ZFS cache file, /etc/zfs/zpool.cache, which theoretically contains current pool configurations for all active pools. How 'zpool import' generates a pool configuration for exported or deleted pools is sufficiently complicated to need an entry of its own.

This two stage process means that there are at least two different things that can go wrong with a ZFS pool's configuration information. First, 'zpool import' may not be able to put together what it thinks is a valid pool configuration, in which case I believe that it doesn't even try to pass it to the kernel. Second, the kernel may dislike the configuration that it's handed for its own reasons. In older versions of ZFS (before better ZFS pool recovery landed), any mismatch between the actual pool configuration and the claimed configuration from user level was apparently fatal; now, only some problems are fatal.

As far as I know, 'zpool import' doesn't clearly distinguish between these two cases in its error messages when you're actually trying to import a pool. If you're just running it to see what pools are available, I believe that all of what 'zpool import' reports comes purely from its own limited and potentially imperfect configuration assembly, with no kernel involvement.

(When a pool is fully healthy and in good shape, the configuration that 'zpool import' puts together at the user level will completely match the real configuration in the MOS. When it's not is when you run into potential problems.)

ZFSPoolImportTwoStages written at 00:52:07; Add Comment

2019-06-27

Our last OmniOS fileserver is now out of production (and service)

On Twitter, I noted a milestone last evening:

This evening we took our last OmniOS fileserver out of production and powered it off (after a great deal of slow work; all told this took more than a year). They've had a good run, so thank you Illumos/OmniOS/OmniTI/etc for the generally quiet and reliable service.

We still haven't turned any of our iSCSI backends off (they're Linux, not OmniOS), but that will be next, probably Friday (the delay is just in case). Then we'll get around to recycling all of the hardware for some new use, whatever it will turn out to be.

When we blank out the OmniOS system disks as part of recycling the hardware, that really will be the end of the line for the whole second generation of our fileserver infrastructure and the last lingering traces of our long association with Sun will be gone, swallowed by time.

It's been pointed out to me by @oclsc that since we're still using ZFS (now ZFS on Linux), we still have a tie to Sun's lineage. It doesn't really feel the same, though; open source ZFS is sort of a lifeboat pushed out of Sun toward the end, not Sun(ish) itself.

(This is probably about as fast as I should have expected from having almost all of the OmniOS fileservers out of production at the end of May. Things always come up.)

Various people and groups at the department have been buying Sun machines and running Sun OSes (first SunOS and then Solaris) almost from the beginning of Sun. I don't know if we bought any Sun 1s, but I do know that some Sun 2s were, and Sun 3s and onward were for many years a big presence (eventually only as servers, although we did have some Sunrays). With OmniOS going out of service, that is the end of our use of that lineage of Unix.

(Of course Sun itself has been gone for some time, consumed by Oracle. But our use of its lineage lived on in OmniOS, since Illumos is more or less Solaris in open source form (and improved from when it was abandoned by its corporate parent).)

I have mixed feelings about OmniOS and I don't have much sentimentality about Solaris itself (it's complicated). But I still end up feeling that there is a weight of history that has shifted here in the department, at the end of a long slow process. Sun is woven through the history of the department's computing, and now all that remains of that is our use of ZFS.

(For all that I continue to think that ZFS is your realistic choice for an advanced filesystems, I also think that we probably wouldn't have wound up using it if we hadn't started with Solaris.)

OmniOSEndOfService written at 22:29:11; Add Comment

2019-06-26

A hazard of our old version of OmniOS: sometimes powering off doesn't

Two weeks ago, I powered down all of our OmniOS fileservers that are now out of production, which is most of them. By that, I mean that I logged in to each of them via SSH and ran 'poweroff'. The machines disappeared from the network and I thought nothing more of it.

This Sunday morning we had a brief power failure. In the aftermath of the power failure, three out of four of the OmniOS fileservers reappeared on the network, which we knew mostly because they sent us some email (there were no bad effects of them coming back). When I noticed them back, I assumed that this had happened because we'd set their BIOSes to 'always power on after a power failure'. This is not too crazy a setting for a production server you want up at all costs because it's a central fileserver, but it's obviously no longer the setting you want once they go out of production.

Today, I logged in to the three that had come back, ran 'poweroff' on them again, and then later went down to the machine room to pull out their power cords. To my surprise, when I looked at the physical machines, they had little green power lights that claimed they were powered on. When I plugged in a roving display and keyboard to check their state, I discovered that all three were still powered on and sitting displaying an OmniOS console message to the effect that they were powering off. Well, they might have been trying to power off, but they weren't achieving it.

I rather suspect that this is what happened two weeks ago, and why these machines all sprang back to life after the power failure. If OmniOS never actually powered the machines off, even a BIOS setting of 'resume last power state after a power failure' would have powered the machines on again, which would have booted OmniOS back up again. Two weeks ago, I didn't go look at the physical servers or check their power state through their lights out management interface; it never occurred to me that 'poweroff' on OmniOS sometimes might not actually power the machine off, especially when the machines did drop off the network.

(One out of the four OmniOS servers didn't spring back to life after the power failure, and was powered off when I looked at the hardware. Perhaps its BIOS was set very differently, or perhaps OmniOS managed to actually power it off. They're all the same hardware and the same OmniOS version, but the server that probably managed to power off had no active ZFS pools on our iSCSI backends; the other three did.)

At this point, this is only a curiosity. If all goes well, the last OmniOS fileserver will go out of production tomorrow evening. It's being turned off as part of that, which means that I'm going to have to check that it actually powered off (and I'd better add that to the checklist I've written up).

HazardOfNoPoweroff written at 01:01:17; Add Comment

2019-06-03

Almost all of our OmniOS machines are now out of production

Last Friday, my co-workers migrated the last filesystem from our HD-based OmniOS fileservers to one of our new Linux fileservers. With this, the only OmniOS fileserver left in production is serving a single filesystem, our central administrative filesystem, which is extremely involved to move because everything uses it all the time and knows where it is (and of course it's where our NFS automounter replacement lives, along with its data files). Moving that filesystem is going to take a bunch of planning and a significant downtime, and it will only happen after I come back from vacation.

(Unlike last time around, we haven't destroyed any pools or filesystems yet in the old world, since we didn't run into any need to.)

This migration has been in process in fits and starts since late last November, so it's taken about seven months to finish. This isn't because we have a lot of data to move (comparatively speaking); instead it's because we have a lot of filesystems with a lot of users. First you have to schedule a time for each filesystem that the users don't object to (and sometimes things come up so your scheduled time has to be abandoned), and then moving each filesystem takes a certain amount of time and boring work (so often people only want to do so many a day, so they aren't spending all of their day on this stuff). Also, our backup system is happier when we don't suddenly give it massive amounts of 'new' data to back up in a single day.

(I think this is roughly comparable to our last migration, which seems to have started at the end of August of 2014 and finished in mid-February of 2015. We've added significantly more filesystems and disk space since then.)

The MVP of the migration is clearly 'zfs send | zfs recv' (as it always has been). Having to do the migrations with something like rsync would likely have been much more painful for various reasons; ZFS snapshots and ZFS send are things that just work, and they come with solid and extremely reassuring guarantees. Part of their importance was that the speed of an incremental ZFS send meant that the user-visible portion of a migration (where we had to take their filesystem away temporarily) could be quite short (short enough to enable opportunistic migrations, if we could see that no one was using some of the filesystems).

At this point we've gotten somewhere around four and a half years of lifetime out of our OmniOS fileservers. This is probably around what we wanted to get, especially since we never replaced the original hard drives and so they're starting to fall out of warranty coverage and hit what we consider their comfortable end of service life. Our first generation Solaris fileservers were stretched much longer, but they had two generations of HDs and even then we were pushing it toward the end of their service life.

(The actual server hardware for both the OmniOS fileservers and the Linux iSCSI backends seems fine, so we expect to reuse it in the future once we migrate the last filesystem and then tear down the entire old environment. We will probably even reuse the data HDs, but only for less important things.)

I think I feel less emotional about this migration away from OmniOS than I did about our earlier migration from Solaris to OmniOS. Moving away from Solaris marked the end of Sun's era here (even if Sun had been consumed by Oracle by that point), but I don't have that sort of feelings about OmniOS. OmniOS was always a tool to me, although unquestionably a useful one.

(I'll write a retrospective on our OmniOS fileservers at some point, probably once the final filesystem has migrated and everything has been shut down for good. I want to have some distance and some more experience with our Linux fileservers first.)

PS: To give praise where it belongs, my co-workers did basically all of the hard, grinding work of this migration, for various reasons. Once things got rolling, I got to mostly sit back and move filesystems when they told me one was scheduled and I should do it. I also cleverly went on vacation during the final push at the end.

OmniOSMostlyEndOfService written at 21:44:16; Add Comment

2019-05-31

Some things on how ZFS dnode object IDs are allocated (which is not sequentially)

One of the core elements of ZFS are dnodes, which define DMU objects. Within a single filesystem or other object sets, dnodes have an object number (aka object id). For dnodes that are files or directories in a filesystem, this is visible as their Unix inode number, but other internal things get dnodes and thus object numbers (for example, the dnode of the filesystem's delete queue). Object ids are 64-bit numbers, and many of them can be relatively small (especially if they are object ids for internal structures, again such as the delete queue). Very large dnode numbers are uncommon, and some files and directories from early in a filesystem's life can have very small object IDs.

(For instance, the object ID of my home directory on our ZFS fileservers is '5'. I'm the only user in this filesystem.)

You might reasonably wonder how ZFS object IDs are allocated. Inspection of a ZFS filesystem will show that they are clearly not allocated sequentially, but they're also not allocated randomly. Based on an inspection of the dnode allocation source code in dmu_object.c, there seem to be two things going on to spread dnode object ids around some (but not too much).

The first thing is that dnode allocation is done from per-CPU chunks of the dnode space. The size of each chunk is set by dmu_object_alloc_chunk_shift, which by default creates 128-dnode chunks. The motivation for this is straightforward; if all of the CPUs in the system were all allocating dnodes from the same area, they would all have to content over locks on this area. Spreading out into separate chunks reduces locking contention, which means that parallel or highly parallel workloads that frequently create files on a single filesystem don't bottleneck on a shared lock.

(One reason that you might create files a lot in a parallel worklog is if you're using files on the filesystem as part of a locking strategy. This is still common in things like mail servers, mail clients, and IMAP servers.)

The second thing is, well, I'm going to quote the comment in the source code to start with:

Each time we polish off a L1 bp worth of dnodes (2^12 objects), move to another L1 bp that's still reasonably sparse (at most 1/4 full). Look from the beginning at most once per txg. If we still can't allocate from that L1 block, search for an empty L0 block, which will quickly skip to the end of the metadnode if no nearby L0 blocks are empty. This fallback avoids a pathology where full dnode blocks containing large dnodes appear sparse because they have a low blk_fill, leading to many failed allocation attempts. [...]

(In reading the code a bit, I think this comment means 'L2 block' instead of 'L0 block'.)

To understand a bit more about this, we need to know about two things. First, we need to know that dnodes themselves are stored in another DMU object, and this DMU object stores data in the same way as all others do, using various levels of indirect blocks. Then we need to know about indirect blocks themselves. L0 blocks directly hold data (in this case the actual dnodes), while L1 blocks hold pointers to L0 blocks and L2 blocks hold pointers to L1 blocks.

(You can see examples of this structure for regular files in the zdb output in this entry and this entry. If I'm doing the math right, for dnodes a L0 block normally holds 32 dnodes and a L<N> block can address up to 128 L<N-1> blocks, through block pointers.)

So, what appears to happen is that at first, the per-CPU allocator gets its chunks sequentially (for different CPUs, or the same CPU) from the same L1 indirect block, which covers 4096 dnodes. When we exhaust all of the 128-dnode chunks in a single group of 4096, we don't move to the sequentially next group of 4096; instead we search around for a sufficiently empty group, and switch to it (where a 'sufficiently empty' group is one with at most 1024 dnodes already allocated). If there is no such group, I think that we may wind up skipping to the end of the currently allocated dnodes and getting a completely fresh empty block of 4096.

If I'm right, the net effect of this is to smear out dnode allocations and especially reallocations over an increasingly large portion of the lower dnode object number space. As your filesystem gets used and files get deleted, many of the lower 4096-dnode groups will have some or even many free dnodes, but not the 3072 that they need to be eligible for be selected for further assignment. This can eventually push dnode allocations to relatively high object numbers even though you may not have anywhere near that many dnodes in use on the filesystem. This is not guaranteed, though, and you may still reuse dnode numbers.

(For example, I just created a new file in my home directory. My home directory's filesystem has 1983310 dnodes used right now, but the inode number (and thus dnode object number) that my new test file got was 1804696.)

ZFSDnodeIdsAllocation written at 23:25:17; Add Comment

2019-05-17

One of our costs of using OmniOS was not having 10G networking

OmniOS has generally been pretty good to us over the lifetime of our second generation ZFS fileservers, but as we've migrated various filesystems from our OmniOS fileservers to our new Linux fileservers, it's become clear that one of the costs we paid for using OmniOS was not having 10G networking.

We certainly started out intending to have 10G networking on OmniOS; our hardware was capable of it, with Intel 10G-T chipsets, and OmniOS seemed happy to drive them at decent speeds. But early on we ran into a series of fatal problems with the Intel ixgbe driver which we never saw any fixes for. We moved our OmniOS machines (and our iSCSI backends) back to 1G, and they have stayed there ever since. When we made this move, we did not have detailed system metrics on things like NFS bandwidth usage by clients, and anyway almost all of our filesystems were on HDs, so 1G seemed like it should be fine. And indeed, we mostly didn't see obvious and glaring problems, especially right away.

What setting up a metrics system (even only on our NFS clients) and then later moving some filesystems from OmniOS (at 1G) to Linux (at 10G) made clear was that on some filesystems, we had definitely been hitting the 1G bandwidth limit and doing so had real impacts. The filesystem this was most visible on is the one that holds /var/mail, our central location for people's mailboxes (ie, their IMAP inbox). This was always on SSDs even on OmniOS, and once we started really looking it was clearly bottlenecked at 1G. It was one of the early filesystems we moved to the Linux fileservers, and the improvement was very visible. Our IMAP server, which has 10G itself, now routinely has bursts of over 200 Mbps inbound and sometimes sees brief periods of almost saturated network bandwidth. More importantly, the IMAP server's performance is visibly better; it is less loaded and more responsive, especially at busy times.

(A contributing factor to this is that any number of people have very big inboxes, and periodically our IMAP server winds up having to read through all of such an inbox. This creates a very asymmetric traffic pattern, with huge inbound bandwidth from the /var/mail fileserver to the IMAP server but very little outbound traffic.)

It's less clear how much of a cost we paid for HD-based filesystems, but it seems pretty likely that we paid some cost, especially since our OmniOS fileservers were relatively large (too large, in fact). With lots of filesystems, disks, and pools on each fileserver, it seems likely that there would have been periods where each fileserver could have reached inbound or outbound network bandwidth rates above 1G, if they'd had 10G networking.

(And this excludes backups, where it seems quite likely that 10G would have sped things up somewhat. I don't consider backups as important as regular fileserver NFS traffic because they're less time and latency sensitive.)

At the same time, it's quite possible that this cost was still worth paying in order to use OmniOS back then instead of one of the alternatives. ZFS on Linux was far less mature in 2013 and 2014, and I'm not sure how well FreeBSD would have worked, especially if we insisted on keeping a SAN based design with iSCSI.

(If we had had lots of money, we might have attempted to switch to other 10G networking cards, probably SFP+ ones instead of 10G-T (which would have required switch changes too), or to commission someone to fix up the ixgbe driver, or both. But with no funds for either, it was back to 1G for us and then the whole thing was one part of why we moved away from Illumos.)

OmniOSNo10GCost written at 01:11:05; Add Comment

2019-04-07

A ZFS resilver can be almost as good as a scrub, but not quite

We do periodic scrubs of our pools, roughly every four weeks on a revolving schedule (we only scrub one pool per fileserver at once, and only over the weekend, so we can't scrub all pools on one of our HD based fileservers in one weekend). However, this weekend scrubbing doesn't happen if there's something else more important happening on the fileserver. Normally there isn't, but one of our iSCSI backends didn't come back up after our power outage this Thursday night. We have spare backends, so we added one in to the affected fileserver and started the process of resilvering everything onto the new backend's disks to restore redundancy to all of our mirrored vdevs.

I've written before about the difference between scrubs and resilvers, which is that a resilver potentially reads and validates less than a scrub does. However, we only have two way mirrors and we lost one side of all of them in the backend failure, so resilvering all mirrors has to read all of the metadata and data on every remaining device of every pool. At first, I thought that this was fully equivalent to a scrub and thus we had effectively scrubbed all of our pools on that fileserver, putting us ahead of our scrub schedule instead of behind it. Then I realized that it isn't, because resilvering doesn't verify that the newly written data on the new devices is good.

ZFS doesn't have any explicit 'read after write' checks, although it will naturally do some amount of reads from your new devices just as part of balancing reads. So although you know that everything on your old disks is good, you can't have full confidence that your new disks have correct copies of everything. If something got corrupted on the way to the disk or the disk has a bad spot that wasn't spotted by its electronics, you won't know until it's read back, and the only way to force that is with an explicit scrub.

For our purposes this is still reasonably good. We've at least checked half of every pool, so right now we definitely have one good copy of all of our data. But it's not quite the same as scrubbing the pools and we definitely don't want to reset all of the 'last scrubbed at time X' markers for the pools to right now.

(If you have three or four way mirrors, as we have had in the past, a resilver doesn't even give you this because it only needs to read each piece of data or metadata from one of your remaining N copies.)

ZFSResilverAlmostScrub written at 00:47:37; Add Comment

2019-04-01

Our plan for handling TRIM'ing our ZFS fileserver SSDs

The versions of ZFS that we're running on our fileservers (both the old and the new) don't support using TRIM commands on drives in ZFS pools. Support for TRIM has been in FreeBSD ZFS for a while, but it only just landed in the ZFS on Linux development version and it's not in Illumos. Given our general upgrade plans, we're also not likely to get TRIM support over the likely production lifetime of our current ZFS SSDs through upgrading the OS and ZFS versions later. So you might wonder what our plans are to deal with how SSD performance can decrease when they think they're all filled up, if you don't TRIM them or otherwise deallocate blocks every so often.

Honestly, the first part of our plan is to ignore the issue unless we see signs of performance problems. This is not ideal but it is the simplest approach. It's reasonably likely that our ZFS fileservers will be more limited by NFS and networking than by SSD performance, and as far as I understand things, nominally full SSDs mostly suffer from write performance issues, not read performance. Our current view (only somewhat informed by actual data) is that our read volume is significantly higher than our write volume. We certainly aren't currently planning any sort of routine preventative work here, and we wouldn't unless we saw problem signs.

If we do see problems signs and do need to clear SSDs, our plan is to do the obvious brute force thing in a ZFS setup with redundancy. Rather than try to TRIM SSDs in place, we'll entirely spare out a given SSD so that it has no live data on it, and then completely clear it, probably using Linux's blkdiscard. We might do this in place on a production fileserver, or we might go to the extra precaution of pulling the SSD out entirely, swapping in a freshly cleared one, and clearing the old SSD on a separate machine. Doing this swap has the twin advantages that we're not risking accidentally clearing the wrong SSD on the fileserver and we don't have to worry about the effects of an extra-long, extra-slow SATA command on the rest of the system and the other drives.

(This plan, such as it is, is not really new with our current generation Linux fileservers. We've had one OmniOS fileserver that used SSDs for a few special pools, and this was always our plan for dealing with any clear problems due to the SSDs slowing down due to being full up. We haven't had to use it, but then we haven't really gone looking for performance problems with its SSDs. They seem to still run fast enough after four or more years, and so far that's good enough for us.)

ZFSOurSSDTrimPlan written at 22:35:23; Add Comment

(Previous 10 or go back to March 2019 at 2019/03/26)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.