Some things on the GUID checksum in ZFS pool uberblocks
When I talked about how '
zpool import' generates its view of a
pool's configuration, I mentioned that an
additional kernel check of the pool configuration is that ZFS
uberblocks have a simple 'checksum' of all of
the GUIDs of the vdev tree. When the kernel is considering
a pool configuration, it rejects it if the sum of the GUIDs in the
vdev tree doesn't match the GUID sum from the uberblock.
(The documentation of the disk format claims that it's only the checksum of the leaf vdevs, but as far as I can see from the code it's all vdevs.)
I was all set to write about how this interacts with the vdev
configurations that are in ZFS labels, but
as it turns out this is no longer applicable. In versions of ZFS
that have better ZFS pool recovery,
the vdev tree that's used is the one that's read from the pool's
Meta Object Set (MOS), not the pool configuration that was passed
in from user level by '
zpool import'. Any mismatch between the
uberblock GUID sum and the vdev tree GUID sum likely indicates a
serious consistency problem somewhere.
(For the user level vdev tree, the difference between having a vdev's configuration and having all of its disks available is potentially important. As we saw yesterday, the ZFS label of every device that's part of a vdev has a complete copy of that vdev's configuration, including all of the GUIDs of its elements. Given a single intact ZFS label for a vdev, you can construct a configuration with all of the GUIDs filled in and thus pass the uberblock GUID sum validation, even if you don't have enough disks to actually use the vdev.)
The ZFS uberblock update sequence guarantees that the ZFS disk labels and their embedded vdev configurations should always be up to date with the current uberblock's GUID sum. Now that I know about the embedded uberblock GUID sum, it's pretty clear why the uberblock must be synced on all vdevs when the vdev or pool configuration is considered 'dirty'. The moment that the GUID sum of the current vdev tree changes, you'd better update everything to match it.
(The GUID sum changes if any rearrangement of the vdev tree happens.
This includes replacing one disk with another, since each disk has
a unique GUID sum. In case you're curious, the ZFS disk label always
has the full tree for a top level vdev, including the special
replacing' and '
spare' sub-vdevs that show up during these
PS: My guess from a not very extensive look through the kernel code
is that it's very hard to tell from user level if you have a genuine
uberblock GUID sum mismatch or another problem that returns the
same extended error code to user level. The good news is that I
think the only other case that returns
is if you have missing log device(s).
zpool import' generates its view of a pool's configuration
Full bore ZFS pool import happens in two stages,
zpool import' puts together a vdev configuration for the
pool, passes it to the kernel, and then the kernel reads the real
pool configuration from ZFS objects in the pool's Meta Object Set.
zpool import' does this is outlined at a high level by a
to summarize the comment, the configuration is created by assembling
and merging together information from the ZFS label of each device.
There is an important limitation to this process, which is that the
ZFS label only contains information on the vdev configuration, not
on the overall pool configuration.
To show you what I mean, here's relevant portions of a ZFS label
(as dumped by '
zdb -l') for a device from one of our pools:
txg: 5059313 pool_guid: 756813639445667425 top_guid: 4603657949260704837 guid: 13307730581331167197 vdev_children: 5 vdev_tree: type: 'mirror' id: 3 guid: 4603657949260704837 is_log: 0 children: type: 'disk' id: 0 guid: 7328257775812323847 path: '/dev/disk/by-path/pci-0000:19:00.0-sas-phy3-lun-0-part6' children: type: 'disk' id: 1 guid: 13307730581331167197 path: '/dev/disk/by-path/pci-0000:00:17.0-ata-4-part6'
(For much more details that are somewhat out of date, see the ZFS On-Disk Specifications [pdf].)
Based on this label, '
zpool import' knows what the GUID of this
vdev is, which disk of the vdev it's dealing with and where the
other disk or disks in it are supposed to be found, the pool's GUID,
how many vdevs the pool has in total (it has 5) and which specific
vdev this is (it's the fourth of five; vdev numbering starts from
0). But it doesn't know anything about the other vdevs, except
that they exist (or should exist).
When zpool assembles the pool configuration, it will use the best
information it has for each vdev, where the 'best' is taken to be
the vdev label with the highest
txg (transaction group number).
The label with the highest txg for the entire pool is used to
determine how many vdevs the pool is supposed to have. Note that
there's no check that the best label for a particular vdev has a
txg that is anywhere near the pool's (assumed) current txg. This
means that if all of the modern devices for a particular vdev
disappear and a very old device for it reappears, it's possible for
zpool to assemble a (user-level) configuration that claims that the
old device is that vdev (or the only component available for that
vdev, which might be enough if the vdev is a mirror).
If zpool can't find any labels for a particular vdev, all it can
do in the configuration is fill in an artificial 'there is a vdev
missing' marker; it doesn't even know whether it was a raidz or a
mirrored vdev, or how much data is on it. When '
prints the resulting configuration, it doesn't explicitly show these
missing vdevs; if I'm reading the code right, your only clue as to
where they are is that the pool configuration will abruptly skip
from, eg, 'mirror-0' to 'mirror-2' without reporting 'mirror-1'.
There's an additional requirement for a working pool configuration,
although it's only checked by the kernel, not zpool. The pool
uberblocks have a
ub_guid_sum field, which must match the sum
of all GUIDs in the vdev tree. If the GUID sum doesn't match, you'll
get one of those frustrating 'a device is missing somewhere' errors
on pool import. An entirely missing vdev naturally forces this to
happen, since all of its GUIDs are unknown and obviously not
contributing what they should be to this sum. I don't know how this
interacts with better ZFS pool recovery.
ZFS pool imports happen in two stages of pool configuration processing
The mechanics of how ZFS pools are imported is one of the more obscure areas of ZFS, which is a potential problem given that things can go very wrong (often with quite unhelpful errors). One special thing about ZFS pool importing is that it effectively happens in two stages, first with user-level processing and then again in the kernel, and these two stages use two potentially different pool configurations. My primary source for this is the discussion from Illumos issue #9075:
[...] One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. [...]
Here's my translation of that. In order to tell the kernel to load
a pool, '
zpool import' has to come up with a vdev configuration
for the pool and then provide it to the kernel. However, this is
not the real pool configuration; the real pool configuration is
stored in the pool itself (in regular ZFS objects that are part of
the MOS), where the kernel reads it again as the kernel imports the
Although not mentioned explicitly, the pool configuration that
zpool import' comes up with and passes to the kernel is not read
from the canonical pool configuration, because reading those ZFS
objects from the MOS requires a relatively full implementation of
ZFS, which '
zpool import' does not have (the kernel obviously
does). One source of the pool configuration for '
is the ZFS cache file,
/etc/zfs/zpool.cache, which theoretically
contains current pool configurations for all active pools. How
zpool import' generates a pool configuration for exported or
deleted pools is sufficiently complicated to need an entry of its
This two stage process means that there are at least two different
things that can go wrong with a ZFS pool's configuration information.
zpool import' may not be able to put together what it
thinks is a valid pool configuration, in which case I believe that
it doesn't even try to pass it to the kernel. Second, the kernel
may dislike the configuration that it's handed for its own reasons.
In older versions of ZFS (before better ZFS pool recovery landed), any mismatch between the actual pool
configuration and the claimed configuration from user level was
apparently fatal; now, only some problems are fatal.
As far as I know, '
zpool import' doesn't clearly distinguish
between these two cases in its error messages when you're actually
trying to import a pool. If you're just running it to see what pools
are available, I believe that all of what '
zpool import' reports
comes purely from its own limited and potentially imperfect
configuration assembly, with no kernel involvement.
(When a pool is fully healthy and in good shape, the configuration
zpool import' puts together at the user level will completely
match the real configuration in the MOS. When it's not is when you
run into potential problems.)
Our last OmniOS fileserver is now out of production (and service)
On Twitter, I noted a milestone last evening:
This evening we took our last OmniOS fileserver out of production and powered it off (after a great deal of slow work; all told this took more than a year). They've had a good run, so thank you Illumos/OmniOS/OmniTI/etc for the generally quiet and reliable service.
We still haven't turned any of our iSCSI backends off (they're Linux, not OmniOS), but that will be next, probably Friday (the delay is just in case). Then we'll get around to recycling all of the hardware for some new use, whatever it will turn out to be.
When we blank out the OmniOS system disks as part of recycling the hardware, that really will be the end of the line for the whole second generation of our fileserver infrastructure and the last lingering traces of our long association with Sun will be gone, swallowed by time.
It's been pointed out to me by @oclsc that since we're still using ZFS (now ZFS on Linux), we still have a tie to Sun's lineage. It doesn't really feel the same, though; open source ZFS is sort of a lifeboat pushed out of Sun toward the end, not Sun(ish) itself.
(This is probably about as fast as I should have expected from having almost all of the OmniOS fileservers out of production at the end of May. Things always come up.)
Various people and groups at the department have been buying Sun machines and running Sun OSes (first SunOS and then Solaris) almost from the beginning of Sun. I don't know if we bought any Sun 1s, but I do know that some Sun 2s were, and Sun 3s and onward were for many years a big presence (eventually only as servers, although we did have some Sunrays). With OmniOS going out of service, that is the end of our use of that lineage of Unix.
(Of course Sun itself has been gone for some time, consumed by Oracle. But our use of its lineage lived on in OmniOS, since Illumos is more or less Solaris in open source form (and improved from when it was abandoned by its corporate parent).)
I have mixed feelings about OmniOS and I don't have much sentimentality about Solaris itself (it's complicated). But I still end up feeling that there is a weight of history that has shifted here in the department, at the end of a long slow process. Sun is woven through the history of the department's computing, and now all that remains of that is our use of ZFS.
(For all that I continue to think that ZFS is your realistic choice for an advanced filesystems, I also think that we probably wouldn't have wound up using it if we hadn't started with Solaris.)
A hazard of our old version of OmniOS: sometimes powering off doesn't
Two weeks ago, I powered down all of our OmniOS fileservers that
are now out of production, which is
most of them. By that, I mean that I logged in to each of them via
SSH and ran '
poweroff'. The machines disappeared from the network
and I thought nothing more of it.
This Sunday morning we had a brief power failure. In the aftermath of the power failure, three out of four of the OmniOS fileservers reappeared on the network, which we knew mostly because they sent us some email (there were no bad effects of them coming back). When I noticed them back, I assumed that this had happened because we'd set their BIOSes to 'always power on after a power failure'. This is not too crazy a setting for a production server you want up at all costs because it's a central fileserver, but it's obviously no longer the setting you want once they go out of production.
Today, I logged in to the three that had come back, ran '
on them again, and then later went down to the machine room to pull
out their power cords. To my surprise, when I looked at the physical
machines, they had little green power lights that claimed they were
powered on. When I plugged in a roving display and keyboard to check
their state, I discovered that all three were still powered on and
sitting displaying an OmniOS console message to the effect that they
were powering off. Well, they might have been trying to power off,
but they weren't achieving it.
I rather suspect that this is what happened two weeks ago, and why
these machines all sprang back to life after the power failure. If
OmniOS never actually powered the machines off, even a BIOS setting
of 'resume last power state after a power failure' would have powered
the machines on again, which would have booted OmniOS back up again.
Two weeks ago, I didn't go look at the physical servers or check
their power state through their lights out management interface;
it never occurred to me that '
poweroff' on OmniOS sometimes might
not actually power the machine off, especially when the machines
did drop off the network.
(One out of the four OmniOS servers didn't spring back to life after the power failure, and was powered off when I looked at the hardware. Perhaps its BIOS was set very differently, or perhaps OmniOS managed to actually power it off. They're all the same hardware and the same OmniOS version, but the server that probably managed to power off had no active ZFS pools on our iSCSI backends; the other three did.)
At this point, this is only a curiosity. If all goes well, the last OmniOS fileserver will go out of production tomorrow evening. It's being turned off as part of that, which means that I'm going to have to check that it actually powered off (and I'd better add that to the checklist I've written up).
Almost all of our OmniOS machines are now out of production
Last Friday, my co-workers migrated the last filesystem from our HD-based OmniOS fileservers to one of our new Linux fileservers. With this, the only OmniOS fileserver left in production is serving a single filesystem, our central administrative filesystem, which is extremely involved to move because everything uses it all the time and knows where it is (and of course it's where our NFS automounter replacement lives, along with its data files). Moving that filesystem is going to take a bunch of planning and a significant downtime, and it will only happen after I come back from vacation.
(Unlike last time around, we haven't destroyed any pools or filesystems yet in the old world, since we didn't run into any need to.)
This migration has been in process in fits and starts since late last November, so it's taken about seven months to finish. This isn't because we have a lot of data to move (comparatively speaking); instead it's because we have a lot of filesystems with a lot of users. First you have to schedule a time for each filesystem that the users don't object to (and sometimes things come up so your scheduled time has to be abandoned), and then moving each filesystem takes a certain amount of time and boring work (so often people only want to do so many a day, so they aren't spending all of their day on this stuff). Also, our backup system is happier when we don't suddenly give it massive amounts of 'new' data to back up in a single day.
(I think this is roughly comparable to our last migration, which seems to have started at the end of August of 2014 and finished in mid-February of 2015. We've added significantly more filesystems and disk space since then.)
The MVP of the migration is clearly
zfs send | zfs recv' (as it always has been). Having to do the
migrations with something like
rsync would likely have been much
more painful for various reasons; ZFS snapshots and ZFS send are
things that just work, and they come with solid and extremely
reassuring guarantees. Part of their importance was that the speed
of an incremental ZFS send meant that the user-visible portion of
a migration (where we had to take their filesystem away temporarily)
could be quite short (short enough to enable opportunistic migrations,
if we could see that no one was using some of the filesystems).
At this point we've gotten somewhere around four and a half years of lifetime out of our OmniOS fileservers. This is probably around what we wanted to get, especially since we never replaced the original hard drives and so they're starting to fall out of warranty coverage and hit what we consider their comfortable end of service life. Our first generation Solaris fileservers were stretched much longer, but they had two generations of HDs and even then we were pushing it toward the end of their service life.
(The actual server hardware for both the OmniOS fileservers and the Linux iSCSI backends seems fine, so we expect to reuse it in the future once we migrate the last filesystem and then tear down the entire old environment. We will probably even reuse the data HDs, but only for less important things.)
I think I feel less emotional about this migration away from OmniOS than I did about our earlier migration from Solaris to OmniOS. Moving away from Solaris marked the end of Sun's era here (even if Sun had been consumed by Oracle by that point), but I don't have that sort of feelings about OmniOS. OmniOS was always a tool to me, although unquestionably a useful one.
(I'll write a retrospective on our OmniOS fileservers at some point, probably once the final filesystem has migrated and everything has been shut down for good. I want to have some distance and some more experience with our Linux fileservers first.)
PS: To give praise where it belongs, my co-workers did basically all of the hard, grinding work of this migration, for various reasons. Once things got rolling, I got to mostly sit back and move filesystems when they told me one was scheduled and I should do it. I also cleverly went on vacation during the final push at the end.
Some things on how ZFS dnode object IDs are allocated (which is not sequentially)
One of the core elements of ZFS are dnodes, which define DMU objects. Within a single filesystem or other object sets, dnodes have an object number (aka object id). For dnodes that are files or directories in a filesystem, this is visible as their Unix inode number, but other internal things get dnodes and thus object numbers (for example, the dnode of the filesystem's delete queue). Object ids are 64-bit numbers, and many of them can be relatively small (especially if they are object ids for internal structures, again such as the delete queue). Very large dnode numbers are uncommon, and some files and directories from early in a filesystem's life can have very small object IDs.
(For instance, the object ID of my home directory on our ZFS fileservers is '5'. I'm the only user in this filesystem.)
You might reasonably wonder how ZFS object IDs are allocated. Inspection of a ZFS filesystem will show that they are clearly not allocated sequentially, but they're also not allocated randomly. Based on an inspection of the dnode allocation source code in dmu_object.c, there seem to be two things going on to spread dnode object ids around some (but not too much).
The first thing is that dnode allocation is done from per-CPU chunks of the dnode space. The size of each chunk is set by dmu_object_alloc_chunk_shift, which by default creates 128-dnode chunks. The motivation for this is straightforward; if all of the CPUs in the system were all allocating dnodes from the same area, they would all have to content over locks on this area. Spreading out into separate chunks reduces locking contention, which means that parallel or highly parallel workloads that frequently create files on a single filesystem don't bottleneck on a shared lock.
(One reason that you might create files a lot in a parallel worklog is if you're using files on the filesystem as part of a locking strategy. This is still common in things like mail servers, mail clients, and IMAP servers.)
The second thing is, well, I'm going to quote the comment in the source code to start with:
Each time we polish off a L1 bp worth of dnodes (2^12 objects), move to another L1 bp that's still reasonably sparse (at most 1/4 full). Look from the beginning at most once per txg. If we still can't allocate from that L1 block, search for an empty L0 block, which will quickly skip to the end of the metadnode if no nearby L0 blocks are empty. This fallback avoids a pathology where full dnode blocks containing large dnodes appear sparse because they have a low blk_fill, leading to many failed allocation attempts. [...]
(In reading the code a bit, I think this comment means 'L2 block' instead of 'L0 block'.)
To understand a bit more about this, we need to know about two things. First, we need to know that dnodes themselves are stored in another DMU object, and this DMU object stores data in the same way as all others do, using various levels of indirect blocks. Then we need to know about indirect blocks themselves. L0 blocks directly hold data (in this case the actual dnodes), while L1 blocks hold pointers to L0 blocks and L2 blocks hold pointers to L1 blocks.
(You can see examples of this structure for regular files in the
zdb output in this entry and this
entry. If I'm doing the math right,
for dnodes a L0 block normally holds 32 dnodes and a L<N> block
can address up to 128 L<N-1> blocks, through block pointers.)
So, what appears to happen is that at first, the per-CPU allocator gets its chunks sequentially (for different CPUs, or the same CPU) from the same L1 indirect block, which covers 4096 dnodes. When we exhaust all of the 128-dnode chunks in a single group of 4096, we don't move to the sequentially next group of 4096; instead we search around for a sufficiently empty group, and switch to it (where a 'sufficiently empty' group is one with at most 1024 dnodes already allocated). If there is no such group, I think that we may wind up skipping to the end of the currently allocated dnodes and getting a completely fresh empty block of 4096.
If I'm right, the net effect of this is to smear out dnode allocations and especially reallocations over an increasingly large portion of the lower dnode object number space. As your filesystem gets used and files get deleted, many of the lower 4096-dnode groups will have some or even many free dnodes, but not the 3072 that they need to be eligible for be selected for further assignment. This can eventually push dnode allocations to relatively high object numbers even though you may not have anywhere near that many dnodes in use on the filesystem. This is not guaranteed, though, and you may still reuse dnode numbers.
(For example, I just created a new file in my home directory. My home directory's filesystem has 1983310 dnodes used right now, but the inode number (and thus dnode object number) that my new test file got was 1804696.)
One of our costs of using OmniOS was not having 10G networking
OmniOS has generally been pretty good to us over the lifetime of our second generation ZFS fileservers, but as we've migrated various filesystems from our OmniOS fileservers to our new Linux fileservers, it's become clear that one of the costs we paid for using OmniOS was not having 10G networking.
We certainly started out intending to have 10G networking on OmniOS; our hardware was capable of it, with Intel 10G-T chipsets, and OmniOS seemed happy to drive them at decent speeds. But early on we ran into a series of fatal problems with the Intel ixgbe driver which we never saw any fixes for. We moved our OmniOS machines (and our iSCSI backends) back to 1G, and they have stayed there ever since. When we made this move, we did not have detailed system metrics on things like NFS bandwidth usage by clients, and anyway almost all of our filesystems were on HDs, so 1G seemed like it should be fine. And indeed, we mostly didn't see obvious and glaring problems, especially right away.
What setting up a metrics system (even only on our NFS clients) and
then later moving some filesystems from OmniOS (at 1G) to Linux (at
10G) made clear was that on some filesystems, we had definitely
been hitting the 1G bandwidth limit and doing so had real impacts.
The filesystem this was most visible on is the one that holds
/var/mail, our central location for people's mailboxes (ie, their
IMAP inbox). This was always on SSDs even on OmniOS, and once we
started really looking it was clearly bottlenecked at 1G. It was
one of the early filesystems we moved to the Linux fileservers, and
the improvement was very visible. Our IMAP server, which has 10G
itself, now routinely has bursts of over 200 Mbps inbound and
sometimes sees brief periods of almost saturated network bandwidth.
More importantly, the IMAP server's performance is visibly better;
it is less loaded and more responsive, especially at busy times.
(A contributing factor to this is that any number of people have
very big inboxes, and periodically our IMAP server winds up having
to read through all of such an inbox. This creates a very asymmetric
traffic pattern, with huge inbound bandwidth from the
fileserver to the IMAP server but very little outbound traffic.)
It's less clear how much of a cost we paid for HD-based filesystems, but it seems pretty likely that we paid some cost, especially since our OmniOS fileservers were relatively large (too large, in fact). With lots of filesystems, disks, and pools on each fileserver, it seems likely that there would have been periods where each fileserver could have reached inbound or outbound network bandwidth rates above 1G, if they'd had 10G networking.
(And this excludes backups, where it seems quite likely that 10G would have sped things up somewhat. I don't consider backups as important as regular fileserver NFS traffic because they're less time and latency sensitive.)
At the same time, it's quite possible that this cost was still worth paying in order to use OmniOS back then instead of one of the alternatives. ZFS on Linux was far less mature in 2013 and 2014, and I'm not sure how well FreeBSD would have worked, especially if we insisted on keeping a SAN based design with iSCSI.
(If we had had lots of money, we might have attempted to switch to other 10G networking cards, probably SFP+ ones instead of 10G-T (which would have required switch changes too), or to commission someone to fix up the ixgbe driver, or both. But with no funds for either, it was back to 1G for us and then the whole thing was one part of why we moved away from Illumos.)
A ZFS resilver can be almost as good as a scrub, but not quite
We do periodic scrubs of our pools, roughly every four weeks on a revolving schedule (we only scrub one pool per fileserver at once, and only over the weekend, so we can't scrub all pools on one of our HD based fileservers in one weekend). However, this weekend scrubbing doesn't happen if there's something else more important happening on the fileserver. Normally there isn't, but one of our iSCSI backends didn't come back up after our power outage this Thursday night. We have spare backends, so we added one in to the affected fileserver and started the process of resilvering everything onto the new backend's disks to restore redundancy to all of our mirrored vdevs.
I've written before about the difference between scrubs and resilvers, which is that a resilver potentially reads and validates less than a scrub does. However, we only have two way mirrors and we lost one side of all of them in the backend failure, so resilvering all mirrors has to read all of the metadata and data on every remaining device of every pool. At first, I thought that this was fully equivalent to a scrub and thus we had effectively scrubbed all of our pools on that fileserver, putting us ahead of our scrub schedule instead of behind it. Then I realized that it isn't, because resilvering doesn't verify that the newly written data on the new devices is good.
ZFS doesn't have any explicit 'read after write' checks, although it will naturally do some amount of reads from your new devices just as part of balancing reads. So although you know that everything on your old disks is good, you can't have full confidence that your new disks have correct copies of everything. If something got corrupted on the way to the disk or the disk has a bad spot that wasn't spotted by its electronics, you won't know until it's read back, and the only way to force that is with an explicit scrub.
For our purposes this is still reasonably good. We've at least checked half of every pool, so right now we definitely have one good copy of all of our data. But it's not quite the same as scrubbing the pools and we definitely don't want to reset all of the 'last scrubbed at time X' markers for the pools to right now.
(If you have three or four way mirrors, as we have had in the past, a resilver doesn't even give you this because it only needs to read each piece of data or metadata from one of your remaining N copies.)
Our plan for handling TRIM'ing our ZFS fileserver SSDs
The versions of ZFS that we're running on our fileservers (both
the old and the new) don't support using
on drives in ZFS pools. Support for
TRIM has been in FreeBSD ZFS
for a while,
but it only just landed in the ZFS on Linux development version
and it's not in Illumos. Given our general upgrade plans, we're also not likely to
TRIM support over the likely production lifetime of our current
ZFS SSDs through upgrading the OS and ZFS versions later. So you
might wonder what our plans are to deal with how SSD performance
can decrease when they think they're all filled up, if you don't
TRIM them or otherwise deallocate blocks every so often.
Honestly, the first part of our plan is to ignore the issue unless we see signs of performance problems. This is not ideal but it is the simplest approach. It's reasonably likely that our ZFS fileservers will be more limited by NFS and networking than by SSD performance, and as far as I understand things, nominally full SSDs mostly suffer from write performance issues, not read performance. Our current view (only somewhat informed by actual data) is that our read volume is significantly higher than our write volume. We certainly aren't currently planning any sort of routine preventative work here, and we wouldn't unless we saw problem signs.
If we do see problems signs and do need to clear SSDs, our plan is
to do the obvious brute force thing in a ZFS setup with redundancy.
Rather than try to
TRIM SSDs in place, we'll entirely spare out
a given SSD so that it has no live data on it, and then completely
clear it, probably using Linux's
blkdiscard. We might do this in place on
a production fileserver, or we might go to the extra precaution of
pulling the SSD out entirely, swapping in a freshly cleared one,
and clearing the old SSD on a separate machine. Doing this swap has
the twin advantages that we're not risking accidentally clearing
the wrong SSD on the fileserver and we don't have to worry about
the effects of an extra-long, extra-slow SATA command on the rest
of the system and the other drives.
(This plan, such as it is, is not really new with our current generation Linux fileservers. We've had one OmniOS fileserver that used SSDs for a few special pools, and this was always our plan for dealing with any clear problems due to the SSDs slowing down due to being full up. We haven't had to use it, but then we haven't really gone looking for performance problems with its SSDs. They seem to still run fast enough after four or more years, and so far that's good enough for us.)