Some things on SSDs and their support for explicitly discarding blocks

January 18, 2023

Although things became complicated later, HDDs started out having a specific physical spot for each and every block (and even today most HDDs mostly have such a thing). You could in theory point at a very tiny spot on a HDD and correctly say 'this is block 5,321 and (almost) always will be'. Every time you wrote to block 5,321, that tiny spot would get new data, as an in-place update. SSDs famously don't work like this, because in general you can't immediately rewrite a chunk of flash memory that's been written to the way you can a HDD platter; instead, you need to write to newly erased flash memory. In order for SSDs to pretend that they were rewriting data in place, SSDs need both a data structure to map from logical block addresses to wherever the latest version of the block is in physical flash memory and a pool of ready to use erased flash blocks that the SSD can immediately write to.

In general the size of this erased blocks pool is a clear contributor to the long term write performance of SSDs. By now we're all aware that a fresh or newly erased SSD generally has better sustained write performance than one that's been written to for a while. The general assumption is that a large part of the difference is the size of the pool of immediately ready erased flash blocks keeps shrinking as more and more of the SSD is written to.

(SSDs are very complicated black boxes so we don't really know this for sure; there could be other contributing factors that the manufacturers don't want to tell us about.)

One way that SSDs maintain such a pool (even after they've been written to a lot) is through over-provisioning. If a SSD claims to be 500 GB but really has 512 GB of flash memory, it has an extra 12 GB of flash that it can use for its own purposes, including for a pool of pre-erased flash blocks. Such a pool won't hold up forever if you keep writing to the SSD without pause, but by now we expect that sustained write speed will drop on a SSD at some point. One of the many unpredictable variables in SSD performance is how fast a SSD will be able to refresh its pool given some amount of idle time.

The other way that SSDs can maintain such a pool is that you tell them that some logical blocks can be thrown away. One way to do this is erasing the drive, which has the drawback that it erases everything. The more modern way is for your filesystem or your block layer to use a SSD 'TRIM' command to tell the SSD that some blocks are unused and so can be entirely discarded (the actual specifics in SATA, SCSI/SAS, and NVMe are impressively varied). Obviously TRIM can be used to implement 'erase drive', although this may not be quite the same inside the SSD as a real erase; this use of TRIM for drive erase is what I believe Linux's blkdiscard does by default.

For obvious reasons, correctly implementing TRIM operations in your filesystem and block storage layers is critical. If there are any bugs that send TRIM commands for the wrong blocks (either to the wrong block addresses or mistaking which blocks are unused), you've just created data loss. People also used to worry about SSDs themselves having bugs in their TRIM implementations, since modern SSDs contain fearsome piles of code. By now, my impression is that TRIM has been around long enough and enough things are using it by default that the bugs have been weeded out (but then see the Debian wiki page).

(I believe that modern Linux systems default to TRIM being on for common filesystems. On the other side, OpenZFS still defaults automatic TRIM to off except on FreeBSD, although it's been long enough since my initial caution about TRIM on ZFS that I should try it.)

One of the interesting issues with TRIM is how it interacts with encrypted disks or filesystems, which are increasingly common on laptops and desktops. On the one hand, supporting TRIM is probably good for performance and maybe SSD lifetime; on the other hand, it raises challenges and potentially leaks information about how big the filesystem is and what blocks are actually used. I honestly don't know what various systems do here.

In many Linux environments, filesystems tend to sit on top of various underlying layers, such as LVM and software RAID (and disk encryption). In order for filesystem TRIM support to do any good it must be translated and passed through those various layers, which is something that hasn't always happened. According to the Arch Wiki SSD page, modern versions of LVM support passing TRIM through from the filesystem, and I believe that software RAID has for some time.

A further complicate in TRIM support is that if you're using SATA SSDs behind a SAS controller, apparently not all models of (SATA) SSDs will support TRIM in that setup. We have Crucial MX500 2 TB SSDs in some Ubuntu 22.04 LTS fileservers where 'lsblk -dD' says the SATA connected ones will do TRIM operations but the SAS connected ones won't. However, WD Blue 2 TB SSDs say they're happy to do TRIM even when connected to the SAS side of things.

(Also, I believe that TRIM may often not work if you're connecting a SATA SSD to your system through a USB drive dock. This is a pity because it's otherwise a quite convenient way to work through a bunch of SSDs to blank out and reset. I wouldn't be surprised if this depends on both the USB drive dock and the SATA SSD. Now that I've discovered 'lsblk -D' I'm going to do some experimentation.)

At one point I would have guessed that various SSDs might specially recognize writes of all-zero blocks or similar things and trigger TRIM-like functionality, where the write is just discarded and the logical block is marked as 'trim to <zero or whatever>' (I worried about this in the context of benchmarking). I can't rule out SSDs doing that today, but given widespread support for TRIM, recognizing all-zero writes seems like the kind of thing you'd quietly drop from your SSD firmware to simplify life.

Comments on this page:

By Arnaud Gomes at 2023-01-19 03:46:13:

TRIM improving performance is a common idea, but it may turn out to be very wrong at least on consumer SSDs. The Samsung 870 EVO series, for instance (which I know first hand) can easily block for several minutes when (synchronously) TRIMming a few hundred gigabytes.

There is a trade-off here, TRIM does improve the long-term performance of a SSD but the short-term cost may be big.

   -- A
Written on 18 January 2023.
« An aggressive, stealthy web spider operating from Microsoft IP space
My twitch about adding a shim in front of a (shell script) interpreter »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jan 18 23:10:24 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.