2023-01-26
Some notes on using using TRIM on SSDs with ZFS on Linux
One of the things you can do to keep your SSDs performing well over time is to explicitly discard ('TRIM') disk blocks that are currently unused. ZFS on Linux has support for TRIM commands for some time; the development version got it in 2019, and it first appeared in ZoL 0.8.0. When it was new, I was a bit nervous about using it immediately, but it's been years since then and recently I did some experimentation with it. Well, with one version of ZoL's TRIM support, the manual one.
ZFS on Linux has two ways to periodically TRIM your pool(s), the
automatic way and the manual way. The automatic way is to set
'autotrim=on
'
for selected pools; this comes with various cautions that are mostly
covered in zpoolprops(7).
The manual way is to periodically run 'zpool trim
'
with suitable arguments. One significant advantage of explicitly
running 'zpool trim' is that you have a lot more control over the
process, and in particular manual trims let you restrict trimming
to a single device, instead of having trimming happen on all of
them at once. If you trim your pools for only one device at a time (or
only one device per vdev) and then scrub your pool afterward, you're
pretty well protected against something going wrong in the TRIM
process and the wrong disk blocks getting erased.
(My current experiments with 'zpool trim' are on Ubuntu 22.04 on some test pools, and scrubs say that nothing has gotten damaged in them afterward.)
The manual 'zpool trim' supports a -r command line option that controls how fast ZFS asks the disk to TRIM blocks. If you set this to, for example, 100 MBytes (per second), ZoL will only ask your SSD (or SSDs) to TRIM 100 MBytes of blocks every second. Sending TRIM commands to the SSD doesn't use read or write bandwidth as such, but it does ask the SSD to do things and that may affect other things that the SSD is doing. I wouldn't be surprised if some SSDs can TRIM at basically arbitrary rates with little to no impact on IO, while other SSDs get much more visibly distracted. As far as I can tell from some tests, this rate option does work (at least as far as ZFS IO statistics report).
I'm not sure how much information 'zpool iostat' will report about ongoing TRIMs (either automatic or manual), but various information is available in the underlying statistics exported from the kernel. Your options for getting at this detailed information aren't great. At the moment, the available IO statistics appear to be a per-vdev 'bytes trimmed' number that counts up during TRIM operations (in sys/fs/zfs.h's vdev_stat structure), which only appears to have non-zero values for per-disk IO statistics, and histograms of the 'IO size' of TRIM operations (but 'individual' IO is not necessarily what you think it is, and there are some comments that individual TRIM 'IOs' of larger than 16 MBytes will be counted as 16 MBytes in the histograms, as that's their largest bucket). As with the 'rate' of trimming, all of these numbers are really counting the amount of data that ZFS has told the SSD or SSDs to throw away.
(All of these TRIM IO statistics are exposed by my version of the ZFS exporter for Prometheus.)
I'm not sure you can do very much with these IO statistics except use them to tell when your TRIMs ran and on what vdev, and for that there are other IO 'statistics' that are exposed by ZFS on Linux, although probably 'zpool iostat' won't tell you about them.
(The 'vdev trim state' is the vdev_trim_state_t enum in sys/fs/zfs.h, where 1 means a trim is active, 2 is it's been canceled, 3 is it's been suspended, and 4 is that it has completed. A zero means that a trim hasn't been done on this disk.)