Wandering Thoughts


I wouldn't use ZFS for swap (either for swapfiles or with a zvol)

As part of broadly charting how Linux finds where to write and read swap blocks, I recently noted that ZFS on Linux can't be used to hold a swapfile. David Magda noted that you could get around this by creating a zvol and using it for swap. While the Linux kernel will accept this and it works, at least to some extent, I wouldn't rely on it and I wouldn't do it unless I was desperate and had no other choice. Fundamentally, swapping to ZFS is not in accordance with what people (and often Unix kernels) expect from writing pages out to swap.

(On Linux, the Arch wiki has some definite cautions.)

Both the Unix kernel and Unix system administrators expect swapping pages out (and reading them back in) to be low overhead operations, things that are very close to writing a block of memory to some disk blocks or reading disk blocks into (preallocated) memory. This is not how ZFS works, even for writes to zvols. Due to ZFS's fundamental decision to never overwrite data in place, writing out blocks to a zvol requires allocating new space for them in the ZFS pool, collecting all of the relevant changes together, and then writing out a transaction group (or perhaps writing the blocks to a ZFS Intent Log). This more or less intrinsically requires allocating memory for various internal ZFS book-keeping, as well as obtaining various ZFS-related locks (for example, ones that protect the data structures that track free blocks). And it winds up doing a lot more IO than just the direct pages of memory being written to swap.

A lot of the time this will all work. Often you aren't swapping under heavy memory pressure, you're just pushing some unused pages out to swap and it's okay for this to take a while and allocate some extra memory and so on. But, regardless of whether it works, it's much more complicated than swapping is normally supposed to be and that makes it more chancy and less predictable. All of this leaves me feeling that swap to ZFS (and to anything similar to it) is for very unusual situations, not normal operation. If I didn't expect to really ever need swap on a system, I think I'd rather have no swap rather than swap on a zvol.

(ZFS isn't the only swap environment that has this problem. Swapping to a file on NFS has many of the same issues, and is also something that I can't recommend.)

ZFS is good for many things, but not everything, and one of the things it's not good at is very low overhead direct write IO. This is by design, since you can't combine it with copy on write and ZFS decided the latter was more important (and I agree with it).

ZFSForSwapMyViews written at 21:52:35; Add Comment


I wish ZFS supported per-user reservations, not just per-user quotas

ZFS supports a variety of ways to control space usage in filesystems. You can set a quota or a reservation on a filesystem, and you can set disk space and object count quotas on users, groups, and 'projects' in a filesystem. However, if you look at this list you'll notice an omission; you can't set a reservation for users, groups, or projects in a filesystem. There are some situations (at least in our world) where this would be convenient to have.

The most common case that comes up is that we have a bunch of people in a single filesystem, some of whom may fill up the filesystem by accident in the course of their work and others (such as professors) who we always want to be able to use some additional space so they can keep working. This is the ideal situation for a positive reservation instead of a negative quota, since what we want to put a limit on is the pool of space used by a group of people.

(The real ZFS answer is to put people who need reservations in their own filesystems because filesystems are cheap. But moving people from one filesystem to another is often rather disruptive and not trivial to coordinate, so often it doesn't get seriously contemplated until actual problems happen.)

OpenZFS has supported 'project quotas' since version 0.8.0, as covered in zfs-project(8) and zfs-projectspace(8). Project quotas can be used to give a single person (or group of people) a reservation in a filesystem, by putting their directories into a new project and then putting a project quota limit on the default project. However, you can't use this to give two people each a reservation of their own without putting quotas on each of them too, which is potentially (very) undesirable.

(ZFS project quotas appear to be in the current version of Illumos but I'm not sure when they appeared. It may have been added to the tree in August of 2019, per issue #11479.)

I don't have any personal experience with project quotas. Our Ubuntu ZFS fileservers are still running Ubuntu 18.04, which is too old to support them, and even once we upgrade to 22.04 we probably won't try it because of the various challenges of administering and managing them.

PS: Since ZFS supports project quotas, it also supports tracking space usage by 'project'. Here 'project' is basically 'whatever you want to tag with some unique identifier', which means that you could go through and tag every top level directory in a filesystem with a separate project ID so you could easily get reports on how much space is in use in each of them. Ordinary people probably just use 'du -hs'.

PPS: I think it would be reasonable to require the filesystem to have a reservation that was at least as big as the sum of all of the user reservations in it (or the user, group, and project ones if you wanted to support all of those).

ZFSPerUserReservationWish written at 19:14:10; Add Comment


Why the ZFS ZIL's "in-place" direct writes of large data are safe

I recently read ZFS sync/async + ZIL/SLOG, explained (via), which reminded me that there's a clever but unsafe seeming thing that ZFS does here, that's actually safe because of how ZFS works. Today, I'm going to talk about why ZFS's "in-place" direct writes to main storage for large synchronous writes are safe, despite that perhaps sounding dangerous.

ZFS periodically flushes writes to disk as part of a ZFS transaction group; these days a transaction group commit happens every five seconds by default. However, sometimes programs want data to be sent to disk sooner than that (for example, your editor saving a file; it will sync the file in one way or another at the end, so that you don't lose it if there's a system crash immediately afterward). To do this, ZFS has the ZFS Intent Log (ZIL), which is a log of all write operations since the last transaction group where ZFS promised programs that the writes were durably on disk (to simplify a bit). If the system crashes before the writes can be sent to disk normally as part of a transaction group, ZFS can replay the ZIL to recreate them.

Taken by itself, this means that ZFS does synchronous writes twice, once to the ZIL as part of making them durable and then a second time as part of a regular transaction group. As an optimization, under the right circumstances (which are complicated, especially with a separate log device) ZFS will send those synchronous writes directly to their final destination in your ZFS pool, instead of to the ZIL, and then simply record a pointer to the destination in the ZIL. This sounds dangerous, since you're writing data directly into the filesystem (well, the pool) instead of into a separate log, and in a different filesystem it might be. What makes it safe in ZFS is that in ZFS, all writes go to unused (free) disk space because ZFS is what we generally call a copy-on-write system. Even if you're rewriting bits of an existing file, ZFS writes the new data to free space, not over the existing file contents (and it does this whether or not you're doing a synchronous write).

(ZFS does have to update some metadata in place, but it's a small amount of metadata and it's carefully ordered to make transaction group commits atomic. When doing these direct writes, ZFS also flushes your data to disk before it writes the ZIL that points to your data.)

Obviously, ZFS makes no assumptions about the contents of free disk space. This means that if your system crashes after ZFS has written your synchronous data into its final destination in what was free space until ZFS used it just now, but before it writes out a ZIL entry for it (and tells your editor or database that the data is safely on disk), no harm is done. No live data has been overwritten, and the change to what's in free space is unimportant (well, to ZFS, you may care a lot about the contents of the file that you were just a little bit late to save as power died).

Similarly, if your system crashes after the ZIL is written out but before the regular transaction group commits, the space your new data is written to is still marked as free at the regular ZFS level but the ZIL knows better. When the ZIL is replayed to apply all of the changes it records, your new data will be correctly connected to the overall ZFS pool (meta)data structures, the space will be marked as used, and so on.

(I've mentioned this area in the past when I wrote about the ZIL's optimizations for data writes, but at the time I explained its safety more concisely and somewhat in passing. And the ZFS settings and specific behavior I mentioned in that entry may now be out of date, since it's from almost a decade ago.)

ZFSZILSafeDirectWrites written at 22:27:38; Add Comment


ZFS DVA offsets are in 512-byte blocks on disk but zdb misleads you about them

Yesterday I asserted that ZFS DVA offsets were in bytes, based primarily on using zdb to dump a znode and then read a data block using the offset that zdb printed. Over on Twitter, Matthew Ahrens corrected my misunderstanding:

The offset is stored on disk as a multiple of 512, see the DVA_GET_OFFSET() macro, which passes shift=SPA_MINBLOCKSHIFT=9. For human convenience, the DVA is printed in bytes (e.g. by zdb). So the on-disk format can handle up to 2^72 bytes (4 ZiB) per vdev.

... but the current software doesn't handle more than 2^64 bytes (16 EiB).

That is to say, when zdb prints ZFS DVAs it is not showing you the actual on-disk representation, or a lightly decoded version of it; instead the offset is silently converted from its on-disk form of 512-byte blocks to a version in bytes. I think that this is also true of other pieces of ZFS code that print DVAs as part of diagnostics, kernel messages, and so on. Based on lightly reading the code, I believe that the size of the DVA is also recorded on disk in 512-byte blocks, because zdb and other things use a similar C macro (DVA_GET_ASIZE()) when printing it.

(Both macros are #define'd in include/sys/spa.h.)

So, to summarize: on disk, ZFS DVA offsets are in units of 512-byte blocks, with offset 0 (on each disk) starting after a 4 Mbyte header. In addition, zdb prints offsets (and sizes) in units of bytes, not their on disk 512-byte blocks (in hexadecimal), as (probably) do other things. If zdb says that a given DVA is '0:7ea00:400', that is a byte offset of 518656 bytes and a byte size of 1024 bytes. Zdb is decoding these for you from their on disk form. If a kernel message talks about DVA '0:7ea00:400' it's also most likely using byte offsets, as zdb does.

These DVA block offsets are always for 512 byte blocks. The 'block size' of the offset is fixed, and doesn't depend on the physical block size of the disk, the logical block size of the disk, or the ashift of the vdev. Since 512 bytes is the block size for the minimum ashift, ZFS will never have to assign finer grained addresses than that, even if it's somehow dealing with a disk or other storage with smaller sized blocks. This makes using 512 byte 'blocks' completely safe. That the DVA offsets are in blocks is, in a sense, mostly a way of increasing how large that a vdev can be (adding nine bits of size).

(This is not as crazy a concern as you might think, since DVA offsets in a raidz vdev cover the entire (raw) disk space of the vdev. If you want to allow, say, 32 disk raidz vdevs without size limits, the disk size limit is 1/32nd of the vdev size limit. That's still a very big disk by today's standards, but if you're building a filesystem format that you expect may be used for (say) 50 years, you want to plan ahead.)

I haven't looked at the OpenZFS code in depth to see how it handles DVA offsets in the current code. The comments in include/sys/spa.h make it sound like all interpretation of the contents of DVAs go through the macros in the file, including the offset. The only apparent way to get access to the offset is with the DVA_GET_OFFSET() macro, which converts it to a byte offset in the process; this suggests that much or all ZFS code probably passes around DVA offsets as byte offsets, not their on-disk block offset form.

(This is somewhat suggested by what Matthew Ahrens said about how the current software behaves; if it deals primarily or only with byte offsets, it's limited to vdevs with at 2^64 bytes, although the disk format could accommodate larger ones. If all of the internal ZFS code deals with byte offsets, this might be part of why zdb prints DVAs as byte offsets; if you're going back and forth between a kernel debugger and zdb output, you want them to be in the same units.)

I'm disappointed that I was wrong (both yesterday and in the past), but at least I now have a definitive answer and I understand more about the situation.

ZFSDVAOffsetsInBytesII written at 21:28:41; Add Comment


ZFS DVA offsets are in bytes, not (512-byte) blocks

In ZFS, a DVA (Device Virtual Address) is the equivalent of a block address in a regular filesystem. For our purposes today, the important thing is that a DVA tells you where to find data by a combination of the vdev (as a numeric index) and an offset into the vdev (and also a size). However, this description leaves a question open, which is what units are ZFS DVA offsets in. Back when I looked into the details of DVAs in 2017, the code certainly appeared to be treating the offset as being in bytes; however, various other sources have sometimes asserted that offsets are in units of 512-byte blocks. Faced with this uncertainty, today I decided to answer the question once and for all with some experimentation.

(One of the 'various sources' for the DVA offset being in 512-byte blocks is the "ZFS On-Disk Specification" that you can find copies of floating around on the Internet, eg currently here or this repository with the specification and additional information. See section 2.1.)

Update: This turns out to be wrong (or a misunderstanding). On disk, ZFS DVA offsets are stored as (512-byte) blocks, but tools like zdb print them as byte offsets. See ZFSDVAOffsetsInBytesII.

I'll start with the more or less full version of the experiment on a file-based ZFS pool.

# truncate --size 100m disk01
# zpool create tank disk01
# vi /tank/TESTFILE
[.. enter more than 512 bytes of text ...]
# sync
# zdb -vv -O tank TESTFILE
Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
     6    1   128K     1K     1K     512     1K  100.00  ZFS plain file
                                            176   bonus  System attributes
Indirect blocks:
  0 L0 0:7ea00:400 400L/400P F=1 B=49/49 cksum=[...]

This file is 1 KB (two 512-byte blocks) on disk (since it's not compressed; I deliberately didn't turn that on), and starts at offset 0x7ea00 aka 518656 in whatever unit that is. Immediately we have one answer; our 'disk01' data file only has 204800 512-byte blocks, so this cannot be a plain 512-byte block offset. However, if we just read at byte 518656 (block 1013), we won't succeed in finding our file data. Per various sources (eg ZFS Raidz Data Walk), there is a 4 MByte (0x400000 bytes) header that we also have to add in. That's 8192 512-byte blocks for the header, plus 1013 for the DVA offset gives us a block offset from the start of the file of 9205 512-byte blocks, so:

# dd if=disk01 bs=512 skip=9205 | sed 5q
line 2 of zfs test file
line 3 of zfs test file
line 4 of zfs test file
line 5 of zfs test file

I've found my test file exactly where it should be.

Just to be sure, I also did this same experiment on our test fileserver, where the ZFS pool uses mirrored disk partitions (instead of a file). The answer is the same; treating the ZFS DVA offset as a byte offset and adding 4 MBytes gets me the right (byte) offset into the disk partition to find the file contents that should be there. Although I haven't verified it in the code, I would be very surprised if raidz or draid DVA offsets are any different (although raidz DVA offsets snake across all disks in the raidz).

(This experiment is obviously much harder if you have a dataset with compression turned on. I don't know if there's any easy way to get zdb to decompress a block from standard input. Modern versions of zdb can read ZFS blocks directly, with the -R option, but while useful this doesn't quite help answer the question here. I guess I could have strace'd zdb to see what offset it read the block from.)

(This is one of my ZFS uncertainties that has quietly nagged at me for years but that I can now finally put to bed.)

ZFSDVAOffsetsInBytes written at 21:48:26; Add Comment


ZFS DVAs and what they cover on raidz vdevs

In ZFS, a DVA (Device Virtual Address) is the equivalent of a block address in a regular filesystem. For our purposes today, the important thing is that a DVA tells you where to find data by a combination of the vdev (as a numeric index) and an offset into the vdev (and also a size). This implies, for example, that in mirrored vdevs, all mirrors of a block are at the same place on each disk, and that in raidz vdevs the offset is striped sequentially across all of your disks.

Recently I got confused about one bit of how DVA offsets work on raidz vdevs. On a raidz vdev, is the offset an offset into the logical data on the vdev (which is to say, ignoring and skipping over the space used by parity), or is it an offset into the physical data on the vdev (including parity space)?

The answer is on raidz vdevs, DVA offsets cover the entire available disk space and include parity space. If you have a raidz vdev of five 1 GB disks, the vdev DVA offsets go from 0 to 5 GB (and by corollary, the parity level has no effect on the meaning of the offset). This is what is said by, for example, Max Bruning's RAIDZ On-Disk Format.

If you think about how parity works in raidz vdevs, this makes sense. In raidz, unlike conventional RAID5/6/7, the amount of parity and its location isn't set in advance, and you can force ZFS to make 'short writes' (writes of less than the data width of your raidz vdev) that still force it to create full parity. If DVA offsets ignored and skipped over parity, going from an offset to an on disk location would be very complex and the maximum offset limit for a vdev could change. Having the DVA offset cover all disk space on the vdev regardless of what it's used for is much simpler (and it also allows you to address parity with ordinary ZFS IO mechanisms that take DVAs).

I don't know how DVA offsets work on draid vdevs. Since I haven't been successful at understanding the draid code's addressing in the past, I haven't tried delving into it this time around.

(I might have known this at some point when I first looked into ZFS DVAs, but if so I forgot it since then, and recently had a clever ZFS idea that turns out to not work because of this. Now that I've written this down, maybe I'll remember it for the next time around.)

ZFSDVAsAndRaidzOffsets written at 21:52:22; Add Comment


What ZFS 'individual' and 'aggregated' IO size statistics mean

If you consult the zpool iostat manual page, one of the things it will tell you about is request size histograms, which come in two variants for each type of ZFS IO:

Print request size histograms [...]. This includes histograms of individual I/O (ind) and aggregate I/O (agg). These stats can be useful for observing how well I/O aggregation is working. [...]

This leaves many things unexplained. Let's start with what aggregated IO is and where it comes from. In the kernel, the primary source of aggregated IO in normal operation is vdev_queue_aggregate() in vdev_queue.c, where there are some comments:

Sufficiently adjacent io_offset's in ZIOs will be aggregated. [...]

We can aggregate I/Os that are sufficiently adjacent and of the same flavor, as expressed by the AGG_INHERIT flags. The latter requirement is necessary so that certain attributes of the I/O, such as whether it's a normal I/O or a scrub/resilver, can be preserved in the aggregate.

In other words, if ZFS is processing a bunch of independent IOs that are adjacent or sufficiently close together, it will aggregate them together. The 'flavor' of ZFS IO here is more than read versus write; it covers the IO priority as well (eg, synchronous versus asynchronous reads). There's also a limit on how large a span of data this aggregation can cover, and the limit depends on whether your ZFS pool is on SSDs or on HDDs. On SSDs the limit is the ZFS parameter zfs_vdev_aggregation_limit_non_rotating, which defaults to 128 KBytes, and on HDDs the limit is zfs_vdev_aggregation_limit, which defaults to 1 MByte.

An eventual ZFS IO is either 'aggregated' or 'individual'; the two are exclusive of each other (see the histogram update code in vdev_stat_update() in vdev.c). The sum of aggregated and individual IO is how much IO you've done. Similarly, the size distribution of the joint histogram is the size distribution of your IO. This is experimentally verified for counts, although I can't completely prove it to myself in the OpenZFS code. I'm also not entirely sure what's going on with the IO sizes.

(An aggregated IO swallows other, existing ZIOs. The size and count of the aggregated IO could be either just the 'head' IO and its size, or it could wind up including all of the swallowed IOs. I can't figure out from the OpenZFS code which it is or convince myself of one or the other, although I certainly would expect the aggregated IO's statistics to include and cover all of the swallowed ZIOs, and for the swallowed ZIOs to not show up in the 'individual' numbers.)

On all of the machines that I can readily get statistics for, individual IO seems more common by count and to be responsible for more IO size than aggregated IO. However, aggregated IO is often large enough in both count and especially size to matter; it's clear to me that you can't just ignore aggregated IO. Asynchronous writes seem to especially benefit from aggregation. All of this is on SSD based pools, often using large files, and I don't understand how the non-rotating aggregation limit is interacting with the 128 KB default ZFS recordsize.

I don't know if you should expect I/O aggregation to happen (and be alarmed if your statistics suggest that it's not) or just be pleased when it does. Presumably some of this has to do with what size individual IOs you see, especially on SSDs.

PS: The current types of ZFS IO that you're likely to see are synchronous and asynchronous reads and writes, scrub, trim, and 'rebuild'; all of these actually come from the priority that ZFS assigns to IOs. The non-histogram per-type counters for total bytes and total operations are somewhat different, but to confuse you trim IO is counted as 'ioctl' there, for reasons beyond the scope of this entry.

ZFSIndividualVsAggregatedIOs written at 23:09:51; Add Comment


ZFS pool IO statistics (and vdev statistics) are based on physical disk IO

Today I wound up re-learning something that I sort of already knew about the IO statistics about pools and vdevs that you can get through things such as zpool iostat. Namely, that at least for bytes read and written and the number of IO operations, these IO statistics are what I call physical IO statistics; they aggregate and sum up the underlying physical disk IO information.

Whenever you're generating IO statistics for a logical entity with redundancy, you have a choice for how to present information about the volume and number of IOs. One option is to present the logical view of IO, where something asked you to write 1 GByte of data so you report that. The other option is to present the physical view of IO, where although you were given 1 GB to write, you wrote 2 GB to disk because you wrote it to both sides of a two-way mirror.

The logical view is ofte how people think of doing IO to entities with redundancy, and it's what things such as Linux's software RAID normally report. If you write 1 GB to a two-way Linux software RAID mirror, your disk IO statistics will tell you that there was 1 GB of writes to 'mdX' and 1 GB of writes to two disks (correlating this is up to you). If you do the same thing to a ZFS filesystem in a pool using mirrored vdevs, 'zpool iostat' will report that the pool did 2 GB of write IO.

(Well, 'zpool iostat' itself will report this as a bandwidth number. But the underlying information that ZFS provides is the cumulative volume in bytes.)

Presenting pool and vdev IO volume statistics the way ZFS does has the useful property that the numbers all add up. If you sum up all of the IO to devices in a vdev, you get the IO volume for the vdev; if you sum up IO volume across all vdevs, you get the IO volume for the pool (ignoring for the moment the prospect of devices being removed from a vdev). However, it makes it somewhat harder to know what logical write volume (and sometimes read volume) you're actually seeing, because you have to know how your vdevs multiply logical write IO and take that into account. A pool reporting 1 GB of write IO with two-way mirrors is seeing much more logical IO than a pool with four-way mirrors would be.

(Of course, IO volume and load is already not really comparable across pools because different pools may have different numbers of vdevs even if they have the same type of vdev. A pool with three mirror vdevs can handle much more write volume than a pool with only one mirror vdev, assuming they're using the same disks and so on.)

One view of logical IO volume for a pool can be gotten by adding up all of the per-dataset IO statistics (assuming that you can get them). However this will give you a genuine logical view of things, including read IO that was satisfied from the ARC and never went to disk at all. For some purposes this will be what you want; for others, it may be awkward.

The old simple per-pool IO statistics worked this way, so in a sense I should have known this already, but those are a different and now-obsolete system than the statistics 'zpool iostat' uses. ZFS is also somewhat inconsistent on this; for example, pool scrub progress on mirrored vdevs is reported in logical bytes, not physical ones.

PS: As you'd expect, the 'zpool iostats' level IO statistics report scrub IO in physical terms, not the logical one that 'zpool scrub' reports. In modern sequential scrubs on mirrored vdevs, the physical IO from 'zpool iostat' can add up to more than the total amount claimed to be scrubbed times the mirror level. I assume that this extra IO is due to any IO needed for the initial metadata scan.

ZFSPoolIostatsPhysical written at 23:16:16; Add Comment


I need to remember to check for ZFS filesystems being mounted

Over on the Fediverse I said something:

I keep re-learning the ZFS lesson that you want to check not only for the mount point of ZFS filesystems but also that they're actually mounted, since ZFS can easily have un-mounted datasets due to eg replication in progress.

We have a variety of management scripts on our fileservers that do things on 'all ZFS filesystems on this fileserver' or 'a specific ZFS filesystem if it's hosted on this fileserver'. Generally they get their list of ZFS filesystems and their locations by looking at the mountpoint property (we set an explicit mount location for all of our ZFS filesystems, instead of using the default locations). Most of the time this works fine, but every so often one of the scripts has blown up and we've quietly fixed it to do better.

The problem is that ZFS filesystems can be visible in things like 'zfs list' and have a mountpoint property without actually being mounted. Most of the time all ZFS filesystems with a mountpoint will actually be mounted, so most of the time the simpler version works. However, every so often we're moving a filesystem around with 'zfs send' and 'zfs receive', and either an initial replication of the filesystem sits unmounted on its new home, or the old version of the now migrated filesystem sits unmounted on its old fileserver, retained for a while as a safety measure.

It's not hard to fix our scripts, but we have to find them (and then remember not to make this mistake again when we write new scripts). This time around I did do a sweep over all of our scripts looking for use of 'zfs list' and the 'mountpoint' property and so on, and didn't find anything where we (now) weren't also checking the 'mounted' property. Hopefully it will stay that way, now that I've written this entry to remind myself.

Sidebar: Two reasons other filesystems mostly don't have this problem

The obvious reason that other filesystems mostly don't have this problem is that they sort of don't have a state where they're present with a mount point assigned but not actually mounted. The less obvious reason is that most filesystems don't have a separate tool to list them; instead you look at the output of 'mount' or some other way of looking at what filesystems are mounted, and that obviously excludes filesystems that aren't. You can do the same with ZFS, but using 'zfs list' and so on is often more natural.

(With other filesystems, the rough equivalent is to have a 'noauto' filesystem in /etc/fstab that's not currently mounted. If you get your list of filesystems from fstab, you'll see the same sort of issue. Of course in practice you mostly don't look at fstab, since it doesn't reflect the live state of the system. Things in fstab may be unmounted, and things not in fstab may be mounted

ZFSCheckForMounted written at 22:46:00; Add Comment


We do see ZFS checksum failures, but only infrequently

One of the questions hovering behind ZFS is how often, in practice, you actually see data corruption issues that are caught by checksums and other measures, especially on modern solid state disks. On our old OmniOS and iSCSI fileserver environment we saw somewhat regular ZFS checksum failures, but that environment had a lot of moving parts, ranging from iSCSI through spinning rust. Our current fileserver environment uses local SSDs, and initially it seemed we were simply not experiencing checksum failures any more. Over time, though, we have experienced some (well, some not associated with SSDs that failed completely minutes later).

Because there's no in-pool persistent count of errors, I have to extract this information from our worklog reports of clearing checksum errors, which means that I may well have missed some. Our current fileserver infrastructure has been running since around September of 2018, so many pools are now coming up on three and a half years old.

  • In early 2019, a SSD experienced an escalating series of checksum failures over multiple days that eventually caused ZFS to fault the disk out. We replaced the SSD. No I/O errors were ever reported for it.

  • in mid 2019, a SSD with no I/O errors had a single checksum failure found in a scrub, which might have come from a NAND block failing and being reallocated (based on SMART data). The disk is still in service as far as I can tell, with no other problems.

  • at the end of August 2019, an otherwise problem-free SSD had one checksum error found in a scrub. Again, SMART data suggests it may have been some sort of NAND block failure that resulted in a reallocation. The disk is still in service with no other problems.

  • in mid 2021, a SSD reported six checksum errors during a scrub. As in all the other cases, SMART data suggests there was a NAND block failure and reallocation, and the disk didn't report any I/O errors. The disk is still in service with no other problems.

(We also had a SSD report a genuine read failure at the end of 2019. ZFS repaired 128 Kb and the pool scrubbed fine afterward.)

So we've seen three incidents of checksum failures (two of which were only for a single ZFS block) on disks that have otherwise been completely fine, and one case where checksum failures were an early warning of disk failures. We started out with six fileservers, each with 16 ZFS data disks, and added a seventh fileserver later (none of these SSD checksum reports are from the newest fileserver). Conservatively, this means that our three or four incidents are across 96 disks.

(At the same time, this means four out of 96 or so SSDs had a checksum problem at some point, which is about a 4% rate.)

We have actually had a number of SSD failures on these fileservers. I'm not going to try to count how many, but I'm pretty certain that there have been more than four. This means that in our fileserver environment, SSDs seem to fail outright more often than they experience checksum failures. Having written this entry, I'm actually surprised by how infrequent checksum failures seem to be.

(I'm not going to try to count SSD failures, because that too would require going back through worklog messages.)

ZFSOurRareChecksumFailures written at 22:03:10; Add Comment

(Previous 10 or go back to March 2022 at 2022/03/12)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.