Wandering Thoughts archives

2022-03-11

Filesystems can experience at least three different sorts of errors

Yesterday I wrote about how it would be nice if Linux exposed a count of disk errors and mentioned that some Linux filesystems do expose such a count of errors, but it's not clear what sort of errors they mean. This sounds like a peculiar thing to say, but in fact filesystems can experience at least two or three different sorts of errors. I will call these I/O errors, integrity errors, and structural errors.

An I/O error happens when the underlying storage device returns an error from a read or a write operation. Some filesystems have some internal redundancy that can hide such errors that occur in the right sort of places, but most of the time this is a direct user error that will correspond to an I/O error that's (hopefully) reported by the storage device. Because of this generally direct link between a lower level error and a filesystem error, a filesystem might opt not to track and report these errors, especially when they happen while reading user data instead of filesystem metadata.

An integrity error happens when the filesystem has some form of checksums over (some of) its on disk data, and the recorded checksum fails to match what it should be based on the data the filesystem got from the storage device. ZFS is famous for having checksums on both user data and filesystem metadata, although it's not the only filesystem to do this. There are other filesystems that have checksums that only apply to filesystem metadata. Almost all storage devices have some level of undetected bit corruption, and checksums can also detect various other sorts of disk damage (such as misdirected writes).

A structural error happens when the filesystem detects that some of its on-disk metadata is not correct, in any of the many specific ways for any particular sort of metadata to be incorrect. Sometimes this happens because on-disk data has been corrupted, but sometimes it happens because the filesystem code has bugs that caused something incorrect and invalid to be written out to disk (in which case the metadata may have perfectly valid checksums or other integrity checks). A filesystem that counts errors and can recognize integrity errors on metadata might not want to double count such errors as structural errors as well.

Given all of this, you can see that a filesystem that counts 'errors' without being more specific is rather unclear. Is this a count of all errors that the filesystem can detect, including I/O errors? Is this a count of all structural errors regardless of their cause, even if they come from detected (and logged) I/O errors or integrity errors? If a filesystem counts integrity errors somehow, does that count include failed integrity checks which at least implicitly happen when there are I/O errors?

(There are situations where you can experience I/O errors on I/O that that's only necessary to verify the integrity, not to return the requested data. You might reasonably count this as both an I/O error and an integrity error, as opposed to the situation where you have an I/O error on data that's directly necessary.)

Any given filesystem that reports or counts errors is going to have an answer to all of these questions, but there is no single set of obvious and clearly correct answers. It varies on a filesystem by filesystem basis, so if you only hear that a filesystem is reporting 'errors', you don't know as much about what it's reporting as you might think.

tech/FilesystemsThreeErrorTypes written at 22:39:40; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.