Wandering Thoughts archives


Filesystems and progressive deletion of things

I recently read Taras Glek's Curious Case of Maintaining Sufficient Free Space with ZFS, where Glek noticed that ZFS wasn't immediately updating its space accounting information when things were deleted. This isn't necessarily surprising and I'm not sure it's unique to ZFS. In practice, I believe that many filesystems don't actually perform all steps of deleting a file at once (as we see it from the outside).

There are two conjoined problems for filesystems when deleting things. First, in order to really delete things from a filesystem, you need to know what they are. So to delete a file, the filesystem needs to know specifically what disk blocks the file uses so the filesystem can go mark them as free in the data structures it uses to do this. This information about what disk blocks are used is not necessarily in memory; in fact, very little about the file may be in memory. This means that in order to delete the file, the filesystem may need to read a bunch of data about it off of the disks and then process it. For large files, there are several levels of this data in a tree structure of indirect blocks. This isn't necessarily a fast process, especially if the system uses HDDs and is under IO pressure already.

(Generally each indirect block you have to read from disk requires a seek, and HDDs still can only do on the order of 100 of them a second. SATA and SAS SSDs are much faster, and NVMe SSDs even faster still, but there is still some latency and delay for each block.)

The second part is that this information about what disk blocks are in use may be larger than you want to hold in memory and process at once (especially when combined with all of the metadata for free filesystem blocks that you're about to update). You can reduce memory usage (and perhaps complexity) by freeing the file's disk blocks in conveniently sized batches. In a filesystem with some kind of journaling (including ZFS), this can also reduce the size of the journal record(s) you need to commit in order to make things work.

This progressive deletion is mostly invisible to people, but one place that it can materialize is in filesystem space accounting and space allocation. If you're freeing blocks and updating metadata in batches, it's natural to update the visible information about disk space used and free in batches too, rather than try to do it all at the end (or worse, all at the start). This is probably especially the case if you're committing things in batches too.

(A filesystem does generally know how many blocks of disk space a file takes up, so it can choose to update the accounting information right away at the start of the deletion. But then it creates a situation where not all of the claimed free space is actually usable right now, although there are other workarounds for that.)

PS: It's also possible to have deletion happen asynchronously from the perspective of user level programs, where their calls to 'unlink()' return almost immediately while the files are actually deleted in the background. But I don't know if any filesystems actually do this.

tech/FilessytemProgressiveDelete written at 22:40:51; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.