The pending delete problem for Unix filesystems

May 15, 2015

Unix has a number of somewhat annoying filesystem semantics that tend to irritate designers and implementors of filesystems. One of the famous ones is that you can delete a file without losing access to it. On at least some OSes, if your program open()s a file and then tries to delete it, either the deletion fails with 'file is in use' or you immediately lose access to the file; further attempts to read or write it will fail with some error. On Unix your program retains access to the deleted file and can even pass this access to other processes in various ways. Only when the last process using the file closes it will the file actually get deleted.

This 'use after deletion' presents Unix and filesystem designers with the problem of how you keep track of this in the kernel. The historical and generic kernel approach is to keep both a link count and a reference count for each active inode; an inode is only marked as unused and the filesystem told to free its space when both counts go to zero. Deleting a file via unlink() just lowers the link count (and removes a directory entry); closing open file descriptors is what lowers the reference count. This historical approach ignored the possibility of the system crashing while an inode had become unreachable through the filesystem and was only being kept alive by its reference count; if this happened the inode became a zombie, marked as active on disk but not referred to by anything. To fix it you had to run a filesystem checker, which would find such no-link inodes and actually deallocate them.

(When Sun introduced NFS they were forced to deviate slightly from this model, but that's an explanation for another time.)

Obviously this is not suitable for any sort of journaling or 'always consistent' filesystem that wants to avoid the need for a fsck after unclean shutdowns. All such filesystems must keep track of such 'deleted but not deallocated' files on disk using some mechanism (and the kernel has to support telling filesystems about such inodes). When the filesystem is unmounted in an orderly way, these deleted files will probably get deallocated. If the system crashes, part of bringing the filesystem up on boot will be to apply all of the pending deallocations.

Some filesystems will do this as part of their regular journal; you journal, say, 'file has gone to 0 reference count', and then you know to do the deallocation on journal replay. Some filesystems may record this information separately, especially if they have some sort of 'delayed asynchronous deallocation' support for file deletions in general.

(Asynchronous deallocation is popular because it means your process can unlink() a big file without having to stall while the kernel frantically runs around finding all of the file's data blocks and then marking them all as free. Given that finding out what a file's data blocks are often requires reading things from disk, such deallocations can be relatively slow under disk IO load (even if you don't have other issues there).)

PS: It follows that a failure to correctly record pending deallocations or properly replay them is one way to quietly lose disk space on such a journaling filesystem. Spotting and fixing this is one of the things that you need a filesystem consistency checker for (whether it's a separate program or embedded into the filesystem itself).

Written on 15 May 2015.
« In Go, you need to always make sure that your goroutines will finish
The ZFS delete queue: ZFS's solution to the pending delete problem »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri May 15 01:02:45 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.