The pending delete problem for Unix filesystems
Unix has a number of somewhat annoying filesystem semantics that
tend to irritate designers and implementors of filesystems. One of
the famous ones is that you can delete a file without losing access
to it. On at least some OSes, if your program
open()s a file and
then tries to delete it, either the deletion fails with 'file is
in use' or you immediately lose access to the file; further attempts
to read or write it will fail with some error. On Unix your program
retains access to the deleted file and can even pass this access
to other processes in various ways. Only when the last process using
the file closes it will the file actually get deleted.
This 'use after deletion' presents Unix and filesystem designers
with the problem of how you keep track of this in the kernel. The
historical and generic kernel approach is to keep both a link count
and a reference count for each active inode; an inode is only marked
as unused and the filesystem told to free its space when both counts
go to zero. Deleting a file via
unlink() just lowers the link
count (and removes a directory entry); closing open file descriptors
is what lowers the reference count. This historical approach ignored
the possibility of the system crashing while an inode had become
unreachable through the filesystem and was only being kept alive
by its reference count; if this happened the inode became a zombie,
marked as active on disk but not referred to by anything. To fix
it you had to run a filesystem checker, which would
find such no-link inodes and actually deallocate them.
(When Sun introduced NFS they were forced to deviate slightly from this model, but that's an explanation for another time.)
Obviously this is not suitable for any sort of journaling or 'always
consistent' filesystem that wants to avoid the need for a
after unclean shutdowns. All such filesystems must keep track of
such 'deleted but not deallocated' files on disk using some mechanism
(and the kernel has to support telling filesystems about such
inodes). When the filesystem is unmounted in an orderly way, these
deleted files will probably get deallocated. If the system crashes,
part of bringing the filesystem up on boot will be to apply all of
the pending deallocations.
Some filesystems will do this as part of their regular journal; you journal, say, 'file has gone to 0 reference count', and then you know to do the deallocation on journal replay. Some filesystems may record this information separately, especially if they have some sort of 'delayed asynchronous deallocation' support for file deletions in general.
(Asynchronous deallocation is popular because it means your process
unlink() a big file without having to stall while the kernel
frantically runs around finding all of the file's data blocks and
then marking them all as free. Given that finding out what a file's
data blocks are often requires reading things from disk, such deallocations can be relatively
slow under disk IO load (even if you don't have other issues there).)
PS: It follows that a failure to correctly record pending deallocations or properly replay them is one way to quietly lose disk space on such a journaling filesystem. Spotting and fixing this is one of the things that you need a filesystem consistency checker for (whether it's a separate program or embedded into the filesystem itself).
Comments on this page:Written on 15 May 2015.