The pending delete problem for Unix filesystems

May 15, 2015

Unix has a number of somewhat annoying filesystem semantics that tend to irritate designers and implementors of filesystems. One of the famous ones is that you can delete a file without losing access to it. On at least some OSes, if your program open()s a file and then tries to delete it, either the deletion fails with 'file is in use' or you immediately lose access to the file; further attempts to read or write it will fail with some error. On Unix your program retains access to the deleted file and can even pass this access to other processes in various ways. Only when the last process using the file closes it will the file actually get deleted.

This 'use after deletion' presents Unix and filesystem designers with the problem of how you keep track of this in the kernel. The historical and generic kernel approach is to keep both a link count and a reference count for each active inode; an inode is only marked as unused and the filesystem told to free its space when both counts go to zero. Deleting a file via unlink() just lowers the link count (and removes a directory entry); closing open file descriptors is what lowers the reference count. This historical approach ignored the possibility of the system crashing while an inode had become unreachable through the filesystem and was only being kept alive by its reference count; if this happened the inode became a zombie, marked as active on disk but not referred to by anything. To fix it you had to run a filesystem checker, which would find such no-link inodes and actually deallocate them.

(When Sun introduced NFS they were forced to deviate slightly from this model, but that's an explanation for another time.)

Obviously this is not suitable for any sort of journaling or 'always consistent' filesystem that wants to avoid the need for a fsck after unclean shutdowns. All such filesystems must keep track of such 'deleted but not deallocated' files on disk using some mechanism (and the kernel has to support telling filesystems about such inodes). When the filesystem is unmounted in an orderly way, these deleted files will probably get deallocated. If the system crashes, part of bringing the filesystem up on boot will be to apply all of the pending deallocations.

Some filesystems will do this as part of their regular journal; you journal, say, 'file has gone to 0 reference count', and then you know to do the deallocation on journal replay. Some filesystems may record this information separately, especially if they have some sort of 'delayed asynchronous deallocation' support for file deletions in general.

(Asynchronous deallocation is popular because it means your process can unlink() a big file without having to stall while the kernel frantically runs around finding all of the file's data blocks and then marking them all as free. Given that finding out what a file's data blocks are often requires reading things from disk, such deallocations can be relatively slow under disk IO load (even if you don't have other issues there).)

PS: It follows that a failure to correctly record pending deallocations or properly replay them is one way to quietly lose disk space on such a journaling filesystem. Spotting and fixing this is one of the things that you need a filesystem consistency checker for (whether it's a separate program or embedded into the filesystem itself).


Comments on this page:

By dozzie at 2015-05-15 03:07:37:

On the other hand, failing to remove opened file explodes with whole bunch of other problems in a different place. Look at how Windows updates work and how annoying that is, precisely because you can't remove a DLL/EXE if it is still opened.

By Alan at 2015-05-15 10:22:49:

It's been very convenient for system software. I think most developers won't want to break quick updates on servers & well-understood tech. But the promised future is updates as an atomic transaction... which replaces the entire container hosting your app :). Ref-counting of individual inodes not required. At all.

Note you don't need to remove open executables, which is what causes the filesystem problem above. Just move them & then schedule daily cleanups, easy. (Using an atomic sys_swap_files() aka linux renameat2() proposal). Pretty confident Windows doesn't let you move open files either :).

Deleting open files is a hard feature to give up. But in general path/inode distinction looks more like a historical accident than a key feature. Hardlinks are not only deprecated for UX, their main use today is deprecated by the linux filesystem. It's an implementation detail (of rename()) that didn't need to be exposed (like open() on directories). You can still force deletion using lsof+kill or by implementing sys_revoke() :).

Not that Windows model got it right. The obvious question is [what if a background service (like file sync) opened the file you want to delete](http://stackoverflow.com/questions/3764072/c-win32-how-to-wait-for-a-pending-delete-to-complete#comment29866993_3764072 ). Answer: your program fails, and it's very annoying to troubleshoot.

Examples: Firefox didn't like online updates, it seemed to lose XUL and break. Google Chrome devs in particular complain about the linux model being a pain to implement (they integrate with distro update systems). GNOME have implemented offline updates... from what I read, mainly out of concern of interrupting desktop services. And if you try to update your PHP app, ongoing requests will race against e.g. class definitions being replaced in the wrong order.

Written on 15 May 2015.
« In Go, you need to always make sure that your goroutines will finish
The ZFS delete queue: ZFS's solution to the pending delete problem »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri May 15 01:02:45 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.