Wandering Thoughts archives

2015-05-15

Your Illumos-based NFS fileserver may be 'leaking' deleted files

By now you may have guessed the punchline of my sudden interest in ZFS delete queues: we had a problem with ZFS leaking space for deleted files that was ultimately traced down to an issue with pending deletes that our fileserver wasn't cleaning up when it should have been.

As a well-debugged filesystem, ZFS should not outright leak pending deletions, where there are no remaining references anywhere yet the files haven't been cleaned up (well, more or less; snapshots come into the picture, as mentioned). However it's possible for both user-level and kernel-level things to hold references to now-deleted files in the traditional way and thus keep them from being actually removed. User-level things holding open files should be visible in, eg, fuser, and anyways this is a well-known issue that savvy people will immediately ask you about. Kernel level things may be less visible, and there is at least one in mainline Illumos and thus OmniOS r151014 (the current release as I write this entry).

Per George Wilson on the illumos-zfs mailing list here, Delphix found that the network lock manager (the nlockmgr SMF service) could hold references to (deleted) files under some circumstances (see the comment in their fix). Under the right circumstances this can cause significant space lossage over time; we saw loss rates of 5 GB a week. This is worked around by restarting nlockmgr; this restart drops the old references and thus allows ZFS to actually remove the files and free up potentially significant amounts of your disk space. Rebooting the whole server will do it too, for obvious reasons, but is somewhat less graceful.

(Restarting nlockmgr is said to be fully transparent to clients, but we have not attempted to test that. When we did our nlockmgr restart we did as much as possible to make any locking failures a non-issue.)

As far as I know there is no kernel-level equivalent of fuser, so that you could list eg even all currently active kernel level references to files in a particular filesystem (never mind what kernel subsystem is holding such references). I'd love to be wrong here; it's an annoying gap in Illumos's observability.

solaris/ZFSDeleteQueueNLMLeak written at 23:29:05; Add Comment

The ZFS delete queue: ZFS's solution to the pending delete problem

Like every other always-consistent filesystem, ZFS needs a solution to the Unix pending delete problem (files that have been deleted on the filesystem but that are still in use). ZFS's solution is implemented with a type of internal ZFS object called the 'ZFS delete queue', which holds a reference to any and all ZFS objects that are pending deletion. You can think of it as a kind of directory (and technically it's implemented with the same underlying storage as directories are, namely a ZAP store).

Each filesystem in a ZFS pool has its own ZFS delete queue object, holding pending deletes for objects that are in (or were originally in) that filesystem. Also, each snapshot has a ZFS delete queue as well, because the current state of a filesytem's ZFS delete queue is captured as part of making a snapshot. This capture of delete queues in snapshots has some interesting consequences; the short version is that once a delete queue with entries is captured in a snapshot, the space used by those pending deleted objects cannot be released until the snapshot itself is deleted.

(I'm not sure that this space usage is properly accounted for in the 'usedby*' space usage properties, but I haven't tested this specifically.)

There is no simple way to find out how big the ZFS delete queue is for a given filesystem. Instead you have to use the magic zdb command to read it out, using 'zdb -dddd DATASET OBJNUM' to dump details of individual ZFS objects so that you can find out how many ZAP entries a filesystem's 'ZFS delete queue' object has; the number of current ZAP entries is the number of pending deletions. See the sidebar for full details, because it gets long and tedious.

(In some cases it will be blatantly obvious that you have some sort of problem because df and 'zfs list' and so on report very different space numbers than eg du does, and you don't have any of the usual suspects like snapshots.)

Things in the ZFS delete queue still count in and against per-user and per-group space usage and quotas, which makes sense because they're still not quite deleted. If you use 'zfs userspace' or 'zfs groupspace' for space tracking and reporting purposes this can result in potentially misleading numbers, especially if pending deletions are 'leaking' (which can happen). If you actually have and enforce per-user or per-group quotas, well, you can wind up with users or groups that are hitting quota limits for no readily apparent reason.

(Needing to add things to the ZFS delete queue has apparently caused problems on full filesystems at least in the past, per this interesting opensolaris discussion from 2006.)

Sidebar: A full example of finding how large a ZFS delete queue is

To dump the ZFS delete queue for a filesystem, first you need to know what its object number is; this is usually either 2 (for sufficiently old filesystems) or 3 (for newer ones), but the sure way to find out is to look at the ZFS master node for the filesystem (which is always object 1). So to start with, we'll dump the ZFS master node to find out the object number of the delete queue.

# zdb -dddd fs3-corestaff-01/h/281 1
Dataset [....]

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         1    1    16K     1K     8K     1K  100.00  ZFS master node
[...]
        microzap: 512 bytes, 3 entries

                DELETE_QUEUE = 2 
[...]

The object number of this filesystem's delete queue is 2 (it's an old filesystem, having been originally created on Solaris 10). So we can dump the ZFS delete queue:

# zdb -dddd fs3-corestaff-01/h/281 2
Dataset [...]

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         2    2    16K    16K   144K   272K  100.00  ZFS delete queue
        dnode flags: USED_BYTES USERUSED_ACCOUNTED 
        dnode maxblkid: 16
        Fat ZAP stats:
[...]
                ZAP entries: 5
[...]
                3977ca = 3766218 
                3977da = 3766234 
                397a8b = 3766923 
                397a87 = 3766919 
                397840 = 3766336 

(The final list here is the ZAP entries themselves, going from some magic key (on the left) to the ZFS object numbers on the right. If we wanted to, we could use these object numbers to inspect (or even read out) the actual things that are pending deletion. This is probably most useful to find out how large they are and thus how much space they should be consuming.)

There are two different forms of ZAPs and zdb reports how many entries they have somewhat differently. In the master node we saw a 'microzap', used when the ZAP is and always has been small. Here we see a 'Fat ZAP', which is what a small ZAP turns into if at some point it grows big enough. Once the ZFS delete queue becomes a fat ZAP it stays that way even if it later only has a few entries, as we see here.

In this case the ZFS delete queue for this filesystem holds only five entries, which is not particularly excessive or alarming. Our problem filesystem had over ten thousand entries by the time we resolved the issue.

PS: You can pretty much ignore the summary line with its pretty sizes; as we see here, they have very little to do with how many delete queue entries you have right now. A growing ZFS delete queue size may be a problem indicator, but here the only important thing in the summary is the type field, which confirms that we have the right sort of objects both for the ZFS master node and the ZFS delete queue.

PPS: You can also do this exercise for snapshots of filesystems; just use the full snapshot name instead of the filesystem.

(I'm not going to try to cover zdb usage details at all, partly because I'm just flailing around with it. See Ben Rockwood's zdb: Examining ZFS At Point-Blank Range for one source of more information.)

solaris/ZFSDeleteQueue written at 23:28:11; Add Comment

The pending delete problem for Unix filesystems

Unix has a number of somewhat annoying filesystem semantics that tend to irritate designers and implementors of filesystems. One of the famous ones is that you can delete a file without losing access to it. On at least some OSes, if your program open()s a file and then tries to delete it, either the deletion fails with 'file is in use' or you immediately lose access to the file; further attempts to read or write it will fail with some error. On Unix your program retains access to the deleted file and can even pass this access to other processes in various ways. Only when the last process using the file closes it will the file actually get deleted.

This 'use after deletion' presents Unix and filesystem designers with the problem of how you keep track of this in the kernel. The historical and generic kernel approach is to keep both a link count and a reference count for each active inode; an inode is only marked as unused and the filesystem told to free its space when both counts go to zero. Deleting a file via unlink() just lowers the link count (and removes a directory entry); closing open file descriptors is what lowers the reference count. This historical approach ignored the possibility of the system crashing while an inode had become unreachable through the filesystem and was only being kept alive by its reference count; if this happened the inode became a zombie, marked as active on disk but not referred to by anything. To fix it you had to run a filesystem checker, which would find such no-link inodes and actually deallocate them.

(When Sun introduced NFS they were forced to deviate slightly from this model, but that's an explanation for another time.)

Obviously this is not suitable for any sort of journaling or 'always consistent' filesystem that wants to avoid the need for a fsck after unclean shutdowns. All such filesystems must keep track of such 'deleted but not deallocated' files on disk using some mechanism (and the kernel has to support telling filesystems about such inodes). When the filesystem is unmounted in an orderly way, these deleted files will probably get deallocated. If the system crashes, part of bringing the filesystem up on boot will be to apply all of the pending deallocations.

Some filesystems will do this as part of their regular journal; you journal, say, 'file has gone to 0 reference count', and then you know to do the deallocation on journal replay. Some filesystems may record this information separately, especially if they have some sort of 'delayed asynchronous deallocation' support for file deletions in general.

(Asynchronous deallocation is popular because it means your process can unlink() a big file without having to stall while the kernel frantically runs around finding all of the file's data blocks and then marking them all as free. Given that finding out what a file's data blocks are often requires reading things from disk, such deallocations can be relatively slow under disk IO load (even if you don't have other issues there).)

PS: It follows that a failure to correctly record pending deallocations or properly replay them is one way to quietly lose disk space on such a journaling filesystem. Spotting and fixing this is one of the things that you need a filesystem consistency checker for (whether it's a separate program or embedded into the filesystem itself).

unix/UnixPendingDeleteProblem written at 01:02:45; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.