2015-05-17
A bit more on the ZFS delete queue and snapshots
In my entry on ZFS delete queues, I mentioned that a filesystem's delete queue is captured in snapshots and so the space used by pending deletes is held by snapshots. A commentator then asked:
So in case someone uses zfs send/receive for backup he accidentially stores items in the delete queue?
This is important enough to say explicitly: YES. Absolutely.
Since it's part of a snapshot, the delete queue and all of the space
it holds will be transferred if you use zfs send to move a
filesystem snapshot elsewhere for whatever reason. Full backups,
incremental backups, migrating a filesystem, they all copy all of
the space held by the delete queue (and then keep it allocated on
the received side).
This has two important consequences. The first is that if you
transfer a filesystem with a heavy space loss due to things being
held in the delete queue for whatever reason,
you can get a very head-scratching result. If you don't actually
mount the received dataset you'll wind up with a dataset that claims
to have all of its space consumed by the dataset, not snapshots,
but if you 'zfs destroy' the transfer snapshot the dataset promptly
shrinks. Having gone through this experience myself, this is a very
WAT moment.
The second important consequence is that apparently the moment you
mount the received dataset, the current live version will immediately
diverge from the snapshot (because ZFS wakes up, says 'ah, a delete
queue with no live references', and applies all of those pending
deletes). This is a problem if you're doing repeated incremental
receives, because the next incremental receive will tell you
'filesystem has diverged from snapshot, you'll have to tell me to
force a rollback'. On the other hand, if ZFS space accounting is
working right this divergence should transfer a bunch of the space
the filesystem is consuming into the usedbysnapshots category.
Still, this must be another head-scratching moment, as just mounting
a filesystem suddenly caused a (potentially big) swing in space
usage and a divergence from the snapshot.
(I have not verified this mounting behavior myself, but in retrospect
it may be the cause of some unexpected divergences we've experienced
while migrating filesystems. Our approach was always just to use
'zfs recv -F ...', which is prefectly viable if you're really
sure that you're not blowing your own foot off.)
2015-05-15
Your Illumos-based NFS fileserver may be 'leaking' deleted files
By now you may have guessed the punchline of my sudden interest in ZFS delete queues: we had a problem with ZFS leaking space for deleted files that was ultimately traced down to an issue with pending deletes that our fileserver wasn't cleaning up when it should have been.
As a well-debugged filesystem, ZFS should not outright leak pending
deletions, where there are no remaining references anywhere yet the
files haven't been cleaned up (well, more or less; snapshots come
into the picture, as mentioned). However it's
possible for both user-level and kernel-level things to hold
references to now-deleted files in the traditional way and thus
keep them from being actually removed. User-level things holding
open files should be visible in, eg, fuser, and anyways this is
a well-known issue that savvy people will immediately ask you
about. Kernel level things may be less visible, and there is at
least one in mainline Illumos and thus OmniOS r151014 (the current
release as I write this entry).
Per George Wilson on the illumos-zfs mailing list here, Delphix
found that the network lock manager (the nlockmgr SMF service)
could hold references to (deleted) files under some circumstances
(see the comment in their fix).
Under the right circumstances this can cause significant space
lossage over time; we saw loss rates of 5 GB a week. This is worked
around by restarting nlockmgr; this restart drops the old references
and thus allows ZFS to actually remove the files and free up
potentially significant amounts of your disk space. Rebooting
the whole server will do it too, for obvious reasons, but is somewhat
less graceful.
(Restarting nlockmgr is said to be fully transparent to clients,
but we have not attempted to test that. When we did our nlockmgr
restart we did as much as possible to make any locking failures a
non-issue.)
As far as I know there is no kernel-level equivalent of fuser,
so that you could list eg even all currently active kernel level
references to files in a particular filesystem (never mind what
kernel subsystem is holding such references). I'd love to be wrong
here; it's an annoying gap in Illumos's observability.
The ZFS delete queue: ZFS's solution to the pending delete problem
Like every other always-consistent filesystem, ZFS needs a solution to the Unix pending delete problem (files that have been deleted on the filesystem but that are still in use). ZFS's solution is implemented with a type of internal ZFS object called the 'ZFS delete queue', which holds a reference to any and all ZFS objects that are pending deletion. You can think of it as a kind of directory (and technically it's implemented with the same underlying storage as directories are, namely a ZAP store).
Each filesystem in a ZFS pool has its own ZFS delete queue object, holding pending deletes for objects that are in (or were originally in) that filesystem. Also, each snapshot has a ZFS delete queue as well, because the current state of a filesytem's ZFS delete queue is captured as part of making a snapshot. This capture of delete queues in snapshots has some interesting consequences; the short version is that once a delete queue with entries is captured in a snapshot, the space used by those pending deleted objects cannot be released until the snapshot itself is deleted.
(I'm not sure that this space usage is properly accounted for in
the 'usedby*' space usage properties, but I haven't tested this
specifically.)
There is no simple way to find out how big the ZFS delete queue is
for a given filesystem. Instead you have to use the magic zdb
command to read it out, using 'zdb -dddd DATASET OBJNUM' to dump
details of individual ZFS objects so that you can find out how many
ZAP entries a filesystem's 'ZFS delete queue' object has; the number
of current ZAP entries is the number of pending deletions. See the
sidebar for full details, because it gets long and tedious.
(In some cases it will be blatantly obvious that you have some
sort of problem because df and 'zfs list' and so on report
very different space numbers than eg du does, and you don't
have any of the usual suspects like snapshots.)
Things in the ZFS delete queue still count in and against per-user
and per-group space usage and quotas, which makes sense because
they're still not quite deleted. If you use 'zfs userspace' or
'zfs groupspace' for space tracking and reporting purposes this
can result in potentially misleading numbers, especially if pending
deletions are 'leaking' (which can happen).
If you actually have and enforce per-user or per-group quotas, well,
you can wind up with users or groups that are hitting quota limits
for no readily apparent reason.
(Needing to add things to the ZFS delete queue has apparently caused problems on full filesystems at least in the past, per this interesting opensolaris discussion from 2006.)
Sidebar: A full example of finding how large a ZFS delete queue is
To dump the ZFS delete queue for a filesystem, first you need to know what its object number is; this is usually either 2 (for sufficiently old filesystems) or 3 (for newer ones), but the sure way to find out is to look at the ZFS master node for the filesystem (which is always object 1). So to start with, we'll dump the ZFS master node to find out the object number of the delete queue.
# zdb -dddd fs3-corestaff-01/h/281 1
Dataset [....]
Object lvl iblk dblk dsize lsize %full type
1 1 16K 1K 8K 1K 100.00 ZFS master node
[...]
microzap: 512 bytes, 3 entries
DELETE_QUEUE = 2
[...]
The object number of this filesystem's delete queue is 2 (it's an old filesystem, having been originally created on Solaris 10). So we can dump the ZFS delete queue:
# zdb -dddd fs3-corestaff-01/h/281 2
Dataset [...]
Object lvl iblk dblk dsize lsize %full type
2 2 16K 16K 144K 272K 100.00 ZFS delete queue
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 16
Fat ZAP stats:
[...]
ZAP entries: 5
[...]
3977ca = 3766218
3977da = 3766234
397a8b = 3766923
397a87 = 3766919
397840 = 3766336
(The final list here is the ZAP entries themselves, going from some magic key (on the left) to the ZFS object numbers on the right. If we wanted to, we could use these object numbers to inspect (or even read out) the actual things that are pending deletion. This is probably most useful to find out how large they are and thus how much space they should be consuming.)
There are two different forms of ZAPs and zdb reports how many
entries they have somewhat differently. In the master node we saw
a 'microzap', used when the ZAP is and always has been small. Here
we see a 'Fat ZAP', which is what a small ZAP turns into if at some
point it grows big enough. Once the ZFS delete queue becomes a fat
ZAP it stays that way even if it later only has a few entries, as
we see here.
In this case the ZFS delete queue for this filesystem holds only five entries, which is not particularly excessive or alarming. Our problem filesystem had over ten thousand entries by the time we resolved the issue.
PS: You can pretty much ignore the summary line with its pretty sizes;
as we see here, they have very little to do with how many delete queue
entries you have right now. A growing ZFS delete queue size may be
a problem indicator,
but here the only important thing in the summary is the type field,
which confirms that we have the right sort of objects both for the ZFS
master node and the ZFS delete queue.
PPS: You can also do this exercise for snapshots of filesystems; just use the full snapshot name instead of the filesystem.
(I'm not going to try to cover zdb usage details at all, partly
because I'm just flailing around with it. See Ben Rockwood's zdb:
Examining ZFS At Point-Blank Range for one
source of more information.)
2015-05-02
OmniOS as a NFS server has problems with sustained write loads
We have been hunting an serious OmniOS problem for some time. Today we finally have enough data that I feel I can say something definitive:
An OmniOS NFS server will lock up under (some) sustained write loads if the write volume is higher than its disks can sustain.
I believe that this issue is not specific to OmniOS; it's likely Illumos in general, and was probably inherited from OpenSolaris and Solaris 10. We've reproduced a similar lockup on our old fileservers, running Solaris 10 update 8.
Our current minimal reproduction is the latest OmniOS (r151014) on our standard fileserver hardware, with 1G networking added and with a test pool of a single mirrored vdev on two (local) 7200 RPM 2TB SATA disks. With both 1G networks being driven at basically full wire speed by a collection of NFS client systems writing out a collection of different files on that test pool, the system will run okay for a while and then suddenly enter a situation where system free memory nosedives abruptly and the amount of kernel memory used for things other than the ARC jumps massively. This leads immediately to a total system hang when the free memory hits rock bottom.
(This is more write traffic than the disks can sustain due to mirroring. We have 200 MBytes/sec of incoming NFS writes, which implies 200 MBytes/sec of writes to each disk. These disks appear to top out at 150 MBytes/sec at most, and that's probably only a burst figure.)
Through a series of relatively obvious tests that are too long to detail here (eg running only one network's worth of NFS clients), we're pretty confident that this system is stable under a write load that it can sustain. Overload is clearly not immediate death (within a few seconds or the like), so we assume that the system can survive sufficiently short periods of overload if the load drops afterwards. However we have various indications that it does not fully recover from such overloads for a long time (if ever).
(Death under sustained overload would explain many of the symptoms we've seen of our various fileserver problems (eg). The common element in all of the trigger causes is that they cause (or could cause) IO slowdowns; backend disks with errors, backend disks that are just slow responding, full pools, or even apparently pools hitting their quota limits, even 10G networking problems. A slowdown of IO would take a fileserver that was just surviving a current high client write volume and push it over the edge.)
The memory exhaustion appears to be related to a high and increasing level of outstanding incomplete or unprocessed NFS requests. We have some indication that increasing the number of NFS server threads helps stave off the lockup for a while, but we've had our test server lock up (in somewhat different test scenarios) with widely varying numbers.
In theory this shouldn't happen. An NFS server that is being overloaded should push back on the clients in various ways, not enter a death spiral of accepting all of their traffic, eating all its memory, and then locking up. In practice, well, we have a serious problem in production.
PS: Yes, I'll write something for the OmniOS mailing lists at some point. In practice tweets are easier than blog entries, which are easier than useful mailing list reports.