2014-10-31
With ZFS, rewriting a file in place might make you run out of space
Here's an interesting little issue that I confirmed recently: if you rewrite an existing file in place with random IO on a plain ZFS filesystem, you can wind up using extra space and even run out of space. This is a little bit surprising but is not a bug; it's just fallout from how ZFS works.
It's easy to see how this can happen if you have compression or deduplication turned on on the filesystem and you rewrite different data; the new data might compress or deduplicate less well than the old data and so use up more space. Deduplication might especially be prone to this if you initialize your file with something simple (zeroes, say) and then rewrite with actual data.
(The corollary to this is that continuously rewritten files like the storage for a database can take up a fluctuating amount of disk space over time on such a filesystem. This is one reason of several that we're unlikely to ever turn compression on on our fileservers.)
But this can happen even on filesystems without dedup or compression,
which is a little bit surprising. What's happening is the result of the
ZFS 'record size' (what many filesystems would call their block size).
ZFS has a variable record size, ranging from the minimum block size of
your disks up to the recordsize parameter, usually 128 KB. When you
write data, especially sequential data, ZFS will transparently aggregate
it together into large blocks; this makes both writes and reads more
efficient and so is a good thing.
So you start out by writing a big file sequentially, which aggregates things together into 128 KB on-disk blocks, puts pointers to those blocks into the file's metadata, and so on. Now you come back later and rewrite the file using, say, 8 KB random IO. Because ZFS is a copy on write filesystem, it can't overwrite the existing data in place. Instead every time you write over a chunk of an existing 128 KB block, the block winds up effectively fragmented and your new 8 KB chunk consumes some amount of extra space for extra block pointers and so on (and perhaps extra metaslab space due to fragmentation).
To be honest, actually pushing a filesystem or a pool out of space requires you to be doing a lot of rewrites and to already be very close to the space limit. And if you hit the limit, it seems to not cause more than occasional 'out of space' errors for the rewrite IO; things will go to 0 bytes available but the rewrites will continue to mostly work (new write IO will fail, of course). Given comments I've seen in the code while looking into the extra space reservation in ZFS pools, I suspect that ZFS is usually estimating that an overwrite takes no extra space and so usually allowing it through. But I'm guessing at this point.
(The other thing I don't know is what such a partially updated block looks like on disk. Does the entire original 128 KB block get fully read, split and rewritten somehow, or is there something more clever going on? Decoding the kernel source will tell me if I can find and understand the right spot, but I'm not that curious at the moment.)
2014-10-26
Things that can happen when (and as) your ZFS pool fills up
There's a shortage of authoritative information on what actually happens if you fill up a ZFS pool, so here is what I've both gathered about it from other people's information and experienced.
The most often cited problem is bad performance, with the usual cause being ZFS needing to do an increasing amount of searching through ZFS metaslab space maps to find free space. If not all of these are in memory, a write may require pulling some or all of them into memory, searching through them, and perhaps finding not enough space. People cite various fullness thresholds for this starting to happen, eg anywhere from 70% full to 90% full. I haven't seen any discussion about how severe this performance impact is supposed to be (and on what sort of vdevs; raidz vdevs may behave differently than mirror vdevs here).
(How many metaslabs you have turns out to depend on how your pool was created and grown.)
A nearly full pool can also have (and lead to) fragmentation, where the free space is in small scattered chunks instead of large contiguous runs. This can lead to ZFS having to write 'gang blocks', which are a mechanism where ZFS fragments one large logical block into smaller chunks (see eg the mention of them in this entry and this discussion which corrects some bits). Gang blocks are apparently less efficient than regular writes, especially if there's a churn of creation and deletion of them, and they add extra space overhead (which can thus eat your remaining space faster than expected).
If a pool gets sufficiently full, you stop being able to change most filesystem properties; for example, to set or modify the mountpoint or change NFS exporting. In theory it's not supposed to be possible for user writes to fill up a pool that far. In practice all of our full pools here have resulted in being unable to make such property changes (which can be a real problem under some circumstances).
You are supposed to be able to remove files from a full pool (possibly barring snapshots), but we've also had reports from users that they couldn't do so and their deletion attempt failed with 'No space left on device' errors. I have not been able to reproduce this and the problem has always gone away on its own.
(This may be due to a known and recently fixed issue, Illumos bug #4950.)
I've never read reports of catastrophic NFS performance problems for all pools or total system lockup resulting from a full pool on an NFS fileserver. However both of these have happened to us. The terrible performance issue only happened on our old Solaris 10 update 8 fileservers; the total NFS stalls and then system lockups have now happened on both our old fileservers and our new OmniOS based fileservers.
(Actually let me correct that; I've seen one report of a full pool killing a modern system. In general, see all of the replies to my tweeted question.)
By the way: if you know of other issues with full or nearly full ZFS pools (or if you have additional information here in general), I'd love to know more. Please feel free to leave a comment or otherwise get in touch.
2014-10-25
The difference in available pool space between zfs list and zpool list
For a while I've noticed that 'zpool list' would report that our pools
had more available space than 'zfs list' did and I've vaguely wondered
about why. We recently had a very serious issue due to a pool filling
up, so suddenly I became very interested in the whole issue and did
some digging. It turns out that there are two sources of the difference
depending on how your vdevs are set up.
For raidz vdevs, the simple version is that 'zpool list' reports more
or less the raw disk space before the raidz overhead while 'zfs list'
applies the standard estimate that you expect (ie that N disks worth of
space will vanish for a raidz level of N). Given that raidz overhead is
variable in ZFS, it's easy to see why the two commands are behaving this
way.
In addition, in general ZFS reserves a certain amount of pool space for various reasons, for example so that you can remove files even when the pool is 'full' (since ZFS is a copy on write system, removing files requires some new space to record the changes). This space is sometimes called 'slop space'. According to the code this reservation is 1/32nd of the pool's size. In my actual experimentation on our OmniOS fileservers this appears to be roughly 1/64th of the pool and definitely not 1/32nd of it, and I don't know why we're seeing this difference.
(I found out all of this from a Ben Rockwood blog entry and then found the code in the current Illumos codebase to see what the current state was (or is).)
The actual situation with what operations can (or should) use what space
is complicated. Roughly speaking, user level writes and ZFS operations
like 'zfs create' and 'zfs snapshot' that make things should use the
1/32nd reserved space figure, file removes and 'neutral' ZFS operations
should be allowed to use half of the slop space (running the pool down
to 1/64th of its size), and some operations (like 'zfs destroy') have
no limit whatever and can theoretically run your pool permanently and
unrecoverably out of space.
The final authority is the Illumos kernel code and its comments. These
days it's on Github so I can just link to the two most relevant bits:
spa_misc.c's discussion of spa_slop_shift
and dsl_synctask.h's discussion of zfs_space_check_t.
(What I'm seeing with our pools would make sense if everything was actually being classified as a 'allowed to use half of the slop space' operation. I haven't traced the Illumos kernel code at this level so I have no idea how this could be happening; the comments certainly suggest that it isn't supposed to be.)
(This is the kind of thing that I write down so I can find it later, even though it's theoretically out there on the Internet already. Re-finding things on the Internet can be a hard problem.)
2014-10-03
When using Illumos's lockstat, check the cumulative numbers too
Suppose, not entirely hypothetically that you
have an Illumos (or OmniOS or etc) system that is experiencing
something that looks an awful lot like kernel contention; for
example, periodic 'mpstat 1' output where one CPU is spending
100% of its time in kernel code. Perhaps following Brendan Gregg's
Solaris USE method, you stumble
over lockstat and decide to give it a try. This is a fine thing,
as it's a very nice tool and can give you lots of fascinating output.
However, speaking from recent experience, I urge you to at some
point run lockstat with the -P option and check that output
too. I believe that lockstat normally sorts its output by count,
highest first; -P changes this to sort by total time (ie the count
times its displayed average time). The very important thing that
this does is it very prominently surfaces relatively rare but really
long things. In my case, I spent a bunch of time and effort looking
at quite frequent and kind of alarming looking adaptive mutex spins,
but when I looked at 'lockstat -P' I discovered a lock acquisition
that only had 30 instances over 60 seconds but that had an average
spin time (not block time) of 55 milliseconds.
(Similarly, when I looked at the adaptive mutex block times I discovered the same lock acquisition, this time blocked 37 times in 60 seconds with an average block time of 1.6 seconds.)
In theory you can spot these things when scanning through the full
lockstat output even without -P, but in practice humans don't
work that way; we scan the top of the list and then as everything
starts to dwindle away into sameness our eyes glaze over. You're
going to miss things, so let lockstat do the work for you to
surface them.
(If you specifically suspect long things you can use -d to only
report on them, but picking a useful -d value probably requires
some guesswork and looking at basic lockstat output.)
By the way, there turn out to be a bunch of interesting tricks you
can do with lockstat. I recommend reading all the way through the
EXAMPLES section and especially paying attention to the discussion
of why various flags get used in various situations. Unlike the usual
manpage examples, it only gets more interesting as it goes along.
(And if you need really custom tooling you can use the lockstat DTrace provider in your own DTrace scripts. I wound up doing that today as part of getting information on one of our problems.)