With ZFS, rewriting a file in place might make you run out of space

October 31, 2014

Here's an interesting little issue that I confirmed recently: if you rewrite an existing file in place with random IO on a plain ZFS filesystem, you can wind up using extra space and even run out of space. This is a little bit surprising but is not a bug; it's just fallout from how ZFS works.

It's easy to see how this can happen if you have compression or deduplication turned on on the filesystem and you rewrite different data; the new data might compress or deduplicate less well than the old data and so use up more space. Deduplication might especially be prone to this if you initialize your file with something simple (zeroes, say) and then rewrite with actual data.

(The corollary to this is that continuously rewritten files like the storage for a database can take up a fluctuating amount of disk space over time on such a filesystem. This is one reason of several that we're unlikely to ever turn compression on on our fileservers.)

But this can happen even on filesystems without dedup or compression, which is a little bit surprising. What's happening is the result of the ZFS 'record size' (what many filesystems would call their block size). ZFS has a variable record size, ranging from the minimum block size of your disks up to the recordsize parameter, usually 128 KB. When you write data, especially sequential data, ZFS will transparently aggregate it together into large blocks; this makes both writes and reads more efficient and so is a good thing.

So you start out by writing a big file sequentially, which aggregates things together into 128 KB on-disk blocks, puts pointers to those blocks into the file's metadata, and so on. Now you come back later and rewrite the file using, say, 8 KB random IO. Because ZFS is a copy on write filesystem, it can't overwrite the existing data in place. Instead every time you write over a chunk of an existing 128 KB block, the block winds up effectively fragmented and your new 8 KB chunk consumes some amount of extra space for extra block pointers and so on (and perhaps extra metaslab space due to fragmentation).

To be honest, actually pushing a filesystem or a pool out of space requires you to be doing a lot of rewrites and to already be very close to the space limit. And if you hit the limit, it seems to not cause more than occasional 'out of space' errors for the rewrite IO; things will go to 0 bytes available but the rewrites will continue to mostly work (new write IO will fail, of course). Given comments I've seen in the code while looking into the extra space reservation in ZFS pools, I suspect that ZFS is usually estimating that an overwrite takes no extra space and so usually allowing it through. But I'm guessing at this point.

(The other thing I don't know is what such a partially updated block looks like on disk. Does the entire original 128 KB block get fully read, split and rewritten somehow, or is there something more clever going on? Decoding the kernel source will tell me if I can find and understand the right spot, but I'm not that curious at the moment.)

Written on 31 October 2014.
« A drawback to handling errors via exceptions
A drawback in how DWiki parses its wikitext »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 31 22:50:21 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.