With ZFS, rewriting a file in place might make you run out of space

October 31, 2014

Here's an interesting little issue that I confirmed recently: if you rewrite an existing file in place with random IO on a plain ZFS filesystem, you can wind up using extra space and even run out of space. This is a little bit surprising but is not a bug; it's just fallout from how ZFS works.

It's easy to see how this can happen if you have compression or deduplication turned on on the filesystem and you rewrite different data; the new data might compress or deduplicate less well than the old data and so use up more space. Deduplication might especially be prone to this if you initialize your file with something simple (zeroes, say) and then rewrite with actual data.

(The corollary to this is that continuously rewritten files like the storage for a database can take up a fluctuating amount of disk space over time on such a filesystem. This is one reason of several that we're unlikely to ever turn compression on on our fileservers.)

But this can happen even on filesystems without dedup or compression, which is a little bit surprising. What's happening is the result of the ZFS 'record size' (what many filesystems would call their block size). ZFS has a variable record size, ranging from the minimum block size of your disks up to the recordsize parameter, usually 128 KB. When you write data, especially sequential data, ZFS will transparently aggregate it together into large blocks; this makes both writes and reads more efficient and so is a good thing.

So you start out by writing a big file sequentially, which aggregates things together into 128 KB on-disk blocks, puts pointers to those blocks into the file's metadata, and so on. Now you come back later and rewrite the file using, say, 8 KB random IO. Because ZFS is a copy on write filesystem, it can't overwrite the existing data in place. Instead every time you write over a chunk of an existing 128 KB block, the block winds up effectively fragmented and your new 8 KB chunk consumes some amount of extra space for extra block pointers and so on (and perhaps extra metaslab space due to fragmentation).

To be honest, actually pushing a filesystem or a pool out of space requires you to be doing a lot of rewrites and to already be very close to the space limit. And if you hit the limit, it seems to not cause more than occasional 'out of space' errors for the rewrite IO; things will go to 0 bytes available but the rewrites will continue to mostly work (new write IO will fail, of course). Given comments I've seen in the code while looking into the extra space reservation in ZFS pools, I suspect that ZFS is usually estimating that an overwrite takes no extra space and so usually allowing it through. But I'm guessing at this point.

(The other thing I don't know is what such a partially updated block looks like on disk. Does the entire original 128 KB block get fully read, split and rewritten somehow, or is there something more clever going on? Decoding the kernel source will tell me if I can find and understand the right spot, but I'm not that curious at the moment.)

Comments on this page:

By Etienne Dechamps at 2014-11-01 04:17:06:

"The other thing I don't know is what such a partially updated block looks like on disk. Does the entire original 128 KB block get fully read, split and rewritten somehow, or is there something more clever going on?"

No, that's exactly what it does. Worse, not only does it do RMW (Read-Modify-Write), it does it in the dumbest way possible and doesn't even know how to coalesce writes: https://github.com/zfsonlinux/zfs/issues/361

This presentation goes into excruciating detail about this behavior: http://www.youtube.com/watch?v=LtY3vpX-cdM

This RMW problem is especially painful when using a ZFS file or zvol as a virtual disk (e.g. for a VM).

By Paul Tötterman at 2014-11-01 09:15:18:
From at 2014-11-02 13:38:43:

Variable block works a littler bit differently than you suggest. If your recordsize is set to 128KB then files smaller than 128GB will be contained in a single block. However if a file is larger than the recordsize than all its block will be exactly of recordsize. If you overwrite only part of a block, zfs will have to read entire block (if it is not cached), apply the change and write an entire block. If you have an application which often does it (like databses) that is the main reason why you want to set recordsize to match db blocksize, to avoid partial overwrites and extra unnecessary reads.

Variable block in zfs mainly means that it can be different for different files.

Written on 31 October 2014.
« A drawback to handling errors via exceptions
A drawback in how DWiki parses its wikitext »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Oct 31 22:50:21 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.