ZFS's recordsize, holes in files, and partial blocks

September 27, 2017

Yesterday I wrote about using zdb to peer into ZFS's on-disk storage of files, and in particular I wondered if you wrote a 160 Kb file, would ZFS really use two 128 Kb blocks for it. The answer appeared to be 'no', but I was a little bit confused by some things I was seeing. In a comment, Robert Milkowski set me right:

In your first case (160KB file with 128KB recordsize) it does actually create 2x 128KB blocks. However, because you have compression enabled, the 2nd 128KB block has 32KB of random data (non-compressible) and 96KB of 0s which nicely compresses. You can actually see it reported by zdb as 0x20000L/0x8400P (so 128KB logical and 33KB physical).

He suggested testing on a filesystem with compression off in order to see the true state of affairs. Having done so and done some more digging, he's correct and we can see some interesting things here.

The simple thing to report is the state of a 160 Kb file (the same as yesterday) on a filesystem without compression. This allocates two full 128 Kb blocks on disk:

    0  L0 0:53a40ed000:20000 20000L/20000P F=1 B=19697368/19697368
20000  L0 0:53a410d000:20000 20000L/20000P F=1 B=19697368/19697368

     segment [0000000000000000, 0000000000040000) size  256K

These are 0x20000 bytes long on disk and the physical size is no different from the logical size. The file size in the dnode is reported as 163840 bytes, and presumably ZFS uses this to know when to return EOF as we read the second block.

One consequence of this is that it's beneficial to turn on compression even for filesystems with uncompressible data, because doing so gets you 'compression' of partial blocks (by compressing those zero bytes). On the filesystem without compression, that 32 Kb of uncompressible data forced the allocation of 128 Kb of space; on the filesystem with compression, the same 32 Kb of data only required 33 Kb of space.

A more interesting test file has holes that cover an entire recordsize block. Let's make one that has 128 Kb of data, skips the second 128 Kb block entirely, has 32 Kb of data at the end of the third 128 Kb block, skips the fourth 128 Kb block, and has 32 Kb of data at the end of the fifth 128 Kb block. Set up with dd, this is:

dd if=/dev/urandom of=testfile2 bs=128k count=1
dd if=/dev/urandom of=testfile2 bs=32k seek=11 count=1 conv=notrunc
dd if=/dev/urandom of=testfile2 bs=32k seek=19 count=1 conv=notrunc

Up until now I've been omitting the output for the L1 indirect block that contains block information for the L0 blocks. With it included, the file's data blocks look like this:

# zdb -vv -O ssddata/homes cks/tmp/testfile2
[...]
Indirect blocks:
     0 L1  0:8a2c4e2c00:400 20000L/400P F=3 B=3710016/3710016
     0  L0 0:8a4afe7e00:20000 20000L/20000P F=1 B=3710011/3710011
 40000  L0 0:8a2c4cec00:8400 20000L/8400P F=1 B=3710015/3710015
 80000  L0 0:8a2c4da800:8400 20000L/8400P F=1 B=3710016/3710016

     segment [0000000000000000, 0000000000020000) size  128K
     segment [0000000000040000, 0000000000060000) size  128K
     segment [0000000000080000, 00000000000a0000) size  128K

The blocks at 0x20000 and 0x60000 are missing entirely; these are genuine holes. The blocks at 0x40000 and 0x80000 are 128 Kb logical but less physical, and are presumably compressed. Can we tell for sure? The answer is yes:

# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile2
[...]
     0 L1  DVA[0]=<0:8a2c4e2c00:400> DVA[1]=<0:7601b4be00:400> [L1 ZFS plain file] fletcher4 lz4 [...]
     0  L0 DVA[0]=<0:8a4afe7e00:20000> [L0 ZFS plain file] fletcher4 uncompressed [...]
 40000  L0 DVA[0]=<0:8a2c4cec00:8400> [L0 ZFS plain file] fletcher4 lz4 [...]
 80000  L0 DVA[0]=<0:8a2c4da800:8400> [L0 ZFS plain file] fletcher4 lz4 [...]

(That we need to use both -vv and -bbbb here is due to how zdb's code is set up, and it's rather a hack to get what we want. I had to read the zdb source code to work out how to make it work.)

Among other things (which I've omitted here), this shows us that the 0x40000 and 0x80000 blocks are compressed with lz4, while the 0x0 block is uncompressed (which is what we expect from 128 Kb of random data). ZFS always compresses metadata (or at least tries to), so the L1 indirect block is also compressed with lz4.

This shows us that sparse files benefit from compression being turned on even if they contain uncompressible data. If this was a filesystem with compression off, the blocks at 0x40000 and 0x80000 would each have used 128 Kb of space, not the 33 Kb of space that they did here. ZFS filesystem compression thus helps space usage both for trailing data (which is not uncommon) and for sparse files (which may be relatively rare on your filesystems).

It's sometimes possible to dump the block contents of things like L1 indirect blocks, so you can see a more direct representation of them. This is where it's important to know that metadata is compressed, so we can ask zdb to decompress it with a magic argument:

# zdb -R ssddata 0:8a2c4e2c00:400:id
[...]
DVA[0]=<0:8a4afe7e00:20000> [L0 ZFS plain file] fletcher4 uncompressed unencrypted LE contiguous unique single size=20000L/20000P birth=3710011L/3710011P fill=1 cksum=3fcb4949b1aa:ff8a4656f2b87fd:d375da58a32c3eee:73a5705b7851bb59
HOLE [L0 unallocated] size=200L birth=0L
DVA[0]=<0:8a2c4cec00:8400> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/8400P birth=3710015L/3710015P fill=1 cksum=1079fbeda2c0:117fba0118c39e9:3534e8d61ddb372b:b5f0a9e59ccdcb7b
HOLE [L0 unallocated] size=200L birth=0L
DVA[0]=<0:8a2c4da800:8400> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/8400P birth=3710016L/3710016P fill=1 cksum=10944482ae3e:11830a40138e0c8:2f1dbd6afa0ee9b4:7d3d6b2c247ae44
HOLE [L0 unallocated] size=200L birth=0L
[...]

Here we can see the direct representation of the L1 indirect block with explicit holes between our allocated blocks. (This is a common way of representing holes in sparse files; most filesystems have some variant of it.)

PS: I'm not using 'zdb -ddddd' today because when I dug deeper into zdb, I discovered that 'zdb -O' would already report this information when given the right arguments, thereby saving me an annoying step.

Sidebar: Why you can't always dump blocks with zdb -R

To decompress a (ZFS) block, you need to know what it's compressed with and its uncompressed size. This information is stored in whatever metadata points to the block, not in the block itself, and so currently zdb -R simply guesses repeatedly until it gets a result that appears to work out right:

# zdb -R ssddata 0:8a2c4e2c00:400:id
Found vdev type: mirror
Trying 00400 -> 00600 (inherit)
Trying 00400 -> 00600 (on)
Trying 00400 -> 00600 (uncompressed)
Trying 00400 -> 00600 (lzjb)
Trying 00400 -> 00600 (empty)
Trying 00400 -> 00600 (gzip-1)
Trying 00400 -> 00600 (gzip-2)
[...]
Trying 00400 -> 20000 (lz4)
DVA[0]=<0:8a4afe7e00:20000> [...]

The result that zdb -R gets may or may not actually be correct, and thus may or may not give you the actual decompressed block data. Here it worked; at other times I've tried it, not so much. The last 'Trying' that zdb -R prints is the one it thinks is correct, so you can at least see if it got it right (here, for example, we know that it did, since it picked lz4 with a true logical size of 0x20000 and that's what the metadata we have about the L1 indirect block says it is).

Ideally zdb -R would gain a way of specifying the compression algorithm and the logical size for the d block flag. Perhaps someday.

Written on 27 September 2017.
« Using zdb to peer into how ZFS stores files on disk
Putting cron jobs into systemd user slices doesn't always work (on Ubuntu 16.04) »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Sep 27 00:14:11 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.