recordsize, holes in files, and partial blocks
Yesterday I wrote about using
zdb to peer into ZFS's on-disk
storage of files, and in particular I
wondered if you wrote a 160 Kb file, would ZFS really use two
128 Kb blocks for it. The answer appeared to be 'no', but I was
a little bit confused by some things I was seeing. In a comment,
Robert Milkowski set me right:
In your first case (160KB file with 128KB recordsize) it does actually create 2x 128KB blocks. However, because you have compression enabled, the 2nd 128KB block has 32KB of random data (non-compressible) and 96KB of 0s which nicely compresses. You can actually see it reported by zdb as 0x20000L/0x8400P (so 128KB logical and 33KB physical).
He suggested testing on a filesystem with compression off in order to see the true state of affairs. Having done so and done some more digging, he's correct and we can see some interesting things here.
The simple thing to report is the state of a 160 Kb file (the same as yesterday) on a filesystem without compression. This allocates two full 128 Kb blocks on disk:
0 L0 0:53a40ed000:20000 20000L/20000P F=1 B=19697368/19697368 20000 L0 0:53a410d000:20000 20000L/20000P F=1 B=19697368/19697368 segment [0000000000000000, 0000000000040000) size 256K
These are 0x20000 bytes long on disk and the physical size is no different from the logical size. The file size in the dnode is reported as 163840 bytes, and presumably ZFS uses this to know when to return EOF as we read the second block.
One consequence of this is that it's beneficial to turn on compression even for filesystems with uncompressible data, because doing so gets you 'compression' of partial blocks (by compressing those zero bytes). On the filesystem without compression, that 32 Kb of uncompressible data forced the allocation of 128 Kb of space; on the filesystem with compression, the same 32 Kb of data only required 33 Kb of space.
A more interesting test file has holes that cover an entire recordsize
block. Let's make one that has 128 Kb of data, skips the second 128 Kb
block entirely, has 32 Kb of data at the end of the third 128 Kb block,
skips the fourth 128 Kb block, and has 32 Kb of data at the end of the
fifth 128 Kb block. Set up with
dd, this is:
dd if=/dev/urandom of=testfile2 bs=128k count=1 dd if=/dev/urandom of=testfile2 bs=32k seek=11 count=1 conv=notrunc dd if=/dev/urandom of=testfile2 bs=32k seek=19 count=1 conv=notrunc
Up until now I've been omitting the output for the L1 indirect block that contains block information for the L0 blocks. With it included, the file's data blocks look like this:
# zdb -vv -O ssddata/homes cks/tmp/testfile2 [...] Indirect blocks: 0 L1 0:8a2c4e2c00:400 20000L/400P F=3 B=3710016/3710016 0 L0 0:8a4afe7e00:20000 20000L/20000P F=1 B=3710011/3710011 40000 L0 0:8a2c4cec00:8400 20000L/8400P F=1 B=3710015/3710015 80000 L0 0:8a2c4da800:8400 20000L/8400P F=1 B=3710016/3710016 segment [0000000000000000, 0000000000020000) size 128K segment [0000000000040000, 0000000000060000) size 128K segment [0000000000080000, 00000000000a0000) size 128K
The blocks at 0x20000 and 0x60000 are missing entirely; these are genuine holes. The blocks at 0x40000 and 0x80000 are 128 Kb logical but less physical, and are presumably compressed. Can we tell for sure? The answer is yes:
# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile2 [...] 0 L1 DVA=<0:8a2c4e2c00:400> DVA=<0:7601b4be00:400> [L1 ZFS plain file] fletcher4 lz4 [...] 0 L0 DVA=<0:8a4afe7e00:20000> [L0 ZFS plain file] fletcher4 uncompressed [...] 40000 L0 DVA=<0:8a2c4cec00:8400> [L0 ZFS plain file] fletcher4 lz4 [...] 80000 L0 DVA=<0:8a2c4da800:8400> [L0 ZFS plain file] fletcher4 lz4 [...]
(That we need to use both
-bbbb here is due to how
zdb's code is set up, and it's rather a hack to get what we want.
I had to read the
zdb source code to work out how to make it work.)
Among other things (which I've omitted here), this shows us that the 0x40000 and 0x80000 blocks are compressed with lz4, while the 0x0 block is uncompressed (which is what we expect from 128 Kb of random data). ZFS always compresses metadata (or at least tries to), so the L1 indirect block is also compressed with lz4.
This shows us that sparse files benefit from compression being turned on even if they contain uncompressible data. If this was a filesystem with compression off, the blocks at 0x40000 and 0x80000 would each have used 128 Kb of space, not the 33 Kb of space that they did here. ZFS filesystem compression thus helps space usage both for trailing data (which is not uncommon) and for sparse files (which may be relatively rare on your filesystems).
It's sometimes possible to dump the block contents of things like
L1 indirect blocks, so you can see a more direct representation
of them. This is where it's important to know that metadata is
compressed, so we can ask
zdb to decompress it with a magic
# zdb -R ssddata 0:8a2c4e2c00:400:id [...] DVA=<0:8a4afe7e00:20000> [L0 ZFS plain file] fletcher4 uncompressed unencrypted LE contiguous unique single size=20000L/20000P birth=3710011L/3710011P fill=1 cksum=3fcb4949b1aa:ff8a4656f2b87fd:d375da58a32c3eee:73a5705b7851bb59 HOLE [L0 unallocated] size=200L birth=0L DVA=<0:8a2c4cec00:8400> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/8400P birth=3710015L/3710015P fill=1 cksum=1079fbeda2c0:117fba0118c39e9:3534e8d61ddb372b:b5f0a9e59ccdcb7b HOLE [L0 unallocated] size=200L birth=0L DVA=<0:8a2c4da800:8400> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/8400P birth=3710016L/3710016P fill=1 cksum=10944482ae3e:11830a40138e0c8:2f1dbd6afa0ee9b4:7d3d6b2c247ae44 HOLE [L0 unallocated] size=200L birth=0L [...]
Here we can see the direct representation of the L1 indirect block with explicit holes between our allocated blocks. (This is a common way of representing holes in sparse files; most filesystems have some variant of it.)
PS: I'm not using '
zdb -ddddd' today because when I dug deeper
zdb, I discovered that '
zdb -O' would already report this
information when given the right arguments, thereby saving me an
Sidebar: Why you can't always dump blocks with
To decompress a (ZFS) block, you need to know what it's compressed
with and its uncompressed size. This information is stored in
whatever metadata points to the block, not in the block itself, and
zdb -R simply guesses repeatedly until it gets a
result that appears to work out right:
# zdb -R ssddata 0:8a2c4e2c00:400:id Found vdev type: mirror Trying 00400 -> 00600 (inherit) Trying 00400 -> 00600 (on) Trying 00400 -> 00600 (uncompressed) Trying 00400 -> 00600 (lzjb) Trying 00400 -> 00600 (empty) Trying 00400 -> 00600 (gzip-1) Trying 00400 -> 00600 (gzip-2) [...] Trying 00400 -> 20000 (lz4) DVA=<0:8a4afe7e00:20000> [...]
The result that
zdb -R gets may or may not actually be correct,
and thus may or may not give you the actual decompressed block data.
Here it worked; at other times I've tried it, not so much. The last
zdb -R prints is the one it thinks is correct, so
you can at least see if it got it right (here, for example, we know
that it did, since it picked lz4 with a true logical size of 0x20000
and that's what the metadata we have about the L1 indirect block says
zdb -R would gain a way of specifying the compression
algorithm and the logical size for the
d block flag. Perhaps