zdb to peer into how ZFS stores files on disk
All files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks.
For reasons beyond the scope of this entry, I was wondering if this was actually true. Specifically, suppose you're using the default 128 Kb recordsize and you write a file that is 160 Kb at the user level (128 Kb plus 32 Kb). The way recordsize is usually described implies that ZFS writes this on disk as two 128 Kb blocks, with the second one mostly empty.
It turns out that we can use
zdb to find out the answer to this
question and other interesting ones like it, and it's not even all
that painful. My starting point was Bruning Questions: ZFS Record
which has an example of using
zdb on a file in a test ZFS pool.
We can actually do this with a test file on a regular pool, like
- Create a test file:
cd $HOME/tmp dd if=/dev/urandom of=testfile bs=160k count=1
/dev/urandomhere to defeat ZFS compression.
zdb -Oto determine the object number of this file:
; zdb -O ssddata/homes cks/tmp/testfile Object lvl iblk dblk dsize dnsize lsize %full type 1075431 2 128K 128K 163K 512 256K 100.00 ZFS plain file
(Your version of
zdbmay be too old to have the -O option, but it's in upstream Illumos and ZFS on Linux.)
zdb -dddddto dump detailed information on the object:
# zdb -ddddd ssddata/homes 1075431 [...] 0 L0 0:7360fc5a00:20000 20000L/20000P F=1 B=3694003/3694003 20000 L0 0:73e6826c00:8400 20000L/8400P F=1 B=3694003/3694003 segment [0000000000000000, 0000000000040000) size 256K
See Bruning Questions: ZFS Record Size for information on what the various fields mean.
ds to use with the
zdbis sort of like explosives; if it doesn't solve your problem, add more
-ds until it does. This number of
ds works with ZFS on Linux for me but you might need more.)
What we have here is two on-disk blocks. One is 0x20000 bytes long,
or 128 KB; the other is 0x8400 bytes long, or 33 Kb. I don't know
why it's 33 Kb instead of 32 Kb, especially since
zdb will also
report that the file has a
size of 163840 (bytes), which is exactly
160 Kb as expected. It's not the
ashift on this pool, because
this is the pool I made a little setup mistake on so it has an
ashift of 9.
Based on what we see here it certainly appears that ZFS will write a short block at the end of a file instead of forcing all blocks in the file to be 128 Kb once you've hit that point. However, note that this second block still has a logical size of 0x20000 bytes (128 Kb), so logically it covers the entire recordsize. This may be part of why it takes up 33 Kb instead of 32 Kb on disk.
That doesn't mean that the 128 Kb recordsize has no effect; in fact, we can show why you might care with a little experiment. Let's rewrite 16 Kb in the middle of that first 128 Kb block, and then re-dump the file layout details:
; dd if=/dev/urandom of=testfile conv=notrunc bs=16k count=1 seek=4 # zdb -ddddd ssddata/homes 1075431 [...] 0 L0 0:73610c5a00:20000 20000L/20000P F=1 B=3694207/3694207 20000 L0 0:73e6826c00:8400 20000L/8400P F=1 B=3694003/3694003
As you'd sort of expect from the description of recordsize, ZFS has not split the 128 Kb block up into some chunks; instead, it's done a read-modify-write cycle on the entire 128 Kb, resulting in an entirely new 128 Kb block and 128 Kb of read and write IO (at least at a logical level; at a physical level this data was probably in the ARC, since I'd just written the file in the first place).
Now let's give ZFS a slightly tricky case to see what it does. Unix files can have holes, areas where no data has been written; the resulting file is called a sparse file. Traditionally holes don't result in data blocks being allocated on disk; instead they're gaps in the allocated blocks. You create holes by writing beyond the end of file. How does ZFS represent holes? We'll start by making a 16 Kb file with no hole, then give it a hole by writing another 16 Kb at 96 Kb into the file.
; dd if=/dev/urandom of=testfile2 bs=16k count=1 # zdb -ddddd ssddata/homes 1078183 [...] 0 L0 0:7330dcaa00:4000 4000L/4000P F=1 B=3694361/3694361 segment [0000000000000000, 0000000000004000) size 16K
Now we add the hole:
; dd if=/dev/urandom of=testfile2 bs=16k count=1 seek=6 conv=notrunc [...] # zdb -ddddd ssddata/homes 1078183 [...] 0 L0 0:73ea07a400:8200 1c000L/8200P F=1 B=3694377/3694377 segment [0000000000000000, 000000000001c000) size 112K
The file started out as having one block of (physical on-disk) size 0x4000 (16 Kb). When we added the hole, it was rewritten to have one block of size 0x8200 (32.5 Kb), which represents 112 Kb of logical space. This is actually interesting; it means that ZFS is doing something clever to store holes that fall within what would normally be a single recordsize block. It's also suggestive that ZFS writes some extra data to the block over what we did (the .5 Kb), just as it did with the second block in our first example.
(The same thing happens if you write the second 16 Kb block at 56 Kb, so that you create a 64 Kb long file that would be one 64 Kb block if it didn't have a hole.)
Now that I've worked out how to use
zdb for this sort of exploration,
there's a number of questions about how ZFS stores files on disks
that I want to look into at some point, including how compression
interacts with recordsize and block sizes.
(I should probably also do some deeper exploration of what the
zdb is reporting means. I've poked around
zdb before, but always in very 'heads
down' and limited ways that didn't involve really understanding
ZFS on-disk structures.)
Update: As pointed out by Robert Milkowski in the comments,
I'm mistaken here and being fooled by compression being on in this
filesystem. See ZFS's
recordsize, holes in files, and partial blocks for the illustrated explanation of
what's really going on.