Using zdb to peer into how ZFS stores files on disk

September 26, 2017

If you've read much about ZFS and ZFS performance tuning, one of the things you'll have run across is the ZFS recordsize. The usual way it's described is, for example (from here):

All files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks.

For reasons beyond the scope of this entry, I was wondering if this was actually true. Specifically, suppose you're using the default 128 Kb recordsize and you write a file that is 160 Kb at the user level (128 Kb plus 32 Kb). The way recordsize is usually described implies that ZFS writes this on disk as two 128 Kb blocks, with the second one mostly empty.

It turns out that we can use zdb to find out the answer to this question and other interesting ones like it, and it's not even all that painful. My starting point was Bruning Questions: ZFS Record Size, which has an example of using zdb on a file in a test ZFS pool. We can actually do this with a test file on a regular pool, like so:

  • Create a test file:
    cd $HOME/tmp
    dd if=/dev/urandom of=testfile bs=160k count=1

    I'm using /dev/urandom here to defeat ZFS compression.

  • Use zdb -O to determine the object number of this file:
    ; zdb -O ssddata/homes cks/tmp/testfile
      Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
     1075431    2   128K   128K   163K     512   256K  100.00  ZFS plain file

    (Your version of zdb may be too old to have the -O option, but it's in upstream Illumos and ZFS on Linux.)

  • Use zdb -ddddd to dump detailed information on the object:
    # zdb -ddddd ssddata/homes 1075431
         0  L0 0:7360fc5a00:20000 20000L/20000P F=1 B=3694003/3694003
     20000  L0 0:73e6826c00:8400 20000L/8400P F=1 B=3694003/3694003
         segment [0000000000000000, 0000000000040000) size  256K

    See Bruning Questions: ZFS Record Size for information on what the various fields mean.

    (How many ds to use with the -d option for zdb is sort of like explosives; if it doesn't solve your problem, add more -ds until it does. This number of ds works with ZFS on Linux for me but you might need more.)

What we have here is two on-disk blocks. One is 0x20000 bytes long, or 128 KB; the other is 0x8400 bytes long, or 33 Kb. I don't know why it's 33 Kb instead of 32 Kb, especially since zdb will also report that the file has a size of 163840 (bytes), which is exactly 160 Kb as expected. It's not the ashift on this pool, because this is the pool I made a little setup mistake on so it has an ashift of 9.

Based on what we see here it certainly appears that ZFS will write a short block at the end of a file instead of forcing all blocks in the file to be 128 Kb once you've hit that point. However, note that this second block still has a logical size of 0x20000 bytes (128 Kb), so logically it covers the entire recordsize. This may be part of why it takes up 33 Kb instead of 32 Kb on disk.

That doesn't mean that the 128 Kb recordsize has no effect; in fact, we can show why you might care with a little experiment. Let's rewrite 16 Kb in the middle of that first 128 Kb block, and then re-dump the file layout details:

; dd if=/dev/urandom of=testfile conv=notrunc bs=16k count=1 seek=4
# zdb -ddddd ssddata/homes 1075431
     0  L0 0:73610c5a00:20000 20000L/20000P F=1 B=3694207/3694207
 20000  L0 0:73e6826c00:8400 20000L/8400P F=1 B=3694003/3694003

As you'd sort of expect from the description of recordsize, ZFS has not split the 128 Kb block up into some chunks; instead, it's done a read-modify-write cycle on the entire 128 Kb, resulting in an entirely new 128 Kb block and 128 Kb of read and write IO (at least at a logical level; at a physical level this data was probably in the ARC, since I'd just written the file in the first place).

Now let's give ZFS a slightly tricky case to see what it does. Unix files can have holes, areas where no data has been written; the resulting file is called a sparse file. Traditionally holes don't result in data blocks being allocated on disk; instead they're gaps in the allocated blocks. You create holes by writing beyond the end of file. How does ZFS represent holes? We'll start by making a 16 Kb file with no hole, then give it a hole by writing another 16 Kb at 96 Kb into the file.

; dd if=/dev/urandom of=testfile2 bs=16k count=1
# zdb -ddddd ssddata/homes 1078183
     0 L0 0:7330dcaa00:4000 4000L/4000P F=1 B=3694361/3694361

      segment [0000000000000000, 0000000000004000) size   16K

Now we add the hole:

; dd if=/dev/urandom of=testfile2 bs=16k count=1 seek=6 conv=notrunc
# zdb -ddddd ssddata/homes 1078183
     0 L0 0:73ea07a400:8200 1c000L/8200P F=1 B=3694377/3694377

      segment [0000000000000000, 000000000001c000) size  112K

The file started out as having one block of (physical on-disk) size 0x4000 (16 Kb). When we added the hole, it was rewritten to have one block of size 0x8200 (32.5 Kb), which represents 112 Kb of logical space. This is actually interesting; it means that ZFS is doing something clever to store holes that fall within what would normally be a single recordsize block. It's also suggestive that ZFS writes some extra data to the block over what we did (the .5 Kb), just as it did with the second block in our first example.

(The same thing happens if you write the second 16 Kb block at 56 Kb, so that you create a 64 Kb long file that would be one 64 Kb block if it didn't have a hole.)

Now that I've worked out how to use zdb for this sort of exploration, there's a number of questions about how ZFS stores files on disks that I want to look into at some point, including how compression interacts with recordsize and block sizes.

(I should probably also do some deeper exploration of what the various information zdb is reporting means. I've poked around with zdb before, but always in very 'heads down' and limited ways that didn't involve really understanding ZFS on-disk structures.)

Update: As pointed out by Robert Milkowski in the comments, I'm mistaken here and being fooled by compression being on in this filesystem. See ZFS's recordsize, holes in files, and partial blocks for the illustrated explanation of what's really going on.

Comments on this page:

By Robert Milkowski at 2017-09-26 05:59:07:

In your first case (160KB file with 128KB recordsize) it does actually create 2x 128KB blocks. However, because you have compression enabled, the 2nd 128KB block has 32KB of random data (non-compressable) and 96KB of 0s which nicely compresses. You can actually see it reported by zdb as 0x20000L/0x8400P (so 128KB logical and 33KB physical.

Create a file system with compression=off and try again, and you will see that it ends up allocating two 128KB physical blocks as well.

Written on 26 September 2017.
« What I use printf for when hacking changes into programs
ZFS's recordsize, holes in files, and partial blocks »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Sep 26 01:18:03 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.