recordsize, holes in files, and partial blocks
Yesterday I wrote about using
zdb to peer into ZFS's on-disk
storage of files, and in particular I
wondered if you wrote a 160 Kb file, would ZFS really use two
128 Kb blocks for it. The answer appeared to be 'no', but I was
a little bit confused by some things I was seeing. In a comment,
Robert Milkowski set me right:
In your first case (160KB file with 128KB recordsize) it does actually create 2x 128KB blocks. However, because you have compression enabled, the 2nd 128KB block has 32KB of random data (non-compressible) and 96KB of 0s which nicely compresses. You can actually see it reported by zdb as 0x20000L/0x8400P (so 128KB logical and 33KB physical).
He suggested testing on a filesystem with compression off in order to see the true state of affairs. Having done so and done some more digging, he's correct and we can see some interesting things here.
The simple thing to report is the state of a 160 Kb file (the same as yesterday) on a filesystem without compression. This allocates two full 128 Kb blocks on disk:
0 L0 0:53a40ed000:20000 20000L/20000P F=1 B=19697368/19697368 20000 L0 0:53a410d000:20000 20000L/20000P F=1 B=19697368/19697368 segment [0000000000000000, 0000000000040000) size 256K
These are 0x20000 bytes long on disk and the physical size is no different from the logical size. The file size in the dnode is reported as 163840 bytes, and presumably ZFS uses this to know when to return EOF as we read the second block.
One consequence of this is that it's beneficial to turn on compression even for filesystems with uncompressible data, because doing so gets you 'compression' of partial blocks (by compressing those zero bytes). On the filesystem without compression, that 32 Kb of uncompressible data forced the allocation of 128 Kb of space; on the filesystem with compression, the same 32 Kb of data only required 33 Kb of space.
A more interesting test file has holes that cover an entire recordsize
block. Let's make one that has 128 Kb of data, skips the second 128 Kb
block entirely, has 32 Kb of data at the end of the third 128 Kb block,
skips the fourth 128 Kb block, and has 32 Kb of data at the end of the
fifth 128 Kb block. Set up with
dd, this is:
dd if=/dev/urandom of=testfile2 bs=128k count=1 dd if=/dev/urandom of=testfile2 bs=32k seek=11 count=1 conv=notrunc dd if=/dev/urandom of=testfile2 bs=32k seek=19 count=1 conv=notrunc
Up until now I've been omitting the output for the L1 indirect block that contains block information for the L0 blocks. With it included, the file's data blocks look like this:
# zdb -vv -O ssddata/homes cks/tmp/testfile2 [...] Indirect blocks: 0 L1 0:8a2c4e2c00:400 20000L/400P F=3 B=3710016/3710016 0 L0 0:8a4afe7e00:20000 20000L/20000P F=1 B=3710011/3710011 40000 L0 0:8a2c4cec00:8400 20000L/8400P F=1 B=3710015/3710015 80000 L0 0:8a2c4da800:8400 20000L/8400P F=1 B=3710016/3710016 segment [0000000000000000, 0000000000020000) size 128K segment [0000000000040000, 0000000000060000) size 128K segment [0000000000080000, 00000000000a0000) size 128K
The blocks at 0x20000 and 0x60000 are missing entirely; these are genuine holes. The blocks at 0x40000 and 0x80000 are 128 Kb logical but less physical, and are presumably compressed. Can we tell for sure? The answer is yes:
# zdb -vv -bbbb -O ssddata/homes cks/tmp/testfile2 [...] 0 L1 DVA=<0:8a2c4e2c00:400> DVA=<0:7601b4be00:400> [L1 ZFS plain file] fletcher4 lz4 [...] 0 L0 DVA=<0:8a4afe7e00:20000> [L0 ZFS plain file] fletcher4 uncompressed [...] 40000 L0 DVA=<0:8a2c4cec00:8400> [L0 ZFS plain file] fletcher4 lz4 [...] 80000 L0 DVA=<0:8a2c4da800:8400> [L0 ZFS plain file] fletcher4 lz4 [...]
(That we need to use both
-bbbb here is due to how
zdb's code is set up, and it's rather a hack to get what we want.
I had to read the
zdb source code to work out how to make it work.)
Among other things (which I've omitted here), this shows us that the 0x40000 and 0x80000 blocks are compressed with lz4, while the 0x0 block is uncompressed (which is what we expect from 128 Kb of random data). ZFS always compresses metadata (or at least tries to), so the L1 indirect block is also compressed with lz4.
This shows us that sparse files benefit from compression being turned on even if they contain uncompressible data. If this was a filesystem with compression off, the blocks at 0x40000 and 0x80000 would each have used 128 Kb of space, not the 33 Kb of space that they did here. ZFS filesystem compression thus helps space usage both for trailing data (which is not uncommon) and for sparse files (which may be relatively rare on your filesystems).
It's sometimes possible to dump the block contents of things like
L1 indirect blocks, so you can see a more direct representation
of them. This is where it's important to know that metadata is
compressed, so we can ask
zdb to decompress it with a magic
# zdb -R ssddata 0:8a2c4e2c00:400:id [...] DVA=<0:8a4afe7e00:20000> [L0 ZFS plain file] fletcher4 uncompressed unencrypted LE contiguous unique single size=20000L/20000P birth=3710011L/3710011P fill=1 cksum=3fcb4949b1aa:ff8a4656f2b87fd:d375da58a32c3eee:73a5705b7851bb59 HOLE [L0 unallocated] size=200L birth=0L DVA=<0:8a2c4cec00:8400> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/8400P birth=3710015L/3710015P fill=1 cksum=1079fbeda2c0:117fba0118c39e9:3534e8d61ddb372b:b5f0a9e59ccdcb7b HOLE [L0 unallocated] size=200L birth=0L DVA=<0:8a2c4da800:8400> [L0 ZFS plain file] fletcher4 lz4 unencrypted LE contiguous unique single size=20000L/8400P birth=3710016L/3710016P fill=1 cksum=10944482ae3e:11830a40138e0c8:2f1dbd6afa0ee9b4:7d3d6b2c247ae44 HOLE [L0 unallocated] size=200L birth=0L [...]
Here we can see the direct representation of the L1 indirect block with explicit holes between our allocated blocks. (This is a common way of representing holes in sparse files; most filesystems have some variant of it.)
PS: I'm not using '
zdb -ddddd' today because when I dug deeper
zdb, I discovered that '
zdb -O' would already report this
information when given the right arguments, thereby saving me an
Sidebar: Why you can't always dump blocks with
To decompress a (ZFS) block, you need to know what it's compressed
with and its uncompressed size. This information is stored in
whatever metadata points to the block, not in the block itself, and
zdb -R simply guesses repeatedly until it gets a
result that appears to work out right:
# zdb -R ssddata 0:8a2c4e2c00:400:id Found vdev type: mirror Trying 00400 -> 00600 (inherit) Trying 00400 -> 00600 (on) Trying 00400 -> 00600 (uncompressed) Trying 00400 -> 00600 (lzjb) Trying 00400 -> 00600 (empty) Trying 00400 -> 00600 (gzip-1) Trying 00400 -> 00600 (gzip-2) [...] Trying 00400 -> 20000 (lz4) DVA=<0:8a4afe7e00:20000> [...]
The result that
zdb -R gets may or may not actually be correct,
and thus may or may not give you the actual decompressed block data.
Here it worked; at other times I've tried it, not so much. The last
zdb -R prints is the one it thinks is correct, so
you can at least see if it got it right (here, for example, we know
that it did, since it picked lz4 with a true logical size of 0x20000
and that's what the metadata we have about the L1 indirect block says
zdb -R would gain a way of specifying the compression
algorithm and the logical size for the
d block flag. Perhaps
zdb to peer into how ZFS stores files on disk
All files are stored either as a single block of varying sizes (up to the recordsize) or using multiple recordsize blocks.
For reasons beyond the scope of this entry, I was wondering if this was actually true. Specifically, suppose you're using the default 128 Kb recordsize and you write a file that is 160 Kb at the user level (128 Kb plus 32 Kb). The way recordsize is usually described implies that ZFS writes this on disk as two 128 Kb blocks, with the second one mostly empty.
It turns out that we can use
zdb to find out the answer to this
question and other interesting ones like it, and it's not even all
that painful. My starting point was Bruning Questions: ZFS Record
which has an example of using
zdb on a file in a test ZFS pool.
We can actually do this with a test file on a regular pool, like
- Create a test file:
cd $HOME/tmp dd if=/dev/urandom of=testfile bs=160k count=1
/dev/urandomhere to defeat ZFS compression.
zdb -Oto determine the object number of this file:
; zdb -O ssddata/homes cks/tmp/testfile Object lvl iblk dblk dsize dnsize lsize %full type 1075431 2 128K 128K 163K 512 256K 100.00 ZFS plain file
(Your version of
zdbmay be too old to have the -O option, but it's in upstream Illumos and ZFS on Linux.)
zdb -dddddto dump detailed information on the object:
# zdb -ddddd ssddata/homes 1075431 [...] 0 L0 0:7360fc5a00:20000 20000L/20000P F=1 B=3694003/3694003 20000 L0 0:73e6826c00:8400 20000L/8400P F=1 B=3694003/3694003 segment [0000000000000000, 0000000000040000) size 256K
See Bruning Questions: ZFS Record Size for information on what the various fields mean.
ds to use with the
zdbis sort of like explosives; if it doesn't solve your problem, add more
-ds until it does. This number of
ds works with ZFS on Linux for me but you might need more.)
What we have here is two on-disk blocks. One is 0x20000 bytes long,
or 128 KB; the other is 0x8400 bytes long, or 33 Kb. I don't know
why it's 33 Kb instead of 32 Kb, especially since
zdb will also
report that the file has a
size of 163840 (bytes), which is exactly
160 Kb as expected. It's not the
ashift on this pool, because
this is the pool I made a little setup mistake on so it has an
ashift of 9.
Based on what we see here it certainly appears that ZFS will write a short block at the end of a file instead of forcing all blocks in the file to be 128 Kb once you've hit that point. However, note that this second block still has a logical size of 0x20000 bytes (128 Kb), so logically it covers the entire recordsize. This may be part of why it takes up 33 Kb instead of 32 Kb on disk.
That doesn't mean that the 128 Kb recordsize has no effect; in fact, we can show why you might care with a little experiment. Let's rewrite 16 Kb in the middle of that first 128 Kb block, and then re-dump the file layout details:
; dd if=/dev/urandom of=testfile conv=notrunc bs=16k count=1 seek=4 # zdb -ddddd ssddata/homes 1075431 [...] 0 L0 0:73610c5a00:20000 20000L/20000P F=1 B=3694207/3694207 20000 L0 0:73e6826c00:8400 20000L/8400P F=1 B=3694003/3694003
As you'd sort of expect from the description of recordsize, ZFS has not split the 128 Kb block up into some chunks; instead, it's done a read-modify-write cycle on the entire 128 Kb, resulting in an entirely new 128 Kb block and 128 Kb of read and write IO (at least at a logical level; at a physical level this data was probably in the ARC, since I'd just written the file in the first place).
Now let's give ZFS a slightly tricky case to see what it does. Unix files can have holes, areas where no data has been written; the resulting file is called a sparse file. Traditionally holes don't result in data blocks being allocated on disk; instead they're gaps in the allocated blocks. You create holes by writing beyond the end of file. How does ZFS represent holes? We'll start by making a 16 Kb file with no hole, then give it a hole by writing another 16 Kb at 96 Kb into the file.
; dd if=/dev/urandom of=testfile2 bs=16k count=1 # zdb -ddddd ssddata/homes 1078183 [...] 0 L0 0:7330dcaa00:4000 4000L/4000P F=1 B=3694361/3694361 segment [0000000000000000, 0000000000004000) size 16K
Now we add the hole:
; dd if=/dev/urandom of=testfile2 bs=16k count=1 seek=6 conv=notrunc [...] # zdb -ddddd ssddata/homes 1078183 [...] 0 L0 0:73ea07a400:8200 1c000L/8200P F=1 B=3694377/3694377 segment [0000000000000000, 000000000001c000) size 112K
The file started out as having one block of (physical on-disk) size 0x4000 (16 Kb). When we added the hole, it was rewritten to have one block of size 0x8200 (32.5 Kb), which represents 112 Kb of logical space. This is actually interesting; it means that ZFS is doing something clever to store holes that fall within what would normally be a single recordsize block. It's also suggestive that ZFS writes some extra data to the block over what we did (the .5 Kb), just as it did with the second block in our first example.
(The same thing happens if you write the second 16 Kb block at 56 Kb, so that you create a 64 Kb long file that would be one 64 Kb block if it didn't have a hole.)
Now that I've worked out how to use
zdb for this sort of exploration,
there's a number of questions about how ZFS stores files on disks
that I want to look into at some point, including how compression
interacts with recordsize and block sizes.
(I should probably also do some deeper exploration of what the
zdb is reporting means. I've poked around
zdb before, but always in very 'heads
down' and limited ways that didn't involve really understanding
ZFS on-disk structures.)
Update: As pointed out by Robert Milkowski in the comments,
I'm mistaken here and being fooled by compression being on in this
filesystem. See ZFS's
recordsize, holes in files, and partial blocks for the illustrated explanation of
what's really going on.
Looking back at my mixed and complicated feelings about Solaris
So Oracle killed Solaris (and SPARC) a couple of weeks ago. I can't say this is surprising, although it's certainly sudden and underhanded in the standard Oracle way. Back when Oracle killed Sun, I was sad for the death of a dream, despite having had ups and downs with Sun over the years. My views about the death of Solaris are more mixed and complicated, but I will summarize them by saying that I don't feel very sad about Solaris itself (although there are things to be sad about).
To start with, Solaris has been dead for me for a while, basically ever since Oracle bought Sun and certainly since Oracle closed the Solaris source. The Solaris that the CS department used for years in a succession of fileservers was very much a product of Sun the corporation, and I could never see Oracle's Solaris as the same thing or as a successor to it. Hearing that Oracle was doing things with Solaris was distant news; it had no relevance for us and pretty much everyone else.
(Every move Oracle made after absorbing Sun came across to me as a 'go away, we don't want your business or to expand Solaris usage' thing.)
But that's the smaller piece, because I have some personal baggage and biases around Solaris itself due to my history. I started using Sun hardware in the days of SunOS, where SunOS 3 was strikingly revolutionary and worked pretty well for the time. It was followed by SunOS 4, which was also quietly revolutionary even if the initial versions had some unfortunate performance issues on our servers (we ran SunOS 4.1 on a 4/490, complete with an unfortunate choice of disk interconnect). Then came Solaris 2, which I've described as a high speed collision between SunOS 4 and System V R4.
To people reading this today, more than a quarter century removed, this probably sounds like a mostly neutral thing or perhaps just messy (since I did call it a collision). But at the time it was a lot more. In the old days, Unix was split into two sides, the BSD side and the AT&T System III/V side, and I was firmly on the BSD side along with many other people at universities; SunOS 3 and SunOS 4 and the version of Sun that produced them were basically our standard bearers, not only for BSD's superiority at the time but also their big technical advances like NFS and unified virtual memory. When Sun turned around and produced Solaris 2, it was viewed as being tilted towards being a System V system, not a BSD system. Culturally, there was a lot of feeling that this was a betrayal and Sun had debased the nice BSD system they'd had by getting a lot of System V all over it. It didn't help that Sun was unbundling the compilers around this time, in an echo of the damage AT&T's Unix unbundling did.
(Solaris 2 was Sun's specific version of System V Release 4, which itself was the product of Sun and AT&T getting together to slam System V and BSD together into a unified hybrid. The BSD side saw System V R4 as 'System V with some BSD things slathered over top', as opposed to 'BSD with some System V things added'. This is probably an unfair characterization at a technical level, especially since SVR4 picked up a whole bunch of important BSD features.)
Had I actually used Solaris 2, I might have gotten over this cultural message and come to like and feel affection for Solaris. But I never did; our 4/490 remained on SunOS 4 and we narrowly chose SGI over Sun, sending me on a course to use Irix until we started switching to Linux in 1999 (at which point Sun wasn't competitive and Solaris felt irrelevant as a result). By the time I dealt with Solaris again in 2005, open source Unixes had clearly surpassed it for sysadmin usability; they had better installers, far better package management and patching, and so on. My feelings about Solaris never really improved from there, despite increasing involvement and use, although there were aspects I liked and of course I am very happy that Sun created ZFS, put it into Solaris 10, and then released it to the world as open source so that it could survive the death of Sun and Solaris.
The summary of all of that is that I'm glad that Sun created a number of technologies that wound up in successive versions of Solaris and I'm glad that Sun survived long enough to release them into the world, but I don't have fond feelings about Solaris itself the way that many people who were more involved with it do. I cannot mourn the death of Solaris itself the way I could for Sun, because for me Solaris was never a part of any dream.
(One part of that is that my dream of Unix was the dream of workstations, not the dream of servers. By the time Sun was doing interesting things with Solaris 10, it was clearly not the operating system of the Unix desktop any more.)