2023-05-13
The paradox of ZFS ARC non-growth and ARC hit rates
We have one ZFS fileserver that sometimes spends quite a while (many hours) with a shrunken ARC size, one tens of gigabytes below its (shrunken) ARC target size. Despite that, its ARC hit rate is still really high. Well, actually, that's not surprising; that's kind of a paradox of ARC growth (for both actual size and target size). This is because the combination of two obvious things: the ARC only grows when it needs to, and a high ARC hit rate means that the ARC isn't seeing much need to grow. More specifically, for reads the ARC only grows when there is a read ARC miss. If your ARC target size is 90 GB, your current ARC size is 40 GB, and your ARC hit rate is 100%, it doesn't matter than you have 50 GB of spare RAM, because the ARC has pretty much nothing to put in it.
This means that your ARC growth rate will usually be correlated with your ARC miss rate, or rather your ARC miss volume (which unfortunately I don't think there are kstats for). The other thing ARC growth rate can be correlated with is with your write volume (because many writes go into the ARC on their way to disk, although I'm not certain all of them do). However, ARC growth from write volume can be a transient thing; if you write something and then delete it, ZFS will first put it in the ARC and then drop it from the ARC.
(Deleting large amounts of data that was in the ARC is one way to rapidly drop the ARC size. If your ARC size shrinks rapidly without the target size shrinking, this is probably what's happened. This data may have been recently written, or it might have been read and then deleted.)
This is in a sense both obvious and general. All disk caches only increase their size while reading if there are cache misses; if they don't have cache misses, nothing happens. ZFS is only unusual in that we worry and obsess over the size of the ARC and how it fluctuates, rather than assuming that it will all just work (for good reasons, especially on Linux, but even on Solaris and later Illumos, the ZFS ARC size was by default constrained to much less than the regular disk cache might have grown to without ZFS).
2023-04-25
Understanding ZFS ARC hit (and miss) kstat statistics
The ZFS ARC exposes a number of kstat statistics about its hit and miss performance, which are obviously quite relevant for understanding if your ARC size and possibly its failure to grow are badly affecting you, or if your ARC hit rate is fine even with a smaller than expected ARC size. Complicating the picture are things like 'MFU hits' and 'MFU ghost hits', where it may not be clear how they relate to plain 'ARC hits'.
There are a number of different things that live in the ZFS ARC, each of which has its own size. Further, the disk blocks in the ARC (both 'data' and 'metadata') are divided between a Most Recently Used (MRU) portion and a Most Frequently Used (MFU) portion (I believe other things like headers aren't in either the MRU or MFU). As covered in eg ELI5: ZFS Caching, the MFU and MRU also have 'ghost' versions of themselves; to simplify, these track what would be in memory if the MFU (or MRU) portion used all of memory.
The MRU, MFU, and the ghost versions of themselves give us our first
set of four hit statistics: 'mru_hits
', 'mfu_hits
',
'mru_ghost_hits
', and 'mfu_ghost_hits
'. These track blocks
that were found in the real MRU or found in the real MFU, in which
case they are actually in RAM, or found in the ghost MRU amd MFU,
in which case they weren't in RAM but theoretically could have been.
As covered in ELI5: ZFS Caching, ZFS tracks the hit rates of
the ghost MRU and MFU as signs for when to change the balance between
the size of the MRU and MFU. If a block wasn't even in the ghost
MFU or MRU, there is no specific kstat for it and we have to deduce
that from comparing MRU and MFU ghost hits with general misses.
However, what we really care about for ARC hits and misses is whether
the block actually was in the ARC (in RAM) or whether it had to be
read off disk. This is what the general 'hits
' and 'misses
'
kstats track, and they do this independently of the MRU and MFU
hits (and ghost 'hits'). At this level, all hits and misses can be
broken down into one of four categories; demand data, demand metadata,
prefetch data, and prefecth metadata (more on this breakdown is in
my entry on ARC prefetch stats). Each
of these four has hit and miss kstats associated with them, named
things like 'demand_data_misses
'. As far as I understand it, a
'prefetch' hit or miss means that ZFS was trying to prefetch something
and either already found it in the ARC or didn't. A 'demand' read
is from ZFS needing it right away.
(This implies that the same ZFS disk block can be a prefetch miss, which reads it into the ARC from disk, and then later a demand hit, when the prefetching paid off and the actual read found it in the ARC.)
In the latest development version of OpenZFS, which will eventually
become 2.2, there is an additional category of 'iohits
'. An
'iohit' happens when ZFS wants a disk block that already has active
IO issued to read it into the ARC, perhaps because there is active
prefetching on it. Like 'hits
' and 'misses
', this has the
four demand vs prefetch and data vs metadata counters associated
with it. I'm not quite sure how these iohits are counted in OpenZFS
2.1, and some of them may slip through the cracks depending on the
exact properties associated with the read (although the change
that introduced iohits
suggests that they may previously have been counted as 'hits
').
If you want to see how your ARC is doing, you want to look at the overall hits and misses. The MRU and MFU hits, especially the 'ghost' hits (which are really misses) strike me as less interesting. If you have ARC misses happening (which leads to actual read IO) and you want to know roughly why, you want to look at the breakdown of the demand vs prefetch and data vs metadata 'misses' kstats.
It's tempting to look at MRU and MFU ghost 'hits' as a percentage
of misses, but I'm not sure this tells you much; it's certainly not
very high on our fileservers.
Somewhat to my surprise, the sum of MFU and MRU hits is just slightly
under the overall number of ARC 'hits
' on all of our fileservers
(which use ZoL 2.1). However, they're exactly the same on my desktops,
which run the development version of ZFS on Linux and so have an
'iohits
'. So possibly in 2.1, you can infer the number of
'iohits' from the difference between overall hits and MRU + MFU
hits.
(I evidently worked much of this out years ago since our ZFS ARC stats displays in our Grafana ZFS dashboards work this way, but I clearly didn't write it down back then. This time around, I'm fixing that for future me.)
2023-04-14
The various sizes of the ZFS ARC (as of OpenZFS 2.1)
The ZFS ARC is ZFS's version of a disk cache. Further general information on it can be found in two highly recommended sources, Brendan Gregg's 2012 Activity of the ZFS ARC and Allan Jude's FOSDEM 2019 ELI5: ZFS Caching (also, via). ZFS exposes a lot of information about the state of the ARC through kstats, but there isn't much documentation about what a lot of them mean. Today we're going to talk about some of the kstats related to size of the ARC. I'll generally be using the Linux OpenZFS kstat names exposed in /proc/spl/kstat/zfs/arcstats.
The current ARC total size in bytes is size
. The ARC is split
into a Most Recently Used (MRU) portion and a Most Frequently Used
(MFU) portion; the two sizes of these are mru_size
and mfu_size
.
Note that the ARC may contain more than MRU and MFU data; it also
holds other things, so size
is not necessarily the same as the
sum of mru_size
and mfu_size
.
The ARC caches both ZFS data (which includes not just file contents but also the data blocks of directories) and metadata (ZFS dnodes and other things). All space used by the ARC falls into one of a number of categories, which are accounted for in the following kstats:
data_size metadata_size bonus_size dnode_size dbuf_size hdr_size l2_hdr_size abd_chunk_waste_size
('abd' is short for 'ARC buffered data'. In Linux you can see kstats related to it in /proc/spl/kstat/zfs/abdstats.)
Generally data_size
and metadata_size
will be the largest two
components of the ARC size; I believe they cover data actually read
off disk, with the other sizes being ZFS in-RAM data structures that
are still included in the ARC. The l2_hdr_size
will be zero if you
have no L2ARC. There is also an arc_meta_used
kstat; this rolls up
everything except data_size
and abd_chunk_waste_size
as one
number that is basically 'metadata in some sense'. This combined number
is important because it's limited by arc_meta_limit
.
(There is also an arc_dnode_limit
, which I believe effectively
limits dnode_size
specifically, although dnode_size
can go
substantially over it under some circumstances.)
When ZFS reads data from disk, in the normal configuration it stores
it straight into the ARC in its on-disk form. This means that it
may be compressed; even if you haven't turned on ZFS on disk
compression for your data, ZFS uses it for metadata. The ARC has
two additional sizes to reflect this; compressed_size
is the
size in RAM, and uncompressed_size
is how much this would expand
to if it was all uncompressed. There is also overhead_size
, which,
well, let's quote include/sys/arc_impl.h:
Number of bytes stored in all the arc_buf_t's. This is classified as "overhead" since this data is typically short-lived and will be evicted from the arc when it becomes unreferenced unless the zfs_keep_uncompressed_metadata or zfs_keep_uncompressed_level values have been set (see comment in dbuf.c for more information).
Things counted in overhead_size
are not counted in the compressed
and uncompressed size; they move back and forth in the code as their
state changes. I believe that the compressed size plus the overhead
size will generally be equal to data_size
+ metadata_size
,
ie both cover 'what is in RAM that has been pulled off disk', but in
different forms.
Finally we get to the ARC's famous target size, the famous (or
infamous) 'arc_c
' or just
'c
'. This is the target size of the ARC; if it is larger than
size
, the ARC will grow as you read (or write) things that aren't
in it, and if it's smaller than size
the ARC will shrink. The
ARC's actual size can shrink for other reasons, but the target size
shrinking is a slower and more involved thing to recover from.
In OpenZFS 2.1 and before, there is a second target size statistic,
'arc_p
' or 'p
' (in arcstats); this is apparently short for
'partition', and is the target size for the Most Recently Used
(MRU) portion of the ARC. The target size for the MFU portion is
'c - p
' and isn't explicitly put into kstats. How 'c' (and 'p')
get changed is a complicated topic that is going in another entry.
(In the current development version of OpenZFS, there's a new and different approach to MFU/MRU balancing (via); this will likely be in OpenZFS 2.2, whenever that is released, and may appear in a system near you before then, depending. The new system is apparently better, but its kstats are more opaque.)
Appendix: The short form version
size |
Current ARC size in bytes. It is composed of data_size + metadata_size + bonus_size +
dnode_size + dbuf_size + hdr_size +
l2_hdr_size + abd_chunk_waste_size |
arc_meta_used |
All of size other than data_size +
abd_chunk_waste_size ; 'metadata' in a
broad sense, as opposed to the narrow sense
of metadata_size . |
mru_size |
Size of the MRU portion of the ARC |
mfu_size |
Size of the MFU portion of the ARC |
arc_meta_limit |
Theoretical limit on arc_meta_used |
arc_dnode_limit |
Theoretical limit on dnode_size |
c aka arc_c |
The target for size |
p aka arc_p |
The target for mru_size |
c - p |
The target for mfu_size |
I believe that generally the following holds:
compressed_size + overhead_size = data_size + metadata_size
In OpenZFS 2.1 and earlier, there is no explicit target for MRU data as separate from MRU metadata. In OpenZFS 2.2, there will be.
2023-03-28
An interesting yet ordinary consequence of ZFS using the ZIL
On the Fediverse, Alan Coopersmith recently shared this:
@bsmaalders @cks writing a temp file and renaming it also avoids the failure-to-truncate issues found in screenshot cropping tools recently (#aCropalypse), but as some folks at work recently discovered, you need to be sure to fsync() before the rename, or a failure at the wrong moment can leave you with a zero-length file instead of the old one as the directory metadata can get written before the file contents data on ZFS.
On the one hand, this is perfectly ordinary behavior for a modern filesystem; often renames are synchronous and durable, but if you create a file, write it, and then rename it to something else, you haven't insured that the data you wrote is on disk, just that the renaming is. On the other hand, as someone who's somewhat immersed in ZFS this initially felt surprising to me, because ZFS is one of the rare filesystems that enforces a strict temporal order on all IO operations in its core IO model of ZFS transaction groups.
How this works is that everything that happens in a ZFS filesystem goes into a transaction group (TXG). At any give time there's only one open TXG and TXGs commit in order, so if B is issued after A, either it's in the same TXG as A the two happen together or it's in a TXG after A and so A has already happened. In transaction groups, you can never have B happen but A not happen. In the TXG mental model of ZFS IO, this data loss is impossible, since the rename happened after the data write.
However, all of this strict TXG ordering goes out the window once
you introduce the ZFS Intent Log (ZIL), because
the ZIL's entire purpose is to persist selected operations to disk
before they're committed as part of a transaction group. Renames
and file creations always go in the ZIL (along with various other
metadata operations), but file data only goes in the ZIL if you
fsync()
it (this is a slight simplification, and file data
isn't necessarily directly in the ZIL).
So once the ZIL was in my mental model I could understand what
had happened. In
effect the presence of the ZIL had changed ZFS from a filesystem
with very strong data ordering properties to one with more ordinary
ones, and in such a more ordinary filesystem you do need to fsync()
your newly written file data to make it durable.
(And under normal circumstances ZFS always has the ZIL, so I was engaging in a bit of skewed system programmer thinking.)
2022-11-13
I wouldn't use ZFS for swap (either for swapfiles or with a zvol)
As part of broadly charting how Linux finds where to write and read swap blocks, I recently noted that ZFS on Linux can't be used to hold a swapfile. David Magda noted that you could get around this by creating a zvol and using it for swap. While the Linux kernel will accept this and it works, at least to some extent, I wouldn't rely on it and I wouldn't do it unless I was desperate and had no other choice. Fundamentally, swapping to ZFS is not in accordance with what people (and often Unix kernels) expect from writing pages out to swap.
(On Linux, the Arch wiki has some definite cautions.)
Both the Unix kernel and Unix system administrators expect swapping pages out (and reading them back in) to be low overhead operations, things that are very close to writing a block of memory to some disk blocks or reading disk blocks into (preallocated) memory. This is not how ZFS works, even for writes to zvols. Due to ZFS's fundamental decision to never overwrite data in place, writing out blocks to a zvol requires allocating new space for them in the ZFS pool, collecting all of the relevant changes together, and then writing out a transaction group (or perhaps writing the blocks to a ZFS Intent Log). This more or less intrinsically requires allocating memory for various internal ZFS book-keeping, as well as obtaining various ZFS-related locks (for example, ones that protect the data structures that track free blocks). And it winds up doing a lot more IO than just the direct pages of memory being written to swap.
A lot of the time this will all work. Often you aren't swapping under heavy memory pressure, you're just pushing some unused pages out to swap and it's okay for this to take a while and allocate some extra memory and so on. But, regardless of whether it works, it's much more complicated than swapping is normally supposed to be and that makes it more chancy and less predictable. All of this leaves me feeling that swap to ZFS (and to anything similar to it) is for very unusual situations, not normal operation. If I didn't expect to really ever need swap on a system, I think I'd rather have no swap rather than swap on a zvol.
(ZFS isn't the only swap environment that has this problem. Swapping to a file on NFS has many of the same issues, and is also something that I can't recommend.)
ZFS is good for many things, but not everything, and one of the things it's not good at is very low overhead direct write IO. This is by design, since you can't combine it with copy on write and ZFS decided the latter was more important (and I agree with it).
2022-10-31
I wish ZFS supported per-user reservations, not just per-user quotas
ZFS supports a variety of ways to control space usage in filesystems. You can set a quota or a reservation on a filesystem, and you can set disk space and object count quotas on users, groups, and 'projects' in a filesystem. However, if you look at this list you'll notice an omission; you can't set a reservation for users, groups, or projects in a filesystem. There are some situations (at least in our world) where this would be convenient to have.
The most common case that comes up is that we have a bunch of people in a single filesystem, some of whom may fill up the filesystem by accident in the course of their work and others (such as professors) who we always want to be able to use some additional space so they can keep working. This is the ideal situation for a positive reservation instead of a negative quota, since what we want to put a limit on is the pool of space used by a group of people.
(The real ZFS answer is to put people who need reservations in their own filesystems because filesystems are cheap. But moving people from one filesystem to another is often rather disruptive and not trivial to coordinate, so often it doesn't get seriously contemplated until actual problems happen.)
OpenZFS has supported 'project quotas' since version 0.8.0, as covered in zfs-project(8) and zfs-projectspace(8). Project quotas can be used to give a single person (or group of people) a reservation in a filesystem, by putting their directories into a new project and then putting a project quota limit on the default project. However, you can't use this to give two people each a reservation of their own without putting quotas on each of them too, which is potentially (very) undesirable.
(ZFS project quotas appear to be in the current version of Illumos but I'm not sure when they appeared. It may have been added to the tree in August of 2019, per issue #11479.)
I don't have any personal experience with project quotas. Our Ubuntu ZFS fileservers are still running Ubuntu 18.04, which is too old to support them, and even once we upgrade to 22.04 we probably won't try it because of the various challenges of administering and managing them.
PS: Since ZFS supports project quotas, it also supports tracking
space usage by 'project'. Here 'project' is basically 'whatever you
want to tag with some unique identifier', which means that you could
go through and tag every top level directory in a filesystem with
a separate project ID so you could easily get reports on how much
space is in use in each of them. Ordinary people probably just use
'du -hs
'.
PPS: I think it would be reasonable to require the filesystem to have a reservation that was at least as big as the sum of all of the user reservations in it (or the user, group, and project ones if you wanted to support all of those).
2022-09-20
Why the ZFS ZIL's "in-place" direct writes of large data are safe
I recently read ZFS sync/async + ZIL/SLOG, explained (via), which reminded me that there's a clever but unsafe seeming thing that ZFS does here, that's actually safe because of how ZFS works. Today, I'm going to talk about why ZFS's "in-place" direct writes to main storage for large synchronous writes are safe, despite that perhaps sounding dangerous.
ZFS periodically flushes writes to disk as part of a ZFS transaction group; these days a transaction group commit happens every five seconds by default. However, sometimes programs want data to be sent to disk sooner than that (for example, your editor saving a file; it will sync the file in one way or another at the end, so that you don't lose it if there's a system crash immediately afterward). To do this, ZFS has the ZFS Intent Log (ZIL), which is a log of all write operations since the last transaction group where ZFS promised programs that the writes were durably on disk (to simplify a bit). If the system crashes before the writes can be sent to disk normally as part of a transaction group, ZFS can replay the ZIL to recreate them.
Taken by itself, this means that ZFS does synchronous writes twice, once to the ZIL as part of making them durable and then a second time as part of a regular transaction group. As an optimization, under the right circumstances (which are complicated, especially with a separate log device) ZFS will send those synchronous writes directly to their final destination in your ZFS pool, instead of to the ZIL, and then simply record a pointer to the destination in the ZIL. This sounds dangerous, since you're writing data directly into the filesystem (well, the pool) instead of into a separate log, and in a different filesystem it might be. What makes it safe in ZFS is that in ZFS, all writes go to unused (free) disk space because ZFS is what we generally call a copy-on-write system. Even if you're rewriting bits of an existing file, ZFS writes the new data to free space, not over the existing file contents (and it does this whether or not you're doing a synchronous write).
(ZFS does have to update some metadata in place, but it's a small amount of metadata and it's carefully ordered to make transaction group commits atomic. When doing these direct writes, ZFS also flushes your data to disk before it writes the ZIL that points to your data.)
Obviously, ZFS makes no assumptions about the contents of free disk space. This means that if your system crashes after ZFS has written your synchronous data into its final destination in what was free space until ZFS used it just now, but before it writes out a ZIL entry for it (and tells your editor or database that the data is safely on disk), no harm is done. No live data has been overwritten, and the change to what's in free space is unimportant (well, to ZFS, you may care a lot about the contents of the file that you were just a little bit late to save as power died).
Similarly, if your system crashes after the ZIL is written out but before the regular transaction group commits, the space your new data is written to is still marked as free at the regular ZFS level but the ZIL knows better. When the ZIL is replayed to apply all of the changes it records, your new data will be correctly connected to the overall ZFS pool (meta)data structures, the space will be marked as used, and so on.
(I've mentioned this area in the past when I wrote about the ZIL's optimizations for data writes, but at the time I explained its safety more concisely and somewhat in passing. And the ZFS settings and specific behavior I mentioned in that entry may now be out of date, since it's from almost a decade ago.)
2022-08-31
ZFS DVA offsets are in 512-byte blocks on disk but zdb misleads you about them
Yesterday I asserted that ZFS DVA offsets were in bytes, based primarily on using zdb
to dump
a znode and then read a data block using the offset that zdb printed.
Over on Twitter, Matthew Ahrens corrected my misunderstanding:
The offset is stored on disk as a multiple of 512, see the DVA_GET_OFFSET() macro, which passes shift=SPA_MINBLOCKSHIFT=9. For human convenience, the DVA is printed in bytes (e.g. by zdb). So the on-disk format can handle up to 2^72 bytes (4 ZiB) per vdev.
... but the current software doesn't handle more than 2^64 bytes (16 EiB).
That is to say, when zdb
prints ZFS DVAs
it is not showing you the actual on-disk representation, or a lightly
decoded version of it; instead the offset is silently converted
from its on-disk form of 512-byte blocks to a version in bytes. I
think that this is also true of other pieces of ZFS code that print
DVAs as part of diagnostics, kernel messages, and so on. Based on
lightly reading the code, I believe that the size of the DVA is
also recorded on disk in 512-byte blocks, because zdb and other
things use a similar C macro (DVA_GET_ASIZE()) when printing
it.
(Both macros are #define'd in include/sys/spa.h.)
So, to summarize: on disk, ZFS DVA offsets are in units of 512-byte
blocks, with offset 0 (on each disk) starting after a 4 Mbyte
header. In addition, zdb
prints offsets (and sizes) in units
of bytes, not their on disk 512-byte blocks (in hexadecimal), as
(probably) do other things. If zdb says that a given DVA is
'0:7ea00:400', that is a byte offset of 518656 bytes and a byte
size of 1024 bytes. Zdb is decoding these for you from their on
disk form. If a kernel message talks about DVA '0:7ea00:400' it's
also most likely using byte offsets, as zdb does.
These DVA block offsets are always for 512 byte blocks. The 'block
size' of the offset is fixed, and doesn't depend on the physical
block size of the disk, the logical block size of the disk, or the
ashift
of the vdev.
Since 512 bytes is the block size for the minimum ashift
, ZFS
will never have to assign finer grained addresses than that, even
if it's somehow dealing with a disk or other storage with smaller
sized blocks. This makes using 512 byte 'blocks' completely safe.
That the DVA offsets are
in blocks is, in a sense, mostly a way of increasing how large that
a vdev can be (adding nine bits of size).
(This is not as crazy a concern as you might think, since DVA offsets in a raidz vdev cover the entire (raw) disk space of the vdev. If you want to allow, say, 32 disk raidz vdevs without size limits, the disk size limit is 1/32nd of the vdev size limit. That's still a very big disk by today's standards, but if you're building a filesystem format that you expect may be used for (say) 50 years, you want to plan ahead.)
I haven't looked at the OpenZFS code in depth to see how it handles DVA offsets in the current code. The comments in include/sys/spa.h make it sound like all interpretation of the contents of DVAs go through the macros in the file, including the offset. The only apparent way to get access to the offset is with the DVA_GET_OFFSET() macro, which converts it to a byte offset in the process; this suggests that much or all ZFS code probably passes around DVA offsets as byte offsets, not their on-disk block offset form.
(This is somewhat suggested by what Matthew Ahrens said about how the current software behaves; if it deals primarily or only with byte offsets, it's limited to vdevs with at 2^64 bytes, although the disk format could accommodate larger ones. If all of the internal ZFS code deals with byte offsets, this might be part of why zdb prints DVAs as byte offsets; if you're going back and forth between a kernel debugger and zdb output, you want them to be in the same units.)
I'm disappointed that I was wrong (both yesterday and in the past), but at least I now have a definitive answer and I understand more about the situation.
2022-08-30
ZFS DVA offsets are in bytes, not (512-byte) blocks
In ZFS, a DVA (Device Virtual Address) is the equivalent of a block address in a regular filesystem. For our purposes today, the important thing is that a DVA tells you where to find data by a combination of the vdev (as a numeric index) and an offset into the vdev (and also a size). However, this description leaves a question open, which is what units are ZFS DVA offsets in. Back when I looked into the details of DVAs in 2017, the code certainly appeared to be treating the offset as being in bytes; however, various other sources have sometimes asserted that offsets are in units of 512-byte blocks. Faced with this uncertainty, today I decided to answer the question once and for all with some experimentation.
(One of the 'various sources' for the DVA offset being in 512-byte blocks is the "ZFS On-Disk Specification" that you can find copies of floating around on the Internet, eg currently here or this repository with the specification and additional information. See section 2.1.)
Update: This turns out to be wrong (or a misunderstanding). On disk, ZFS DVA offsets are stored as (512-byte) blocks, but tools like zdb print them as byte offsets. See ZFSDVAOffsetsInBytesII.
I'll start with the more or less full version of the experiment on a file-based ZFS pool.
# truncate --size 100m disk01 # zpool create tank disk01 # vi /tank/TESTFILE [.. enter more than 512 bytes of text ...] # sync # zdb -vv -O tank TESTFILE Object lvl iblk dblk dsize dnsize lsize %full type 6 1 128K 1K 1K 512 1K 100.00 ZFS plain file 176 bonus System attributes [...] Indirect blocks: 0 L0 0:7ea00:400 400L/400P F=1 B=49/49 cksum=[...]
This file is 1 KB (two 512-byte blocks) on disk (since it's not compressed; I deliberately didn't turn that on), and starts at offset 0x7ea00 aka 518656 in whatever unit that is. Immediately we have one answer; our 'disk01' data file only has 204800 512-byte blocks, so this cannot be a plain 512-byte block offset. However, if we just read at byte 518656 (block 1013), we won't succeed in finding our file data. Per various sources (eg ZFS Raidz Data Walk), there is a 4 MByte (0x400000 bytes) header that we also have to add in. That's 8192 512-byte blocks for the header, plus 1013 for the DVA offset gives us a block offset from the start of the file of 9205 512-byte blocks, so:
# dd if=disk01 bs=512 skip=9205 | sed 5q 2022-08-30 line 2 of zfs test file line 3 of zfs test file line 4 of zfs test file line 5 of zfs test file
I've found my test file exactly where it should be.
Just to be sure, I also did this same experiment on our test fileserver, where the ZFS pool uses mirrored disk partitions (instead of a file). The answer is the same; treating the ZFS DVA offset as a byte offset and adding 4 MBytes gets me the right (byte) offset into the disk partition to find the file contents that should be there. Although I haven't verified it in the code, I would be very surprised if raidz or draid DVA offsets are any different (although raidz DVA offsets snake across all disks in the raidz).
(This experiment is obviously much harder if you have a dataset
with compression turned on. I don't know if there's any easy way
to get zdb to decompress a block from standard input. Modern versions
of zdb can read ZFS blocks directly, with the -R
option, but while
useful this doesn't quite help answer the question here. I guess I
could have strace
'd zdb to see what offset it read the block
from.)
(This is one of my ZFS uncertainties that has quietly nagged at me for years but that I can now finally put to bed.)
2022-08-21
ZFS DVAs and what they cover on raidz vdevs
In ZFS, a DVA (Device Virtual Address) is the equivalent of a block address in a regular filesystem. For our purposes today, the important thing is that a DVA tells you where to find data by a combination of the vdev (as a numeric index) and an offset into the vdev (and also a size). This implies, for example, that in mirrored vdevs, all mirrors of a block are at the same place on each disk, and that in raidz vdevs the offset is striped sequentially across all of your disks.
Recently I got confused about one bit of how DVA offsets work on raidz vdevs. On a raidz vdev, is the offset an offset into the logical data on the vdev (which is to say, ignoring and skipping over the space used by parity), or is it an offset into the physical data on the vdev (including parity space)?
The answer is on raidz vdevs, DVA offsets cover the entire available disk space and include parity space. If you have a raidz vdev of five 1 GB disks, the vdev DVA offsets go from 0 to 5 GB (and by corollary, the parity level has no effect on the meaning of the offset). This is what is said by, for example, Max Bruning's RAIDZ On-Disk Format.
If you think about how parity works in raidz vdevs, this makes sense. In raidz, unlike conventional RAID5/6/7, the amount of parity and its location isn't set in advance, and you can force ZFS to make 'short writes' (writes of less than the data width of your raidz vdev) that still force it to create full parity. If DVA offsets ignored and skipped over parity, going from an offset to an on disk location would be very complex and the maximum offset limit for a vdev could change. Having the DVA offset cover all disk space on the vdev regardless of what it's used for is much simpler (and it also allows you to address parity with ordinary ZFS IO mechanisms that take DVAs).
I don't know how DVA offsets work on draid vdevs. Since I haven't been successful at understanding the draid code's addressing in the past, I haven't tried delving into it this time around.
(I might have known this at some point when I first looked into ZFS DVAs, but if so I forgot it since then, and recently had a clever ZFS idea that turns out to not work because of this. Now that I've written this down, maybe I'll remember it for the next time around.)