What ZFS gang blocks are and why they exist

January 6, 2018

If you read up on ZFS internals, sooner or later you will run across references to 'gang blocks'. For instance, they came up when I talked about what's in a DVA, where DVAs have a flag to say that they point to a gang block instead of a regular block. Gang blocks are vaguely described as being a way of fragmenting a large logical block into a bunch of separate sub-blocks.

A more on-point description can be found in the (draft) ZFS on-disk specification (PDF, via) or the source code comments about them in zio.c. I'll selectively quote from zio.c because it's easier to follow:

A gang block is a collection of small blocks that looks to the DMU like one large block. When zio_dva_allocate() cannot find a block of the requested size, due to either severe fragmentation or the pool being nearly full, it calls zio_write_gang_block() to construct the block from smaller fragments.

A gang block consists of a gang header and up to three gang members. The gang header is just like an indirect block: it's an array of block pointers. It consumes only one sector and hence is allocatable regardless of fragmentation. The gang header's bps point to its gang members, which hold the data.

[...]

Gang blocks can be nested: a gang member may itself be a gang block. Thus every gang block is a tree in which root and all interior nodes are gang headers, and the leaves are normal blocks that contain user data. The root of the gang tree is called the gang leader.

A 'gang header' contains three full block pointers, some padding, and then a trailing checksum. The whole thing is sized so that it takes up only a single 512-byte sector; I believe this means that gang headers in ashift=12 vdevs waste a bunch of space, or at least leave the remaining 3.5 Kb unused.

To understand more about gang blocks, we need to understand why they're needed. As far as I know, this comes down to the fact that ZFS files only ever have a single (logical) block size. If a file is less than the recordsize (usually 128 Kb), it's in a single logical block of the appropriate power of two size; once it hits recordsize or greater, it's in a number of recordsize'd blocks. This means that writing new data to most files normally requires allocating some size of contiguous block (up to 128 Kb, but less if the data you're writing is compressible).

(I believe that there is also metadata that's always unfragmented and may be in blocks up to 128 Kb.)

However, ZFS doesn't guarantee that a pool always has free 128 Kb chunks available, or in fact any particular size of chunk. Instead, free space can be fragmented; you might be unfortunate enough to have many gigabytes of free space, but all of it in fragments that were, say, 32 Kb and smaller. This is where ZFS needs to resort to gang blocks, basically in order to lie to itself about still writing single large blocks.

(Before I get too snarky, I should say that this lie probably simplifies the life of higher level code a fair bit. Rather than have a whole bunch of data and metadata handling code that has to deal with all sorts of fragmentation, most of ZFS can ignore the issue and then lower level IO code quietly makes it all work. Actually using gang blocks should be uncommon.)

All of this explains why the gang block bit is a property of the DVA, not of anything else. The DVA is where space gets allocated, so the DVA is where you may need to shim in a gang block instead of getting a contiguous chunk of space. Since different vdevs generally have different levels of fragmentation, whether or not you have a contiguous chunk of the necessary size will often vary from vdev to vdev, which is the DVA level again.

One quiet complication created by gang blocks is that according to comments in the source code, the gang members may not wind up on the same vdev as the gang header (although ZFS tries to keep them on the same vdev because it makes life easier). This is different from regular blocks, which are always only on a single vdev (although they may be spread across multiple disks if they're on a raidz vdev).

Gang blocks have some space overhead compared to regular blocks (in addition to being more fragmented on disk), but how much is quite dependent on the situation. Because each gang header can only point to three gang member blocks, you may wind up needing multiple levels of nested gang blocks if you have an unlucky combination of fragmented free space and a large block to write. As an example, suppose that you need to write a 128 Kb block and the pool only has 32 Kb chunks free. 128 Kb requires four 32 Kb chunks, which is more than a single gang header can point to, so you need a nested gang block; your overhead is two sectors for the two gang headers needed. If the pool was more heavily fragmented, you'd need more nested gang blocks and the overhead would go up. If the pool had a single 64 Kb chunk left, you could have written the 128 Kb with two 32 Kb chunks and the 64 Kb chunk and thus not needed the nested gang block with its additional gang header.

(Because ZFS only uses a gang block when the space required isn't available in a contiguous block, gang blocks are absolutely sure to be scattered on the disk.)

PS: As far as I can see, a pool doesn't keep any statistics on how many times gang blocks have been necessary or how many there currently are in the pool.

Written on 06 January 2018.
« Confirming the behavior of file block sizes in ZFS
What's happening when you change True and False in Python 2 »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jan 6 02:55:39 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.