Wandering Thoughts archives

2019-05-31

Some things on how ZFS dnode object IDs are allocated (which is not sequentially)

One of the core elements of ZFS are dnodes, which define DMU objects. Within a single filesystem or other object sets, dnodes have an object number (aka object id). For dnodes that are files or directories in a filesystem, this is visible as their Unix inode number, but other internal things get dnodes and thus object numbers (for example, the dnode of the filesystem's delete queue). Object ids are 64-bit numbers, and many of them can be relatively small (especially if they are object ids for internal structures, again such as the delete queue). Very large dnode numbers are uncommon, and some files and directories from early in a filesystem's life can have very small object IDs.

(For instance, the object ID of my home directory on our ZFS fileservers is '5'. I'm the only user in this filesystem.)

You might reasonably wonder how ZFS object IDs are allocated. Inspection of a ZFS filesystem will show that they are clearly not allocated sequentially, but they're also not allocated randomly. Based on an inspection of the dnode allocation source code in dmu_object.c, there seem to be two things going on to spread dnode object ids around some (but not too much).

The first thing is that dnode allocation is done from per-CPU chunks of the dnode space. The size of each chunk is set by dmu_object_alloc_chunk_shift, which by default creates 128-dnode chunks. The motivation for this is straightforward; if all of the CPUs in the system were all allocating dnodes from the same area, they would all have to content over locks on this area. Spreading out into separate chunks reduces locking contention, which means that parallel or highly parallel workloads that frequently create files on a single filesystem don't bottleneck on a shared lock.

(One reason that you might create files a lot in a parallel worklog is if you're using files on the filesystem as part of a locking strategy. This is still common in things like mail servers, mail clients, and IMAP servers.)

The second thing is, well, I'm going to quote the comment in the source code to start with:

Each time we polish off a L1 bp worth of dnodes (2^12 objects), move to another L1 bp that's still reasonably sparse (at most 1/4 full). Look from the beginning at most once per txg. If we still can't allocate from that L1 block, search for an empty L0 block, which will quickly skip to the end of the metadnode if no nearby L0 blocks are empty. This fallback avoids a pathology where full dnode blocks containing large dnodes appear sparse because they have a low blk_fill, leading to many failed allocation attempts. [...]

(In reading the code a bit, I think this comment means 'L2 block' instead of 'L0 block'.)

To understand a bit more about this, we need to know about two things. First, we need to know that dnodes themselves are stored in another DMU object, and this DMU object stores data in the same way as all others do, using various levels of indirect blocks. Then we need to know about indirect blocks themselves. L0 blocks directly hold data (in this case the actual dnodes), while L1 blocks hold pointers to L0 blocks and L2 blocks hold pointers to L1 blocks.

(You can see examples of this structure for regular files in the zdb output in this entry and this entry. If I'm doing the math right, for dnodes a L0 block normally holds 32 dnodes and a L<N> block can address up to 128 L<N-1> blocks, through block pointers.)

So, what appears to happen is that at first, the per-CPU allocator gets its chunks sequentially (for different CPUs, or the same CPU) from the same L1 indirect block, which covers 4096 dnodes. When we exhaust all of the 128-dnode chunks in a single group of 4096, we don't move to the sequentially next group of 4096; instead we search around for a sufficiently empty group, and switch to it (where a 'sufficiently empty' group is one with at most 1024 dnodes already allocated). If there is no such group, I think that we may wind up skipping to the end of the currently allocated dnodes and getting a completely fresh empty block of 4096.

If I'm right, the net effect of this is to smear out dnode allocations and especially reallocations over an increasingly large portion of the lower dnode object number space. As your filesystem gets used and files get deleted, many of the lower 4096-dnode groups will have some or even many free dnodes, but not the 3072 that they need to be eligible for be selected for further assignment. This can eventually push dnode allocations to relatively high object numbers even though you may not have anywhere near that many dnodes in use on the filesystem. This is not guaranteed, though, and you may still reuse dnode numbers.

(For example, I just created a new file in my home directory. My home directory's filesystem has 1983310 dnodes used right now, but the inode number (and thus dnode object number) that my new test file got was 1804696.)

ZFSDnodeIdsAllocation written at 23:25:17; Add Comment

2019-05-17

One of our costs of using OmniOS was not having 10G networking

OmniOS has generally been pretty good to us over the lifetime of our second generation ZFS fileservers, but as we've migrated various filesystems from our OmniOS fileservers to our new Linux fileservers, it's become clear that one of the costs we paid for using OmniOS was not having 10G networking.

We certainly started out intending to have 10G networking on OmniOS; our hardware was capable of it, with Intel 10G-T chipsets, and OmniOS seemed happy to drive them at decent speeds. But early on we ran into a series of fatal problems with the Intel ixgbe driver which we never saw any fixes for. We moved our OmniOS machines (and our iSCSI backends) back to 1G, and they have stayed there ever since. When we made this move, we did not have detailed system metrics on things like NFS bandwidth usage by clients, and anyway almost all of our filesystems were on HDs, so 1G seemed like it should be fine. And indeed, we mostly didn't see obvious and glaring problems, especially right away.

What setting up a metrics system (even only on our NFS clients) and then later moving some filesystems from OmniOS (at 1G) to Linux (at 10G) made clear was that on some filesystems, we had definitely been hitting the 1G bandwidth limit and doing so had real impacts. The filesystem this was most visible on is the one that holds /var/mail, our central location for people's mailboxes (ie, their IMAP inbox). This was always on SSDs even on OmniOS, and once we started really looking it was clearly bottlenecked at 1G. It was one of the early filesystems we moved to the Linux fileservers, and the improvement was very visible. Our IMAP server, which has 10G itself, now routinely has bursts of over 200 Mbps inbound and sometimes sees brief periods of almost saturated network bandwidth. More importantly, the IMAP server's performance is visibly better; it is less loaded and more responsive, especially at busy times.

(A contributing factor to this is that any number of people have very big inboxes, and periodically our IMAP server winds up having to read through all of such an inbox. This creates a very asymmetric traffic pattern, with huge inbound bandwidth from the /var/mail fileserver to the IMAP server but very little outbound traffic.)

It's less clear how much of a cost we paid for HD-based filesystems, but it seems pretty likely that we paid some cost, especially since our OmniOS fileservers were relatively large (too large, in fact). With lots of filesystems, disks, and pools on each fileserver, it seems likely that there would have been periods where each fileserver could have reached inbound or outbound network bandwidth rates above 1G, if they'd had 10G networking.

(And this excludes backups, where it seems quite likely that 10G would have sped things up somewhat. I don't consider backups as important as regular fileserver NFS traffic because they're less time and latency sensitive.)

At the same time, it's quite possible that this cost was still worth paying in order to use OmniOS back then instead of one of the alternatives. ZFS on Linux was far less mature in 2013 and 2014, and I'm not sure how well FreeBSD would have worked, especially if we insisted on keeping a SAN based design with iSCSI.

(If we had had lots of money, we might have attempted to switch to other 10G networking cards, probably SFP+ ones instead of 10G-T (which would have required switch changes too), or to commission someone to fix up the ixgbe driver, or both. But with no funds for either, it was back to 1G for us and then the whole thing was one part of why we moved away from Illumos.)

OmniOSNo10GCost written at 01:11:05; Add Comment

By day for May 2019: 17 31; before May; after May.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.