Wandering Thoughts


Even on SSDs, ongoing activity can slow down ZFS scrubs drastically

Back in the days of our OmniOS fileservers, which used HDs (spinning rust) across iSCSI, we wound up changing kernel tunables to speed up ZFS scrubs and saw a significant improvement. When we migrated to our current Linux fileservers with SSDs, I didn't bother including these tunables (or the Linux equivalent), because I expected that SSDs were fast enough that it didn't matter. Indeed, our SSD pools generally scrub like lightning.

(Our Linux fileservers use a ZFS version before sequential scrubs (also). It's possible that sequential scrub support would change this story.)

Then, this weekend, a ZFS pool with 1.68 TB of space used took two days to scrub (48:15, to be precise). This is not something that happens normally; this size of pool usually scrubs much faster, on the order of a few hours. When I poked at it a bit none of the disks seemed unusually slow and there were no signs of other problems, it was just that the scrub was running slowly. However, looking at NFS client metrics in our metrics system suggested that there was continuous ongoing NFS activity to some of the filesystems in that pool.

Although I don't know for sure, this looks like a classical case of even a modest level of regular ZFS activity causing the ZFS scrub code to back off significantly on IO. Since this is on SSDs, this isn't really necessary (at least for us); we could almost certainly sustain both a more or less full speed scrub and our regular read IO (significant write IO might be another story, but that's because it has some potential performance effects on SSDs in general). However, with no tuning our current version of ZFS is sticking to conservative defaults.

In one sense, this isn't surprising, since it's how ZFS has traditionally reacted to IO during scrubs. In another sense, it is, because it's not something I expected to see affect us on SSDs; if I had expected to see it, I'd have carried forward our ZFS tunables to speed up scrubs.

(Now that I look at our logged data, it appears that ZFS scrubs on this pool have been slow for some time, although not 'two days' slow. They used to complete in a couple of hours, then suddenly jumped to over 24 hours. More investigation may be needed.)

ZFSSSDActivitySlowsScrubs written at 22:48:14; Add Comment


Some thoughts on us overlooking Illumos's syseventadm

In a comment on my praise of ZFS on Linux's ZFS event daemon, Joshua M. Clulow noted that Illumos (and thus OmniOS) has an equivalent in syseventadm, which dates back to Solaris. I hadn't previously known about syseventadm, despite having run Solaris fileservers and OmniOS fileservers for the better part of a decade, and that gives me some tangled feelings.

I definitely wish I'd known about syseventadm while we were still using OmniOS (and even Solaris), because it would probably have simplified our life. Specifically, it probably would have simplified the life of our spares handling system (2, 3). At the least, running immediately when some sort of pool state change happened would have sped up its reaction to devices failing (instead, it ran every fifteen minutes or so from cron, creating a bit of time lag).

(On the whole it was probably good to be forced to make our spares system be state based instead of event based. State based systems are easier to make robust in the face of various sorts of issues, like dropped events.)

At the same time, that we didn't realize syseventadm existed is, in my mind, a sign of problems in how Illumos is organized and documented (which is something it largely inherited from Solaris). For instance, syseventadm is not cross referenced in any of the Fault Manager related manpages ( fmd, fmdump, fmadm, and so on). The fault management system is the obvious entry point for a sysadmin exploring this area on Illumos (partly because it dumps out messages on you), so some sort of cross reference would have led me to syseventadm. Nor does it come up much in discussions on the Internet, although if I'd asked specifically back in the days I might have had someone mention it to me.

(It got mentioned in this Serverfault question, for example.)

A related issue is that in order to understand what you can do with syseventadm, you have to read Illumos header files (cf). This isn't even mentioned in the syseventadm manpage, and the examples in the manpage are all for custom events generated by things from a hypothetical third party vendor MYCO instead of actual system events. Without a lot of context, there are not many clues that ZFS events show up in syseventadm in the first place for you to write a handler for them. It also seems clear that writing handlers is going to involve a lot of experimentation or reading the source to determine what data you get and how it's passed to you and so on.

(In general and speaking as a sysadmin, the documentation for syseventadm doesn't present itself as something that's for end sysadmins to use. If you have to read kernel headers to understand even part of what you can do, this is aimed at system programmers.)

On the whole I'm not terribly surprised that we and apparently other people missed the existence and usefulness of syseventadm, even if clearly there was some knowledge of it in the Illumos community. That we did miss it while ZFS on Linux's equivalent practically shoved itself in our face is an example of practical field usability (or lack thereof) in action.

At this point interested parties are probably best off writing articles about how to do things with syseventadm (especially ZFS things), and perhaps putting it in Illumos ZFS FAQs. Changing the structure of the Illumos documentation or rewriting the manpages probably has too little chance of good returns for the time invested; for the most part, the system documentation for Illumos is what it is.

OverlookingSyseventadm written at 00:21:02; Add Comment


In ZFS, your filesystem layout needs to reflect some of your administrative structure

One of the issues we sometimes run into with ZFS is that ZFS essentially requires you to reflect your administrative structure for allocating and reserving space in how you lay out ZFS filesystems and filesystem hierarchies. This is because in ZFS, all space management is handled through the hierarchy of filesystems (and perhaps in having multiple pools). If you want to make two separate amounts of space available to two separate sets of filesystems (or collectively reserved by them), either they must be in different pools or they must be under different dataset hierarchies within the pool.

(These hierarchies don't have to be visible to users, because you can mount ZFS filesystems under whatever names you want, but they exist in the dataset hierarchy in the pool itself and you'll periodically need to know them, because some commands require the full dataset name and don't work when given the mount point.)

That sounds abstract, so let me make it concrete. Simplifying only slightly, our filesystems here are visible to people as /h/NNN (for home directories) and /w/NNN (workdirs, for everything else). They come from some NFS server and live in some ZFS pool there (inside little container filesystems), but the NFS server and to some extent the pool is an implementation detail. Each research group has its own ZFS pool (or for big ones, more than one pool because one pool can only be so big), as do some individual professors. However, there are not infrequently cases where a professor in a group pool would like to buy extra space that is only for their students, and also this professor has several different filesystems in the pool (often a mixture of /h/NNN homedir filesystems and /w/NNN workdir ones).

This is theoretically possible in ZFS, but in order to implement it ZFS would force us to put all of a professor's filesystems under a sub-hierarchy in the pool. Instead of the current tank/h/100 and tank/w/200, they would have to be something like tank/prof/h/100 and tank/prof/w/200. The ZFS dataset structure is required to reflect the administrative structure of how people buy space. One of the corollaries of this is that you can basically only have a single administrative structure for how you allocate space, because a dataset can only be in one place in the ZFS hierarchy.

(So if two professors want to buy space separately for their filesystems but there's a filesystem shared between them (and they each want it to share in their space increase), you have a problem.)

If there were sub-groups of people who wanted to buy space collectively, we'd need an even more complicated dataset structure. Such sub-groups are not necessarily decided in advance, so we can't set up such a hierarchy when the filesystems are created; we'd likely wind up having to periodically modify the dataset hierarchy. Fortunately the manpages suggest that 'zfs rename' can be done without disrupting service to the filesystem, provided that the mountpoint doesn't change (which it wouldn't, since we force those to the /h/NNN and /w/NNN forms).

While our situation is relatively specific to how we sell space, people operating ZFS can run into the same sort of situation any time they want to allocate or control collective space usage among a group of filesystems. There are plenty of places where you might have projects that get so much space but want multiple filesystems, or groups (and subgroups) that should be given specific allocations or reservations.

PS: One reason not to expose these administrative groupings to users is that they can change. If you expose the administrative grouping in the user visible filesystem name and where a filesystem belongs shifts, everyone gets to change the name they use for it.

ZFSAdminVsFilesystemLayout written at 22:58:55; Add Comment


The unfortunate limitation in ZFS filesystem quotas and refquota

When ZFS was new, the only option it had for filesystems quotas was the quota property, which I had an issue with and which caused us practical problems in our first generation of ZFS fileservers because it covered the space used by snapshots as well as the regular user accessible filesystem. Later ZFS introduced the refquota property, which did not have that problem but in exchange doesn't apply to any descendant datasets (regardless of whether they're snapshots or regular filesystems). At one level this issue with refquota is fine, because we put quotas on filesystems to limit their maximum size to what our backup system can comfortably handle. At another level, this issue impacts how we operate.

All of this stems from a fundamental lack in ZFS quotas, which is ZFS's general quota system doesn't let you limit space used only by unprivileged operations. Writing into a filesystem is a normal everyday thing that doesn't require any special administrative privileges, while making ZFS snapshots (and clones) requires special administrative privileges (either from being root or from having had them specifically delegated to you). But you can't tell them apart in a hierarchy, because ZFS only you offers the binary choice of ignoring all space used by descendants (regardless of how it occurs) or ignoring none of it, sweeping up specially privileged operations like creating snapshots with ordinary activities like writing files.

This limitation affects our pool space limits, because we use them for two different purposes; restricting people to only the space that they've purchased and insuring that pools always have a safety margin of space. Since pools contain many filesystems, we must limit their total space usage using the quota property. But that means that any snapshots we make for administrative purposes consume space that's been purchased, and if we make too many of them we'll run the pool out of space for completely artificial reasons. It would be better to be able to have two quotas, one for the space that the group has purchased (which would limit only regular filesystem activity) and one for our pool safety margin (which would limit snapshots too).

(This wouldn't completely solve the problem, though, since snapshots still consume space and if we made too many of them we'd run a pool that should have free space out of even its safety margin. But it would sometimes make things easier.)

PS: I thought this had more of an impact on our operations and the features we can reasonable offer to people, but the more I think about it the more it doesn't. Partly this is because we don't make much use of snapshots, though, for various reasons that sort of boil down to 'the natural state of disks is usually full'. But that's for another entry.

ZFSHierarchyQuotaLack written at 22:17:27; Add Comment


How we guarantee there's always some free space in our ZFS pools

One of the things that we discovered fairly early on in our experience with ZFS (I think within the lifetime of the first generation Solaris fileservers) is that ZFS gets very unhappy if you let a pool get completely full. The situation has improved since then, but back in the days we couldn't even change ZFS properties, much less remove files as root. Being unable to change properties is a serious issue for us because NFS exports are controlled by ZFS properties, so if we had a full pool we couldn't modify filesystem exports to cut off access from client machines that were constantly filling up the filesystem.

(At one point we resorted to cutting off a machine at the firewall, which is a pretty drastic step. Going this far isn't necessary for machines that we run, but we also NFS export filesystems to machines that other trusted sysadmins run.)

To stop this from happening, we use pool-wide quotas. No matter how much space people have purchased in a pool or even if this is a system pool that we operate, we insist that it always have a minimum safety margin, enforced through a 'quota=' setting on the root of the pool. When people haven't purchased enough to use all of the pool's current allocated capacity, this safety margin is implicitly the space they haven't bought. Otherwise, we have two minimum margins. The explicit minimum margin is that our scripts that manage pool quotas always insist on a 10 MByte safety margin. The implicit minimum margin is that we normally only set pool quotas in full GB, so a pool can be left with several hundred MB of space between its real maximum capacity and the nearest full GB.

All of this pushes the problem back one level, which is determining what the pool's actual capacity is so we can know where this safety margin is. This is relatively straightforward for us because all of our pools use mirrored vdevs, which means that the size reported by 'zpool list' is a true value for the total usable space (people with raidz vdevs are on their own here). However, we must reduce this raw capacity a bit, because ZFS reserves 1/32nd of the pool for its own internal use. We must reserve at least 10 MB over and above this 1/32nd of the pool in order to actually have a safety margin.

(All of this knowledge and math is embodied into a local script, so that we never have to do these calculations by hand or even remember the details.)

PS: These days in theory you can change ZFS properties and even remove files when your pool is what ZFS will report as 100% full. But you need to be sure that you really are freeing up space when you do this, not using more because of things like snapshots. Very bad things happen to your pool if it gets genuinely full right up to ZFS's internal redline (which is past what ZFS will normally let you unless you trick it); you will probably have to back it up, destroy it, and recreate it to fully recover.

(This entry was sparked by a question from a commentator on yesterday's entry on how big our fileserver environment is.)

ZFSGuaranteeFreeSpace written at 22:51:20; Add Comment


Revisiting what the ZFS recordsize is and what it does

I'm currently reading Jim Salter's ZFS 101—Understanding ZFS storage and performance, and got to the section on ZFS's important recordsize property, where the article attempts to succinctly explain a complicated ZFS specific thing. ZFS recordsize is hard to explain because it's relatively unlike what other filesystems do, and looking back I've never put down a unified view of it in one place.

The simplest description is that ZFS recordsize is the (maximum) logical block size of a filesystem object (a file, a directory, a whatever). Files smaller than recordsize have a single logical block that's however large it needs to be (details here); files of recordsize or larger have some number of recordsize logical blocks. These logical blocks aren't necessarily that large in physical blocks (details here); they may be smaller, or even absent entirely (if you have some sort of compression on and all of the data was zeros), and under some circumstances the physical block can be fragmented (these are 'gang blocks').

(ZFS normally doesn't fragment the physical block that implements your logical block, for various good reasons including that one sequential read or write is generally faster than several of them.)

However, this logical block size has some important consequences because ZFS checksums are for a single logical block. Since ZFS always verifies the checksum when you read data, it must read the entire logical block even if you ask only for a part of it; otherwise it doesn't have all the data it needs to compute the checksum. Similarly, it has to read the entire logical block even when your program is only writing a bit of data to part of it, since it has to update the checksum for the whole block, which requires the rest of the block's data. Since ZFS is a copy on write system, it then rewrites the whole logical block (into however large a physical block it now requires), even if you only updated a little portion of it.

Another consequence is that since ZFS always writes (and reads) a full logical block, it also does its compression at the level of logical blocks (and if you use ZFS deduplication, that also happens on a per logical block basis). This means that a small recordsize will generally limit how much compression you can achieve, especially on disks with 4K sectors.

(Using a smaller maximum logical block size may increase the amount of data that you can deduplicate, but it will almost certainly increase the amount of RAM required to get decent performance from deduplication. ZFS deduplication's memory requirements for good performance are why you should probably avoid it; making them worse is not usually a good idea. Any sort of deduplication is expensive and you should use it only when you're absolutely sure it's worth it for your case.)

ZFSRecordsizeMeaning written at 23:35:22; Add Comment


Looking back at DTrace from a Linux eBPF world (some thoughts)

As someone who made significant use of locally written DTrace scripts on OmniOS and has since moved from OmniOS to Linux for our current generation of fileservers, I've naturally been watching the growth of Linux's eBPF tooling with significant interest (and some disappointment, since they're still sort of a work in progress). This has left me with some thoughts on the DTrace experience on Solaris and then OmniOS as contrasted with the eBPF experience on Linux.

On Linux, eBPF and the tools surrounding it are a lot more rough and raw than I think DTrace ever was, and certainly than any version of DTrace that I actively used. By the time I started using it, DTrace was basically fully cooked on Solaris and accordingly there was little change between the start of my DTrace experience and the end of it (I think DTrace gained some more convenience functions for things like dealing with IP addresses, but no major shifts). But at the same time, how Linux as a whole has developed eBPF and the tools surrounding it has made Linux eBPF an open system with multiple interfaces (and levels of interface) in a way that DTrace never was, and this has enabled things that at least I never saw people doing on Solaris and Illumos.

In the DTrace world, the interface almost everyone was supposed to use to deal with DTrace was, well, dtrace the command and its language (the DTrace library is explicitly marked as unstable and private in its manual page). You might run the command in various ways and generate programs for it through templates or other things, but that was how you interacted with things. If other levels of interface were documented (such as building raw DTrace programs yourself and feeding them to the kernel), they were definitely not encouraged and as a result I don't think the community ever did much with them.

(People definitely built tools that used the DTrace system that didn't produce processed text output from dtrace, but these were clearly product level work from a dedicated engineer team, not anything you would produce for smaller scale things. Often they came from Sun or a Illumos distribution provider, and so were entitled to use private interfaces.)

By contrast, the Linux eBPF ecology has created a whole suite of tools at various levels of the stack. There's an equivalent of dtrace, but you can also set up eBPF instrumentation of something from inside a huge number of different programming environments and do a bunch of things with the information that it produces. This has led to very flexible things, such as the Cloudflare eBPF Prometheus exporter (which lets you surface Prometheus metrics for anything that you can write an eBPF program for).

I can't help but feel that the Linux eBPF ecology has benefited a lot from the fractured and separated way that eBPF has been developed. No single group owned the entire stack, top to bottom, and so the multiple groups involved were all tacitly forced to provide more or less documented and more or less stable interfaces to each other. The existence of these interfaces then allowed other people to come along and take advantage of them, writing their own tools on top of one or another bit and re-purposing them to do things like create useful Grafana dashboards (via a comment on here). DTrace's single unified development gave us a much more polished dtrace command much sooner (and made it usable on all Solaris versions that supported DTrace the idea), but that was it.

(I don't fault the DTrace developers for keeping libdtrace private and so on; it really is the obvious thing to do in a unitary development environment. Of course you don't want to lock yourself into backward compatibility with a kernel interface or internal implementation that you now realize is not the best idea.)

DTraceVersusEBPF written at 01:35:25; Add Comment


ZFS on Linux has now become the OpenZFS ZFS implementation

The other day I needed to link to a specific commit in ZFS on Linux for my entry on how deduplicated ZFS streams are now deprecated, so I went to the ZFS on Linux Github repository, which I track locally. Somewhat to my surprise, I wound up on the OpenZFS repository, which is now described as 'OpenZFS on Linux and FreeBSD' and is linked as such from the open-zfs.org page that links to all of them.

(The OpenZFS repo really is a renaming of the ZFS on Linux repo, because my git pulls have transparently kept working. The git tree in the OpenZFS repo is the same git tree that was previously ZFS on Linux. I believe that this is a change for OpenZFS's own repo, although I don't know where that was.)

I knew this was coming (I believe I've seen it mentioned in passing in various places), but it's still something to see that it's been done now. As I thought last year (in this entry), the center of gravity of ZFS development has shifted from Illumos to Linux. The OpenZFS 'zfs' repository doesn't represent itself as the ZFS upstream, but it certainly has a name that tacitly endorses that view (and the view is pretty much the reality).

Although there are risks to this shift, it feels inevitable. Despite ZFS being a third party filesystem on Linux, Linux is still where the action is. It's certainly where we went for our current generation of ZFS fileservers, because Linux could give us things that OmniOS (sadly) could not (such as working 10G-T Ethernet with Intel hardware). As someone who is on Linux ZFS now, I'm glad that it hasn't stagnated even as I'm sad that Illumos apparently did (for ZFS and other things).

I don't think this means anything different for ZFS on Illumos than it did before, when the Illumos people were talking about increasingly adopting changes from what was then ZFS on Linux (cf). I believe that changes and features are flowing between Illumos and OpenZFS (in both directions), but I don't know if there's any effort to make the OpenZFS repo directly useful for (or on) Illumos.

ZFSOnLinuxNowOpenZFS written at 00:39:28; Add Comment


'Deduplicated' ZFS send streams are now deprecated and on the way out

For a fair while, 'zfs send' has had support for a -D argument, aka --dedup, that causes it to send what is called a 'deduplicated stream'. The zfs(1) manpage describes this as:

Generate a deduplicated stream. Blocks which would have been sent multiple times in the send stream will only be sent once. The receiving system must also support this feature to receive a deduplicated stream. This flag can be used regardless of the dataset's dedup property, but performance will be much better if the filesystem uses a dedup-capable checksum (for example, sha256).

This feature is now on the way out in the OpenZFS repository. It was removed in a commit on March 18th, and the commit message explains the situation:

Dedup send can only deduplicate over the set of blocks in the send command being invoked, and it does not take advantage of the dedup table to do so. This is a very common misconception among not only users, but developers, and makes the feature seem more useful than it is. As a result, many users are using the feature but not getting any benefit from it.

Dedup send requires a nontrivial expenditure of memory and CPU to operate, especially if the dataset(s) being sent is (are) not already using a dedup-strength checksum.

Dedup send adds developer burden. It expands the test matrix when developing new features, causing bugs in released code, and delaying development efforts by forcing more testing to be done.

As a result, we are deprecating the use of `zfs send -D` and receiving of such streams. This change adds a warning to the man page, and also prints the warning whenever dedup send or receive are used.

I actually had the reverse misconception about how deduplicated sends worked; I assumed that they required deduplication to be on in the filesystem itself. Since we will never use deduplication, I never looked any further at the 'zfs send' feature. It probably wouldn't have been a net win for us anyway, since our OmniOS fileservers didn't have all that fast CPUs and we definitely weren't using one of the dedup-strength checksums.

(Our current Linux fileservers have better CPUs, but I think they're still not all that impressive.)

The ZFS people are planning various features to deal with the removal of this feature so that people will still be able to use saved deduplicated send streams. However, if you have such streams in your backup systems, you should probably think about aging them out. And definitely you should move away from generating new ones, even though this change is not yet in any release of ZFS as far as I know (on any platform).

ZFSStreamDedupGone written at 22:58:33; Add Comment


How we set up our ZFS filesystem hierarchy in our ZFS pools

Our long standing practice here, predating even the first generation of our ZFS fileservers, is that we have two main sorts of filesystems, home directories (homedir filesystems) and what we call 'work directory' (workdir) filesystems. Homedir filesystems are called /h/NNN (for some NNN) and workdir filesystems are called /w/NNN; the NNN is unique across all of the different sorts of filesystems. Users are encouraged to put as much stuff as possible in workdirs and can have as many of them as they want, which mattered a lot more in the days when we used Solaris DiskSuite and had fixed-sized filesystems.

(This creates filesystems called things like /h/281 and /w/24.)

When we moved from DiskSuite to ZFS, we made the obvious decision to keep these user-visible filesystem names and the not entirely obvious decision that these filesystem names should work even on the fileservers themselves. This meant using the ZFS mountpoint property to set the mount point of all ZFS homedir and workdir filesystems, which works (and worked fine). However, this raised another question, that of what the actual filesystem name inside the ZFS pool should look like (since it no longer has to reflect the mount point).

There are a number of plausible answers here. For example, because our 'NNN' numbers are unique, we could have made all filesystems be simply '<pool>/NNN'. However, for various reasons we decided that the ZFS pool filesystem should reflect the full name of the filesystem, so /h/281 is '<pool>/h/281' instead of '<pool>/281' (among other things, we felt that this was easier to manage and work with). This created the next problem, which is that if you have a ZFS filesystem of <pool>/h/281, <pool>/h has to exist in some form. I suppose that we could have made these just be subdirectories in the root of the pool, but instead we decided to make them be empty and unmounted ZFS filesystems that are used only as containers:

zfs create -o mountpoint=none fs11-demo-01/h
zfs create -o mountpoint=none fs11-demo-01/w

We create these in every pool as part of our pool setup automation, and then we can make, for example, fs11-demo-01/h/281, which will be mounted everywhere as /h/281.

(Making these be real ZFS filesystems means that they can have properties that will be inherited by their children; this theoretically enables us to apply some ZFS properties only to a pool's homedir or workdir filesystems. Probably the only useful one here is quotas.)

ZFSOurContainerFilesystems written at 23:47:32; Add Comment

(Previous 10 or go back to February 2020 at 2020/02/12)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.