Wandering Thoughts

2021-03-06

Some views and notes on ZFS deduplication today

I recently wrote an entry about a lingering sign of old hopes for ZFS deduplication, and got a number of good comments on it that I have reactions and views about. First off, Opk said:

I could be very wrong in my understanding of how zfs dedup works but I've often turned it on for the initial data population. So I turn it on, rsync or zfs send in my data and then I turn it off again. I don't care about memory usage during the initial setup of a system so I assume this is not doing much harm. [...]

How much potential harm this does depends on what you do with the data that was written with deduplication on. If you leave the data sitting there, this is relatively harmless. However, if you delete the data (including overwriting data in files 'in place'), then ZFS must update the DDT (deduplication table) to correctly maintain the reference count of each unique data block. If you don't have enough memory to hold all of the DDT, then this is going to require disk reads to page chunks of it in and out. The amount of reading and slowdown goes up as you delete more and more data at once, for example if you delete an entire snapshot or filesystem.

(This is a classical surprise issue with deduplication, going back to early days. People are disconcerted when operations like 'zfs destroy <snapshot>' sit there for ages, or at least run in the background for ages even if the command returns immediately.)

Brendan Long asked:

Have the massive price drops for SSD's since 2010 change your opinion on this at all? It seems like the performance hit is quite bad if you have to do random seeks on a spinning disk, but it's ok on SSD's, and you can get a 100 GB SSD for $20 these days.

I'm not sure if it's okay on SSDs, so here's my view. Reads aren't slowed by being deduplicated, but writes (and deletes) require a synchronous check of the DDT for every block, which means a synchronous SSD read IO if the necessary section of the DDT isn't in RAM. It's not clear to me what latency SSDs have for isolated synchronous reads, but my vaguely measured numbers suggest that we should assume on at least a couple of milliseconds per read.

I haven't read the ZFS code, so I don't know if it performs DDT checking serially as it processes each block being written or deleted (which would be a natural approach), or if it somehow batches the checks up to issue them in parallel. If DDT checks are fully serial and you have to go to SSD on each one, you're looking at a write or delete rate of at most a thousand blocks a second. If you're dealing with 128 KB blocks (the typical maximum ZFS recordsize), that works out to about 125 MBytes a second. This is okay but not all that impressive for a SSD, and it would mean that deleting large objects could still take quite a while to complete.

(Deleting 100 GB might take over 13 minutes, for example.)

On the other hand, if we assume that a typical SATA 6 GB/s SSD has a sustained write bandwidth of 550 Mbytes/sec, you only need around 4,400 DDT checks a second in order to hit that data rate for writing out 128 KB ZFS blocks. In practice you're probably not going to get 550 Mbytes/sec of user level write bandwidth out of a deduplicated ZFS pool on a single SSD, because both the necessary DDT writes and the DDT reads will take up some of the bandwidth to and from the SSD (even if the DDT is entirely in RAM, it gets updated on writes and deletes and those updates have to be written back to the SSD).

(This also implies that 4,400 written out DDT blocks a second is about the maximum you can do on a single SSD, for deletes. But I expect that writing out updated DDT entries for deletes is batched and generally doesn't touch that many different blocks of the DDT.)

On the whole, I think that there are enough uncertainties about the performance of deduplicated ZFS pools even on SSDs that I wouldn't want to build one for general use. I'd say 'without a good amount of testing', but I'm not sure that testing would convince me that I wouldn't run into a corner case in ordinary use after long enough.

ZFSDedupTodayNotes written at 23:07:38; Add Comment

2021-02-19

ZFS pool partial (selective) feature upgrades are coming in OpenZFS

I'm an active user of (Open)ZFS on Linux on my personal machines (office workstation and home Linux machine), where I deliberately run the very latest development versions. But if you ran 'zpool status' on either machine, you would see a lot of:

status: Some supported and requested features are not
        enabled on the pool. The pool can still be used,
        but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once
        this is done, the pool may no longer be accessible
        by software that does not support the features.
        See zpool-features(5) for details.

(This verbose message irks me for other reasons.)

The reason for this is exactly that I run the latest development versions. Right now, if you run 'zpool upgrade' you get no choice about what happens; your pools are upgraded to support absolutely all of the features that the code you're running knows about. The same thing happens by default when you create a pool (although you can specify exact features if you know what you're doing). For people like me, who have old pools but are running the very latest development versions, this is dangerous. I don't want to enable ZFS pool features that aren't enabled in any released version yet in case I have to revert back to using a stable, released version of (Open)ZFS.

The good news is that OpenZFS's development version just landed a fix for this, in fact a very general one. The simple version is that there's a new ZFS pool property called 'compatibility'; if set, it limits what features a pool will be created with or upgraded to. You can set it to a wide variety of general choices, which include things like 'OpenZFS 2.0 on Linux' and 'what Grub2 will support'.

(As a side effect, looking at the files that defines these options will tell you what's supported, or believed to be supported, on various platforms.)

Since this is a ZFS pool property, I believe that the way to selectively (or partially) upgrade an existing pool created with no compatibility option set is to do 'zpool set compatibility=...' to whatever before hand and then run 'zpool upgrade'. This is somewhat underdocumented right now (from my perspective) and I rather wish that 'zpool upgrade' itself could take what to upgrade to (well, what to limit upgrades to) as an explicit argument, the way 'zpool create' does. I suppose that setting a pool property (and then leaving it there) is safer than relying on never accidentally running a 'zpool upgrade' with no restrictions.

Another useful side effect of the compatibility pool property, at least according to the documentation, is that apparently 'zpool status' will no longer nag you if your pool supports all of the features that are allowed by its compatibility setting, even if that's a very low one. This may finally get 'zpool status' to shut up about this for me, someday.

(I won't be taking advantage of this feature to finally upgrade my pools to the OpenZFS 2.0 level until a bit more time has passed and other people have found any problems with it. The development version of ZFS is well tested, but I'm still cautious.)

This is a quality of life improvement that many OpenZFS users will never really notice, but for system administrators and people like me it's going to be great. Since one of the compatibility options is 'Grub2', it will probably also help people on Linux who want to use ZFS for their root filesystem.

PS: I don't know when (or if) this will be merged back into Illumos. I don't believe that OpenZFS is attempting to explicitly drive this; instead I believe they leave it up to the Illumos developers to pull in OpenZFS changes of interest. As far as Linux goes, I suspect that this won't be part of any 2.0.x update and will likely wait for 2.1.0, whenever that happens.

ZFSPartialUpgradeOption written at 23:27:24; Add Comment

2021-01-24

Thinking through what can go badly with databases on ZFS

Famously, if you're running a database with its storage on ZFS and you care about performance, you need to tune various ZFS parameters for the filesystem (or filesystems) that the database is on. You especially need to tune the ZFS recordsize property; generally people will say that if you change only one thing, you should change this to be either the same size as your database's block size or perhaps twice its size. But this raises a question for a certain sort of person, namely what goes badly when you leave ZFS's recordsize alone and run a database anyway. I can't answer this from experiments and experience (we've never tried to run performance sensitive databases on our ZFS fileservers), but I can work through this based on knowledge of how ZFS works. I'm going to assume SSD or NVMe storage; if you're still running a database on spinning rust and trying for performance, ZFS's recordsize setting is the least of your problems.

(Examples of tuning recommendations include this [PDF] (via) or Let's Encrypt's ZFS datastore for MariaDB (via).)

The default ZFS recordsize is 128 Kb. What this means is that once a file is 128 Kb or larger, it's stored in logical blocks that are 128 Kb in size (this is the size before compression, so the physical size on disk may vary). Within ZFS, both reads and writes must be done to entire (logical) blocks at once, even if at the user level you only want to read or write a small amount of data. This 128 Kb logical block IO forces overheads on both database reads and especially database writes.

For reads, ZFS must transfer up to 128 Kb from disk (although in a single IO transaction), checksum the entire (decompressed) 128 Kb, probably hold it in the ARC (ZFS's in kernel disk cache), and finally give the database the 8 Kb or 16 Kb chunk that it really wants. I suspect that what usually hurts the most here is the extra memory overhead (assuming that the database doesn't then go back and want another 8 Kb or 16 Kb chunk out of the same 128 Kb block, which is now ready in memory). SSDs and especially NVMe drives have high bandwidth and support a lot of operations per second, so the extra data transferred probably doesn't have a big effect there, although the extra data transferred, decompressed, and checksummed may increase your read IO latency a bit.

Things are worse for database writes. To update an 8 Kb or 16 Kb chunk, ZFS must read the 128 Kb block into memory if it's not already there (taking the read overheads, including latency), checksum and likely compress the new version of the 128 Kb block, allocate new disk space for it all, and write it. Importantly, the same read, modify, and write process is required most of the time if you're appending to a file, such as a database's write-ahead log. When the database fsync()s its data (either for its log or for the main data files), ZFS may also write the full data into the ZFS Intent Log. Because a fsync() forces the disk to flush data to durable storage and the time this takes usually depends on how much data there is to flush, I think the increased data written to the ZIL will increase fsync() latency and thus transaction commit latency.

(It's not clear to me if a partial write of a block in a file that has hit the full recordsize writes only the new user-level data to the ZIL or if the ZIL includes the full block, probably out of line but still forced to disk.)

On modern SSDs and NVMe drives, there's a limited internal drive cache of fast storage for buffering writes before they have to be put on the slower main flash. If your database has a high enough write volume, the extra data that has to be written with a 128 Kb recordsize might push the drive out of that fast write storage and slow down all writes. I suspect that most people don't have that much write traffic and that this isn't a real concern; my impression is that people normally hit this drive limit with sustained asynchronous writes.

PS: Appending a small amount of data to a file that is 128 Kb or larger usually requires the same read, modify, write cycle because the last block of a file is still 128 Kb even if the file doesn't entirely fill it up. You get to skip the overhead only when you're starting a new 128 Kb block; if you're appending in 16 Kb chunks, this is every 8th chunk.

PPS: I have some thoughts about the common recommendation for a logbias of throughput on modern storage, but that needs another entry. The short version is that what throughput really does is complicated and it may not be to your benefit today on devices where random IO is free and write bandwidth is high.

(This entry was sparked by this Fediverse toot, although it doesn't in the least answer the toot's question.)

ZFSDatabasesWhatHappens written at 00:47:58; Add Comment

2021-01-21

A lingering sign of old hopes for ZFS deduplication

Over on Twitter, I said:

It's funny-sad that ZFS dedup was considered such an important feature when it launched that 'zpool list' had a DEDUP field added, even for systems with no dedup ever enabled. Maybe someday zpool status will drop that field in the default output.

For people who have never seen it, here is 'zpool list' output on a current (development) version of OpenZFS on Linux:

; zpool list
NAME     SIZE  ALLOC  FREE  CKPOINT  EXPANDSZ  FRAG  CAP  DEDUP  HEALTH  ALTROOT
ssddata  596G   272G  324G        -         -   40%  45%  1.00x  ONLINE  -

The DEDUP field is the ratio of space saved by deduplication, expressed as a multiplier (from the allocated space after deduplication to what it would be without deduplication). It's always present in default 'zpool list' output, and since almost all ZFS pools don't use deduplication, it's almost always 1.00x.

It seems very likely that Sun and the Solaris ZFS developers had great hope for ZFS deduplication when the feature was initially launched. Certainly the feature was very attention getting and superficially attractive; back a decade ago, people had heard of it and would recommend it casually, although actual Solaris developers were more nuanced. It seems very likely that the presence of a DEDUP field in the default 'zpool list' output is a product of an assumption that ZFS deduplication would be commonly used and so showing the field was both useful and important.

However, things did not turn out that way. ZFS deduplication is almost never used, because once people tried to use it for real they discovered that it was mostly toxic, primarily because of high memory requirements. Yet the DEDUP field lingers on in the default 'zpool list' output, and people like me can see it as a funny and sad reminder of the initial hopes for ZFS deduplication.

(OpenZFS could either remove it or, if possible, replace it with the overall compression ratio multiplier for the pool, since many pools these days turn on compression. You would still want to have DEDUP available as a field in some version of 'zpool list' output, since the information doesn't seem to be readily available anywhere else.)

PS: Since I looked it up, ZFS deduplication was introduced in Oracle Solaris 11 for people using Solaris, which came out in November of 2011. It was available earlier for people using OpenSolaris, Illumos, and derivatives. Wikipedia says that it was added to OpenSolaris toward the end of 2009 and first appeared in OpenSolaris build 128, released in early December of 2009.

ZFSDedupLingeringSign written at 23:01:10; Add Comment

2020-12-21

The legibility of different versions of ZFS

I'll put the summary right up at the front: one of the refreshing things that I enjoy about OpenZFS on Linux is how comparatively legible and accessible some aspects of its operation are to me. Well, specifically how comparatively legible starting up ZFS on boot is. Now, there are two sides to that. On one side, the Linux setup to start ZFS is complicated. On the other side, this complexity has always existed in ZFS, it's just that on Solaris (and Illumos/OmniOS), the complexity was deliberately hidden away from you. You were not supposed to have to care about how ZFS started on Solaris because the deep integration of ZFS with the rest of the system should make it just work.

This was fine until the time when it didn't just work. We had some of those at some points, and because we had some of those (and as a general precaution), I wanted to understand the whole process more. When we were running Solaris and then OmniOS, I mostly failed. I never fully understood things like ZFS pool activation and iSCSI or boot time pool activation (also). I'm sure that these things are knowable, and I suspect that they are knowable even for people who aren't Illumos ZFS kernel developers, but I was never able to navigate through everything while we were still running OmniOS.

Given how we overlooked syseventadm, it's quite possible that part of what is going on is that I'm more familiar with Linux boot arcana than I am with Illumos boot arcana. I certainly like systemd much more than SMF, which left me more interested in learning systemd things than in learning SMF ones. And OpenZFS on Linux has no more documentation on the ZFS boot process in Linux than Illumos had for its ZFS boot process the last time I looked, so you're at the mercy of third party documentation like the Arch wiki. But for whatever reasons, I've been more successful at figuring out the Linux ZFS boot process than I ever was with the OmniOS one.

(Illumos also comes pre-set to have all of the ZFS things work while OpenZFS on Linux can leave you to configure things yourself, which is not exactly the greatest experience.)

I do wish that these things were documented (for Illumos and OpenZFS both). They don't have to be officially supported as 'how this will be for all time', but just knowing how things are supposed to work can be a great help when you run into problems. And beyond that, it's good to know more about how your systems operate under the surface. In the end there is no magic, only things that you don't know.

ZFSVersionsLegibility written at 22:30:48; Add Comment

2020-11-29

Some thoughts on how I still miss DTrace (and also mdb)

Although I'm generally happy with our Linux fileservers, every so often we run into an issue where I miss OmniOS's DTrace and mdb; DTrace for dynamic visibility into what the system was doing, and mdb for static inspection and tracing through kernel data structures. In theory Linux has equivalents of both of these. In practice this Linux future is unevenly distributed. It's likely that our Linux fileservers will have great visibility once we upgrade them to Ubuntu 22.04 in 2022 or 2023, but it's taking some time to get there. This is in stark contrast to Solaris, where DTrace (and mdb) were usable from the very beginnings of our ZFS fileservers, very shortly after Solaris 10 included DTrace at all.

It's my feeling that this difference is ultimately because of how Solaris was developed as a unitary whole by Sun, a single organization, in contrast to how Linux's performance and observability systems have been developed in many separated pieces by different groups. Since there was a single organization controlling all Solaris development, people within Sun were in a position to set overriding priorities and decide that DTrace was a good enough idea that it would be finished and shipped all at once in a ready to go state. Since Solaris was a unitary system, the kernel and user tools could ship together as a coordinated thing and the same people could develop both together.

(Similar things apply for mdb, which needs to move in step with the kernel.)

I do feel that the Linux approach has had some important advantages, but it's undeniable that it's been slower to produce actual results (and those results are unevenly distributed, depending on how up to date your version of Linux is and how rapidly it incorporated various things). My perception is that there's been a back and forth where people put forward needs and prototypes, the kernel adds facilities, people build tools that use and push those facilities, and the usefulness of these tools pushes the kernel forward. Some of these tools have wound up being superseded or abandoned, and the overall selection of tools (and facilities) isn't unified. Since the overall Linux system is not a unified thing under the control of one organization, no one in Linux could write an engineering white paper, build a prototype, or otherwise line up everyone behind one thing to get it done rapidly and out into the world all at once.

In a way, this is what I miss from DTrace and mdb. Solaris could move decisively, for better or worse (and to some extent Illumos still can). Not all of its decisive moves were wins (ask me what I feel about SMF), but there was a real power in its ability to make them, and DTrace is a great illustration of that.

PS: It feels possible that not making a public, supported interface between dtrace and the kernel was one of the things that allowed DTrace as a whole to be shipped early, since it reduced the number of public things that had to be carefully designed and tested. This is another area where having a unitary system helps in several ways.

DTraceStillMiss written at 23:52:33; Add Comment

2020-08-26

Even on SSDs, ongoing activity can slow down ZFS scrubs drastically

Back in the days of our OmniOS fileservers, which used HDs (spinning rust) across iSCSI, we wound up changing kernel tunables to speed up ZFS scrubs and saw a significant improvement. When we migrated to our current Linux fileservers with SSDs, I didn't bother including these tunables (or the Linux equivalent), because I expected that SSDs were fast enough that it didn't matter. Indeed, our SSD pools generally scrub like lightning.

(Our Linux fileservers use a ZFS version before sequential scrubs (also). It's possible that sequential scrub support would change this story.)

Then, this weekend, a ZFS pool with 1.68 TB of space used took two days to scrub (48:15, to be precise). This is not something that happens normally; this size of pool usually scrubs much faster, on the order of a few hours. When I poked at it a bit none of the disks seemed unusually slow and there were no signs of other problems, it was just that the scrub was running slowly. However, looking at NFS client metrics in our metrics system suggested that there was continuous ongoing NFS activity to some of the filesystems in that pool.

Although I don't know for sure, this looks like a classical case of even a modest level of regular ZFS activity causing the ZFS scrub code to back off significantly on IO. Since this is on SSDs, this isn't really necessary (at least for us); we could almost certainly sustain both a more or less full speed scrub and our regular read IO (significant write IO might be another story, but that's because it has some potential performance effects on SSDs in general). However, with no tuning our current version of ZFS is sticking to conservative defaults.

In one sense, this isn't surprising, since it's how ZFS has traditionally reacted to IO during scrubs. In another sense, it is, because it's not something I expected to see affect us on SSDs; if I had expected to see it, I'd have carried forward our ZFS tunables to speed up scrubs.

(Now that I look at our logged data, it appears that ZFS scrubs on this pool have been slow for some time, although not 'two days' slow. They used to complete in a couple of hours, then suddenly jumped to over 24 hours. More investigation may be needed.)

ZFSSSDActivitySlowsScrubs written at 22:48:14; Add Comment

2020-07-24

Some thoughts on us overlooking Illumos's syseventadm

In a comment on my praise of ZFS on Linux's ZFS event daemon, Joshua M. Clulow noted that Illumos (and thus OmniOS) has an equivalent in syseventadm, which dates back to Solaris. I hadn't previously known about syseventadm, despite having run Solaris fileservers and OmniOS fileservers for the better part of a decade, and that gives me some tangled feelings.

I definitely wish I'd known about syseventadm while we were still using OmniOS (and even Solaris), because it would probably have simplified our life. Specifically, it probably would have simplified the life of our spares handling system (2, 3). At the least, running immediately when some sort of pool state change happened would have sped up its reaction to devices failing (instead, it ran every fifteen minutes or so from cron, creating a bit of time lag).

(On the whole it was probably good to be forced to make our spares system be state based instead of event based. State based systems are easier to make robust in the face of various sorts of issues, like dropped events.)

At the same time, that we didn't realize syseventadm existed is, in my mind, a sign of problems in how Illumos is organized and documented (which is something it largely inherited from Solaris). For instance, syseventadm is not cross referenced in any of the Fault Manager related manpages ( fmd, fmdump, fmadm, and so on). The fault management system is the obvious entry point for a sysadmin exploring this area on Illumos (partly because it dumps out messages on you), so some sort of cross reference would have led me to syseventadm. Nor does it come up much in discussions on the Internet, although if I'd asked specifically back in the days I might have had someone mention it to me.

(It got mentioned in this Serverfault question, for example.)

A related issue is that in order to understand what you can do with syseventadm, you have to read Illumos header files (cf). This isn't even mentioned in the syseventadm manpage, and the examples in the manpage are all for custom events generated by things from a hypothetical third party vendor MYCO instead of actual system events. Without a lot of context, there are not many clues that ZFS events show up in syseventadm in the first place for you to write a handler for them. It also seems clear that writing handlers is going to involve a lot of experimentation or reading the source to determine what data you get and how it's passed to you and so on.

(In general and speaking as a sysadmin, the documentation for syseventadm doesn't present itself as something that's for end sysadmins to use. If you have to read kernel headers to understand even part of what you can do, this is aimed at system programmers.)

On the whole I'm not terribly surprised that we and apparently other people missed the existence and usefulness of syseventadm, even if clearly there was some knowledge of it in the Illumos community. That we did miss it while ZFS on Linux's equivalent practically shoved itself in our face is an example of practical field usability (or lack thereof) in action.

At this point interested parties are probably best off writing articles about how to do things with syseventadm (especially ZFS things), and perhaps putting it in Illumos ZFS FAQs. Changing the structure of the Illumos documentation or rewriting the manpages probably has too little chance of good returns for the time invested; for the most part, the system documentation for Illumos is what it is.

OverlookingSyseventadm written at 00:21:02; Add Comment

2020-07-01

In ZFS, your filesystem layout needs to reflect some of your administrative structure

One of the issues we sometimes run into with ZFS is that ZFS essentially requires you to reflect your administrative structure for allocating and reserving space in how you lay out ZFS filesystems and filesystem hierarchies. This is because in ZFS, all space management is handled through the hierarchy of filesystems (and perhaps in having multiple pools). If you want to make two separate amounts of space available to two separate sets of filesystems (or collectively reserved by them), either they must be in different pools or they must be under different dataset hierarchies within the pool.

(These hierarchies don't have to be visible to users, because you can mount ZFS filesystems under whatever names you want, but they exist in the dataset hierarchy in the pool itself and you'll periodically need to know them, because some commands require the full dataset name and don't work when given the mount point.)

That sounds abstract, so let me make it concrete. Simplifying only slightly, our filesystems here are visible to people as /h/NNN (for home directories) and /w/NNN (workdirs, for everything else). They come from some NFS server and live in some ZFS pool there (inside little container filesystems), but the NFS server and to some extent the pool is an implementation detail. Each research group has its own ZFS pool (or for big ones, more than one pool because one pool can only be so big), as do some individual professors. However, there are not infrequently cases where a professor in a group pool would like to buy extra space that is only for their students, and also this professor has several different filesystems in the pool (often a mixture of /h/NNN homedir filesystems and /w/NNN workdir ones).

This is theoretically possible in ZFS, but in order to implement it ZFS would force us to put all of a professor's filesystems under a sub-hierarchy in the pool. Instead of the current tank/h/100 and tank/w/200, they would have to be something like tank/prof/h/100 and tank/prof/w/200. The ZFS dataset structure is required to reflect the administrative structure of how people buy space. One of the corollaries of this is that you can basically only have a single administrative structure for how you allocate space, because a dataset can only be in one place in the ZFS hierarchy.

(So if two professors want to buy space separately for their filesystems but there's a filesystem shared between them (and they each want it to share in their space increase), you have a problem.)

If there were sub-groups of people who wanted to buy space collectively, we'd need an even more complicated dataset structure. Such sub-groups are not necessarily decided in advance, so we can't set up such a hierarchy when the filesystems are created; we'd likely wind up having to periodically modify the dataset hierarchy. Fortunately the manpages suggest that 'zfs rename' can be done without disrupting service to the filesystem, provided that the mountpoint doesn't change (which it wouldn't, since we force those to the /h/NNN and /w/NNN forms).

While our situation is relatively specific to how we sell space, people operating ZFS can run into the same sort of situation any time they want to allocate or control collective space usage among a group of filesystems. There are plenty of places where you might have projects that get so much space but want multiple filesystems, or groups (and subgroups) that should be given specific allocations or reservations.

PS: One reason not to expose these administrative groupings to users is that they can change. If you expose the administrative grouping in the user visible filesystem name and where a filesystem belongs shifts, everyone gets to change the name they use for it.

ZFSAdminVsFilesystemLayout written at 22:58:55; Add Comment

2020-06-30

The unfortunate limitation in ZFS filesystem quotas and refquota

When ZFS was new, the only option it had for filesystems quotas was the quota property, which I had an issue with and which caused us practical problems in our first generation of ZFS fileservers because it covered the space used by snapshots as well as the regular user accessible filesystem. Later ZFS introduced the refquota property, which did not have that problem but in exchange doesn't apply to any descendant datasets (regardless of whether they're snapshots or regular filesystems). At one level this issue with refquota is fine, because we put quotas on filesystems to limit their maximum size to what our backup system can comfortably handle. At another level, this issue impacts how we operate.

All of this stems from a fundamental lack in ZFS quotas, which is ZFS's general quota system doesn't let you limit space used only by unprivileged operations. Writing into a filesystem is a normal everyday thing that doesn't require any special administrative privileges, while making ZFS snapshots (and clones) requires special administrative privileges (either from being root or from having had them specifically delegated to you). But you can't tell them apart in a hierarchy, because ZFS only you offers the binary choice of ignoring all space used by descendants (regardless of how it occurs) or ignoring none of it, sweeping up specially privileged operations like creating snapshots with ordinary activities like writing files.

This limitation affects our pool space limits, because we use them for two different purposes; restricting people to only the space that they've purchased and insuring that pools always have a safety margin of space. Since pools contain many filesystems, we must limit their total space usage using the quota property. But that means that any snapshots we make for administrative purposes consume space that's been purchased, and if we make too many of them we'll run the pool out of space for completely artificial reasons. It would be better to be able to have two quotas, one for the space that the group has purchased (which would limit only regular filesystem activity) and one for our pool safety margin (which would limit snapshots too).

(This wouldn't completely solve the problem, though, since snapshots still consume space and if we made too many of them we'd run a pool that should have free space out of even its safety margin. But it would sometimes make things easier.)

PS: I thought this had more of an impact on our operations and the features we can reasonable offer to people, but the more I think about it the more it doesn't. Partly this is because we don't make much use of snapshots, though, for various reasons that sort of boil down to 'the natural state of disks is usually full'. But that's for another entry.

ZFSHierarchyQuotaLack written at 22:17:27; Add Comment

(Previous 10 or go back to May 2020 at 2020/05/10)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.