Wandering Thoughts

2020-07-08

"Interface smuggling", a Go design pattern for expanding APIs

Interfaces are one of the big ways of creating and defining APIs in Go. Go famously encourages these interfaces to be very minimal; the widely used and implemented io.Reader and io.Writer are each one method. Minimal APIs such as this have the advantage that almost anything can implement them, which means that Go code that accepts an io.Reader or io.Writer can work transparently with a huge range of data sources and destinations.

However, this very simplicity and generality means that these APIs are not necessarily the most efficient way to perform operations. For example, if you want to copy from an io.Reader to an io.Writer, such as io.Copy() does, using only the basic API means that you have to perform intermediate data shuffling when in many cases either the source could directly write to the destination or the destination could directly read from the source. Go's solution to this is what I will call interface smuggling.

In interface smuggling, the actual implementation is augmented with additional well known APIs, such as io.ReaderFrom and io.WriterTo. Functions that want to work more efficiently when possible, such as io.Copy(), attempt to convert the io.Reader or io.Writer they obtained to the relevant API and then use it if the conversion succeeded:

if wt, ok := src.(WriterTo); ok {
   return wt.WriteTo(dst)
}
if rt, ok := dst.(ReaderFrom); ok {
   return rt.ReadFrom(src)
}
[... do copy ourselves ...]

I call this interface smuggling because we are effectively smuggling a different, more powerful, and broader API through a limited one. In the case of types supporting io.WriterTo and io.ReaderFrom, io.Copy completely bypasses the nominal API; the .Read() and .Write() methods are never actually used, at least directly by io.Copy (they may be used by the specific implementations of .WriteTo() or .ReadFrom(), or more interface smuggling may take place).

(Go code also sometimes peeks at the concrete types of interface API arguments. This is how under the right circumstances, io.Copy will wind up using the Linux splice(2) or sendfile(2) system calls.)

There is also interface smuggling that expands the API, as seen in things like io.ReadCloser and io.ReadWriteSeeker. If you have a basic io.Reader, you can try to convert it to see if it actually supports these expanded APIs, and then use them if it does.

PS: There's probably a canonical Go term for doing this as part of your API design, either to begin with or as part of expanding it while retaining backward compatibility. If so, feel free to let me know in the comments.

programming/GoInterfaceSmuggling written at 23:37:47; Add Comment

2020-07-07

Some thoughts on Fedora moving to btrfs as the default desktop file system

The news of the time interval for me is that there is a Fedora change proposal to make btrfs the default file system for Fedora desktop (via, itself via; see also the mailing list post). Given that in the past I've been a btrfs sceptic (eg, from 2015), long time readers might expect me to have some views here. However, this time around my views are cautiously optimistic for btrfs (and Fedora), although I will only be watching from a safe distance.

The first two things to note are that 2015 is a long time ago (in computer time) and I'm too out of touch with btrfs to have an informed opinion on its current state. I'm confident that people in Fedora wouldn't have proposed this change if there weren't good reasons to believe that btrfs is up to the task. The current btrfs status looks pretty good on a skim, although the section on device replacement makes me a little alarmed. The Fedora proposal also covers who else is using btrfs and has been for some time, and it's a solid list that suggest btrfs is not going to explode for Fedora users.

I'm a big proponent of modern filesystems with data and metadata checksums, so I like that aspect of btrfs. As far as performance goes, most people on desktops are unlikely to notice the difference, and as a long term user of ZFS on Linux I can testify how nice it is to not have to preallocate space to specific filesystems (even if with LVM you can grow them later).

However, I do feel that this is Fedora being a bit adventurous. This is in line with Fedora's goals and general stance of being a relatively fearless leading edge distribution, but at the same time sometimes the leading edge is also the bleeding edge. I would not personally install a new Fedora machine with btrfs in the first few releases of Fedora that defaulted to it, because I expect that there will be teething problems. Some of these may be in btrfs, but others will be in system management programs and practices that don't cope with btrfs or conflict with it.

In the long run I think that this change to btrfs will be good for Fedora and for Linux as a whole. Ext4 is a perfectly decent filesystem (and software RAID works fine), but it's possible to do much better, as ZFS has demonstrated for a long time.

linux/FedoraBtrfsDefaultView written at 23:47:04; Add Comment

I now think that blog 'per day' pages with articles are a mistake

Back in 2005 when I wrote DWiki, the engine that is used for Wandering Thoughts, there was an accepted standard structure for blogs that people followed, me included. For instance, it was accepted convention that the front page of your blog showed a number of the most recent articles, and you could page backward to older ones. Part of this structure was the idea that you would have a page for each day and that page would show the article or articles written that day (if any). When I put together DWiki's URL structure for blog-like areas, I followed this, and to this day Wandering Thoughts has these per-day pages.

I now think that these per day pages are not the right thing to do on the modern web (for most blogs), for three reasons. The first reason is that they don't particularly help real blog usability, especially getting people to explore your blog after they land on a page. Most people make at most one post a day, so exploring day by day doesn't really get you anything more than links in a blog entry to the next entry and the previous entry will (and if the links have the destination's title, they will probably be giving you more information than a day).

The second reason is that because they duplicate content from your actual articles, they confuse search engine based navigation. Perhaps the search engine will know that the actual entry is the canonical version and present that in preference to the per-day page where the entry is also present, but perhaps not. And if you do have two entries in one day, putting both of their texts on one page risks disappointment in someone who is searching for a combination of terms where one term is only in one entry and the other term is in a second.

The third and weakest reason is a consequence of how on the modern web, everything gets visited. Per-day pages are additional pages in your blog and web crawlers will visit them, driving up your blog's resource consumption in the process. These days my feelings are that you generally want to minimize the number of pages in your blog, not maximize them, something I've written about more in The drawback of having a dynamic site with lots of URLs on today's web. But this is not a very strong reason, if you have a reasonably efficient blog and you serve per-day pages that don't have the full article text.

I can't drop per-day pages here on Wandering Thoughts, because I know that people have links to them and I want those links to keep working as much as possible. The simple thing to do is to stop putting full entries on per-day pages, and instead just put in their title and a link to them (just as I already do on per-month and per-year pages); this at least gets rid of the duplication of entry text and makes it far more likely that search engine based navigation will deliver people to the actual entry. The more elaborate thing would be to automatically serve a HTTP redirect to the entry for any per-day page that had only a single entry.

(For relatively obvious reasons you'd want to make this a temporary redirect.)

There's a bit of me that's sad about this shift in blog design and web usage; the per-day, per-month, and per-year organization had a pleasant regularity and intuitive appeal. But I think its time has passed. More and more, we're all tending toward the kind of minimal URL structure typical of static sites, even when we have dynamic sites and so could have all the different URL structures and ways of accessing our pages that we could ask for.

web/BlogDroppingPerDayPages written at 00:09:25; Add Comment

2020-07-05

A Go lesson learned: sometimes I don't want to use goroutines if possible

We have a heavily NFS based server environment here, with multiple NFS servers and an IMAP server that accesses all mailboxes over NFS. That IMAP server has had ongoing issues with elevated load averages, and what at least seems to be IMAP slowness. However, our current metrics leave a lot of uncertainties about the effects of all of this, because we basically only have a little bit of performance data for a few IMAP operations. One thing I'd like to do is gather some very basic Unix level NFS performance data from our IMAP server and from some other machines, to see if I can see anything.

One very simple metric is how long it takes to read a little file from every NFS filesystem we have mounted on a machine. As it happens, we already have the little files (they're used for another system management purpose), so all I need is a program to open and read each one while timing how long it takes. There's an obvious issue with doing this sequentially, which is that if there's a single slow filesystem, it could delay everything else.

The obvious answer here was Go, goroutines, and some form of goroutine pool. Because the goroutines just do IO (and they're only being used to avoid one bit of IO delaying another separate bit), the natural size of the goroutine pool is fairly large, say 50 to 100 goroutines (we have a lot of NFS filesystems). This is quite easy and obvious to implement in Go, so I put together a little Go program for it and watched the numbers it generated as they jumped around.

Then, out of reflexive caution, I tried running the same program with a goroutine pool size of 1, which more or less forced serial execution (the pool goroutine infrastructure was still active but there was only one worker goroutine doing all the reading). To my surprise the 'time to read a file' number for all filesystems was visibly and decidedly lower. I could run the program side by side with the two different goroutine pool sizes and see this clearly.

Some thinking gave me a possible reason why this is so. My core code does essentially the following (minus error checking):

start := time.Now()
file, err := os.Open(target)
n, err := file.Read(buffer)
duration := time.Now().Sub(start)

This sequence makes two system calls and each system call is a potential goroutine preemption point. If a goroutine gets preempted during either system call, it can only record the finishing time once it's scheduled again (and finishes the read, if it was preempted in the open). If there are 50 or more goroutines all doing this, some of them could well be preempted and then not scheduled for some time, and that scheduling delay will show up in the final duration. When there aren't multiple goroutines active, there should be very little scheduling delay and the recorded durations (especially the longest durations) will be lower. And the ideal situation for essentially no goroutine contention is of course one goroutine.

(Technically this makes two more system calls to get the time at the start and the end of the sequence, but on modern systems, especially Linux, these don't take long enough to trigger Go's system call preemption and probably don't even enter the kernel itself.)

Because I still worry about individual slow filesystems slowing everything down (or stalls on some filesystems), my solution was a more complicated work pool approach that starts additional worker goroutines only when all of the current ones seem to have stalled for too long. If all goes well (and it generally does in my testing), this runs with only one goroutine.

(My current code has the drawback that once the goroutine worker pool expands, all of them stay active, which means that enough slow filesystems early on in the checks could get me back to the thundering herd problem. I'm still thinking about that issue.)

programming/GoWhenNotManyGoroutines written at 23:52:25; Add Comment

2020-07-04

How you get multiple TLS certificate chains from a server certificate

I've known and read for some time that a single server certificate can have more than one chain to a root certificate that you trust, but I never really thought about the details of how this worked. Then the AddTrust thing happened, I started writing about how Prometheus's TLS checks would have reacted to it, and Guus left a comment on that entry that got me thinking about what else Prometheus could sensibly look at here. So now I want to walk through the mechanics of multiple TLS chains to get this straight in my head.

Your server certificate and the other TLS certificates in a chain are each signed by an issuer; in a verified chain, this chain of issuers eventually reaches a Certificate Authority root certificate that people have some inherent trust in. However, a signed certificate doesn't specifically name and identify the issuer's certificate by, say, its serial number or hash; instead issuers are identified by their X.509 Subject Name and also at least implicitly by their keypair (and sometimes explicitly). By extension, your signed certificate also identifies the key type of the issuer's certificate; if your server certificate is signed by RSA, an intermediate certificate with an ECDSA keypair is clearly not the correct parent certificate.

(Your server certificate implicitly identifies the issuer by keypair because they signed your certificate with it; an intermediate certificate with a different keypair can never validate the signature on your certificate.)

However, several certificates can have the same keypair and X.509 Subject Name, provided that other attributes differ. One such attribute is the issuer that signed them (including whether this is a self-signed CA root certificate). So the first thing is that having more than one certificate for an issuer is generally required to get multiple chains. If you only have one certificate for each issuer, you can pretty much only build a single chain.

There are three places that these additional certificates for an issuer can come from; they can be sent by the server, they can be built into your certificate store in advance, or they can be cached because you saw them in some other context. The last is especially common with browsers, which often cache intermediate certificates that they see and may use them in preference to the intermediate certificate that a TLS server sends. Other software is generally more static about what it will use. My guess is that we're unlikely to have multiple certificates for a single CA root issuer, at least for modern CAs and modern root certificate sets as used by browsers and so on. This implies that the most likely place to get additional issuer certificates is from intermediate certificates sent by a server.

(In any case, it's fairly difficult to know what root certificate sets clients are using when they talk to your server. If your server sends the CA root certificate you think should be used as part of the certificate chain, a monitoring client (such as Prometheus's checks) can at most detect when it's got an additional certificate for that CA root issuer in its own local trust store.)

One cause of additional issuer certificates is what's called cross-signing a CA's intermediate certificate, as is currently the case with Let's Encrypt's certificates. In cross-signing, a CA generate two versions of its intermediate certificate, using the same X.509 Subject Name and keypair; one is signed by its own CA root certificate and one is signed by another CA root certificate. A CA can also cross-sign its own new root certificate (well, the keypair and issuer) directly, as is the case with the DST Root CA X3 certificate that Let's Encrypt is currently cross-signed with; one certificate for 'DST Root CA X3' is self-signed and likely in your root certificate set, but two others existed that were cross-signed by an older DST CA root certificate.

(As covered in the certificate chain illustrations in Fixing the Breakage from the AddTrust External CA Root Expiration, this was also the case with the expiring AddTrust root CA certificate. The 'USERTrust RSA Certification Authority' issuer was also cross-signed to 'AddTrust External CA Root', a CA root certificate that expired along with that cross-signed intermediate certificate. And this USERTrust root issuer is still cross signed to another valid root certificate, 'AAA Certificate Services'.)

This gives us some cases for additional issuer certificates:

  • your server's provided chain includes multiple intermediate certificates for the same issuer, for example both Let's Encrypt intermediate certificates. A client can build one certificate chain through each.

  • your server provides an additional cross-signed CA certificate, such as the USERTrust certificate signed by AddTrust. A client can build one certificate chain that stops at the issuer certificate that's in its root CA set, or it can build another chain that's longer, using your extra cross-signed intermediate certificate.

  • the user's browser knows about additional intermediate certificates and will build additional chains using them, even though your server doesn't provide them in its set of certificates. This definitely happens, but browsers are also good about handling multiple chains.

In a good world, all intermediate certificates will have an expiration time no later than the best certificate for the issuer that signed them. This was the case with the AddTrust expiration; the cross-signed USERTrust certificate expired at the same time as the AddTrust root certificate. In this case you can detect the problem by noticing that a server provided intermediate certificate is expiring soon. If only a CA root certificate at the end of an older chain is expiring soon and the intermediate certificate signed by it has a later expiration date, you need to check the expiration time of the entire chain.

As a practical matter, monitoring the expiry time of all certificates provided by a TLS server seems very likely to be enough to detect multiple chain problems such as the AddTrust issue. Competent Certificate Authorities shouldn't issue server or intermediate certificates with expiry times later than their root (or intermediate) certificates, so we don't need to try to find and explicitly check those root certificates. This will also alert on expiring certificates that were provided but that can't be used to construct any chain, but you probably want to get rid of those anyway.

Sidebar: Let's Encrypt certificate chains in practice

Because browsers do their own thing, a browser may construct multiple certificate chains for Let's Encrypt certificates today even if your server only provides the LE intermediate certificate that is signed by DST Root CA X3 (the current Let's Encrypt default for the intermediate certificate). For example, if you visit Let's Encrypt's test site for their own CA root, your browser will probably cache the LE intermediate certificate that chains to the LE CA root certificate, and then visiting other sites using Let's Encrypt may cause your browser to ignore their intermediate certificate and chain through the 'better' one it already has cached. This is what currently happens for me on Firefox.

tech/TLSHowMultipleChains written at 16:03:00; Add Comment

What a TLS self signed certificate is at a mechanical level

People routinely talk about self signed TLS certificates. You use them in situations where you just need TLS but don't want to set up an internal Certificate Authority and can't get an official TLS certificate, and many CA root certificates are self signed. But until recently I hadn't thought about what a self signed certificate is, mechanically. So here is my best answer.

To simplify a lot, a TLS certificate is a bundle of attributes wrapped around a public key. All TLS certificates are signed by someone; we call this the issuer. The issuer for a certificate is identified by their X.509 Subject Name, and also at least implicitly by the keypair used to sign the certificate (since only an issuer TLS certificate with the right public key can validate the signature).

So this gives us the answer for what a self signed TLS certificate is. It's a certificate that lists its own Subject Name as the issuer and is signed with its own keypair (using some appropriate key signature algorithm, such as SHA256-RSA for RSA keys). It still has all of the usual TLS certificate attributes, especially 'not before' and 'not after' dates, and in many cases they'll be processed normally.

Self signed certificates are not automatically CA certificates for a little private CA. Among other things, the self-signed certificate can explicitly set an 'I am not a CA' marker in itself. Whether software respects this if someone explicitly tells it to trust the self-signed certificate as a CA root certificate is another matter, but at least you tried.

Self-signed certificates do have a serial number (which should be unique), and a unique cryptographic hash. Browsers that have been told to trust a self-signed certificate are probably using either these or a direct comparison of the entire certificate to determine if you're giving them the same self-signed certificate, instead of following the process used for identifying issuers (of checking the issuer Subject Name and so on). This likely means that if you re-issue a self-signed certificate using the same keypair and Subject Name, browsers may not automatically accept it in place of your first one.

(As far as other software goes, who knows. There are dragons all over those hills, and I suspect that there is at least some code that accepts a matching Subject Name and keypair as good enough.)

tech/TLSWhatIsSelfSignedCert written at 00:25:10; Add Comment

2020-07-02

The work that's not being done from home is slowly accumulating for us

At the moment, the most recent things I've seen have talked about us not returning to the office before September, and then not all of us at one time. This gives me complicated feelings, including about what work we are doing. From the outside, our current work from home situation probably looks like everything is going pretty well. We've kept the computing lights on, things are being done, and so far all of the ordinary things that people ask of us get done as promptly as usual. A few hardware issues have come up and have been dealt with by people making brief trips into the office. So it all looks healthy; you might even wonder why we need offices.

When I look at the situation from inside, things are a bit different. We may be keeping the normal lights on, but at the same time there's a steadily growing amount of work that is not being done because of our working from home. The most obvious thing is that ordering new servers and other hardware has basically been shut down; not only are we not in the office to work on any hardware, it mostly can't even be delivered to the university right now.

The next obvious thing is the timing of any roll out of Ubuntu 20.04 on our machines. Under normal circumstances, we'd have all of the infrastructure for installing 20.04 machines ready and probably some test machines out there for people to poke at, and we'd be hoping to migrate a number of user-visible machines in August before the fall semester starts. That's looking unlikely, since at this point all we have is an ISO install image that's been tested only in temporary virtual machines. Since we haven't been in the office, we haven't set up any real servers running 20.04 on an ongoing basis. We're in basically the bad case situation I imagined back in early April.

(And of course many of the people we'd like to have poke at 20.04 test machines are busy with their own work from home problems, so even if we had test machines, they would probably get less testing than usual.)

Another sign is that many of our Ubuntu servers have been up without a reboot for what is now an awfully long time for us. Under normal circumstances we might have scheduled a kernel update and reboot by now, but under work from home conditions we only want to take the risk of doing kernel updates and rebooting important servers if there is something critical. If something goes wrong, it's not a walk down to the machine room, it's a trip into the office (and a rather longer downtime).

There's also a slowly accumulating amount of pending physical networking work, where we're asked to change what networks particular rooms or network ports are on because people are moving around. This work traditionally grows as the fall semester approaches and space starts getting sorted out for new graduate students and so on, although that could change drastically this year depending on the university's overall plans for what graduate students will do and where they will work.

(To put it one way, a great deal of graduate student space is not set up for appropriate physical distancing. Nor is a fair amount of other office and lab space.)

One level up from this is that there's a number of projects that need to use some physical servers. We have a bunch of OpenBSD machines on old OpenBSD versions that could do with updates (and refreshes on to new hardware), for example, but we need to build them out in test setups first. Another example is that we have plans to significantly change how we currently use SLURM, but that needs a few machines to set up a little new cluster (on our server network, as part of our full environment).

(A number of these projects need custom network connectivity, such as new test firewalls needing little test networks. Traditionally we build this in some of our 'lab space', with servers just sitting on a table wired together.)

Much of this is inherent in us having and using physical servers. Having physical servers in a machine room means receiving new hardware, racking it, cabling it up, and installing it, all of which we have to do in person (plus at least pulling the cables out of any old hardware that it's replacing). Some of it (such as our reluctance to reboot servers) is because we don't have full remote KVM over IP capabilities on our servers.

PS: We're also lucky that all of this didn't happen in a year when we'd planned to get and deploy a major set of hardware, such as the year when we got the hardware for our current generation of fileservers.

sysadmin/WorkNotDoneFromHome written at 23:06:37; Add Comment

Link: Code Only Says What it Does

Marc Booker's Code Only Says What it Does (via) is about what code doesn't say and why all of those things matter. Because I want you to read the article, I'm going to quote all of the first paragraph:

Code says what it does. That's important for the computer, because code is the way that we ask the computer to do something. It's OK for humans, as long as we never have to modify or debug the code. As soon as we do, we have a problem. Fundamentally, debugging is an exercise in changing what a program does to match what it should do. It requires us to know what a program should do, which isn't captured in the code. Sometimes that's easy: What it does is crash, what it should do is not crash. Outside those trivial cases, discovering intent is harder.

This is not an issue that's exclusive to programming, as I've written about in Configuration management is not documentation, at least not of intentions (procedures and checklists and runbooks aren't documentation either). In computing we love to not write documentation, but not writing down our intentions in some form is just piling up future problems.

links/CodeOnlySaysWhatItDoes written at 19:49:45; Add Comment

2020-07-01

In ZFS, your filesystem layout needs to reflect some of your administrative structure

One of the issues we sometimes run into with ZFS is that ZFS essentially requires you to reflect your administrative structure for allocating and reserving space in how you lay out ZFS filesystems and filesystem hierarchies. This is because in ZFS, all space management is handled through the hierarchy of filesystems (and perhaps in having multiple pools). If you want to make two separate amounts of space available to two separate sets of filesystems (or collectively reserved by them), either they must be in different pools or they must be under different dataset hierarchies within the pool.

(These hierarchies don't have to be visible to users, because you can mount ZFS filesystems under whatever names you want, but they exist in the dataset hierarchy in the pool itself and you'll periodically need to know them, because some commands require the full dataset name and don't work when given the mount point.)

That sounds abstract, so let me make it concrete. Simplifying only slightly, our filesystems here are visible to people as /h/NNN (for home directories) and /w/NNN (workdirs, for everything else). They come from some NFS server and live in some ZFS pool there (inside little container filesystems), but the NFS server and to some extent the pool is an implementation detail. Each research group has its own ZFS pool (or for big ones, more than one pool because one pool can only be so big), as do some individual professors. However, there are not infrequently cases where a professor in a group pool would like to buy extra space that is only for their students, and also this professor has several different filesystems in the pool (often a mixture of /h/NNN homedir filesystems and /w/NNN workdir ones).

This is theoretically possible in ZFS, but in order to implement it ZFS would force us to put all of a professor's filesystems under a sub-hierarchy in the pool. Instead of the current tank/h/100 and tank/w/200, they would have to be something like tank/prof/h/100 and tank/prof/w/200. The ZFS dataset structure is required to reflect the administrative structure of how people buy space. One of the corollaries of this is that you can basically only have a single administrative structure for how you allocate space, because a dataset can only be in one place in the ZFS hierarchy.

(So if two professors want to buy space separately for their filesystems but there's a filesystem shared between them (and they each want it to share in their space increase), you have a problem.)

If there were sub-groups of people who wanted to buy space collectively, we'd need an even more complicated dataset structure. Such sub-groups are not necessarily decided in advance, so we can't set up such a hierarchy when the filesystems are created; we'd likely wind up having to periodically modify the dataset hierarchy. Fortunately the manpages suggest that 'zfs rename' can be done without disrupting service to the filesystem, provided that the mountpoint doesn't change (which it wouldn't, since we force those to the /h/NNN and /w/NNN forms).

While our situation is relatively specific to how we sell space, people operating ZFS can run into the same sort of situation any time they want to allocate or control collective space usage among a group of filesystems. There are plenty of places where you might have projects that get so much space but want multiple filesystems, or groups (and subgroups) that should be given specific allocations or reservations.

PS: One reason not to expose these administrative groupings to users is that they can change. If you expose the administrative grouping in the user visible filesystem name and where a filesystem belongs shifts, everyone gets to change the name they use for it.

solaris/ZFSAdminVsFilesystemLayout written at 22:58:55; Add Comment

2020-06-30

The unfortunate limitation in ZFS filesystem quotas and refquota

When ZFS was new, the only option it had for filesystems quotas was the quota property, which I had an issue with and which caused us practical problems in our first generation of ZFS fileservers because it covered the space used by snapshots as well as the regular user accessible filesystem. Later ZFS introduced the refquota property, which did not have that problem but in exchange doesn't apply to any descendant datasets (regardless of whether they're snapshots or regular filesystems). At one level this issue with refquota is fine, because we put quotas on filesystems to limit their maximum size to what our backup system can comfortably handle. At another level, this issue impacts how we operate.

All of this stems from a fundamental lack in ZFS quotas, which is ZFS's general quota system doesn't let you limit space used only by unprivileged operations. Writing into a filesystem is a normal everyday thing that doesn't require any special administrative privileges, while making ZFS snapshots (and clones) requires special administrative privileges (either from being root or from having had them specifically delegated to you). But you can't tell them apart in a hierarchy, because ZFS only you offers the binary choice of ignoring all space used by descendants (regardless of how it occurs) or ignoring none of it, sweeping up specially privileged operations like creating snapshots with ordinary activities like writing files.

This limitation affects our pool space limits, because we use them for two different purposes; restricting people to only the space that they've purchased and insuring that pools always have a safety margin of space. Since pools contain many filesystems, we must limit their total space usage using the quota property. But that means that any snapshots we make for administrative purposes consume space that's been purchased, and if we make too many of them we'll run the pool out of space for completely artificial reasons. It would be better to be able to have two quotas, one for the space that the group has purchased (which would limit only regular filesystem activity) and one for our pool safety margin (which would limit snapshots too).

(This wouldn't completely solve the problem, though, since snapshots still consume space and if we made too many of them we'd run a pool that should have free space out of even its safety margin. But it would sometimes make things easier.)

PS: I thought this had more of an impact on our operations and the features we can reasonable offer to people, but the more I think about it the more it doesn't. Partly this is because we don't make much use of snapshots, though, for various reasons that sort of boil down to 'the natural state of disks is usually full'. But that's for another entry.

solaris/ZFSHierarchyQuotaLack written at 22:17:27; Add Comment

(Previous 10 or go back to June 2020 at 2020/06/29)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.