Wandering Thoughts


Some things on Illumos NFS export permissions

Perhaps at one point I fully understood Solaris and thus Illumos NFS export permissions (but I suspect not). If so, that understanding fell out of my mind at some point over the years since then, and just now I had the interesting experience of discovering that our NFS export permissions have sort of been working by accident.

I'll start with a ZFS sharenfs setting and then break it down. The two most ornate ones we have look like this:


The AAA and BBB netgroups don't overlap with nfs_ssh, but nfs_root and nfs_oldmail are both subsets of nfs_ssh.

The first slightly tricky bit is root=. As the manual page explains in the NOTES section, all that root= does is change the interpretation of UID 0 for clients that are already allowed to read or write to the NFS share. Per the manual page, 'the access the host gets is the same as when the root= option is absent' (and this may include no access). As a corollary, 'root=NG,ro=NG' is basically the same as 'ro=NG,anon=0'. Since our root= netgroups are a subset of our general allowed-access netgroups, we're okay here.

(This part I sort of knew already, or at least I assumed it without having hunted it down specifically in the manual page. See eg this entry.)

The next tricky bit is the interaction of rw= and ro=. Before just now I would have told you that rw= took priority over ro= if you had a host that was included in both (via different netgroups), but it turns out that whichever one is first takes priority. We were getting rw-over-ro effects because we always listed rw= first, but I don't think we necessarily understood that when we wrote the second sharenfs setting. The manual page is explicit about this:

If rw= and ro= options are specified in the same sec= clause, and a client is in both lists, the order of the two options determines the access the client gets.

(Note that the behavior is different if you use general ro or rw. See the manpage.)

We would have noticed if we flipped this around for the one filesystem with overlapping ro= and rw= groups, since the machine that was supposed to be able to write to the filesystem would have failed (and the failure would have stalled our mail system). But it's still sort of a narrow escape.

What this shows me vividly, once again, is the appeal of casual superstition. I really thought I understood how Illumos NFS exports worked (and I only checked the manpage to see if it described things explicitly, and that because I was writing an entry for here). Instead I had drifted into a convenient assumption of how things were.

Sidebar: Our general narrow miss on this

We have a bunch of local programs for managing our fileservers. One of the things these programs do is manipulate NFS exports options, so that we can have a configuration file that sets general share options and then allows us to specify that specific filesystems extend them, with convenient syntax, eg:

# global share options 
shareopts nosuid,sec=sys,rw=nfs_ssh,root=nfs_root

# our SAN filesystems:
fs3-corestaff-01   /h/281   rw+=AAA

This means that /h/281 should be exported read-write to the AAA netgroup as well as the usual main netgroup for our own machines.

The actual code is written in Python and turns all of the NFS exports options into Python dictionary keys and values. Python dictionaries are unordered, so under normal circumstances reassembling the exports options would have put them into some random order, so anything with both rw= and ro= could have wound up in the wrong order. However, conveniently I decided to put the NFS export options into a canonical order when I converted them back to string form, and this put rw= before ro= (and sec=sys before both). There's no sign in my code comments that I knew this was important; it seems to have just been what I thought of as the correct canonical ordering. Possibly I was blindly copying and preserving earlier work where we always had rw= first.

solaris/IllumosNFSExportsPerms written at 23:43:51; Add Comment


Why people are probably going to keep using today's Unixes

A while back I wrote about how the value locked up in the Unix API makes it durable. The short version is that there's a huge amount of effort and thus value invested in both the kernels (that provide one level of the Unix API) and in all of the programs and tools and systems that run on top of them, using the Unix APIs. If you start to depart from this API you start to lose access to all of those things.

The flipside of this is why I think people are probably going to keep using current Unixes in the future instead of creating new Unix-like OSes or Unix OSes. To a large extent, the potential value in departing from current Unixes lies in doing things differently at some API level, and once you depart from the API you're fighting the durable power of the Unix API. If you don't depart from the Unix API, it's hard to see much of a point; 'we wrote a different kernel but we still support all of the Unix API' (and variants) don't appear to have all that high a value. You're spending a lot of effort to wind up in essentially the same place.

(There was a day when you could argue that current Unix kernels and systems were fatally flawed and you could make important improvements. Given how well they work today and how much effort they represent, that argument is no longer very convincing. Perhaps we could do better, but can we do lots better, enough to justify the cost?)

In one way this is depressing; it means that the era of many Unixes and many Unix-like OSes flourishing is over. Not only is the cost of departing from Unix too high, but so is the cost of reimplementing it and possibly even keeping up with the leading implementations. The Unixes we have today are likely to be the only Unixes we ever have, and probably not all of them are going to survive over the long term (and that's apart from the commercial ones that are on life support today, like Solaris).

(This isn't really a new observation; Rob Pike basically made it a long time ago in the context of academic systems software research (see the mention in this entry).)

But this doesn't mean that innovation in Unix and the Unix API is dead; it just means that it has to happen in a different way. You can't drive innovation by creating a new Unix or Unix-like, but you can drive innovation by putting something new into a Unix that's popular enough, so it becomes broadly available and people start taking advantage of it (the obvious candidate here is Linux). It's possible that OpenBSD's pledge() will turn out to be such an innovation (whether other Unixes implement it as a system call or as a library function that uses native mechanisms).

(Note that not all attempts to extend or change the practical Unix API turn out to be good ideas over the long term.)

It also doesn't always mean that what we wind up with is really 'Unix' in a conventional sense. One thing that's already happening is that an existing Unix is used as the heart of something that has custom layers wrapped around it. Android, iOS, and macOS are all versions of this; they have a core layer that uses an existing Unix kernel and so on but then a bunch of things specific to themselves on top. These systems have harvested what they find to be the useful value of their Unix and then ignored the rest of it. Of course all of them represent a great deal of effort in their custom components, and they wouldn't have happened if the people involved couldn't extract a lot of value from that additional work.

(This extends my other tweet from the time of the first entry.)

unix/DurableCurrentUnixes written at 23:42:28; Add Comment


When I'll probably be able to use Python assignment expressions

The big recent Python news is that assignment expressions have been accepted for Python 3.8. This was apparently so contentious and charged a process that in its wake Guido van Rossum has stepped down as Python's BDFL. I don't have any strong feelings on assignment expressions for reasons beyond the scope of this entry, but today I want to think about how soon I could possibly use them in my Python code, and then how soon I could safely use them (ie how soon they will be everywhere I care about). The answers to out to be surprising, at least to me (it's probably not to experienced Python hands).

The nominal Python 3.8 release schedule is set out in PEP 569. According to it, Python 3.8 is planned to be released in October of 2019; however, there's some signs that the Python people want to move faster on this (see this LWN article). If Python sticks to the original timing, Python 3.8 might make Ubuntu 20.04 LTS (released in April 2020 but frozen before then) and would probably make the next Fedora release if Fedora keeps to their current schedule and does a release in May of 2020. So at this point it looks like the earliest I'd be able to use assignment expressions is in about two years. If Python moves up the 3.8 release schedule significantly, it might make one Fedora release earlier (the fall 2019 release), making that about a year and a half before I could think about it.

There are many versions of 'can safely use' for me, but I'll pick the one for work. There 'safely use' means that they're supported by the oldest Ubuntu LTS release I need to run the Python code on. We're deploying long-lived Ubuntu 18.04 machines now that will only be updated starting in 2022, so if Python 3.8 makes Ubuntu 20.04 that will be when I can probably start thinking about it, because everything will be 2020 or later. That's actually a pretty short time to safe use as these things go, but that's a coincidence due to the release timing of Python 3.8 and Ubuntu LTS versions. If Python 3.8 misses Ubuntu 20.04 LTS, I'd have to wait another two years (to 2024) unless I only cared about running my code on Ubuntu 22.04.

Of course, I'm projecting things four to six years into the future and that's dangerous at the best of times. We've already seen that Python may change its release timing, and who knows about both Ubuntu and Fedora.

(It seems a reasonably safe guess that I'll still be using Fedora on my desktops over that time period, and pretty safe that we'll still be using Ubuntu LTS at work, but things could happen there too.)

The reason that all of this was surprising to me was that I assumed Python 3.8 was further along in its development if controversial and much argued over change proposals were getting accepted for it. I guess the arguments started well before Python 3.7 was released, which makes sense given the 3.7 release schedule; 3.7 was frozen at the end of January, so everyone could start arguing about 3.8 no later than then.

(The official PEP has an initial date of the end of February, but I've heard it was in development and being discussed before then, just not formalized yet as a PEP.)

PS: If Debian keeps to their usual release schedule, it looks like Python 3.8 on its original schedule would miss the next Debian stable version (Debian 10). It would probably miss it even on an aggressive release schedule that saw Python 3.8 come out only a year after 3.7, since 3.7 was released only a few weeks ago.

python/AssignmentExpressionsWhen written at 23:35:50; Add Comment

Understanding ZFS System Attributes

Like most filesystems, ZFS faces the file attribute problem. It has a bunch of file attributes, both visible ones like the permission mode and the owner and internal ones like the parent directory of things and file generation number, and it needs to store them somehow. But rather than using fixed on-disk structures like everyone else, ZFS has come up with a novel storage scheme for them, one that simultaneously deals with both different types of ZFS dnodes wanting different sets of attributes and the need to evolve attributes over time. In the grand tradition of computer science, ZFS does it with an extra level of indirection.

Like most filesystems, ZFS puts these attributes in dnodes using some extra space (in what is called the dnode 'bonus buffer'). However, the ZFS trick is that whatever system attributes a dnode has are simply packed into that space without being organized into formal structures with a fixed order of attributes. Code that uses system attributes retrieves them from dnodes indirectly by asking for, say, the ZPL_PARENT of a dnode; it never cares exactly how they're packed into a given dnode. However, obviously something does.

One way to implement this would be some sort of tagged storage, where each attribute in the dnode was actually a key/value pair. However, this would require space for all of those keys, so ZFS is more clever. ZFS observes that in practice there are only a relatively small number of different sets of attributes that are ever stored together in dnodes, so it simply numbers each distinct attribute layout that ever gets used in the dataset, and then the dnode just stores the layout number along with the attribute values (in their defined order). As far as I can tell from the code, you don't have to pre-register all of these attribute layouts. Instead, the code simply sets attributes on dnodes in memory, then when it comes time to write out the dnode in its on-disk format ZFS checks to see if the set of attributes matches a known layout or if a new attribute layout needs to be set up and registered.

(There are provisions to handle the case where the attributes on a dnode in memory don't all fit into the space available in the dnode; they overflow to a special spill block. Spill blocks have their own attribute layouts.)

I'm summarizing things a bit here; you can read all of the details and more in a big comment at the start of sa.c.

As someone who appreciates neat solutions to thorny problems, I quite admire what ZFS has done here. There is a cost to the level of indirection that ZFS imposes, but once you accept that cost you get a bunch of clever bonuses. For instance, ZFS uses dnodes for all sorts of internal pool and dataset metadata, and these dnodes often don't have any use for conventional Unix file attributes like permissions, owner, and so on. With system attributes, these metadata dnodes simply don't have those attributes and don't waste any space on them (and they can use the same space for other attributes that may be more relevant). ZFS has also been able to relatively freely add attributes over time.

By the way, this scheme is not quite the original scheme that ZFS used. The original scheme apparently had things more hard-coded, but I haven't dug into it in detail since this has been the current scheme for quite a while. Which scheme is in use depends on the ZFS pool and filesystem versions; modern system attributes require ZFS pool version 24 or later and ZFS filesystem version 5 or later. You probably have these, as they were added to (Open)Solaris in 2010.

solaris/ZFSSystemAttributes written at 01:11:37; Add Comment


The challenge of storing file attributes on disk

In pretty much every Unix filesystem and many non-Unix ones, files (and more generally all filesystem objects) have a collection of various basic attributes, things like modification time, permissions, ownership, and so on, as well as additional attributes that the filesystem uses for internal purposes (eg). This means that every filesystem needs to figure out how to store and represent these attributes on disk (and to a lesser extent in memory). This presents two problems, an immediate one and a long term one.

The immediate problem is that different types of filesystem objects have different attributes that make sense for them. A plain file definitely needs a (byte) length that is stored on disk, but that doesn't make any sense to store on disk for things like FIFOs, Unix domain sockets, and even block and character devices, and it's not clear if a (byte) length still make sense for directories either given that they're often complex data structures today. There's also attributes that some non-file objects need that files don't; a classical example in Unix is st_rdev, the device ID of special files.

(Block and character devices may have byte lengths in stat() results but that's a different thing entirely than storing a byte length for them on disk. You probably don't want to pay any attention to the on-disk 'length' for them, partly so that you don't have to worry about updating it to reflect what you'll return in stat(). Non-linear directories definitely have a space usage, but that's usually reported in blocks; a size in bytes doesn't necessarily make much sense unless it's just 'block count times block size'.)

The usual answer for this is to punt. The filesystem will define an on-disk structure (an 'inode') that contains all of the fields that are considered essential, especially for plain files, and that's pretty much it. Objects that don't use some of the basic attributes still pay the space cost for them, and extra attributes you might want either get smuggled somewhere or usually just aren't present. Would you like attributes for how many live entries and how many empty entry slots are in a directory? You don't get it, because it would be too much overhead to have the attributes there for everything.

The long term problem is dealing with the evolution of your attributes. You may think that they're perfect now (or at least that you can't do better given your constraints), but if your filesystem lives for long enough, that will change. Generally, either you'll want to add new attributes or you'll want to change existing ones (for example, widening a timestamp from 32 bits to 64 bits). More rarely you may discover that existing attributes make no sense any more or aren't as useful as you thought.

If you thought ahead, the usual answer for this is to include unused extra space in your on-disk attribute structure and then slowly start using it for new attributes or extending existing ones. This works, at least for a while, but it has various drawbacks, including that because you only have limited space you'll have long arguments about what attributes are important enough to claim some of it. On the other hand, perhaps you should have long arguments over permanent changes to the data stored in the on-disk format and face strong pressures to do it only when you really have to.

As an obvious note, the reason that people turn to a fixed on-disk 'inode' structure is that it's generally the most space-efficient option and you're going to have a lot of these things sitting around. In most filesystems, most of them will be for regular files (which will outnumber directories and other things), and so there is a natural pressure to prioritize what regular files need at the possible expense of other things. It's not the only option, though; people have tried a lot of things.

I've talked about the on-disk format for filesystems here, but you face similar issues in any archive format (tar, ZIP, etc). Almost all of them have file 'attributes' or metadata beyond the name and the in-archive size, and they have to store and represent this somehow. Often archive formats face both issues; different types of things in the archive want different attributes, and what attributes and other metadata needs to be stored changes over time. There have been some creative solutions over the years.

tech/FileAttributesProblem written at 00:28:48; Add Comment


ZFS on Linux's sharenfs problem (of what you can and can't put in it)

ZFS has an idea of 'properties' for both pools and filesystems. To quote from the Illumos zfs manpage:

Properties are divided into two types, native properties and user-defined (or "user") properties. Native properties either export internal statistics or control ZFS behavior. [...]

Filesystem properties are used to control things like whether compression is on, where a ZFS filesystem is mounted, if it is read-only or not, and so on. One of those properties is called sharenfs; it controls whether or not the filesystem is NFS exported, and what options it's exported with. One of the advantages of having ZFS manage this for you through the sharenfs property is that ZFS will automatically share and unshare things as the ZFS pool and filesystem are available or not available; you don't have to try to coordinate the state of your NFS shares and your ZFS filesystem mounts.

As I write this, the current ZFS on Linux zfs manpage says this about sharenfs:

Controls whether the file system is shared via NFS, and what options are to be used. [...] If the property is set to on, the dataset is shared using the default options:


See exports(5) for the meaning of the default options. Otherwise, the exportfs(8) command is invoked with options equivalent to the contents of this property.

That's very interesting wording. It's also kind of a lie, because ZFS on Linux caught itself in a compatibility bear trap (or so I assume).

This wording is essentially the same as the wording in Illumos (and in the original Solaris manpages). On Solaris, the sharenfs property is passed more or less straight to share_nfs as the NFS share options in its -o argument, and as a result what you put in sharenfs is just those options. This makes sense; the original Solaris version of ZFS was not created to be portable to other Unixes, so it made no attempt to have its sharenfs (or sharesmb) be Unix-independent. It was part of Solaris, so what went into sharenfs was Solaris NFS share options, including obscure ones.

It would have been natural of ZFS on Linux to take the same attitude towards what went into sharenfs on Linux, and indeed the current wording of the manpage sort of implies that this is what's happening and that you can simply use what you'd put in exports(5). Unfortunately, this is not the case. Instead, ZFS on Linux attempts to interpret your sharenfs setting as OmniOS NFS share options and tries to convert them to equivalent Linux options.

(I assume that this was done to make it theoretically easier to move pools and filesystems between ZoL and Illumos/Solaris ZFS, because the sharenfs property would mean the same thing and be interpreted the same way on both systems. Moving filesystems back and forth is not as crazy as it sounds, given zfs send and zfs receive.)

There are two problems with this. The first is that the conversion process doesn't handle all of the Illumos NFS share options. Some it will completely reject or fail on (they're just totally unsupported), while others it will accept but produce incorrect conversions that don't work. The set of accepted and properly handled conversions is not documented and is unlikely to ever be. The second problem is that Linux can do things with NFS share options that Illumos doesn't support (the reverse is true too, but less directly relevant). Since ZFS on Linux provides you no way to directly set Linux share options, you can't use these Linux specific NFS share options at all through sharenfs.

Effectively what the current ZFS on Linux approach does is that it restricts you to an undocumented subset of the Illumos NFS share options are supported by Linux and correctly converted by ZoL. If you're doing anything at all sophisticated with your NFS sharing options (as we are), this means that using sharenfs on Linux is simply not an option. We're going to have to roll our own NFS share option handling and management system, which is a bit irritating.

(We're also going to have to make sure that we block or exclude sharenfs properties from being transferred from our OmniOS fileservers to our ZoL fileservers during 'zfs send | zfs receive' copies, which is a problem that hadn't occurred to me until I wrote this entry.)

PS: There is an open ZFS on Linux issue to fix the documentation; it includes mentions of some mis-parsing sharenfs bugs. I may even have the time and energy to contribute a patch at some point.

PPS: Probably what we should do is embed our Linux NFS share options as a ZFS filesystem user property. This would at least allow our future management system to scan the current ZFS filesystems to see what the active NFS shares and share options should be, as opposed to having to also consult and trust some additional source of information for that.

linux/ZFSOnLinuxSharenfsProblem written at 01:03:29; Add Comment


You should probably write down what your math actually means

Let's start with my tweet:

I have no brain, but at least I can work out what this math in my old code actually means, something I didn't bother to do when I wrote the original code & comments years ago. (That was a mistake, but maybe I didn't actually understand the math then, it just looked good.)

For the almost authentic experience I'm going to start with the code and its comments that I was looking at and explain it afterward.

       # if we are not going to protect a lot more data
       # than has already been protected, no. Note that
       # this is not the same thing as looking at the
       # percentages, because it is effectively percentages
       # multiplied by the number of disks being resilvered.
       # Or something.
       if rdata < psum*4 or rdata < esum*2:
               return (False, "won't protect enough extra data")

ZFS re-synchronizes redundant storage in an unusual way, which has some unfortunate implications some of the time. I will just quote myself from that entry:

So: if you have disk failure in one mirror vdev, activate a spare, and then have a second disk fail in another mirror and activate another spare, work on resilvering the first spare will immediately restart from scratch. [...]

My code and comment is from a function in our ZFS spares handling system that is trying to decide if it should activate another spare even though it will abort an in-progress spare replacement, or if it should let the first spare replacement finish.

The problem with this comment is that while it explains the idea of this check to a certain extent, it doesn't explain the math at all; the math is undocumented magic. It's especially undocumented magic if you don't know what rdata, psum, and esum are and where they come from, as I didn't when I was returning to this code for the first time in several years (because I wanted to see if it still made sense in a different environment). Since there's no explanation of the math, we don't know if it actually express the comment's idea or if it's making some sort of mistake, perhaps camouflaged by how various terms are calculated.

(It's not that hard to get yourself lost in a tangle of calculated terms. See, for example, the parenthetical discussion of how svctm is calculated in this entry.)

In fact when I dug into this, it turns out that my math was at least misleading for us. I'll quote some comments:

# psum: repaired space in all resilvering vdevs in bytes
# esum: examined space in all resilvering vdevs in bytes
# NOTE: psum == esum for mirrors, but not necessarily for
# raidz vdevs.

Our ZFS fileservers only have mirrors and none of our spares handling code has ever been tested on raidz pools. Using both psum and esum in my code was at best a well intentioned brain slip, but in practice it was misleading. Since both are the same, the real condition is the larger one, ie 'rdata < psum*4'. rdata itself is an estimate of how much currently unredundant data we're going to add redundancy for with our new spare or spares.

To start, let's rewrite that condition to be clearer. Ignoring various pragmatic math issues, 'rdata < psum*4' is the same as 'rdata/4 < psum'. In words and expanding the variables out, this is true if we've already repaired at least as much data as one quarter of the additional data we'd make redundant by adding more spares.

Is this a sensible criteria in general, or with these specific numbers? I honestly have no idea. But at least I now understand what the math is actually doing.

In fact it took two tries to get to this understanding, because it turns out that I misinterpreted the math the first time around, when I made my tweets. Only when I had to break it down again to write this entry did I really work out what it's doing. This really shows very vividly that the moment you understand your math (or think you do), write that understanding of your math down. Be specific. It's not necessarily going to be obvious to you later.

(If you work on some code all the time, or if the math is common knowledge in the field, maybe not; then it falls into the category of obvious comments that are saying 'add 2 + 2'. Also, perhaps better variable names could have helped here, as well as avoiding the too-clever use of a multiplication instead of a division.)

PS: Since I wrote 'Or something.' even in the original comment, I clearly knew at the time that I was waving my hands at least a bit. I should have paid more attention to that danger sign back then, but I was probably too taken with my own cleverness. When it comes ot this sort of math and calculation work, this is an ongoing issue and concern for me.

programming/ExplainYourMath written at 20:55:28; Add Comment


Some thoughts on performance shifts in moving from an iSCSI SAN to local SSDs

At one level, we're planning for our new fileserver environment to be very similar to our old one. It will still use ZFS and NFS, our clients will treat it the same, and we're even going to be reusing almost all of our local management tools more or less intact. At another level, though, it's very different because we're dropping our SAN in this iteration. Our current environment is an iSCSI-based SAN using HDs, where every fileserver connects to two iSCSI backends over two independent 1G Ethernet networks; mirrored pairs of disks are split between backends, so we can lose an entire backend without losing any ZFS pools. Our new generation of hardware uses local SSDs, with mirrored pairs of disks split between SATA and SAS. This drastic low level change is going to change a number of performance and failure characteristics of our environment, and today I want to think aloud about how the two environments will differ.

(One reason I care about their differences is that it affects how we want to operate ZFS, by changing what's slow or user-visible and what's not.)

In our current iSCSI environment, we have roughly 200 MBytes/sec of total read bandwidth and write bandwidth across all disks (which we can theoretically get simultaneously) and individual disks can probably do about 100 to 150 MBytes/sec of some combination of reads and writes. With mirrors, we have 2x write amplification from incoming NFS traffic to outgoing iSCSI writes, so 100 Mbytes/sec of incoming NFS writes saturates our disk write bandwidth (and it also seems to squeeze our read bandwidth). Individual disks can do on the order of 100 IOPs/sec, and with mirrors, pure read traffic can be distributed across both disks in a pair for 200 IOPs/sec in total. Disks are shared between multiple pools, which visibly causes problems, possibly because the sharing is invisible to our OmniOS fileservers so they do a bad job of scheduling IO.

Faults have happened at all levels of this SAN setup. We have lost individual disks, we have had one of the two iSCSI networks stop being used for some or all of the disks or backends (usually due to software issues), and we have had entire backends need to be rotated out of service and replaced with another one. When we stop using one of the iSCSI networks for most or all disks of one backend, that backend drops to 100 Mbytes/sec of total read and write bandwidth, and we've had cases where the OmniOS fileserver just stopped using one network so it was reduced to 100 Mbytes/sec to both backends combined.

On our new hardware with local Crucial MX300 and MX500 SSDs, each individual disk has roughly 500 Mbytes/sec of read bandwidth and at least 250 Mbytes/sec of write bandwidth (the reads are probably hitting the 6.0 Gbps SATA link speed limit). The SAS controller seems to have no total bandwidth limit that we can notice with our disks, but the SATA controller appears to top out at about 2000 Mbytes/sec of aggregate read bandwidth. The SSDs can sustain over 10K read IOPs/sec each, even with all sixteen active at once. With a single 10G-T network connection for NFS traffic, a fileserver can do at most about 1 GByte/sec of outgoing reads (which theoretically can be satisfied from a single pair of disk) and 1 GByte/sec of incoming writes (which would likely require at least four disk pairs to get enough total write bandwidth, and probably more because we're writing additional ZFS metadata and periodically forcing the SSDs to flush and so on).

As far as failures go, we don't expect to lose either the SAS or the SATA controllers, since both of them are integrated into the motherboard. This means we have no analog of an iSCSI backend failure (or temporary unavailability), where a significant number of physical disks are lost at once. Instead the only likely failures seem to be the loss of individual disks and we certainly hope to not have a bunch fall over at once. I have seen a SATA-connected disk drop from a 6.0 Gbps SATA link speed down to 1.5 Gbps, but that may have been an exceptional case caused by pulling it out and then immediately re-inserting it; this dropped the disk's read speed to 140 MBytes/sec or so. We'll likely want to monitor for this, or in general for any link speed that's not 6.0 Gbps.

(We may someday have what is effectively a total server failure, even if the server stays partially up after a fan failure or a partial motherboard explosion or whatever. But if this happens, we've already accepted that the server is 'down' until we can physically do things to fix or replace it.)

In our current iSCSI environment, both ZFS scrubs to check data integrity and ZFS resilvers to replace failed disks can easily have a visible impact on performance during the workday and they don't go really fast even after our tuning; this is probably not surprising given both total read/write bandwidth limits from 1G networking and IOPs/sec limits from using HDs. When coupled with our multi-tenancy, this means that we've generally limited how much scrubbing and resilvering we'll do at once. We may have historically been too cautious about limiting resilvers (they're cheaper than you might think), but we do have a relatively low total write bandwidth limit.

Our old fileservers couldn't have the same ZFS pool use two chunks from the same physical disk without significant performance impact. On our new hardware this doesn't seem to be a problem, which suggests that we may experience much less impact from multi-tenancy (which we're still going to have, due to how we sell storage). This is intuitively what I'd expect, at least for random IO, since SSDs have so many IOPs/sec available; it may also help that the fileserver can now see that all of this IO is going to the same disk and schedule it better.

On our new hardware, test ZFS scrubs and resilvers have run at anywhere from 250 Mbyte/sec on upward (on mirrored pools), depending on the test pool's setup and contents. With high SSD IOPs/sec and read and write bandwidth (both to individual disks and in general), it seems very likely that we can be much more aggressive about scrubs and resilvers without visibly affecting NFS fileserver performance, even during the workday. With an apparent 6000 Mbytes/sec of total read bandwidth and perhaps 4000 Mbytes/sec of total write bandwidth, we're pretty unlikely to starve regular NFS IO with scrub or resilver IO even with aggressive tuning settings.

(One consequence of hoping to mostly see single-disk failures is that under normal circumstances, a given ZFS pool will only ever have a single failed 'disk' from a single vdev. This makes it much less relevant that resilvering multiple disks at once in a ZFS pool is mostly free; the multi-disk case is probably going to be a pretty rare thing, much rarer than it is in our current iSCSI environment.)

tech/ShiftsInSANToLocalSSD written at 23:43:52; Add Comment

Remembering that Python lists can use tuples as the sort keys

I was recently moving some old Python 2 code to Python 3 (due to a recent decision). This particular code is sufficiently old that it has (or had) a number of my old Python code habits, and in particular it made repeated use of list .sort() with comparison functions. Python 3 doesn't support this; instead you have to tell .sort() what key to use to sort the list. For a lot of the code the conversion was straightforward and obvious because it was just using a field from the object as the sort key. Then I hit a comparison function that looked like this:

def _pricmp(a, b):
  apri = a.prio or sys.maxint
  bpri = b.prio or sys.maxint
  if apri != bpri:
      return cmp(apri, bpri)
  return cmp(a.totbytes, b.totbytes)

I stared at this with a sinking feeling, because this comparison function wasn't just picking a field, it was expressing logic. Losing complex comparison logic is a long standing concern of mine, so I was worried that I'd finally run into a situation where I would be forced into unpleasant hacks.

Then I remembered something obvious: Python supports sorting on tuples, not just single objects. Sorting on tuples compares the two tuples field by field, so you can easily implement the same sort of tie-breaking secondary comparison that I was doing in _pricmp. So I wrote a simple function to generate the tuple of key fields:

def _prikey(a):
  apri = a.prio or sys.maxint
  return (apri, a.totbytes)

Unsurprisingly, this just worked (including the tie-breaking, which actually comes up fairly often in this particular comparison). It's probably even somewhat clearer, and it certainly avoids some potential comparison function mistakes

(It's also shorter, but that's not necessarily a good thing.)

PS: Python has supported sorting tuples for a long time but I don't usually think about it, so things had to swirl around in my head for a bit before the light dawned about how to solve my issue. There's a certain mental shift that you need to go from 'the key= function retrieves the key field' to 'the key= function creates the sort key, but it's usually a plain field value'.

python/SortTakesTupleKeys written at 00:43:25; Add Comment


TLS Certificate Authorities and 'trust'

In casual conversation about CAs, it's common for people to talk about whether you trust a CA (or should) and whether a CA is trustworthy. I often bristle at using 'trust' in these contexts, but it's been hard to articulate why. Today, in a conversation on HN prompted by my entry on the first imperative of commercial CAs, I came up with a useful explanation.

Let's imagine that there's a new CA that's successfully set itself up as a copy of how Let's Encrypt operates; it uses the same hardware, runs the same open source software, configures things the same, follows the same procedures, has equally good staff, has been properly audited, and in general has completely duplicated Let's Encrypt's security and operational excellence. However, it has opted for the intellectually pure approach of starting with new root certificates that are not cross-signed by anyone and it is not in any browser root stores yet; as a result, its certificates are not trusted by any browser.

(Let's Encrypt has made this example plausible, because as a non-commercial CA that mostly does things with automation it doesn't have as many reasons to keep how it operates a secret as a commercial CA does.)

In any reasonable and normal sense of the word, this CA is as trustworthy as Let's Encrypt is. It will issue or not issue TLS certificates in the same situations that LE would (ignoring rate limits and pretending that everyone who authorizes LE in CAA records will also authorize this CA and so on), and its infrastructure and procedures are as secure and solid as LE's. If we trust LE, and I think we do, it's hard to say why we wouldn't trust this CA.

If we say that this CA is 'less trustworthy' than Let's Encrypt anyway, what we really mean is 'TLS certificates from this CA currently provoke browser warnings'. This is a perfectly good thing to care about (and it's usually what matters in practice), but it is not really 'trust' and the difference matters because we have a whole tangled set of expectations, beliefs, and intuitions surrounding the idea of trust. When we use the language of trust to talk about technical issues of which CA certificates the browsers accept and when, we create at least some confusion and lose some clarity, and we risk losing sight of what browser-accepted TLS certificates really are, what they tell us, and what we care about with them.

For instance, if we talk about trust and you get a TLS certificate from a CA, it seems to make intuitive sense to say that you need to trust the CA and that it should be trustworthy. But what does that actually mean when we look at the technical details? What should the CA do or not do? How does that affect our security, especially in light of the fundamental practical problem with the CA model?

At the same time, talking about the trustworthiness of a CA is not a pointless thing. If a CA is not trustworthy (in the normal sense of the word), it should not be included in browsers (and eventually will not be). It's just that the trustworthiness of a CA is only loosely correlated with whether TLS certificates from the CA are currently accepted by browsers, which is almost always what we really care about. As we've seen with StartCom, it can take a quite long time to transition from concluding that a CA is no longer trustworthy to having all its TLS certificates no longer accepted by browsers.

There can also be some amount of time when a new CA is trustworthy but is not included in browsers, because inclusion takes a while. This actually happened with Let's Encrypt; it's just that Let's Encrypt worked around this time delay by getting their certificate cross-signed by an existing in-browser CA, so people mostly didn't notice.

(I will concede that using 'trust' casually is very attractive. For example, in the sentence above I initially wrote 'trusted CA' instead of 'in-browser CA', and while that's sort of accurate I decided it was not the right phrasing to use in this entry.)

Sidebar: The one sort of real trust required in the CA model

Browser vendors and other people who maintain sets of root certificates must trust that CAs included in them will not issue certificates improperly and will in general conduct themselves according to the standards and requirements that the browser has. What constitutes improper issuance is one part straightforward and one part very complicated and nuanced; see, for example, the baseline requirements.

tech/CertificateAuthoritiesAndTrust written at 23:02:11; Add Comment

We've decided to write our future Python tools in Python 3

About a year ago I wrote about our decision to restrict what languages we use to develop internal tools and mentioned that one of the languages we picked was Python. At the time, I mostly meant Python 2 (although we already had one Python 3 program even then, which I sort of had qualms about). Since I now believe in using Python 3 for new code, I decided that the right thing for us to do was explicitly consider the issue and reach a decision, rather than just tacitly winding up in some state.

Our decision is perhaps unsurprising; my co-workers are entirely willing to go along with a slow migration to Python 3. We've now actively decided that new or significantly revised tools written in Python will be written in Python 3 or ported to it (barring some important reason not to do so, for example if the new code needs to still run on our important OmniOS machines). Python 3 is the more future proof choice, and all of the machines where we're going to run Python in the future have a recent enough version of Python 3.

That this came up now is not happenstance or coincidence. We have a suite of local ZFS cover programs and our own ZFS spares handling system, which are all primarily written in Python 2. With a significantly different fileserver setup on the horizon, I've recently started work on 'porting' these programs over to our new fileserver environment (where, for example, we won't have iSCSI backends). This work involves significant revisions and an entirely new set of code to do things like derive disk mapping information under Linux on our new hardware. When I started writing this new code, I asked myself whether this new code in this new environment should still be Python 2 code or whether we should take the opportunity to move it to Python 3 while I was doing major work anyway. I now have an answer; this code is going to be Python 3 code.

(We have Python 3 code already in production, but that code is not critical in the way that our ZFS status monitoring and spares system will be.)

Existing Python 2 code that's working perfectly fine will mostly or entirely remain that way, because we have much more important things to do right now (and generally, all the time). We'll have to deal with it someday (some of it is already ten years old and will probably run for at least another ten), but it can wait.

(A chunk of this code is our password propagation system, but there's an outside chance that we'll wind up using LDAP in the future and so won't need anything like the current programs.)

As a side note, moving our spares system over to a new environment has been an interesting experience, partly because getting it running initially was a pretty easy thing. But that's another entry.

sysadmin/Python3ForOurNewTools written at 00:41:13; Add Comment

(Previous 11 or go back to July 2018 at 2018/07/06)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.