Wandering Thoughts


Digging into BSD's choice of Unix group for new directories and files

I have to eat some humble pie here. In comments on my entry on an interesting chmod failure, Greg A. Woods pointed out that FreeBSD's behavior of creating everything inside a directory with the group of the directory is actually traditional BSD behavior (it dates all the way back to the 1980s), not some odd new invention by FreeBSD. As traditional behavior it makes sense that it's explicitly allowed by the standards, but I've also come to think that it makes sense in context and in general. To see this, we need some background about the problem facing BSD.

In the beginning, two things were true in Unix: there was no mkdir() system call, and processes could only be in one group at a time. With processes being in only one group, the choice of the group for a newly created filesystem object was easy; it was your current group. This was felt to be sufficiently obvious behavior that the V7 creat(2) manpage doesn't even mention it.

(The actual behavior is implemented in the kernel in maknode() in iget.c.)

Now things get interesting. 4.1c BSD seems to be where mkdir(2) is introduced and where creat() stops being a system call and becomes an option to open(2). It's also where processes can be in multiple groups for the first time. The 4.1c BSD open(2) manpage is silent about the group of newly created files, while the mkdir(2) manpage specifically claims that new directories will have your effective group (ie, the V7 behavior). This is actually wrong. In both mkdir() in sys_directory.c and maknode() in ufs_syscalls.c, the group of the newly created object is set to the group of the parent directory. Then finally in the 4.2 BSD mkdir(2) manpage the group of the new directory is correctly documented (the 4.2 BSD open(2) manpage continues to say nothing about this). So BSD's traditional behavior was introduced at the same time as processes being in multiple groups, and we can guess that it was introduced as part of that change.

When your process can only be in a single group, as in V7, it makes perfect sense to create new filesystem objects with that as their group. It's basically the same case as making new filesystem objects be owned by you; just as they get your UID, they also get your GID. When your process can be in multiple groups, things get less clear. A filesystem object can only be in one group, so which of your several groups should a new filesystem object be owned by, and how can you most conveniently change that choice?

One option is to have some notion of a 'primary group' and then provide ways to shuffle around which of your groups is the primary group. One problem with this is that it's awkward and error-prone to work in different areas of the filesystem where you want your new files and directories to be in different groups; every time you cd around, you may have to remember to change your primary group. If you move into a collaborative directory, better shift (in your shell) to that group; cd back to $HOME, or simply want to write a new file in $HOME, and you'd better remember to change back.

Another option is the BSD choice of inheriting the group from context. By far the most common case is that you want your new files and directories to be created in the 'context', ie the group, of the surrounding directory. If you're working in $HOME, this is your primary login group; if you're working in a collaborative area, this is the group being used for collaboration. Arguably it's a feature that you don't even have to be in that group (if directory permissions allow you to make new files). Since you can chgrp directories that you own, this option also gives you a relatively easy and persistent way to change which group is chosen for any particular area.

If you fully embrace the idea of Unix processes being in multiple groups, not just having one primary group and then some number of secondary groups, then the BSD choice makes a lot of sense. And for all of its faults, BSD tended to relatively fully embrace its changes (not totally, perhaps partly because it had backwards compatibility issues to consider). While it leads to some odd issues, such as the one I ran into, pretty much any choice here is going to have some oddities. It's also probably the more usable choice in general if you expect much collaboration between different people (well, different Unix logins), partly because it mostly doesn't require people to remember to do things.

(I know that on our systems, a lot of directories intended for collaborative work tend to end up being setgid specifically to get this behavior.)

BSDDirectoryGroupChoice written at 01:00:53; Add Comment


Sometimes, chmod can fail for interesting reasons

I'll start by presenting this rather interesting and puzzling failure in illustrated form:

; mkdir /tmp/newdir
; chmod g+s /tmp/newdir
chmod: /tmp/newdir: Operation not permitted

How can I not be able to make this chmod change when I just made the directory and I own it? For extra fun, some people on this particular system won't experience this problem, and in fact many of them are the people you might report this problem to, namely the sysadmins.

At first I wondered if this particular /tmp filesystem disallowed setuid and setgid entirely, but it turned out to be not that straightforward:

; ls -ld /tmp/newdir
drwxr-xr-x  2 cks  wheel  512 May  3 00:35 /tmp/newdir

This at least explains why my chmod attempt failed. I'm not in group wheel, and for good reasons you can't make a file setgid to a group that you're not a member of. But how on earth did my newly created directory in /tmp wind up in group wheel, a group I'm not a member of? Well, perhaps someone made /tmp setgid, so all directories created in it inherited its group (presumably group wheel). Let's see:

; ld -ld /tmp
drwxrwxrwt  157 root  wheel  11776 May  3 00:41 /tmp

Although /tmp is indeed group wheel, it has perfectly ordinary permissions (mode 777 and sticky ('t'), so you can only delete or rename your own files). There's no setgid to be seen.

The answer to this mystery is that this is a FreeBSD machine, and on FreeBSD, well, let's quote the mkdir(2) manpage:

The directory's owner ID is set to the process's effective user ID. The directory's group ID is set to that of the parent directory in which it is created.

And also the section of the open(2) manpage that deals with creation of new files:

When a new file is created it is given the group of the directory which contains it.

In other words, on FreeBSD all directories have an implicit setgid bit. Everything created inside them (whether directories or files) inherits the directory's group. Normally this is not a problem and you'll probably never notice, but /tmp (and /var/tmp) are special because they allow everyone to create files and directories in them, and so there are a lot of people making things there who are not a member of the directory's group.

(The sysadmins usually are members of group wheel, though, so things will work for them. This should add extra fun if a user reports the general chmod issue as a problem, since sysadmins can't reproduce it as themselves.)

You might think that this is an obscure issue that no one will ever care about, but actually it caused a Go build failure on FreeBSD for a while. Tracking down the problem took me a while and a bunch of head scratching.

PS: arguably GID 0 should not be group wheel but instead something else that only root is a member of and wheel should be a completely separate group. To have group wheel used for group ownership as well as su access to root is at least confusing.

ChmodInterestingFailure written at 01:39:47; Add Comment


Some versions of sort can easily sort IPv4 addresses into natural order

Every so often I need to deal with a bunch of IPv4 addresses, and it's most convenient (and best) to have them sorted into what I'll call their natural ascending order. Unfortunately for sysadmins, the natural order of IPv4 addresses is not their lexical order (ie what sort will give you), unless you zero-pad all of their octets. In theory you can zero pad IPv4 addresses if you want, turning into, but this form has two flaws; it looks ugly and it doesn't work with a lot of tools.

(Some tools will remove the zero padding, some will interpret zero-padded octets as being in octal instead of decimal, and some will leave the leading zeros on and not work at all; dig -x is one interesting example of the latter. In practice, there are much better ways to deal with this problem and people who zero-pad IPv4 addresses need to be politely corrected.)

Fortunately it turns out that you can get many modern versions of sort to sort plain IPv4 addresses in the right order. The trick is to use its -V argument, which is also known as --version-sort in at least GNU coreutils. Interpreting IPv4 addresses as version numbers is basically exactly what we want, because an all-numeric MAJOR.MINOR.PATCH.SUBPATCH version number sorts in exactly the same way that we want an IPv4 A.B.C.D address to sort.

Unfortunately as far as I know there is no way to sort IPv6 addresses into a natural order using common shell tools. The format of IPv6 addresses is so odd and unusual that I expect we're always going to need a custom program for it, although perhaps someday GNU Sort will grow the necessary superintelligence.

This is a specific example of the kind of general thinking that you need in order to best apply Unix shell tools to your problems. It's quite helpful to always be on the lookout for ways that existing features can be reinterpreted (or creatively perverted) in order to work on your problems. Here we've realized that sort's idea of 'version numbers' includes IPv4 addresses, because from the right angle both they and (some) version numbers are just dot-separated sequences of numbers.

PS: with brute force, you can use any version of sort that supports -t and -k to sort IPv4 addresses; you just need the right magic arguments. I'll leaving working them out (or doing an Internet search for them) as an exercise for the reader.

PPS: for the gory details of how GNU sort treats version sorting, see the Gnu sort manual's section on details about version sort. Okay, technically it's ls's section on version sorting. Did you know that GNU coreutils ls can sort filenames partially based on version numbers? I didn't until now.

(This is a more verbose version of this tweet of mine, because why should I leave useful stuff just on Twitter.)

Sidebar: Which versions of sort support this

When I started writing this entry, I assumed that sort -V was a GNU coreutils extension and would only be supported by the GNU coreutils version. Unixes with other versions (or with versions that are too old) would be out of luck. This doesn't actually appear to be the case, to my surprise.

Based on the GNU Coreutils NEWS file, it appears that 'sort -V' appeared in GNU coreutils 7.0 or 7.1 (in late 2008 to early 2009). The GNU coreutils sort is used by most Linux distributions, including all of the main ones, and almost anything that's modern enough to be getting security updates should have a version of GNU sort that is recent enough to include this.

Older versions of FreeBSD appear to use an old version of GNU coreutils sort; I have access to a FreeBSD 9.3 machine that reports that /usr/bin/sort is GNU coreutils sort 5.3.0 (from 2004, apparently). Current versions of FreeBSD and OpenBSD have switched to their own version of sort, known as version '2.3-FreeBSD', but this version of sort also supports -V (I think the switch happened in FreeBSD 10, because a FreeBSD 10.3 machine I have access to reports this version). Exactly how -V orders things is probably somewhat different between GNU coreutils sort and FreeBSD/OpenBSD sort, but it doesn't matter for IPv4 addresses.

The Illumos /usr/bin/sort is very old, but I know that OmniOS ships /usr/gnu/bin/sort as standard and really you want /usr/gnu/bin early in your $PATH anyways. Life is too short to deal with ancient Solaris tool versions with ancient limitations.

SortingIPv4Addresses written at 01:26:50; Add Comment


Wayland is now the future of Unix graphics and GUIs

The big Unix graphics news of the past week is that Ubuntu threw in the towel on their Unity GUI and with it their Mir display server (see the Ars story for more analysis). I say 'Unix' instead of 'Linux' here because I think this is going to have consequences well beyond Linux.

While there was a three-way fight for the future between Wayland, Ubuntu's Mir, and the default of X, it was reasonably likely that support for X was going to remain active in things like Firefox, KDE, and even Gnome. As a practical matter, Mir and Wayland were both going to support X programs, so if you targeted X (possibly as well as Wayland and/or Mir) you could run on everything and people would not be yelling at you and so on. But, well, there isn't a three-way fight any more. There is only X and Wayland now, and that makes Wayland the path forward by default. With only one path forward, the pressure for applications and GUI environments to remain backwards compatible to X is going to be (much) lower. And we already know how the Gnome people feel about major breaking changes; as Gnome 3 taught us, the Gnome developers are perfectly fine with them if they think the gain is reasonable.

In short: running exclusively on Wayland is the future of Gnome and Gnome-based programs, which includes Firefox; I suspect that it's also the future of KDE. It's not an immediate future, but in five years I suspect that it will be at least looming if not arriving. At that point, anyone who is not running Wayland will not be getting modern desktop software and programs and sooner or later won't be getting browser security fixes for what they currently have.

People run desktop software on more Unixes than just Linux. With Gnome and important desktop apps moving to Wayland, those Unixes face a real problem; they can live with old apps, or they can move to Wayland too. FreeBSD is apparently working seriously on Wayland support (cf), and at one point a Dragonfly BSD developer had Wayland running there. OpenBSD? Don't hold your breath. Solaris? That's up to Oracle these days but I don't really expect it; it would be a lot of work and I can't imagine that Oracle has many customers who will pay for it. Illumos? Probably not unless someone gets very energetic.

With that said, old X programs and environments are not going to suddenly go away. Fvwm will be there for years or decades to come, for example, as will xterm and any number of other current X programs and window managers. But people who are stuck in X will also be increasingly stuck in the past, unable to run current versions of more and more programs.

(For some people, this will be just fine. We're probably going to see a fairly strong sorting function among the free Unixes for what sort of person winds up where, which is going to make cultural issues even more fun than usual.)

PS: Some people may sneer at 'desktop software and programs', but this category includes quite a lot of things that are attractive but by and large desktop agnostic, like photography programs, Twitter clients, and syndication feed readers. Most modern graphical programs on Unix are built on top of some mid-level toolkit like GTK+ or QT, not on basic X stuff, because those mid-level toolkits make it so much faster and easier to put together GUIs. If and when those toolkits become Wayland-only and the latest versions of the programs move to depend on recent versions of the toolkits, the programs become Wayland-only too.

WaylandNowTheFuture written at 00:28:59; Add Comment


Why the modern chown command uses a colon to separate the user and group

In the beginning, all chown(1) did was change the owner of a file; if you wanted to change a file's group too, you had to use chgrp(1) as well. This is actually more unusual than I realized before I started to write this entry, because even in V7 Unix the chown(2) system call itself could change both user and group, per the V7 chown(2) manpage. Restricting chown(1) to only changing the owner did make the command itself pretty simple, though.

By the time of 4.1c BSD, chown(1) had become chown(8), because, per the manual page:

Only the super-user can change owner,
in order to simplify as yet unimplemented accounting procedures.

(The System V line of Unixes would retain an unrestricted chown(2) system call for some time and thus I believe they kept the chown command in section 1, for general commands anyone could use.)

In 4.3 BSD, someone decided that chown(8) might as well let you change the group at the same time, to match the system call. As the manual page covers, they used this syntax:

/etc/chown [ -f -R ] owner[.group] file ...

That is, to chown a file to user cks, group staff, you did 'chown cks.staff file'.

This augmented version of the chown command was picked up by various Unixes that descended from 4.x BSD, although not immediately (like many things from 4.3 BSD, it took a while to propagate around). Sometimes this was the primary version of chown, found in /usr/bin or the like; sometimes this was a compatibility version, in /usr/ucb (Solaris through fairly late, for example). Depending on how you set up your $PATH on such systems, you could wind up using this version of chown and thus get used to having 'user:group' rejected as an error.

Then, when it came time for POSIX to standardize this, someone woke up and created the modern syntax for changing both owner and group at once. As seen in the Single Unix Specification for chown, this is 'chown owner[:group] file', ie the separator is now a colon. Since POSIX and the SUS normally standardized existing practice (where it actually existed), you might wonder why they changed it. The answer is simple: a colon is not a valid character in a login, while a dot is.

Sure, dots are unusual in Unix logins in most places, but they're legal and they do show up in some environments (and they're legal in group names as well). Colons are outright illegal unless you like explosions, fundamentally because they're the field separator character in /etc/passwd and /etc/group. The SUS manpage actually has an explicit discussion of this in the RATIONALE section, although it doesn't tell you what it means by 'historical implementations'.

(The SUS manpage also discusses a scenario where using chown and chgrp separately isn't sufficient, and you have to make the change in a single chown() system call.)

PS: Since I think I ran into this dot-versus-colon issue on our old Solaris 10 fileservers, I probably had /usr/ucb before /usr/bin in my $PATH there. I generally prefer UCB versions of things to the stock Solaris versions, but in this case it tripped me up.

PPS: It turns out the GNU chown accepts the dot form as well provided that it's unambiguous, although this is covered only in the chown info file, and is not mentioned in the normal manpage.

ChownColonSeparatorWhy written at 00:47:44; Add Comment


What should it mean for a system call to time out?

I was just reading Evan Klitzke's Unix System Call Timeouts (via) and among a number of thoughts about it, one of the things that struck me is a simple question. Namely, what should it mean for a Unix system call to time out?

This question may sound pointlessly philosophical, but it's actually very important because what we expect a system call timeout to mean will make a significant difference in how easy it would be to add system calls with timeouts. So let's sketch out two extreme versions. The first extreme version is that if a timeout occurs, the operation done by the system call is entirely abandoned and undone. For example, if you call rename("a", "b") and the operation times out, the kernel guarantees that the file a has not been renamed to b. This is obviously going to be pretty hard, since the kernel may have to reverse partially complete operations. It's also not always possible, because some operations are genuinely irreversible. If you write() data to a pipe and time out partway through doing so (with some but not all data written), you cannot reach into the pipe and 'unwrite' all of the already sent data; after all, some of it may already have been read by a process on the other side of the pipe.

The second extreme version is that having a system call time out merely causes your process to stop waiting for it to complete, with no effects on the kernel side of things. Effectively, the system call is shunted to a separate thread of control and continues to run; it may complete some time, or it may error out, but you never have to wait for it to do either. If the system call would normally return a new file descriptor or the like, the new file descriptor will be closed immediately when the system call completes. In practice implementing a strict version of this would also be relatively hard; you'd need an entire infrastructure for transferring system calls to another kernel context (or more likely, transplanting your user-level process to another kernel context, although that has its own issues). This is also at odds with the existing system calls that take timeouts, which generally result in the operation being abandoned part way through with no guarantees either way about its completion.

(For example, if you make a non-blocking connect() call and then use select() to wait for it with a timeout, the kernel does not guarantee that if the timeout fires the connect() will not be completed. You are in fact in a race between your likely close() of the socket and the connection attempt actually completing.)

The easiest thing to implement would probably be a middle version. If a timeout happens, control returns to your user level with a timeout indication, but the operation may be partially complete and it may be either abandoned in the middle of things or completed for you behind your back. This satisfies a desire to be able to bound the time you wait for system calls to complete, but it does leave you with a messy situation where you don't know either what has happened or what will happen when a timeout occurs. If your mkdir() times out, the directory may or may not exist when you look for it, and it may or may not come into existence later on.

(Implementing timeouts in the kernel is difficult for the same reason that asynchronous IO is hard; there is a lot of kernel code that is much simpler if it's written in straight line form, where it doesn't have to worry about abandoning things part way through at essentially any point where it may have to wait for the outside world.)

SystemCallTimeoutMeaning written at 01:03:40; Add Comment


Modern X Windows can be a very complicated environment

I mentioned Corebird, the GTK+ Twitter client the other day, and generally positive. That was on a logical weekend. The next day I went in to the office, set up Corebird there, and promptly ran into a problem: I couldn't click on links in Tweets, or rather I could but it didn't activate the link (it would often do other things). Corebird wasn't ignoring left mouse clicks in general, it's just that they wouldn't activate links. I had not had this problem at home (or my views would not have been so positive. I use basically the same fvwm-based window manager environment at home and at work, but since Corebird is a GTK+ application and GTK+ applications can be influenced by all sorts of magic settings and (Gnome) setting daemons, I assumed that it was something subtle that was different in my work GTK+/Gnome environment and filed a Fedora bug in vague hopes. To my surprise, it turned out to be not merely specific to fvwm, but specific to one aspect of my particular fvwm mouse configuration.

The full version is in the thread in the fvwm mailing list, but normally when you click and release a button, the X server generates two events, a ButtonPress and then a ButtonRelease. However, if fvwm was configured in a way such that it might need to do something with a left button press, a different set of events was generated:

  • a LeaveNotify with mode NotifyGrab, to tell Corebird that the mouse pointer had been grabbed away from it (by fvwm).
  • an EnterNotify with mode NotifyUngrab, to tell Corebird 'here is your mouse pointer back because the grab has been released' (because fvwm was passing the button press through to Corebird).
  • the ButtonPress for the mouse button.

The root issue appears to be that something in the depths of GTK+ takes the LeaveNotify to mean that the link has lost focus. Since GTK+ doesn't think the link is focused, when it receives the mouse click it doesn't activate the link, but it does take other action, since it apparently still understands that the mouse is being clinked in the text of the GtkLabel involved.

(There's a test program that uses a simple GtkLabel to demonstrate this, see this, and apparently there are other anomalies in GtkLabel's input processing in this area.)

If you think that this all sounds very complex, yes, exactly. It is. X has a complicated event model to start with, and then interactions with the window manager add extra peculiarities on top. The GTK+ libraries are probably strictly speaking in the wrong here, but I also rather suspect that this is a corner case that the GTK+ programmers never imagined, much less encountered. In a complex environment, some possibilities will drop through the cracks.

(If you want to read a high level overview of passive and active (mouse button) grabs, see eg this 2010 writeup by Peter Hutter. Having read it, I feel like I understand a bit more about what fvwm is doing here.)

By the way, some of this complexity is an artifact of the state of computing when X was created, specifically that both computers and networking were slow. Life would be simpler for everyone if all X events were routed through the window manager and then the window manager passed them on to client programs as appropriate. However, this would require all events to pass through an extra process (and possibly an extra one or two network hops), and in the days when X was young this could have had a real impact on overall responsiveness. So X goes to a great deal of effort to deliver events directly to programs whenever possible while still allowing the window manager to step in.

(My understanding is that in Wayland, the compositor handles all events and passes them to clients as it decides. The Wayland compositor is a lot more than just the equivalent of an X window manager, but it fills that role, and so in Wayland this issue wouldn't come up.)

ModernXCanBeVeryComplex written at 22:39:05; Add Comment


Cheap concurrency is an illusion (at least on Unix)

Recently I wound up reading this article (via), which contains the following:

[...] this memo assumes that there already exists an efficient concurrency implementation where forking a new lightweight process takes at most hundreds of nanoseconds and context switch takes tens of nanoseconds. Note that there are already such concurrency systems deployed in the wild. One well-known example are Golang's goroutines but there are others available as well.

When designing APIs and similar things, it is quite important to understand that extremely lightweight processes are an illusion. I don't mean that in the sense that they aren't actually lightweight in practice (although you probably want to pay attention to CPU cache effects here). I mean that in the sense that they don't actually exist and their nonexistence has real consequences.

All 'lightweight processes' on Unix are some form of what is known as 'green threads', which is to say that they exist purely at the user level. They are extremely cheap to create and switch to because all that the user level has to do here is shuffle some registers around or mark some memory as allocated for the initial stack. But the moment that you have to materialize a kernel entity to back a lightweight process (perhaps because it is about to do a blocking operation), things become much more expensive.

The reality is that there is no such thing as a genuinely lightweight kernel process, at least not in Unixes. Kernel processes have associated data structures and not insignificant kernel stacks, and they take involved locks to maintain reference counts on virtual memory areas and so on. Kernel processes are certainly a lot more lightweight these days than they used to be ten or twenty years ago, sufficiently so that POSIX threads are mostly 1:1 user to kernel threads because it's simpler, but they don't even start approaching the kind of lightweight you need to have as many of them as you can have, say, goroutines in a Go program. Systems like Go engage in a very careful behind the scenes dance in order to multiplex all of those goroutines onto many fewer kernel processes.

(One reason kernel processes need not insignificant kernel stacks is that Unix kernels are written in C, and C does not like having its stack relocated around in order to resize it. Languages like Go go to a great deal of effort in their compiler and runtime to make this work (and this can be alarmingly complicated).)

The consequence of this is that if you want to do a lot of concurrent things at once, at the level of the API to the kernel you can't do these things in a 1 to 1 relationship between a single thing and a kernel thread. If you try to have a 1:1 relationship, your system will explode under load with too many relatively expensive kernel threads. Instead you really must have some form of aggregation in the userspace-to-kernel API, regardless of whether or not you expose it to people.

This doesn't mean that the illusion of lightweight processes and cheap concurrency doesn't work or isn't useful. Usually it does work and is useful, partly because people find it much easier to reason about the behavior of straight-line code than they do about the behavior of state machines or callbacks. But it's an abstraction, with all that that implies, and if you are designing APIs you need to think about how they will be implemented at the level where this illusion is not true. Otherwise you may wind up designing APIs that promise things that you can't actually deliver.

(This holds true for more than just user to kernel APIs. If you design a Go based system where you appear to promise that a ton of goroutines can all do file IO all at once, you are probably going to have pain because in most environments each separate bit of file IO will require its own kernel thread. The Go runtime may do its best to deliver this, but you probably won't like what happens to your machine when a few thousand goroutines are busy trying to do their theoretically lightweight file IO.)

CheapConcurrencyIllusion written at 21:46:41; Add Comment


My views on X (Windows)

One of the famous quotes about the X Windows System is this one:

Sometimes when you fill a vacuum, it still sucks. - Rob Pike

The first thing to say here is that it is not quite the case that X filled a vacuum and so succeeded by default. Certainly X was the first cross-Unix window system, but Unix workstations existed before X, so of course various Unix vendors had to come up with GUI environments and window systems for them like SunView. However, people do not speak entirely favourably about those systems.

Early Unix window systems were not designed by idiots (despite what some commentary might say); they were designed by dedicated, smart engineers who were doing the best job that they could at the time and on the machines that they had. It's just that people didn't yet know enough about window systems to see what was a good or a bad idea, what sort of APIs worked, and so on, and it didn't help that early machines and Unixes were very limited. Mistakes were made; indeed, mistakes were inevitable. So the early window systems were rather far from ideal and once the dust settled, not entirely appealing.

This is part of the vacuum that X filled (although not all of it). X came along at the right time to learn from prior experience (and evolved a couple of times without having to worry much about backwards compatibility issues), and the result was, well, quite New Jersey. It worked, and through at least the early 1990s, when it didn't work people bashed on it until it did. Generally it even worked better than its competition at the time or delivered compelling additional features, or both (although this is not the only reason it succeeded so well).

At the same time, X has never been elegant. In this it partakes of the spirit of (V7) Unix; not the side that gave us the clean Unix design ideas, but the side that gave us a kernel where active processes were just put in a fixed-size array (that was walked with for loops):

struct proc proc[NPROC];

Why not? It worked well enough, and V7 was a New Jersey approach work in progress.

That X is not elegant or simple or very Unixy is what Rob Pike meant when he said that it sucked. X is a generally unattractive beast under the hood, full of protocol and API complexity, with various messy features and any number of omissions of things that people would really like to be covered (some of which were later sort of added). It's not as policy-free as it claims to be, and anyways being policy-free is ducking various important issues (especially if you want disparate clients to genuinely interoperate). But X did in a sense fill a vacuum and more broadly it definitely filled a need, and doing so was (and is) important. The logic of X existing is inorexable and it's a clear improvement over what came before. For all of X's warts, no one really wants to go back to SunView or the other pre-X options.

These days, my own reaction to X's warts is basically to shrug. Given that it's the best option I have in practice, well, it works well enough; in fact, it generally works pretty great. I'm aware that a great deal of sweat and irritation is being expended behind the scenes to make that happen and every so often I stub my toe on a rough edge of modern X, but for the most part I can ignore all of that. I would like it to be more elegant (or to be magically replaced by something that was), but it's not something I'm passionate about. I am passionate about not going back to a long catalog of not-as-nice window systems that I have used in the past, sometimes out of necessity. And yes, that even includes cool ones like the Blit.

(I'm not convinced we know enough to do a window system design that's significantly more elegant than X and still has X's virtues, including working well across the network. I guess Wayland may give us a test of this. There was NeWS, always the great white hope of Unix window systems, but I think it had the wrong core design and anyways who knows what would have happened to its initial elegance after ten or twenty years of being adapted to the harsh realities of general use.)

PS: Yes, X is not supposed to be called 'X Windows'. Sorry, that ship sailed years ago; in practice, 'X Windows' is a perfectly widely accepted name for it in situations where you don't feel like spelling out the full name but 'X' or 'X11' is too short.

XWindowsViews written at 01:50:21; Add Comment


What it's sensible to use a bunch of Unix swap space for

I've long written that you don't want too much swap space because if you try to actively use more RAM than your machine actually has, swap space basically just gives you more rope. Your machine is not going to perform very well (or at all) while things are busy frantically paging memory in and out, so it's usually better to just have things fail immediately. But there are sensible uses for decent amounts of swap space, even if they're relatively rare, and today I feel like trying to run them down.

First off, you almost always want a bit of swap space configured even if you never plan to use it, generally something on the order of a few hundred megabytes up to a gigabyte. The sad reality of life is that many Unix kernels contains code that simply assumes that you have some amount of swap space, even just a bit (partly because very few kernel developers even try to test with no swap set up). If you have no swap space configured at all, this code can malfunction in various unhappy ways. Feeding your system 256 MB or 512 MB for swap is a small price to pay to avoid running into these corner cases.

But that's just a little space, not a bunch of it. So here are some uses for an appreciable amount of swap space that I can think of:

  • Hibernating a laptop or other machine that can do this (under some configurations).

  • If you have a memory based tmpfs /tmp or other scratch directory, which are increasingly popular and common (although I think it's a bad idea). If you have one and you expect to use much space in it and be under overall memory pressure, you probably want as much swap space as you ever expect something to use in your /tmp, and then probably some more for insurance.

    (Unfortunately there are likely to be a lot of programs that will write large files to /tmp under the right circumstances, especially if they get fed unusually large inputs. This is one reason I don't like tmpfs /tmp.)

  • If you have a system that insists on some relatively strong form of 'strict overcommit', where it wants to reserve RAM or swap space for almost all of the memory that your programs may potentially use, and you have programs that allocate a lot of memory that they don't touch. Here you're using a bunch of swap space to basically trick the system's memory accounting, so that it's happy letting programs allocate memory they'll never use.

  • If you have mostly inactive programs that use an appreciable amount of memory plus programs with occasional (or periodic) short term spike demands for most or almost all of the memory on the machine. With sufficient swap space, the spike demand will push everything else out to swap, run to completion, and then everything else will slowly wake up and page back in. You won't enjoy the system having to page back in a few gigabytes of memory, but it probably beats the alternatives (including splitting things out to separate machines).

    (Bonus points are awarded here if you have a scheduling system that actively SIGSTOPs or otherwise totally suspends the lower priority programs so that they don't even try to run during the demand spike.)

  • If you have programs with (significant) memory leaks and what they lose track of is dirty memory (as opposed to clean memory that they allocated but never touched). As leaked memory, the program is basically never going to touch it again, but as dirty memory it can only live in RAM or swap and clearly you'd rather have it live in swap.

  • If you have programs that use a lot of memory but touch most of it only very rarely, very slowly, or both. Unlike leaked memory, the program will look at this memory again at some point but in the mean time it wastes your valuable RAM, so you would rather page it out to swap space for now and then pay the performance impact of swapping it back in later.

    (At this point you're teetering on the edge. If you've misjudged how much memory the program will want to look at how fast, you can shove your overall system right into memory starvation and swap death.)

  • If the most important thing is that the system not crash, even if it's not doing very much else besides swapping madly. Such systems are probably doing their important work either in the kernel (using reserved memory) or in programs that are locked into memory, so that they keep going even while the rest of the system is more or less locked in swap death. This may or may not work very well in practice; among other issues, kernels often want memory themselves and so may get entangled in what theoretically is only 'unimportant' user space swap death.

As it probably shows from how I described these things, the further down the list you go the more dubious I get about how wise these hacks are (and I maintain that most of them are hacks). Generally you should have a pretty strong confidence that you know exactly what your overall system is going to do (and why).

(You can also be desperate and hoping that adding more swap space will let one of these cases limp along. If so, I hope that you have good monitoring so you can reboot entire machines if or when they fall over.)

SwapLotsWhatFor written at 00:51:47; Add Comment

(Previous 10 or go back to January 2017 at 2017/01/29)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.