Wandering Thoughts


Making two Unix permissions mistakes in one

I tweeted:

Today's state of work-brain:
mkdir /tmp/fred
umask 077 /tmp/fred

Immediately after these two commands, I hit cursor-up to change the 'umask' to 'chmod', so that I then ran 'chmod 077 /tmp/fred'. Fortunately I was doing this as a regular user, so my next action exposed my error.

This whole sequence of commands is a set of mistakes jumbled together in a very Unix way. My goal was to create a new /tmp/fred directory that was only accessible to me. My second command is not just wrong because I wanted chmod instead of umask (I should have run umask before the mkdir, not after), but because I had the wrong set of permissions for chmod. It was as if my brain wanted Unix to apply a 'umask 077' to the creation of /tmp/fred after the fact. Since the numeric permissions you give to umask are the inverse of the permissions you give to chmod (you tell umask what you don't want instead of what you do), my change of umask to chmod then left /tmp/fred with completely wrong permissions; instead of being only accessible to me, it was fully accessible to everyone except me.

(Had I been doing this as root, I would then have been able to cd into the directory, put files in it, access files in it, and so on, and might not have noticed that the permissions were reversed from what I actually wanted.)

The traditional Unix umask itself is a very Unix command (well, shell built-in), in that it more or less directly calls umask(). This allows a very simple implementation, which was a priority in early Unixes like V7. A more sensible implementation would be that you specify effectively the maximum permissions that you want (for example, that things can be '755') and then umask would invert this to get the value it uses for umask(). But early Unixes took the direct approach, counting on people to remember the inversion and perform it in their heads.

In the process of writing this entry I learned that POSIX umask supports symbolic modes, and that they work this way. You get and set umask modes like 'u=rwx,g=rx,o=rx' (aka '022', the traditional friendly Unix umask), and they're the same permissions as you would use with chmod. I believe that this symbolic mode is supported by any modern Bourne compatible shell (including zsh), but it isn't necessarily supported by non-Bourne shells such as tcsh or rc (which is my shell).

PermissionsTwoMistakes written at 23:53:11; Add Comment


Why Bash and GNU Readline's "bracketed paste" mode is not for us

Last month, we discovered that recent versions of Bash and GNU Readline now default to special handling of pasting into them. This is called "bracketed paste" mode; it requires you to explicitly hit Return in order to have the pasted text accepted and allows you to edit the paste before then. Locally, we have decided to turn this off for the root account, and I've also turned it off for my own account. We have pragmatic reasons for our collective decision, and I also have a broad general reason for my own account.

Our pragmatic reason is that we are almost always pasting from our own instructions, which we obviously trust. In addition, we do this sort of pasting quite a bit. Since our own source text is trusted, an extra step to accept them is both annoying and almost certainly ineffective at avoiding mistakes. Since we know the source is trusted, we're extremely unlikely to pause, look at what we're about to hit Return on, and realize we've pasted the wrong thing, especially since we do this all the time.

(People are extremely bad at spotting the one in a thousand exception to routine processes.)

My broader personal reason is that bracketed paste quietly clashes with the xterm model of fast cut and paste. Xterm and things that imitate it are specifically set up so that you can cut and paste entirely from the mouse; you use the left and perhaps right mouse buttons to make your selection and then the middle mouse button to paste it. Bracketed paste doesn't just require an extra step, it requires you to use the keyboard. On top of that, right-handed mouse users with normal keyboards either have to move their right hand from the mouse to the Return key or reach over with their left hand to hit Return.

Bracketed paste probably looks better to laptop users (where the 'mouse' is below the keyboard) and to people who use Copy and Paste through the keyboard with Ctrl-C and Ctrl-V. But for xterm users who trust the source they're pasting from, I suspect that the inconveniences outweigh the occasional advantage of being able to edit what you're pasting to alter it a bit before it takes effect.

(The traditional way to be able to edit what you're pasting in xterm is to not select and paste the trailing newline. However, this usually requires you to use a slower selection method than just triple-clicking the source line.)

PS: What I would actually like is a way to invoke bracketed paste in xterm when pasting, for example with shift plus the middle mouse button. This would improve the convenience of pasting things that I want to edit after pasting, because I could freely select things with newlines but still modify them before they act.

BracketedPasteWhyNot written at 23:44:32; Add Comment


The Unix background of Linux's 'file-max' and nr_open kernel limits on file descriptors

Somewhat recently, Lennart Poettering wrote about File Descriptor Limits. In passing, Poettering said (emphasis mine):

[...] Specifically on Linux there are two system-wide sysctls: fs.nr_open and fs.file-max. (Don't ask me why one uses a dash and the other an underscore, or why there are two of them...) [...]

I can't help much about the first question, but the answer to the second one may be that Linux is carrying on a tradition that goes deep in the history of Unix (and, it turns out, to its early implementation). Specifically, it goes back to the very simple kernel designs of Research Unix versions, such as V7 (the famous starting point for so much diversity).

Unix started out as a small and simple system, and the Research Unix kernels often used simple and what we would consider brute force data structures. In particular, early Unixes tended to use fixed size arrays of things that kernels today allocate dynamically. When it comes to open files and file descriptors, there are two things that you have to keep track of. Each process has some number of file descriptors, then the underlying open files may be shared between processes and so have to be tracked in some global state.

V7 and other Research Unixes implemented this in a straightforward way. Each process had a fixed size array of its open files, the u_ofile array in the user structure, and then there was another fixed size global array for all open files, the file struct (in c.c; the struct file is defined in file.h). The sizes of both of these arrays were set when you built the kernel, in param.h, and influenced how much of the PDP-11's very limited memory the resulting kernel would take up.

(If you read through the V7 param.h, you can see that V7 had any number of very small limits. The limit on the total number of open files may seem small, but standard input, standard output, and standard error are often widely shared; each login session might reasonably use only one open file for all of them for the shell and all of the processes it runs interactively.)

The existence of these compiled in limits lasted a fair while; the 4.3 BSD user.h still has a fixed size array of file descriptors for each process, for example (4.4 BSD switched to a dynamic scheme). So did Linux 0.96c, as seen in sched.h (and also a fixed size global array of open file structures; see the implementation of sys_open in open.c).

Once the actual allocation of both the per process set of file descriptors and the global set became more dynamic, people naturally started putting limits on just how dynamic this could be. It was natural to make the per-process a resource limit (at least normally) while setting a kernel tunable limit for the global limit. Linux also has a kernel limit that caps how many open files a process can have that overrides the normal resource limits.

Having two separate limits even on kernels which dynamically allocate these things makes some sense, but not necessarily a lot of it. A limit on the number of file descriptors that a single process can have open at once will save you from a program with a coding error that leaks open files (especially if it leaks them rapidly). A separate limit on the total number of open file descriptors across all processes is effectively a limit on the amount of memory one area of the kernel can lock down, which is at least potentially useful.

(I expect that Poettering knows all of this background, but other people don't necessarily, so I decided to write about it. I mentioned some of this in my entry on dup(2) and shared file descriptors.)

PS: The obvious speculation about why Linux's sysctl for per process open file descriptors has an underscore in its name is that the original kernel #define was called NR_OPEN. Since the original kernel #define for the global maximum as NR_FILE, this doesn't explain why it uses a dash.

TwoFileDescriptorLimits written at 00:35:59; Add Comment


New versions of Bash (and readline) default to special handling of pasting into the shell (or other programs)

Our standard habit on our OpenBSD machines is to use their packaged version of Bash as the root shell, instead of the default of OpenBSD's version of ksh. When I set up an OpenBSD 6.9 machine recently and started to paste in install steps from our standard instructions, an odd thing happened: the line I'd just pasted in was highlighted in reverse video and Bash just sat there, instead of doing anything. After some fiddling around, I discovered that I had to hit Return in order get things to go (and at that point Bash would act on all of the input without further prompts, even if I'd pasted in multiple lines).

People who are more familiar with Bash and readline than I was already know what this is; this is the Bash readline setting of enable-bracketed-paste. OpenBSD 6.9 is the first Unix I've encountered where this was turned on by default. This isn't because OpenBSD did anything special; instead, it's because OpenBSD 6.9 is the first Unix I've used that has Bash 5.1. As mentioned in the detailed Bash list of changes, bracketed paste mode was enabled by default starting in bash-5.1-alpha. The reverse video behavior of it is also new there; in 5.0 and before, nothing special shows in bracketed paste mode to signal that something unusual is going on.

Bash 5.1 will be rolling out over time to more Unixes, so I suspect that more people will be running into this behavior unless their particular Unix opts to disable it in one way or another. If I had updated to Fedora 34 by now, it's possible I'd have already encountered this (I believe that Fedora 34 has Bash 5.1, but I don't know how Fedora opted to set bracketed paste).

This change to the default of bracketed paste is also in GNU Readline 8.1 (according to its CHANGES). However, Readline has a configuration time option that can change this, so different Unixes may opt to build Readline differently. On a Unix with Readline 8.1+ and bracketed paste enabled by default, I believe that all programs using GNU Readline will automatically have this behavior.

(This directly affects me because these days I build my alternate shell using GNU Readline. If I do nothing, it will someday inherit this behavior on new versions of Fedora and Ubuntu.)

If you decide that you don't want bracketed paste mode, the safest change is to set this in either your $HOME/.inputrc or globally in /etc/inputrc. You would do this with:

set enable-bracketed-paste off

This will (or should) cover Bash and anything else that starts using Readline 8.1 on a Unix that builds 8.1 with this enabled. Adding this to your .inputrc today is harmless if you have Readline 7.0 and Bash 4.4 or later (the versions where this setting was introduced).

If you just want to turn this off only in Bash (at least for now), I think that you want to set up a $HOME/.bashrc that has in it:

bind 'set enable-bracketed-paste off'

You can set this in your .profile, but then it won't be turned on in subshells.

How to set this bind up globally for Bash depends on how your Unix's version of Bash was built and may not be possible. Ubuntu builds Bash so that there's a global /etc/bash.bashrc you can put this bind into, but Fedora and OpenBSD don't. Fedora provides a starting .bashrc for new accounts that source /etc/bashrc, so you can put the bind there and probably get most people. Since Bash is an add-on in OpenBSD, it has nothing like this and people are on their own to disable it one by one.

BashBracketedPasteChange written at 00:04:51; Add Comment


Simple use of Let's Encrypt on OpenBSD is pleasantly straightforward (as of 6.8)

For reasons beyond the scope of this entry, I recently needed to get a Let's Encrypt TLS certificate for testing on an OpenBSD machine, which isn't something I've done before. On a relatively modern OpenBSD (6.8), it was pleasantly straightforward and easy, with the programs necessary already installed in a full base install (which is what we normally do on our OpenBSD machines, since a full install is so small).

OpenBSD's standard Let's Encrypt client is acme-client, which has to be configured through /etc/acme-client.conf and then invoked (for example) as 'acme-client -v yourhost' to start the process of getting your TLS certificate. As the OpenBSD documentation tells you, a sample acme-client.conf is in /etc/examples and is easy to edit into shape to list the names you want Let's Encrypt certificates for. I opted to add the optional contact option to the 'letsencrypt' authority in my acme-client.conf, although in retrospect it's pointless for a test server where I don't care if the certificate expires after I'm done.

In the OpenBSD tradition, acme-client is very minimal. Unlike more all encompassing Let's Encrypt clients like Certbot, acme-client doesn't do either post-renewal actions or answering the HTTP challenges.

(Certbot and many other heavy-weight clients have their own little internal HTTP server that they'll run for you for the duration of a challenge, if you ask. This is Certbot's standalone mode.)

To handle the HTTP side of things, the easiest approach is to run OpenBSD's standard httpd server at least temporarily. OpenBSD ships with a sample httpd.conf in /etc/examples that I was able to use with very few changes. Because I wanted to be able to test my new certificate, I left the HTTPS version of my host in my httpd.conf (although it wasn't serving anything), but you could remove it to have a HTTP server that's just there to answer the Let's Encrypt challenges. Pleasantly, OpenBSD httpd will still start if you have a HTTPS site configured but the TLS certificates for it are missing. This lets you leave your HTTPS site configured before you've gotten the Let's Encrypt certificate for it.

(The default OpenBSD httpd.conf redirects all of your HTTP site to your theoretical HTTPS site, which nicely makes it serve nothing except answers to Let's Encrypt challenges.)

Because I was getting my Let's Encrypt TLS certificate for something other than serving a web site, I didn't permanently enable the HTTP server. I then had to start httpd more forcefully than usual, with 'rcctl -f start httpd'; otherwise it reported:

/etc/rc.d/httpd: need -f to force start since httpd_flags=NO

(This is unlike the Prometheis host agent package, which can be started temporarily with 'rcctl start node_exporter' before you decide to permanently enable it. This is probably routine to regular OpenBSD users.)

Once I'd started httpd, acme-client succeeded without problems. Unlike Certbot and some other clients, acme-client has no separate Let's Encrypt account registration process; it registers an account if and when necessary to get your certificate. I opted to get both the certificate ('domain certificate ...' in acme-client.conf) and the default full chain ('domain full chain certificate ...'). This isn't strictly necessary, since you can always manually extract each individual certificate from the full chain file, but I wasn't sure how I was going to use the certificate so I opted to save myself time.

(Since this was a quick test setup, I haven't tried to automate any renewal, but the acme-client manpage has an example cron entry. You need a separate cron entry for every separate certificate you have on the machine; unlike Certbot, there is no 'try to renew anything necessary', even though all of your certificates are listed in your acme-client.conf.)

OpenBSDNiceLetsEncrypt written at 00:53:13; Add Comment


Unix job control has some dark corners and challenging cases, illustrated

Recently I learned that the common Linux version of vipw does some odd things with your terminal's foreground process group, one of the cores of Unix job control, due to a change made in late 2019 (via). It does this to fix a problem that's described in the issue and sort of in the commit, but neither the issue nor the commit discuss the larger context of why all of this is needed. Also, the fix has a small omission which can cause problems in some circumstances. So how did we get here?

Vipw's job is to let you safely edit /etc/passwd (or with the right option /etc/shadow), but it's not an editor itself, it's a front-end on some arbitrary editor of your preference. It works by setting things up and then running the editor of your choice. This creates three different potential interactions with job control and the terminal. First, infrequently vipw will be running an editor that doesn't take over the terminal or otherwise do anything with job control (perhaps someone likes ed). When you type Ctrl-Z, the terminal sends the entire foreground process group SIGTSTP and everything stops. Second, ideally vipw will be running an editor that takes over the terminal and handles Ctrl-Z itself, but the editor carefully sends SIGSTOP (or SIGTSTP) to the entire foreground process group. Third, it may be running an editor that takes over the terminal, handles Ctrl-Z itself, but only SIGSTOPs the editor itself, not the entire process group.

(A version of the third case is that someone manually sends SIGSTOP to the editor.)

This third case is the initial problem. If vipw does nothing about it and you type Ctrl-Z to such an editor, your vipw session will appear to hang until you type Ctrl-Z a second time. The editor has stopped itself, but vipw is only waiting for it to finish and exit, so from the shell's perspective it seems that vipw is still the active foreground program. Your second Ctrl-Z will suspend vipw, now that the editor is no longer trapping Ctrl-Z.

That it's so easy for editors to do the wrong thing here is the first of our dark corners of job control. An editor that does this wrong (or 'wrong') will work almost all of the time, because people are usually not running it with a front-end like vipw or from inside a script (another situation where suspending only yourself is the wrong answer).

To handle this third case, vipw needs to listen for the editor being suspended and then suspend itself too, passing the suspension through to the shell. But now the first and the second cases are a problem. When either the TTY or the editor suspends the entire process group, a notification that the editor process has been suspended may be queued up for vipw. Vipw can't tell this apart from the third case, so when you un-suspend vipw (and the editor) again it will see that the editor was suspended and immediately re-suspend everything. This is the issue report.

The fix vipw made (and in my opinion the correct one) was to put the editor in a new process group and make this new process group the terminal's foreground process group (this is how process groups interact with job control). With the editor isolated in its own process group, this essentially reduces everything to the third case. Vipw will be unaffected whether the TTY suspends the process group, the editor suspends the process group, or the editor just suspends itself, and in all cases vipw will notice that the editor has been suspended and suspend itself, turning control back over to the shell.

However, this change introduced a new bug, which is that when the editor finally finishes and everything is exiting, vipw doesn't change the terminal's foreground process group back to what it used to be before vipw itself exits. That this bug is almost invisible (and thus easy to introduce without noticing) is the second dark corner of job control.

The reason the bug is almost invisible is that almost all shells today are job control shells and a job control shell is basically indifferent to what you leave the foreground process group set to. Because a job control shell always changes the foreground process group away from itself when it starts or resumes a job, it always has to set it back to itself when it regains control after a program exits. This constant shuffling of foreground process groups is intrinsic to how job control works.

(Even the Almquist shell, used as a relatively minimal sh on FreeBSD and some Linuxes, has job control. OpenBSD's sh is ksh, and it too has job control.)

PS: In self defense, my non-job-control shell (also) has evolved to generally reset the foreground process group to itself when it seems necessary; in fact it started doing this in 2000 (really). However, the code path for readline support didn't do this until I stumbled over this vipw issue. GNU Readline itself will clean up a number of things about the terminal it's attempting to deal with, but not this (which is sensible).

JobControlHasDarkCorners written at 21:36:55; Add Comment


Unix job control and its interactions with TTYs (and shells)

For a user, Unix's job control gives you more or less two things. First, you can stop a program you're running with Ctrl-Z (usually), and often then have it run in the background for a while if you want (or you may want it to just stop, for example so it stops using up all of your CPU). Second, it lets you multiplex your terminal between multiple programs that all want to interact with you. You can Ctrl-Z the current foreground program, bring another one to the foreground, interact with it for a while, switch back, and so on.

The actual implementation of this is somewhat tangled. One way to put it is that Unix doesn't trust programs to cooperate with job control (which also means that programs didn't have to all be updated when BSD introduced the idea of job control). Instead, job control is managed by a combination of your Unix shell and the Unix TTY driver, primarily through the mechanism of the foreground (terminal) process group.

A job control shell puts every command or pipeline you run into a different process group. Then Unix TTYs (pseudo-ttys these days, but once upon a time real serial terminal connections) have one and only one foreground process group; normally, only processes in this process group are allowed to interact freely with the terminal. If a process in another process group attempts to either read from or write to the terminal, it will usually get hit with SIGTTIN or SIGTTOU.

(If you ignore SIGTTIN, you generally get an EIO error instead when you read from the TTY and you're not in the foreground process group.)

Plain ordinary Unix programs do nothing special with the TTY and let this basic handling take care of everything. Some Unix programs need to do more; for example, if they turn off the standard TTY handling, Ctrl-Z stops being special and they need to handle it themselves. Generally they do this by performing any particular clean-up they need to do then sending themselves (or their entire process group) a SIGSTOP signal.

To make all of this work, job control shells change the TTY's current foreground process group to match whatever they think should currently be in the foreground. They also watch for processes (and groups of them) that have become suspended, which is a sign of things like the user pressing Ctrl-Z. One consequence of this is that when the shell regains control of the terminal because the current foreground process has either been suspended or has finished, the first thing the shell has to do is reset the current foreground process group back to itself. Otherwise, its own attempts to do IO to the TTY will fail (when the shell wants to print its prompt, read input, and perhaps tell you that something has just been suspended).

PS: If there are multiple processes in the current foreground process group, they can freely write a jumble of TTY output or all try to read from the TTY at the same time. Job control only protects you from different command lines (ie, different process groups) interfering with each other. Generally this isn't an issue because if there's more than one process in a process group, they're in a pipeline (or a shell script).

JobControlAndTTYs written at 23:22:11; Add Comment


Trying out learning more Vim on demand

These days, I have a reasonably active interest in learning more Vim. Mostly I've done it by reading articles about Vim (eg, and I'm keeping an index). Sadly, a lot of the time this reading doesn't really stick. Recently I have been trying another approach, one that feels really obvious. When I want to do something, such as capitalize uppercase a number of words, I'll go do an Internet search for the answer (here [vim capitalize uppercase word], which immediately leads to a stackoverflow answer that's actually summarized right on the results page of the dominant search engine).

(I don't search for everything. Some things I want to do strike me as either too infrequent or too complex.)

The advantage of this is that it is what you could call highly motivated learning (I want to do this) and I reinforce it right away by using what I just read. If it doesn't entirely stick and this is something I do frequently enough, I'll soon re-do the search, re-read what I already read, and reinforce my memories. Perhaps sooner or later it will stick. If it's not something I do regularly it'll probably fade away, and I haven't wasted too much time on the search instead of doing whatever operation by hand (such as cw'ing the word and retyping it in CAPS).

There are two drawbacks to this approach that I can see (and probably some that I don't). The first is that this works much better on operators (such as gU to uppercase something) than on motions (in common vim jargon I believe these are called 'verbs' and 'nouns' respectively). It's natural to want to do an operation while not knowing how to; it feels less natural to want to move around or select in a new and unusual way. Second (and relatedly), this only works for things that I think are likely to be within Vim's capabilities. If I have no idea that Vim might even do something, I'm not going to try to look for it.

(An example of both at once is my stumbling over visual mode. I'm pretty sure it would never have struck me to do an Internet search for it.)

The second issue suggests that I should keep reading about general Vim features, so I have a broad idea of what Vim is capable of. Then I have a better chance of realizing that Vim probably has a way of doing the specific thing I care about right now, and looking it up.

VimLearningOnDemand written at 23:15:07; Add Comment


My view of Wayland here in 2021

Somewhat recently there was some Wayland related commotion in the Unix tech circles that I vaguely follow, where Wayland people were unhappy about what anti-Wayland people are doing (for example, which may have been reacting to this). Somewhat in reaction to all of this, I had a Twitter reaction:

My Wayland hot take is that I have no idea how well Wayland works in general, but I do know that for me, switching to it will be very disruptive because I'll need a new window manager and none of the Wayland ones work like my fvwm setup does. So I am and will be clinging to X.

I can't say anything about what modern Wayland is like, because I have no personal experience with Wayland. On my work laptop I use Cinnamon, which doesn't support Wayland. On my home and work desktops, I use a highly customized X environment that would not port to Wayland.

At this point Wayland has been coming for more than ten years and has nominally been definitely the future of Unix graphics for four years. But it's fully supported on all Linux graphics hardware by only two environments (GNOME and KDE, see Debian's description of the hardware requirements). Only two of the five significant Linux desktop environments support Wayland on any hardware (GNOME and KDE, with Cinnamon, XFCE, and MATE not). Canonical is only just releasing a version of Ubuntu that uses Wayland for GNOME by default (21.04), and even then it may not do this on Nvidia GPUs (cf). There are some additional Wayland 'window managers' (compositors) such as Sway, but nothing like the diversity that exists on X (although there may be good reasons for this).

Today, it's not so much that people are refusing to use Wayland, it's that a lot of people cannot. If you're not using GNOME or KDE, you're pretty much out in the cold. If you're using an Nvidia GPU, you're probably out in the cold (even if you use GNOME, your Linux probably defaults to Xorg on your hardware). If you don't use Linux, you're definitely out in the cold.

It's true that X server development has more or less stopped (eg, also), although the X server 21.1 milestone seems small right now. It's also true that the current X server works and supports a broad range of desktop environments and Unixes, and plenty of people are using those.

PS: I don't like Nvidia GPUs either and I don't use them, but a lot of Unix people have them and some of those people don't have any choice, for example on some laptops (or people who need to do GPU computation). And they work in X.

Sidebar: Some Wayland references

See the Debian wiki page on supported and unsupported desktop environments, toolkits, and so on. Debian has been using Wayland for GNOME since Debian 10, released in 2019, at least on non-Nvidia hardware. Their Wayland and Nvidia wiki pages are unclear if their GNOME defaults to Wayland even on Nvidia hardware, but I suspect not. The Arch wiki has a list of Wayland compositors, but no information on how usable they are. The Gentoo "Wayland Desktop Landscape" may be helpful, especially since it rates some of the available compositors.

WaylandMyView2021 written at 00:28:02; Add Comment


Why NFS servers generally have a 'reply cache'

In the beginning, NFS operated over UDP, with each NFS request and each NFS reply in a separate UDP packet (possibly fragmented). UDP has the charming property that it can randomly drop arbitrary packets (and also reorder them). If UDP drops a NFS client's request to the server, the NFS client will resent resend it (a 'retransmit' in the jargon of NFS). If UDP drops the server's reply to a client's request, the client will also resend the request, because it can't really tell why it didn't get a reply; it just knows that it didn't.

(Since clients couldn't tell the difference between a sufficiently slow server and packet loss, they also reacted to slow servers by retransmitting their requests.)

A lot of NFS operations are harmless to repeat when the server's response is lost. For instance, repeating any operation that reads or looks up things simply gives the client the current version of the state of things; if this state is different than it was before, it's pretty much a feature that the client gets a more up to date version. However, some operations are very dangerous to repeat if the server response is lost, because the result changes in a bad way. For example, consider a client performing a MKDIR operation that it's using for locking. The first time, the client succeeds but the server's reply is lost; the second time, the client's request fails because the directory now exists, and the server's reply reaches the client. Now you have a stuck lock; the client has succeeded in obtaining the lock but thinks it failed and so nothing is ever going to release the lock.

(This isn't the only way NFS file-based locking problems can happen.)

To try to work around this issue, NFS servers soon introduced the idea of a "reply cache", which caches the NFS server's reply to various operations that are considered dangerous for clients to repeat. The hope and the idea is that when a client resends such a request that the server has already handled, the server will find its reply in this cache and repeat it to the client. Of course this isn't a guaranteed cure, since the cache has a finite size (and I think it's usually not aware of other operations that might invalidate its answers).

In the days of NFS over UDP and frequent packet loss and retransmits, the reply cache was very important. These days, NFS over TCP uses TCP retransmits below the level that the NFS server and client see, so sent server replies are very hard to lose and actual NFS level retransmissions are relatively infrequent (and I think they're more often from the client deciding that the server is too slow than from actual lost replies).

In past entries (eg on how NFS in unreliable for file-based locking), I've said that this is done for operations that aren't idempotent. This is not really correct. There are very few NFS operations that are truly idempotent if re-issued after a delay; a READDIR might see a new entry, for example, or READ could see updated data in a file. But these differences are not considered dangerous in the way that a MKDIR going from success to failure is, and so they are generally not cached in the reply cache in order to leave room for the operations where it really matters.

(Thus, the list of non-cached NFS v3 operations in the Linux kernel NFS server mostly isn't surprising. I do raise my eyes a little bit at COMMIT, since it may return an error. Hopefully the Linux NFS server insures that a repeated COMMIT gets the same error again.)

NFSServerReplyCacheWhy written at 22:01:48; Add Comment

(Previous 10 or go back to March 2021 at 2021/03/23)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.