Wandering Thoughts

2021-02-22

The mhbuild directives I want for sending MIME attachments with MH

I use and like (N)MH for any number of reasons (it's a very Unix and command line focused mail system, although I often use it with exmh). However, for my sins I sometimes have to send people email with (MIME) attachments of things like PDFs and images, especially these days when I can't just give them physical printouts. This is possible in NMH, but the manual page for mhbuild, the core program you need to do this, is both relatively opaque and full of discussions about things that I mostly don't care about. Since I always wind up picking painfully through the mhbuild manpage when I need it, today I'm going to write down the magic directives you (I) need and also the simpler way that I discovered while writing this entry.

Suppose that we start with a FILE.PDF that we want to send. Our end goal is to wind up with an attachment that as at least the following MIME headers:

Content-Type: application/pdf; name="FILE.PDF"
Content-Disposition: attachment; filename="FILE.PDF"
Content-Transfer-Encoding: base64

In theory the 'name="..."' on the Content-Type is unnecessary and only exists for historical reasons. In practice, it's common practice in MIME messages to put the filename in both places. You can also have a Content-Description, and optionally a Content-ID, which I think is generally irrelevant for attachments.

The simple and often sufficient way to add one or more attachments to your message is to list them as Attach headers in your draft (and then use the 'mime' command in whatnow to activate all of mhbuild's processing). The format of these is simply:

Attach: /path/to/some/FILE.PDF

If NMH can get the MIME type right, this will do just what you want (including providing a Content-Description that's the base filename without its directory). Even if it gets the MIME type wrong, you can fix that by editing the Content-Type: by hand afterward, and for that matter you can edit the file name (in all its places) if you would like your recipient to see a different file name than you have locally. In a lot of cases I think this will be good enough for what I want to attach, especially if I rename my PDFs before hand.

(I didn't discover about Attach until now for many reasons, including that the mhbuild manpage doesn't exactly encourage extensively and carefully reading of all of it. It's very MH manpage-y, for better or worse.)

The more complex way is to use a mhbuild directive to specify everything you want. The mhbuild manpage has a lot to say about directives, but much of it is for complicated and odd cases that I don't care about. The format we want for attaching files is:

#application/pdf; name="FILE.PDF" <> {attachment; filename="FILE.PDF"} /path/argle.pdf

(Notice that this example renames the PDF; it is argle.pdf on the filesystem, but the recipient will see it called FILE.PDF.)

For my future reference, this directive breaks down as follows. The first part becomes the Content-Type. The <> tells mhbuild to not generate a Content-ID, which is generally surplus. The {...} becomes the Content-Disposition. We're specifying the filename MIME parameter here, but you can leave it out if you're happy with the recipient seeing your local file name (this is the same as with Attach). If you leave out the '{ ... }' portion entirely, the result will have no Content-Disposition header.

If you want to go the extra distance you can also also provide a Content-Description by using [...], square brackets, as in '[Amazon purchase invoice #10]'. This goes after the <> and before the {...} (or before the filename, if you leave out the {...}). I don't know how many mail clients show or care about the Content-Description; it's not all that common in my saved mail, and most of the time it's just another copy of the file name.

Given all of this, the minimal version is:

#application/pdf <> {attachment} /path/to/FILE.PDF

This will have no 'name=' parameter on the Content-Type, but will have a 'filename=' parameter on the Content-Disposition so the recipient's mail client will probably let them save it under some useful name. If you're not sure what MIME type some file should be, you can use 'file --mime' to see what it thinks or default to application/octet-stream. If you do the latter, you'll be in good company; we see plenty of attachments on incoming legitimate email that have been defaulted that way even when there are more applicable MIME types.

MHSendingMIMEAttachments written at 23:06:37; Add Comment

2021-02-06

Talkd and 'mesg n': a story from the old Unix days

I tweeted:

An ancient Unix habit: I often still reflexively run 'mesg n' on my (single-user) workstation before starting screen, even though it's been a very long time since I ran talkd and so had any worries about that.

Back in the old days of Unix there was a program called talk, which actually made it into the POSIX standard (which I was surprised to discover just now). Talk enabled live two way communication, instead of the one way communication of write (which is also a POSIX standard command), but the relevant thing about it was that it generally worked through a daemon, talkd. Your Unix server ran talkd, and when you ran talk it communicated with talkd to notify the person you wanted to talk to and then let them connect with you.

Back in the days, this communication was done over IP, not (say) Unix domain sockets. Since it was the old days of a trusting network environment, your talkd would accept requests from everyone, not just the local machine (and talk would chat across the network), letting anyone on your local network or often the entire Internet try to start up a talk session with you. This meant that even on a single user workstation, there was a reason to run 'mesg n' to avoid having random talk notifications overwrite screen's output and interfere with it.

(On a multi-user machine, other people on the same machine might try to write to you and you'd want to keep that from interfering with your screen session. This isn't an issue on a single user workstation.)

It's been a very long time since my workstation ran talkd, even for requests from the local network. But my reflexes still want to run that 'mesg n' before I start screen.

(I didn't put 'mesg n' in my shell .profile for what is ultimately fuzzy reasons.)

TalkdAndMesgN written at 23:41:38; Add Comment

2021-02-02

The small oddity in the Unix exec*() family

When I recently wrote about find's -exec option, I casually talked about 'the exec() family of system calls'. This is an odd phrasing and sort of an odd thing to say about Unix system calls, because they don't usually come in families. So let's list off the exec() family:

execv() execve() execvp() execvpe()
execl() execlp() execle()

(This is the list on Linux and OpenBSD; the FreeBSD list has execvP() but not execvpe(). The POSIX version leaves out execvpe() and adds fexecve(), which I don't quite put into this exec() family.)

One of these things is not like the others. Out of the entire list of at least six exec() functions, generally only execve() is a system call; the other exec*() functions are library functions layered on top of it. That there are convenient library functions layered on top of a system call (or a bunch of them) isn't odd; that's what all of stdio is, for example. What makes this situation feel odd is that the names are so close to each other. I have a decent memory for Unix libc function names and most of the time I probably couldn't pick the actual exec() system call out of a lineup like this.

(Right now it's very much in my memory that execve() is the underlying system call on most Unixes, of course.)

This multiplicity goes all the way back to V7 Unix, which documents all of execl(), execv(), execle(), and execve() in its exec(2) manpage. In V7, as is the case today, the underlying system call is execve(), although it had a different system call name. Even V6 had execl() and execv() in the V6 exec(2) manpage.

(The V6 system call was just called exec and took only the program to be executed and argv. When V7 added the environment, it kept the V6 exec call but added a new exece system call that took the environment (well, envp) as an additional argument.)

PS: Some Unixes have underlying system calls that are variants of each other, due to the slow growth and improvement in the system call API over time (for example, to add 64-bit variants of what used to be 32-bit calls). However, usually you only use and think about the most recent version of the system call; they aren't a family of variants the way the exec() family is.

ExecFunctionFamilyOddity written at 23:47:51; Add Comment

2021-01-31

The limitations on find's -exec option and implementation convenience

In my entry on how find mostly doesn't need xargs nowadays, I noted that in '-exec ... {} +', the '{}' (for the filenames find was generating) had to come at the end. In a comment on that entry, an anonymous commentator noted that this didn't apply to the -exec version that runs a separate command for each filename; with it, you can put the substituted filename anywhere in the command. This appears to be not just a GNU Find feature, but instead a common one and I think it's even required by the Single Unix Specification for find.

(The SUS specification of the -exec arguments only restricts the '+' form to having the '{}' immediately before it. Its specification for the ';' form just allows for a general argument list, and then the text description says a '{}' in the argument list is replaced by the current pathname. This is tricky, as seems usual for the SUS and POSIX.)

This difference between the two forms of -exec is an interesting difference, and it probably exists because of implementation convenience for the '+' form. So let's start from beginning. When you use any form of -exec, find runs those commands via the exec() family of system calls (and library functions), which require a (C) array of the command and the arguments to be passed to them (ie, this is argv for the new command). The implementation of this for the single substitution case of '-exec ... ;' is straightforward: you create and pre-populate an argv array of all of the -exec arguments (and the command), and you remember the index of the '{}' parameter in it (if there is one, it's not required). Every time you actually run the command, you put the current pathname in the right argv slot and you're done.

In the restricted form of multiple substitutions, you can sort of do this too. You create an argv array of some size, populate the front of it with all of the fixed options, and then append each pathname to the end as an additional option, keeping track of the total size of all of the arguments until you need to execute the command to avoid that being too big. When you're done, you reset your 'the next pathname goes here' index back to the starting position, at the end of the fixed options, and repeat.

However, if the '{}' could go anywhere you'd need a more complicated implementation that would have to divide the fixed arguments into two parts, one before and one after the '{}'. You would fill the front of your argv with the 'before' fixed arguments, append pathnames as additional arguments until the total size hit your limit, and then append on the 'after' fixed arguments (if any) before the exec(). This is not much extra work but it is a bit of it, and I have to theorize that it was just enough extra work to push the people implementing the SVR4 version (where this feature first appeared) to pick the restricted form to make their lives slightly more convenient and bug free (since code you don't have to write is definitely bug free).

(I'm sure that this isn't the only area of Unix commands where you can see implementation convenience showing through, but find's contrast between the two versions of -exec is an unusually clear example.)

FindExecImplementationShows written at 22:35:00; Add Comment

2021-01-27

find mostly doesn't need xargs today on modern Unixes

I've been using Unix for long enough that 'find | xargs' is a reflex. When I started and for a long time afterward, xargs was your only choice for efficiently executing a command over a bunch of find results. If you didn't want to run one grep or rm or whatever per file (which was generally reasonably slow in those days), you reached for 'find ... -print | xargs ...'. There were some gotchas in traditional xargs usage, and one of them was why GNU xargs, GNU find, and various other things start growing options to use the null byte as an argument terminator instead of the usual (and surprising) definition. Over time I adopted to these and soon was mostly using 'find ... -print0 | xargs -0 ...'.

For usage with find, all of this is unnecessary on a modern Unix and has been for some time, because find folded this into itself. Modern versions of find don't have just the traditional '-exec', which runs one command per file, but also an augmented version of it which aggregates the arguments together like xargs does. This augmented version is used by ending the '-exec' with '+' instead of ';', like so:

find . ... -exec grep -H whatever '{}' +

(I'm giving grep the -H argument for reasons covered here.)

Although I sometimes still reflexively use 'find | xargs', more and more I'm trying to use the simple form of just find with this augmented -exec. My reflexes can learn new tricks, eventually.

This augmented form of -exec is in the Single Unix Specification for find, so unsurprisingly it's not just in GNU Find but also OpenBSD, FreeBSD, NetBSD, and Illumos. I haven't tried to look up a find manpage in whatever commercial Unixes are left (probably at least macOS and AIX). Based on the rationale section of the SUS find, this very convenient find feature was introduced in System V R4. The Single Unix Specification also explains why they didn't adopt the arguably more Unixy option of '-print0' for null-terminated output.

(In practice everyone has adopted -print0 as well, even OpenBSD and Illumos. I assume without checking that they also all have 'xargs -0', because it doesn't make much sense to adopt one without the other.)

PS: Unfortunately this feature is not quite as flexible as it looks. Both the specification and actual find implementations require the '{}' to be at the end of the command, instead of anywhere in it. This means you can't do something like 'find ... -exec mv {} /some/dir +'. This makes life slightly simpler for find's code and probably only rarely matters for actual usage.

FindWithoutXargsToday written at 00:09:21; Add Comment

2021-01-06

Unix shell pipelines have two usage patterns

I've seen a variety of recommendations for safer shell scripting that use Bash and set its 'pipefail' option (for example, this one from 2015). This is a good recommendation in one sense, but it exposes a conflict; this option works great for one usage pattern for pipes, and potentially terribly for another one.

To understand the problem, let's start with what Bash's pipefail does. To quote the Bash manual:

The exit status of a pipeline is the exit status of the last command in the pipeline, unless the pipefail option is enabled. If pipefail is enabled, the pipeline’s return status is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands exit successfully. [...]

The reason to use pipefail is that if you don't, a command failing unexpectedly in the middle of a pipeline won't normally be detected by you, and won't abort your script if you used 'set -e'. You can go out of your way to carefully check everything with $PIPESTATUS, but that's a lot of extra work.

Unfortunately, this is where our old friend SIGPIPE comes into the picture. What SIGPIPE does in pipelines is force processes to exit if they write to a closed pipe. This happens if a later process in a pipeline doesn't consume all of its input, for example if you only want to process the first thousand lines of output of something:

generate --thing | sed 1000q | gronkulate

The sed exits after a thousand lines and closes the pipe that generate is writing to, generate gets SIGPIPE and by default dies, and suddenly its exit status is non-zero, which means that with pipefail the entire pipeline 'fails' (and with 'set -e', your script will normally exit).

(Under some circumstances, what happens can vary from run to run due to process scheduling. It can also depend on how much output early processes are producing compared to what later processes are filtering; if generate produces 1000 lines or less, sed will consume all of them.)

This leads to two shell pipeline usage patterns. In one usage pattern, all processes in the pipeline consume their entire input unless something goes wrong. Since all processes do this, no process should ever be writing to a closed pipe and SIGPIPE will never happen. In another usage pattern, at least one process will stop processing its input early; often such processes are in the pipeline specifically to stop at some point (as sed is in my example above). These pipelines will sometimes or always generate SIGPIPEs and have some processes exiting with non-zero statuses.

Of course, you can deal with this in an environment where you're using pipefail, even with 'set -e'. For instance, you can force one pipeline step to always exit successfully:

(generate --thing || true) | sed 1000q | gronkulate

However, you have to remember this issue and keep track of what commands can exit early, without reading all of their input. If you miss some, your reward is probably errors from your script. If you're lucky, they'll be regular errors; if you're unlucky, they'll be sporadic errors that happen when one command produces an unusually large amount of output or another command does its work unusually soon or fast.

(Also, it would be nice to only ignore SIGPIPE based failures, not other failures. If generate fails for other reasons, we'd like the whole pipeline to be seen as having failed.)

My informal sense is that the 'consume everything' pipeline pattern is far more common than the 'early exit' pipeline pattern, although I haven't attempted to inventory my scripts. It's certainly the natural pattern when you're filtering, transforming, and examining all of something (for example, to count or summarize it).

ShellPipesTwoUsages written at 00:30:46; Add Comment

2020-12-31

GNU Date and several versions of RFC 3339 dates

RFC 3339 is an RFC standard for time strings, which is to say how to represent a timestamp in textual form. In its RFC form, it says that it is for 'timestamps for Internet protocol events', but it's been adopted as a time format well beyond that; for example, some Prometheus tools use it. In its pure RFC 3339 form, it has two advantages: it doesn't use time zone names of any sort, and a RFC 3339 time is written all in one string without any spaces.

A true RFC 3339 version of some local time has two normal representations, a version expressed in UTC ('Zulu') time, with a time zone offset of 00:00, and a version expressed in local time:

2020-12-31T21:07:14-05:00
2021-01-01T02:07:14Z

These both represent the same time, that of the Unix timestamp 1609466834.

Writing out times in RFC 3339 format is a little bit annoying; either you need a time in UTC or you need to remember your local timezone offset (as of the relevant time, no less). To help make up for this, GNU Date has an option where it can produce RFC 3339 dates. Except that it doesn't:

; date --rfc-3339=seconds
2020-12-31 21:07:14-05:00

RFC 3339 is almost unambiguous here; a date-time is expressed as a date and a time with a 'T' between them (a lower case 't' is also accepted). Unfortunately it provides a little escape hatch that GNU Date has taken advantage of:

NOTE: ISO 8601 defines date and time separated by "T".
Applications using this syntax may choose, for the sake of readability, to specify a full-date and full-time separated by (say) a space character.

This escape hatch is not present in RFC 3339's ABNF grammar or in its examples, but the regrettable presence of this little paragraph technically lets GNU Date off the hook. However, other programs that deal with RFC 3339 dates are not so forgiving and do not follow this 'may', instead requiring that RFC 3339 dates be given to them with the T. One large case of these is any program parsing time strings with Go's time package, where the RFC 3339 format specifically requires the 'T'.

GNU Date can produce real RFC 3339 time strings with the '--iso-8601=seconds' option (and its manual notes that the only difference between its ISO 8601 format and its RFC 3339 format is that 'T'). However, it has another peculiarity, although this one is RFC ABNF legal:

; date --iso-8601=seconds --utc
2021-01-01T02:07:14+00:00

GNU Date writes out the +00:00 of UTC instead of shortening it to 'Z'. There may be some programs that specifically require RFC 3339 times in UTC with the Z; if there are, they won't accept GNU Date's output here. Opinions may be divided on whether 'Z' or '+00:00' is better; I tend to come down on the size of 'Z'.

Other versions of date, such as the ones in FreeBSD and OpenBSD, don't have any specific output option for RFC 3339 dates. Since the standard formatting of strftime() has no option for a '[+-]hh:mm' version of the time zone offset (only a version without the separator, as '%z'), you cannot use them to produce RFC 3339 dates in local time. Instead you must remember to always use 'date -u' and fake the time zone (here I use Z):

$ date -u '+%Y-%m-%dT%H:%M:%SZ'
2021-01-01T02:07:14Z

If you have to do this more than once in a blue moon on a FreeBSD or OpenBSD machine, you're clearly going to either be writing a cover script or perhaps a program (so that you can parse a range of time strings as the source time). Or you could install GNU Date, but then you have to deal with the irritation of its version of RFC 3339.

GNUDateAndRFC3339 written at 21:41:57; Add Comment

2020-12-28

It feels like the broad Unix API is being used less these days

A few years ago I wrote about how the practical Unix API is broader than system calls and how the value locked up in the broad Unix API made it pretty durable. I still believe that in one way, but at the same time I've wound up feeling that a lot of modern software development and deployment practices are causing the broad Unix API to be less and less used and useful. What I'm specifically thinking about here is containers.

If you're logging in to a Unix machine and using it, elements of the broad Unix API like $HOME and /tmp matter to you. But for a container (or for deploying a container), they often don't. Containers deliberately ask much less of the host than the broad Unix API (that's one of their features), and to the extent that software inside a container uses the broad API, it's using a sham version that was custom assembled for it. My impression is that some of this shift is social, in attitudes about how container-ized software should be put together and what it should use and assume. To put it one way, I don't think it would be seen as a good thing to use a bunch of shell scripts in a container. Containers aren't general purpose Unix systems and people don't write software for them as if they were.

Right now I don't think this is a significant force in the parts of the broad Unix world that I notice, one big enough to be changing Unix as a whole. There are plenty of people still running and deploying traditional Unix systems (including us), and then putting software straight onto such systems (without containers). These people are all using the broad Unix API and exerting a quiet pressure on software to still support (and use) it, instead of requiring containers or at least some emulation of them (although you can find software that really doesn't want to be deployed 'simply', ie outside a container).

One part of this is likely that Unix remains more than Linux, although not everyone really believes this. Right now containers are fairly strongly tied to Linux for various reasons, so if you write container-only software you're implicitly writing Linux only software. My impression is that many open source projects aren't willing to tie themselves down like this.

Of course, there's also a lot of Unix software that isn't the sort of thing you put in containers in the first place, or at least not in conventional containers (Linux has Flatpaks and Snaps for more interactive applications, but they're not very popular). This software is using the broad Unix API when it arranges to install manpages, support files, and so on in the standard locations. It can also sometimes take advantage of standard services and standard integrations with other software (for example Certbot and other Let's Encrypt automation, which cooperate with various daemons to give them TLS certificates).

UnixAPILessUsed written at 23:54:46; Add Comment

2020-12-04

How to get generic interface names and IPs in OpenBSD PF

Over on Twitter, I had a grump that led to me learning some things:

Since OpenBSD Ethernet interface names are tied to the physical hardware, I wish OpenBSD pf.conf syntax had an abstract name that meant 'the interface with the default route, whatever that is'.

We have straightforward OpenBSD systems with a single active interface that get installed on various hardware, with various interface names, and I would really like to not have to change their pf.conf for every different piece of hardware they get put on (physical or virtual).

Thanks to @oclsc I've now learned about OpenBSD interface group names, and reading the manpage pointed me to the predefined 'egress' group name, which means 'all interfaces with the default route'. You can use 'egress:0' to mean the (first) egress IP address.

We're not savages, so of course we use pf.conf macros in our pf.confs:

server_if = "bnx0"
server_ip = 192.168.100.100

[...]
pass in on $server_if from any to $server_ip port = 22

But this still means that we have to define the name of the interface and the server's IP once. When we have two OpenBSD machines that are clones of each other, for example two OpenVPN servers for redundancy, this historically means that we've had to have two copies of pf.conf that are supposed to differ only in server IP.

(Usually we put the two servers on identical hardware, so the interface names are the same. If we used sufficiently different hardware that the interface names changed, we'd have to vary that too.)

However, it turns out that you can get what I want and, for simple configurations, reduce this to a completely generic version. First and well documented in the manpage for pf.conf is that you can use an interface name in PF rules in place of an IP address (cf). If you do, it means all of the IP addresses associated with that interface. If you want to not accept aliases, you can add :0 on the end to get the first one (cf). Combined with macros, we can write:

server_ip = $server_if:0

Written this way, the address is looked up once when the PF ruleset is loaded and then substituted in. If you dump the installed rules with 'pfctl -s rules', you'll see the actual IPs, and the resulting rules are exactly the same as if you'd specified the IP directly.

To get generic names for interfaces, we need to use the name of interface groups, which are documented in the ifconfig manpage. Conveniently there is a predefined group that does what I want, the 'egress' group:

  • The interfaces the default routes point to are members of the “egress” interface group.

(There can be more than one default route pointing out more than one interface, but in normal use on our servers there is exactly one interface with a default route.)

If you don't want to rely on where the default route is pointing you can explicitly specify a custom group in /etc/hostname.ifname and then use it in a macro in the PF rules. I'm not sure how to best write this in the file, but sort of following information from the hostname.if(5) manpage, I found that it worked to write it on two lines:

inet 192.168.100.100 0xffffff00
group net-sandbox

All of this sounds great but it has a tiny little drawback, which is that it makes your PF configuration a bit more magical. Explicitly writing out the interface name and IP may be annoying some of the time, but it's always extremely obvious what is going on. You don't have to try to remember what 'egress' or 'net-sandbox' means when used as an interface name (or an IP address); it's always right there. Also, you're absolutely guaranteed that your rules are matching only a single IP address or a single interface. With interface group names, you're relying on the rest of your configuration to insure that there is only ever one 'egress' or 'net-sandbox' interface, no matter what you do to the machine.

A related issue is that the meaning of 'egress', 'net-sandbox', and the like can change between PF ruleset loads (and the associated IPs along with them), without any direct changes to pf.conf. This means that you can boot with the system in one setup, change it for some reason, do a 'pfctl -f /etc/pf.conf' with an unchanged pf.conf, and wind up with a different set of rules. In some environments this is a feature; in others it is a drawback, or at least a potential unpleasant surprise.

(What triggered this otday was testing out a version of our OpenBSD VPN servers on the current OpenBSD in a virtual machine. Of course I needed their standard pf.conf, but my virtual machine had both a different IP address and a different interface than the current real servers.)

OpenBSDPFGenericNames written at 23:42:38; Add Comment

2020-11-16

POSIX write() is not atomic in the way that you might like

I was recently reading Evan Jones' Durability: Linux File APIs. In this quite good article, I believe that Jones makes a misstep about what you can assume about write() (both in POSIX and in practice). I'll start with a quote from the article:

The write system call is defined in the IEEE POSIX standard as attempting to write data to a file descriptor. After it successfully returns, reads are required to return the bytes that were written, even when read or written by other processes or threads (POSIX standard write(); Rationale). There is an addition note under Thread Interactions with Regular File Operations that says "If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them." This suggests that all file I/O must effectively hold a lock.

Does that mean write is atomic? Technically, yes: future reads must return the entire contents of the write, or none of it. [...]

Unfortunately, that writes are atomic in general is not what POSIX is saying and even if POSIX tried to say it, it's extremely likely that no Unix system would actually comply and deliver fully atomic writes. First off, POSIX's explicit statements about atomicity apply only in two situations: when anything is writing to a pipe or a FIFO, or when there are multiple threads in the same process all performing operations. What POSIX says about writes interleaved with reads is much more limited, so let me quote it (emphasis mine):

After a write() to a regular file has successfully returned:

  • Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified.

This does not require any specific behavior for read()s on files that are started by another process before the write() returns (including ones started before the write() began). If you issue such a read(), POSIX allows it to see none, some, or all of the data from the write(). Such a read() is only (theoretically) atomic if you issue it from another thread within the same process. This definitely doesn't provide the usual atomicity property that everyone sees either all of an operation or none of it, since a cross process read() performed during the write() is allowed to see partial results. We would not call a SQL database that allowed you to see partially complete transactions 'atomic', but that is what POSIX allows for write() to files.

(It is also what real Unixes almost certainly provide in practice, although I haven't tested this and there are many situations. For instance, I wouldn't be surprised if aligned, page-sized writes (or filesystem block sized ones) were atomic in practice on many Unixes.)

If we think about what it would take to implement atomic file writes across processes, this should be unsurprising. Since Unix programs don't expect short writes on files, we can't make the problem simpler by limiting how large a write we have to make atomic and then capping write() to that size; people can ask us to write megabytes or even gigabytes in a single write() call and that would have to be atomic. This is too much data to be handled by gathering it into an internal kernel buffer and then flipping the visible state of that section of the file in one action. Instead this would likely require byte range locking, where write() and read() lock against each other where their ranges overlap. This is already a lot of locking activity, since every write() and every read() would have to participate.

(You could optimize read() from a file that no one has open for writing, but then it would still need to lock the file so that it can't be opened for writing until the read() completes.)

But merely locking against read() is not good enough on modern Unixes, because many programs actually read data by mmap()'ing files. If you really want write() to be usefully atomic, you must make these memory mapped reads lock against write() as well, which requires relatively expensive page table manipulation. Worrying about mmap() also exposes a related issue, which is that when people read through memory mapping, write() isn't necessarily atomic even at the level of individual pages of memory. A reader using mapped memory may see a page that's half-way through the kernel's write() copying bytes into it.

(This may happen even with read() and write(), since they may both access the same page of data from the file in the kernel's buffer cache, but it is probably easier to lock things there.)

On top of the performance issues, there are fairness issues. If write() is atomic against read(), a long write() or a long read() can stall the other side for potentially significant amounts of time. People do not enjoy slow and delayed read() and write() operations. This also provides a handy way to DoS writers of files that you can open for reading; simply set up to read() the entire file in one go (or as few as possible) over and over again.

However, much of these costs are because we want cross process atomic write()s, which means that the kernel must be the one doing the locking work. Cross thread atomic write() can be implemented entirely at user level within a single process (provided that the C library intercepts read() and write() operations when threading is active). In a lot of cases you can get away with some sort of simple whole file locking, although the database people will probably not be happy with you. Fairness and stalls are also much less of an issue within a single process, because the only person you're hurting is yourself.

(Most programs do not read() and write() from the same file at the same time in two threads.)

PS: Note that even writes to pipes and FIFOs are only atomic if they are small enough; large writes explicitly don't have to be atomic (and generally aren't on real Unixes). It would be rather unusual for POSIX to specify limited size atomicity for pipes and unlimited size atomicity for regular files.

PPS: I would be wary of assuming that any particular Unix actually fully implemented atomic read() and write() between threads. Perhaps I'm being cynical, but I would test it first; it seems like the kind of picky POSIX requirement that people would cut out in the name of simplicity and speed.

WriteNotVeryAtomic written at 23:40:15; Add Comment

Unix doesn't normally do short write()s to files and no one expects it to

A famous issue in handling network IO on Unix is that write() may not send all of your data; you will try to write() 16 KB of data, and the result will tell you that you only actually wrote 4 KB. Failure to handle this case leads to mysteriously lost data, where your sending program thinks it sent all 16 KB but of course the receiver only saw 4 KB. It's very common for people writing network IO libraries on Unix to provide a 'WriteAll' or 'SendAll' operation, or sometimes make it the default behavior.

(Go's standard Write() interface requires full writes unless there was an error, for example.)

In theory the POSIX specification for write() allows it to perform short writes on anything (without an error), not just network sockets, pipes, and FIFOs. In particular it is allowed to do them for regular files, and POSIX even documents some situations where this may happen (for example, if the process received a signal part way through the write() call). In practice, Unixes do not normally do short write()s to files without an error occurring, outside of the special case of a write() being interrupted by a signal that doesn't kill the process outright.

(If the process dies on the spot, there is no write() return value.)

In theory, because it's possible, every Unix program that write()s to a file should be prepared to handle short writes. In practice, since it doesn't really happen, many Unix programs are almost certainly not prepared to handle it. If you (and they) are lucky, these programs check that the return value of the write() is the amount of data they wrote and error out otherwise. Otherwise, they may ignore the write() return value and cheerfully sail on with data lost. Of course they don't actually error out or lose data in practice, because short write()s don't really happen on files.

(Some sorts of programs are generally going to be okay because they are already very careful about data loss. I would expect any good editor to be fine, for example, or at least to report an error.)

This difference between theory and practice means that it would be pretty dangerous to introduce a Unix environment that did routinely have short writes to files (whether it was a new Unix kernel or, say, a peculiar filesystem). This environment would be technically correct and it would be uncovering theoretical issues in programs, but it would probably not be useful.

PS: Enterprising parties could arrange to test this with their favorite programs through a loadable shared library that intercepts write() and shortens the write size. I suspect that you could get an interesting undergraduate Computer Science paper out of it.

WritesNotShortOften written at 00:12:50; Add Comment

(Previous 11 or go back to October 2020 at 2020/10/08)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.