The pending delete problem for Unix filesystems
Unix has a number of somewhat annoying filesystem semantics that
tend to irritate designers and implementors of filesystems. One of
the famous ones is that you can delete a file without losing access
to it. On at least some OSes, if your program
open()s a file and
then tries to delete it, either the deletion fails with 'file is
in use' or you immediately lose access to the file; further attempts
to read or write it will fail with some error. On Unix your program
retains access to the deleted file and can even pass this access
to other processes in various ways. Only when the last process using
the file closes it will the file actually get deleted.
This 'use after deletion' presents Unix and filesystem designers
with the problem of how you keep track of this in the kernel. The
historical and generic kernel approach is to keep both a link count
and a reference count for each active inode; an inode is only marked
as unused and the filesystem told to free its space when both counts
go to zero. Deleting a file via
unlink() just lowers the link
count (and removes a directory entry); closing open file descriptors
is what lowers the reference count. This historical approach ignored
the possibility of the system crashing while an inode had become
unreachable through the filesystem and was only being kept alive
by its reference count; if this happened the inode became a zombie,
marked as active on disk but not referred to by anything. To fix
it you had to run a filesystem checker, which would
find such no-link inodes and actually deallocate them.
(When Sun introduced NFS they were forced to deviate slightly from this model, but that's an explanation for another time.)
Obviously this is not suitable for any sort of journaling or 'always
consistent' filesystem that wants to avoid the need for a
after unclean shutdowns. All such filesystems must keep track of
such 'deleted but not deallocated' files on disk using some mechanism
(and the kernel has to support telling filesystems about such
inodes). When the filesystem is unmounted in an orderly way, these
deleted files will probably get deallocated. If the system crashes,
part of bringing the filesystem up on boot will be to apply all of
the pending deallocations.
Some filesystems will do this as part of their regular journal; you journal, say, 'file has gone to 0 reference count', and then you know to do the deallocation on journal replay. Some filesystems may record this information separately, especially if they have some sort of 'delayed asynchronous deallocation' support for file deletions in general.
(Asynchronous deallocation is popular because it means your process
unlink() a big file without having to stall while the kernel
frantically runs around finding all of the file's data blocks and
then marking them all as free. Given that finding out what a file's
data blocks are often requires reading things from disk, such deallocations can be relatively
slow under disk IO load (even if you don't have other issues there).)
PS: It follows that a failure to correctly record pending deallocations or properly replay them is one way to quietly lose disk space on such a journaling filesystem. Spotting and fixing this is one of the things that you need a filesystem consistency checker for (whether it's a separate program or embedded into the filesystem itself).
In Go, you need to always make sure that your goroutines will finish
Yesterday I described an approach to writing lexers in Go that pushed the actual lexing into a separate goroutine, so that it could run as straight-line code that simply consumed input and produced a stream of tokens (which were sent to a channel). Effectively we're using a goroutine to implement what would be a generator in some other languages. But because we're using goroutines and channels, there's something important we need to do: we need to make sure the lexer goroutine is run to completion, so that the goroutine will actually finish.
Right now you may be saying 'well of course the lexer will always be run to the end of the input, that's what the parser does'. But not so fast; what happens if the parser runs into a parse error because of a syntax error or the like? The natural thing to do in the parser is to immediately error out without looking at any further tokens from the lexer, which means that the actual lexer goroutine will stall as it sits there trying to send the next token into its communication channel, a channel that will never be read from because it's been abandoned by the parser.
The answer here is that the parser must do something to explicitly run the lexer to completion or otherwise cause it to exit, even if the tokens the lexer are producing will never be used. In some environments having the lexer process all of the remaining input is okay because it will always be small (and thus fast), but if you're lexing large bodies of text you'll want to arrange some sort of explicit termination signal via another channel or something.
This is an important way in which goroutines and channels aren't a perfect imitation of generators. In typical languages with generators, abandoning a generator results in it getting cleaned up via garbage collection; you can just walk away without doing anything special. In Go with goroutines, this isn't the case; you need to consider goroutine termination conditions and generally make sure it always happens.
You might think that this is a silly bug and of course anyone who uses goroutines like this will handle it as a matter of course. If so, I regret to inform you that I didn't come up with this realization on my own; instead Rob Pike taught it to me with his bugfix to Go's standard text/template module. If Rob Pike can initially overlook this issue in his own code in the standard library, anyone can.
Go goroutines as a way to capture and hold state
The traditional annoyance when writing lexers is that lexers have internal state (at least their position in the stream of text), but wind up returning tokens to the parser at basically random points in their execution. This means holding the state somewhere and writing the typical start/stop style of code that you find at the bottom of a pile of subroutine calls; your 'get next token' entry point gets called, you run around a bunch of code, you save all your state, and you return the token. Manual state saving and this stuttering style of code execution doesn't lend itself to clear logic.
Some languages have ways around this structure. In languages with generators, your lexer can be a generator that yields tokens. In lazy evaluation languages your lexer turns into a stream transformation from raw text to tokens (and the runtime keeps this memory and execution efficient, only turning the crank when it needs the next token).
In Rob Pike's presentation on lexing in Go, he puts the lexer
code itself into its own little goroutine. It produces tokens by
sending them to a channel; your parser (running separately) obtains
tokens by reading the channel. There are two ways I could put what
Rob Pike's done here. The first is to say that you can use
goroutines to create generators, with a channel send and receive
taking the place of a
yield operation. The second is that
goroutines can be used to capture and hold state. Just as with
ordinary threads, goroutines turn
asynchronous code with explicitly captured state into synchronous
code with implicitly captured state and thus simplify code.
(I suppose another way of putting it is that goroutines can be used for coroutines, although this feels kind of obvious to say.)
I suspect that this use for goroutines is not new for many people (and it's certainly implicit in Rob Pike's presentation), but I'm the kind of person who sometimes only catches on to things slowly. I've read so much about goroutines for concurrency and parallelism that the nature of what Rob Pike (and even I) were doing here didn't really sink in until now.
(I think it's possible to go too far overboard here; not everything needs to be a coroutine or works best that way. When I started with my project I thought I would have a whole pipeline of goroutines; in the end it turned out that having none was the right choice.)
It's time to stop coddling software that can't handle HTTPS URLs
A couple of years ago I moved my personal website from plain HTTP to using HTTPS. When I did that, one of the lessons I learned was that there were a certain number of syndication feed fetchers that didn't support HTTPS requests at all. My solution at the time was to sigh and add some bits to my lighttpd configuration so they'd be allowed to still fetch the HTTP version of my syndication feeds Now I'm in the process of moving this blog from HTTP to HTTPS and so I've been considering what I'll do about issues like this for here. This time around my decision is that I'm not going to create any special rules; anything fetching syndication feeds or web pages from here that can't do HTTPS (or follow redirections) is flat out of luck.
There are some pragmatic reasons for this, but ultimately it comes down to that I think it's now clearly time that we stopped accepting and coddling software that can only deal with HTTP URLs. The inevitable changes of the Internet have rendered such software broken. It's clear that HTTPS is increasingly the future of web activity and also clear that a decent number of sites will be moving to it via HTTP to HTTPS redirection. Software that cannot cope with both of these is decaying; the more sites that do this, the more pragmatically broken the software is.
I'm not going to say that you should never give in and accommodate decaying, broken software; if nothing else, I certainly have made some accommodations myself. But when I do that, I do it on a case by case basis and only when I've decided that it's sufficiently important; I don't do it generally. Coddling broken software in general only prolongs the pain, not just for you but for everyone. In this case, the more we accommodate HTTP only software the more traffic remains HTTP (and subject to snooping and alteration) instead of moving to HTTPS. HTTPS is not ideal, but it's clear that an HTTPS network is an improvement over the HTTP one we have today in practice.
This is likely going to hurt me somewhat (and already has, as some Planets (also) that carry Wandering Thoughts apparently haven't coped with this). But even apart from the pragmatic impossibility of trying to pick through all of the request to here to see which aren't successfully transitioning to HTTPS, I'm currently just not willing to coddle such bad software any more. It's 2015. You'd better be ready for the HTTPS transition because it's coming whether you like it or not.
The reason I feel like this now when I didn't originally is pretty simple: more time has passed. The whole situation with HTTP and HTTPS on the Internet has evolved significantly since 2013, and there is now real and steadily increasing momentum behind the HTTPS shift. What was kind of wild eyed and unreasonable in 2013 is increasingly mainstream.
The problem with proportional fonts for editing code and related things
One of the eternal attractive ideas for programmers, sysadmins, and other people who normally spend a lot of time working with monospaced fonts in editors, terminal emulators, and so on is the idea of switching to proportional fonts. I've certainly considered it myself (there are various editors and so on that will do this) but I've consistently rejected trying to make the switch.
The big problem is text alignment, specifically what I'll call 'interior' text alignment. Having things line up vertically is quite important for readability and I'm not willing to do without it. At one level there's no problem; automatically lining up leading whitespace is a solved issue, and other things you can align by hand. At another level there's a big problem, because I need to interact with an outside world that uses monospace fonts; the stuff I carefully line up in my proportional fonts editor needs to look okay for them and the stuff that they carefully line up in monospace fonts needs to look okay for me. And automatically detecting and aligning things based on implied columns is a hard problem.
(I used to use a manual page reader that used proportional fonts by default. It made some effort to align things but not enough, and on some manual pages the results came out really terrible. This experience has convinced me that proportional fonts with bad alignment are significantly worse than monospaced fonts.)
This is probably not an insoluble problem. But it means that simply writing an editor that uses proportional fonts is the easy part; even properly indenting leading whitespace is the easy part. In turn this means that you need a very smart editor to make using proportional fonts really a nice experience, especially if you routinely interact with code from outside your own sphere. Really smart editors are rare and relatively prickly and opinionated; if you don't like their interface and behavior, well, you're stuck. You're also stuck if you're strongly attached to editors that don't have this kind of smarts.
(The same logic holds for things like terminal programs but even more so. A really smart terminal program that used proportional fonts would have to detect column alignment in output basically on the fly and adjust things.)
So I like the idea of using proportional fonts for this sort of stuff in theory, but I'm pretty sure that in practice I'm never going to find an environment that fully supports it that works for me.
(For those people who wonder why you'd want to consider this idea at all: proportional fonts are usually more readable and nicer than monospaced fonts. This entry is basically all plain text, so you can actually look at the DWikiText monospaced source for it against the web browser version. At least for me, the web browser's proportional font version looks much better.)
Our mail submission system winds up handling two sorts of senders
Yesterday I mentioned that while in
theory our mail submission system could use sender verification to
check whether a
MAIL FROM address at an outside domain was valid,
but that I didn't feel this was worth it. One of the reasons I feel
this way is that I don't think this check will fail very often for
most outside domains, and to do that I need to talk about how we
have two sorts of senders: real people and machines.
Real people are, well, real people with a MUA who are sending email
out through us. My view is that when real people may send out email
using outside domains in their
From: address, it's extremely
likely that this address will be correct; if it's not correct, the
person is probably going to either notice it or get told by people
they are trying to talk to through some out of band mechanism.
Unless you're very oblivious and closed off, you're just not going
to spend very long with your MUA misconfigured this way. On top of
it, real people have to explicitly configure their address in their
MUA, which means there is a whole class of problems that get avoided.
Machines are servers and desktops and everything we have sitting
around on our network that might want to send status email, report
in to its administrator, spew out error reports to warn people of
stuff, and so on. Email from these machines is essentially
unidirectional (it goes out from the machine but not back), may not
be particularly frequent, and is often more or less automatically
configured. All of this makes it very easy for machines to wind up
with bad or bogus
MAIL FROMs. Often you have to go out of your
way during machine setup in order to not get this result.
(For instance, many machines will take their default domain for
MAIL FROMs from DNS PTR results, which malfunctions in the presence
of internal private zones.)
Most broken machine origin addresses are easily recognized, because they involve certain characteristic mistakes (eg using DNS PTR results as your origin domain). Many of these addresses cannot be definitively failed with sender verification because, for example, the machine doesn't even run a SMTP listener that you can talk to.
You can mostly use sender verification for addresses from real people, but even ignoring the other issues there's little point because they'll almost never fail. Real people will almost always be using sender addresses from outside domains, not from internal hostnames.
What addresses we accept and reject during mail submission
Like many places, our mail setup
includes a dedicated mail submission machine (or two). I mentioned yesterday
that this submission machine refuses some
MAIL FROM addresses,
so today I want to talk about what we accept and refuse during mail
submission and why.
When we were designing our mail submission configuration many years
ago, our starting point was that we didn't expect clients to deal
very well if the submission server gave them a failure response.
What you'd like is for the MUA to notice the error, report it, give
you a chance to re-edit the email addresses involved, and so on and
so forth. What we actually expected would happen would be some
combination of lost email, partially delivered email (if some
TOs failed but others succeeded), and awkward interfaces for dealing
with failed email sending. So a big guiding decision was that our
mail submission machine should accept the email if at all possible,
even if we knew that it would partially or completely fail delivery.
It was better to accept the email and send a bounce rather than
count on all of the MUAs that our users use to get it right.
(Some but not all
RCPT TO addresses failing during SMTP is a
somewhat challenging problem for any MUA to deal with. How do you
present this to the user, and what do you want to do when the user
corrects the addresses? For example, if the user corrects the
addresses and resends, should it be resent to all addresses or just
the corrected ones? There's all sorts of UI issues involved.)
Given that our recovery method for bad destination addresses is
sending a bounce, we need to have what at least looks like a valid
MAIL FROM to send the bounce back to; if we don't we can't send
bounces, so we're better off rejecting during SMTP and hoping that
the MUA will do something sensible. For email addresses in outside
domains, the practical best we can do is verify that the domain
exists. For email addresses in our own domain, we can check that
the local part is valid (using our list of valid local parts), so we do.
(We also do some basic safety checks for certain sorts of bad
characters and bad character sequences in
MAIL FROM and
addresses. These probably go beyond what the RFCs require and may
not be doing anything useful these days; we basically inherited
them from the stock Ubuntu configuration of close to a decade ago.)
We allow people to use
MAIL FROM addresses that are not in our
domain in part because some people in the department have a real
need to do this as part of their work. In general we log enough
source information that if anyone abuses this we can find them and
deal with this.
(You might say 'but what about spammers compromising accounts and sending spam through you with forged origin addresses?' My answer is that that's a feature.)
PS: In theory checking outside domain
MAIL FROM addresses is one
place where sender verification has a real justification, and you
can even legitimately use the null sender address for it. In practice there are all sorts of
failure modes that seem likely to cause heartburn and it's just not
worth it in my opinion.
Sometimes it's useful to have brute force handy: an amusing IPMI bug
Once upon a time we had gotten in some new servers. These servers
had an IPMI and the IPMI could be configured to send out email
alerts if something happened, like a fan stopping or a power supply
losing power. Getting such alerts (where possible) seemed like a
good idea, so I dutifully configured this in the IPMI's web interface.
Sensibly, the IPMI needed me to set the origin address for the
email, so I set it to
sm-ipmi@<us> (and then made sure there was
sm-ipmi alias, so our mail submission machine would accept the
Of course, configurations are never quite done until they're tested.
So I poked the IPMI to send me some test email. No email arrived.
When I went off to our mail submission machine to look at its logs,
I got rather a surprise; the logs said the machine had dutifully
rejected a message that claimed a
MAIL FROM address of
While the insides of an IPMI's embedded software are inscrutable
(at least to lazy sysadmins who are not security researchers), this
smells like some form of classic data storage mismatch bug. The web
interface thinks the email address should be stored with an '=' in
front, maybe as an '
X=Y' thing, whereas whatever is actually using
the address either has an off by one character parsing bug or doesn't
want the extra leading
= that the web interface is adding when
it stores it.
There are probably a bunch of ways we could have dealt with this.
As it happens our mail system is flexible enough to let us do the
brute force approach: we just defined an alias called '
Our mail system is willing to accept an '=' in local parts, even
at the start, so that's all it took to make everything happy. It
looks a little bit peculiar in the actual email messages, but that's
just a detail.
A more picky email system would have given us more heartburn here.
In a way we got quite lucky that none of the many levels of checks
and guards we have choked on this. Our alias generation system was
willing to see '=' as a valid character, even at the start; the
basic syntax checks we do on
MAIL FROM didn't block a = at the
start; Exim itself accepts such a
MAIL FROM local part and can
successfully match it against things. I've used mail systems in the
past that were much more strict about this sort of stuff and they'd
almost certainly have rejected such an address out of hand or at
least given us a lot of trouble over it.
(I don't even know if such an address is RFC compliant.)
The whole situation amuses me. The IPMI has a crazy, silly bug that should never have slipped through development and testing, and we're dealing with it by basically ignoring it. We can do that because our mail system is itself willing to accept a rather crazy local part as actually existing and being valid, which is kind of impressive considering how many different moving parts are involved.
PS: I call this the brute force solution because 'make an alias with a funny character in it' is more brute force than, say, 'figure out how to use sender address rewriting to strip the leading = that the IPMI is erroneously putting in there'.
PPS: Of course, some day maybe we'll update the IPMI firmware and
suddenly find the notification mail not going through because the
IPMI developers noticed the bug and fixed it. I suppose I should
add the '
sm-ipmi' alias back in, just in case.
Why keeping output to 80 columns (or less) is still sensible
When I talked about how monitoring tools should report timestamps and other identifying information, I mentioned that I felt that keeping output to 80 columns or less was still a good idea even if meant sometimes optionally omitting timestamps. So let's talk about that, since it's basically received wisdom these days that the 80 column limit is old fashioned, outdated, and unnecessary.
I think that there are still several reasons that short output is sensible, especially at 80 columns or less. First, 80 columns is still the default terminal window size in many environments; if you make a new one and do nothing special, 80 columns is what you get by default (often 80 by 24). This isn't just on Unix systems; I believe that eg Windows often defaults to this size for both SSH client windows and its own command line windows. This means that if your line spills over 80 columns, many people have to take an extra step to get readable results (by widening their default sized window) and they may mangle some existing output for the purposes of eg cut and paste (since many terminal windows still don't re-flow lines when the window widens or narrow).
Next, there's an increasingly popular class (or classes) of device with relatively constrained screen size, namely smartphones and small tablets. Even a large tablet might only be 80 columns wide in vertical orientation. Screen space is precious on those devices and there's often nothing the person using the device can really do to get any more of it. And yes, people are doing an increasing amount of work from such devices, especially in surprise situations where a tablet might be the best (or only) thing you have with you. Making command output useful in such situations is an increasingly good idea.
Finally, overall screen real estate can be a precious resource even on large-screen devices because you can have a lot of things competing for space. And there are still lots of situations where you don't necessarily need timestamps and they'll just add clutter to output that you're actively scanning. I won't pretend that my situation is an ordinary one; there are plenty of times where you're basically just glancing at the instantaneous figures every so often or looking at recent past or the like.
(As far as screen space goes, often my screen winds up completely covered in status monitoring windows when I'm troubleshooting something complicated. Partly this is because it's often not clear what statistic will be interesting so I want to watch them all. Of course what this really means is that we should finally build that OS level stats gathering system I keep writing about. Then we'd always be collecting everything and I wouldn't have to worry about maybe missing something interesting.)
Unix's pipeline problem (okay, its problem with file redirection too)
In a comment on yesterday's entry, Mihai Cilidariu sensibly suggested that I not add timestamp support to my tools but instead outsource this to a separate program in a pipeline. In the process I would get general support for this and complete flexibility in the timestamp format. This is clearly and definitely the right Unix way to do this.
Unfortunately it's not a good way in practice, because of a fundamental pragmatic problem Unix has with pipelines. This is our old friend block buffering versus line buffering. A long time ago, Unix decided that many commands should change their behavior in the name of efficiency; if they wrote lines of output to a terminal you'd get each line as it was written, but if they wrote lines to anything else you'd only get it in blocks.
This is a big problem here because obviously a pipeline like '
timestamp' basically requires the
monitor process to produce output
a line at time in order to be useful; otherwise you'd get large blocks
of lines that all had the same timestamp because they were written to
timestamp process in a block. The sudden conversion from line
buffered to block buffered can also affect other sorts of pipeline
It's certainly possible to create programs that don't have this problem, ones that always write a line at a time (or explicitly flush after every block of lines in a single report). But it is not the default, which means that if you write a program without thinking about it or being aware of the issue at all you wind up with a program that has this problem. In turn people like me can't assume that a random program we want to add timestamps to will do the right thing in a pipeline (or keep doing it).
(Sometimes the buffering can be an accidental property of how a program was implemented. If you first write a simple shell script that runs external commands and then rewrite it as a much better and more efficient Perl script, well, you've probably just added block buffering without realizing it.)
In the end, what all of this really does is that it chips away quietly at the Unix ideal that you can do everything with pipelines and that pipelining is the right way to do lots of stuff. Instead pipelining becomes mostly something that you do for bulk processing. If you use pipelines outside of bulk processing, sometimes it works, sometimes you need to remember odd workarounds so that it's mostly okay, and sometimes it doesn't do what you want at all. And unless you know Unix programming, why things are failing is pretty opaque (which doesn't encourage you to try doing things via pipelines).
(This is equally a potential problem with redirecting program output to files, but it usually hits most acutely with pipelines.)