2016-06-05
My approach for inspecting Go error values
Dave Cheney sort of recently wrote Don't just check errors, handle
them gracefully,
where he strongly suggested that basically you should never check
the actual values of errors. This is generally a great idea, but
sometimes you don't have a choice. For example, for a long time it
was the case that the only way to tell if your DNS lookup had hit
a temporary DNS error (such as 'no authoritative servers for this
domain are responding') or a permanent one ('this name doesn't
exist') was to examine the specific error that you received. While
net.DNSError had a .Temporary() function, it didn't return true
in enough cases; you had to go digging deeper to know.
(This was Go issue 8434 and has since been fixed, although it took a while.)
When I had to work around this issue (in code that I suppose I should now remove), I was at least smart enough to try the official way first:
var serverrstr = "server misbehaving"
func isTemporary(err error) bool {
if e, ok := err.(*net.DNSError); ok {
if e.Temporary() || e.Err == serverrstr {
return true
}
}
return false
}
Checking the official way first made it so that once this issue was
resolved, my code would immediately start relying on the official
way. Checking the error string only for net.DNSError errors made
sure that I wouldn't get false positives from other error types,
which seemed like a good idea at the time.
When I wrote this code I felt reasonably smart about it; I thought
I'd done about as well as I could. Then Dave Cheney's article showed
me that I wasn't quite doing this right; as he says in one section
('Assert errors for behaviour, not type'), I should have really
checked for .Temporary() through an interface instead of just
directly checking the error as a net.DNSError. After all, maybe
someday net.LookupMX() and company will return an additional type
of error in some circumstances that has a .Temporary() method;
if that would happen, my code here wouldn't work right.
(I even put some comments in musing about the idea, but then rejected
it on the grounds that the current net package code didn't do
that so there didn't seem to be any point. In retrospect that was
the wrong position to take, because I wasn't thinking about potential
future developments in the net package.)
I'm conflicted over whether to cast to specific error types if you
have to check the actual error value in some way (as I do here). I
think it comes down to which way is safer for the code to fail. If
you check the value through error.Error(), future changes in the
code you're calling may cause you to match on things that aren't
the specific error type you're expecting. Sometimes this will be
the right answer and sometimes it will be the wrong one, so you
have to weigh the harm of a false positive against the harm of a
false negative.
2016-06-04
The (Unix) shell is not just for running programs
In the Reddit comments on yesterday's entry, I ran across the following comment:
No. The shell literally has the sole purpose of running external programs. Anything more is extra.
The V1 shell read a line, split on whitespace, and executed the command from /bin. You could change the current directory from in the shell, that was it.
On any version of Unix as far back as at least V7, this is false. The Unix shell may have started out simply being a way to run programs, but it long ago stopped being just that. Since the V7 shell is a ground up rewrite, one cannot even argue that the shell simply drifted into these additional features for convenience. The V7 shell was consciously designed from scratch, and as part of that design it included major programming features including control flow constructs drawn directly from the general Algol line of computer language design. Inclusion of these programming features is not an accident and not a drift over time; it is a core part of the shell's design and thus its intended purpose. The V7 shell is there both to run programs and to write programs (shell scripts), and this is completely intended.
(In terms of control flow, I'm thinking here of if, while, and
for, and there's also case.)
In short, the shell as in part a programming language is part of Unix's nature from at least the first really popular Unix version (V7 became the base of many further lines of Unix). To the extent that the Unix design ethos or philosophy exists as a coherent thing, it demonstrably includes a strongly programmable shell.
You can make an argument that the V6 shell (the 'Mashey shell') shows this too, but it was apparently a derivative of and deliberately backwards compatible with the original 'just run things' Thompson shell. The V7 Bourne shell is a clear, from scratch break with the original Thompson shell, and it was demonstrably accepted by Research Unix as being, well, proper Unix.
(If you want even more proof that Research Unix's view of the shell
includes programming, the shell was reimplemented once again for
Version 10 and Plan 9 in the form of Tom Duff's rc shell and, you guessed it, that included
programmability too, this time with more C-like syntax instead of
the Algol-like syntax of the Bourne shell.)
(You can argue that this conjoining of 'just run programs for people' and 'write shell scripts' in a single program is a mistake and these roles should be split apart into two programs, but that's a different argument. I happen to think that it's also wrong, and on more than one level.)
2016-06-03
One thing that makes the Bourne shell an odd language
In many ways, the Bourne shell is a relatively conventional programming
language. It has a few syntactic abnormalities, a few flourishes
created by the fact that it is an engine for running programs
(although other languages have featured equivalents of $(...) in
the form of various levels of 'eval' functionality), and a different
treatment of unquoted words, but the
overall control structure is an extremely familiar Algol-style one
(which is not surprising, since Steve Bourne really liked Algol).
But the Bourne shell does have one thing that clearly makes it an
odd language, namely that it has outsourced what are normally core
language functions to external programs. Or rather it started out
in its original version by outsourcing those functions; versions
of the Bourne shell since then have pulled them back in in various
ways. Here I am thinking of both evaluating conditionals via test
aka [ and arithmetic via expr (which also does some other things
too).
(Bourne shells have had test as a builtin for some time (sometimes
with some annoyances) and built in arithmetic is
often present these days as $...)
There's no reason why test has to be a separate program and neither
test nor expr seems to have existed in Research Unix V6, so
they both appeared in V7 along with the Bourne shell itself. They
aren't written in BourneGol, so they
may not have been written by Steve Bourne himself, but at least
test was clearly written as a companion program (the V7 Bourne
shell manpage
explicitly mentions it, among other things).
I don't know why the original Bourne shell made this decision. It's
possible that it was simply forced by the limitations of the PDP-11
environment of V7. Maybe a version of the Bourne shell with test
and/or expr built into the main shell code would have either been
too big or just considered over-bloated for something that would
mostly be used interactively (and thus not be using test et al
very often). Or possibly they were just easier to write as separate
programs (the V7 expr is just a single yacc file).
Note that there are structural reasons in the Bourne shell to make
if et al conditions be the result of commands, instead of restricting
them to (only) be actual conditions. But the original Bourne shell
could have done this with test or the equivalent as a built-in
command, and it certainly has other built in commands. Perhaps
test needing to be an actual command was one of the things that
pushed it towards not being built in. You can certainly see a spirit
of minimalism at work here if you want to (although I have no idea
if that's the reason).
(This expands on a tweet of mine.)
Sidebar: It's not clear when test picked up its [ alias
Before I started writing this entry, I expected that test was
also known as [ right from the beginning in V7. Now I'm not
so sure. On the one hand, the actual V7 shell scripts I can find
eg here
consistently use test instead of [ and the V7 compile scripts
don't seem to create a [ hardlink. On the other hand, the V7
test source
already has special magic handling if it's invoked as [.
(There are V7 disk images out there that you can boot up on a PDP-11
emulator, so in theory I could fire one up and see if it has a
/bin/[. In practice I'm not that energetic.)
2016-06-02
Spammers can abandon SMTP connections not infrequently
As a result of looking at my SMTP session logs, one of the things that I've started tracking on my 'sinkhole' spamtrap SMTP server is how many senders reach the point where they actively get rejected by my server versus how many senders just disconnect with incomplete sessions where everything has gone fine up to that point. My SMTP session logging said that at least some just gave up, but I wasn't sure how many did this.
(Under normal circumstances you'd expect real sending mailers to almost never just abandon an incomplete session. It's not 'never' because there will always be some sending mailers that have their machine reboot out from underneath them or the like as they're trying to send out a message, but this is not exactly common so it should be very low.)
My results so far are early and somewhat incomplete, but I'll give
you representative numbers anyways. The numbers I have handy right
now are that over the past two and a half days, I've seen 123
abandoned sessions to 440 sessions with refused SMTP commands, or
about a fifth of the sessions are just being abandoned. I don't
particularly have data on where the sessions are being abandoned,
but looking at my SMTP logs say that some senders drop the connection
while I'm sending my initial SMTP greeting banner and some drop it
as I answer their EHLO or HELO.
Now, I don't and can't know why senders are choosing to abandon their SMTP sessions to my sinkhole server. But one thing that my server does is trickle out its SMTP replies rather slowly (including the initial banner), specifically at a rate of one character every tenth of a second. I took this idea from OpenBSD's spamd, but when I put it in I didn't really expect it to do anything. It may be that I'm wrong here and there is a not insignificant amount of spammer software that either specifically recognizes this behavior or simply isn't interested in wasting its time on too-slow mailers.
(I don't yet feel like experimenting by turning this feature off and seeing if the number of abandoned sessions basically goes almost to zero.)
Applications of this to real, non-sinkhole mailers are left as an exercise. As far as I know, no real sending mailer cares about somewhat slow responses at this level, but I admit I haven't exactly attempted to get every major ISP and so on to send my sinkhole server some email just to see if it would work. Big places like Google and Outlook don't seem to have had any problems coping with my sinkhole server, for what that's worth.
Sidebar: what I consider an abandoned session versus a rejected one
A session counts as 'rejected' if the most recent valid HELO/EHLO,
MAIL FROM, RCPT TO, DATA or final '.' on messages was either
5xx'd or 4xx'd. This doesn't consider QUIT, RSET, or other
similar commands and it doesn't consider out of sequence commands.
A session counts as 'abandoned' if it got 'go ahead' 2xx/354 responses
to every valid, in-sequence SMTP command it tried but the sender
either closed the TCP connection or sent a QUIT.
Sessions with things like TLS setup failures don't count as either abandoned or rejected. I see some amount of those, some for sad reasons.
2016-05-31
Understanding the modern view of security
David Magda wrote a good and interesting question in a comment on my entry on the browser security dilemma:
I'm not sure why they can't have an about:config item called something like "DoNotBlameFirefox" (akin to Sendmail's idea).
There is a direct answer to this question (and I sort of wrote it in my comment), but the larger answer is that there has been a broad change in the consensus view of (computer) security. Browsers are a microcosm of this shift and also make a great illustration of it.
In the beginning, the view of security was that your job was to create a system that could be operated securely (often but not always it was secure by default) and give it to people. Where the system ran into problems or operating issues, it would tell people and give them options for what to do next. In the beginning, the diagnostics when something went wrong were terrible (which is a serious problem), but after a while people worked on making them better, clearer, and more understandable by normal people. If people chose to override the security precautions or operate the systems in insecure ways, well, that was their decision and their problem; you trusted people to know what they were doing and your hands were clean if they didn't. Let us call this model the 'Security 1' model.
(PGP is another poster child for the Security 1 model. It's certainly possible to use PGP securely, but it's also famously easy to screw it up in dozens of ways such that you're either insecure or you leak way more information than you intend to.)
The Security 1 model is completely consistent and logical and sound, and it can create solid security. However, like the 'Safety-I' model of safety, it has a serious problem: it not infrequently doesn't actually yield security in real world operation when it is challenged with real security failures. Even when provided with systems that are secure by default, people will often opt to operate them in insecure ways for reasons that make perfect sense to the people on the spot but which are catastrophic for security. Browser TLS security warnings have been ground zero for illustrating this; browser developers have experimentally determined that there is basically no level of strong warnings that will dissuade enough people from going forward to connect to what they think is eg Facebook. There are all sorts of reasons for this, including the vast prevalence of false positives in security alerts and the barrage of warning messages that we've trained people to click through because they're just in the way in the end.
The security failures of the resulting total system of 'human plus computer system' are in one sense not the fault of the designers of the computer system, any more than it is your fault if you provide people with a saw and careful instructions to use it only on wood and they occasionally saw their own limbs off despite your instructions, warnings, stubbornly attached limb guards, and so on. At the same time, the security failures are an entirely predictable failure of the total system. This has resulted in a major shift in thinking about security, which I will call 'Security 2'.
In Security 2 thinking, it is not good enough to have a secure system if people will wind up operating it insecurely. What matters and the goal that designers must focus on is making the total system operate securely, even in adverse conditions; another way to put this is that the security goal has become protecting people in the real world. As a result, a Security 2 focused designer shouldn't allow security overrides to exist if they know those overrides will wind up being (mis)used in a way that defeats the overall security of the system. It doesn't matter if the misuse is user error on the part of the people using the security system; the result is still an insecure total system and people getting owned and compromised, and the designer has failed.
Security 2 systems are designed not necessarily so much to be easy to use as to be hard or impossible to screw up in such a way that you get owned (although often this means making them easy to use too). For example, all the time, automatic end to end encryption of messages in an instant messaging system is a Security 2 feature; optional, must be selected or turned on by hand end to end encryption of messages is a Security 1 feature.
Part of the browser shift to a Security 2 mindset has been to increasingly disallow any and all ways to override core security precautions, including being willing to listen to websites over users when it comes to TLS failures. This is pretty much what I'd expect from a modern Security 2 design, given what we know about actual user behavior.
(The Security 2 mindset raises serious issues when it intersects with user control over their own devices and software, because it more or less inherently involves removing some of that control. For example, I cannot tell modern versions of Firefox to do my bidding over some TLS failures without rebuilding them from source with increasing amounts of hackery applied.)
2016-05-30
The browser security dilemma
So Pete Zaitcev ran into the failure mode of modern browsers being strict about security, which is that the browser locks you out of something that you need to access. The only thing I'm much surprised about is that it happened to Pete Zaitcev before it happened to me. On the one hand, this is really frustrating when it happens to you; on the other hand, the browsers are caught on the horns of a real security dilemma here.
To simplify, there are two sorts of browser users; let us call them sysadmins and ordinary people. Sysadmins both know what they're doing and deal with broken cryptography things on a not infrequent basis, such as device management websites that only support terribly outdated cryptography (say SSLv3 only), or have only weak certificates or keys (512 bytes only, yes really), or their certificate has long since expired and are for the wrong name anyways. As a result, sysadmins both want ways to override TLS failures and can (in theory) be trusted to use them safely. By contrast, ordinary people both don't normally encounter broken cryptography and don't really know enough to handle it safely if they do.
In an ideal world, a browser would be able to tell which sort of person you were and give you an appropriate interface. In this less than ideal world, what browser vendors have discovered is that if you expose a 'sysadmin' interface in basically any way, ordinary people will eventually wind up using it for TLS failures that they definitely should not override. It doesn't matter how well you hide it; sooner or later someone will find it and write it up on the Internet and search engines will index it and people will search for it and navigate the ten steps necessary to enable it (and ignore your scary warnings in the process). If we have learned anything, we've learned that people are extremely motivated to get to their websites and are willing to jump through all sorts of hoops to do so. Even when this is a terrible idea.
Since ordinary people vastly outnumber sysadmins, browsers are increasingly opting to throw sysadmins under the bus (ie, completely not supporting our need to override these checks some of the time). At the moment, some major browsers are less strict than others, but I suspect that this will pass and sooner or later Chrome too will give me and Pete Zaitcev no option here. Maybe we'll still be able to rely on more obscure things (on Linux) like Konqueror, at least if they're functional enough to handle the device management websites and IPMIs and so on that I need to deal with.
(Failing that, there may come a day where I keep around an ancient copy of Firefox to handle such sites, in just the same way that I keep around an ancient copy of Java to deal with various Java based 'KVM over IP' IPMI things. Don't worry, my ancient Java isn't wired up as an applet and only works in a non-default browser setup in the first place.)
'Command line text editor' is not the same as 'terminal-based text editor'
A while back, I saw a mention about what was called a new command line text editor. My ears perked up, and then I was disappointed:
Today's irritation: people who say 'command line text editor' when they mean 'terminal/cursor-based text editor'.
I understand why the confusion comes up, I really do; an in-terminal
full screen editor like vi generally has to be started from the
command line instead of eg from a GUI menu or icon. But for people
like me, the two are not the same and another full screen, text
based editor along the lines of vi (or nano or GNU Emacs without
X) is not anywhere near as interesting as a new real command line
text editor is (or would be).
So, what do people like me mean by 'command line text editor'? Well,
generally some form of editor that you use from the command line
but that doesn't take over your terminal screen and have you cursor
around it and all that. The archetype of interactive command line
text editors is ed, but there are other editors which have such
a mode (sam has
one, for example, although it's not used very much in practice).
Now, a lot of the nominal advantages of ed and similar things are
no longer applicable today. Once upon a time they were good for
things like low bandwidth connections where you wanted to make quick
edits, or slow and heavily loaded machines where you didn't want
to wait for even vi to start up and operate. These days this is
not something that most people worry about, and full screen text
editors undeniably make life easier on you. Paradoxically, this is
a good part of why I would be interested in a new real command line
editor. Anyone who creates one in this day and age probably has
something they think it does really well to make up for not being
a full screen editor, and I want to take a look at it to see this.
I also think that there are plausible advantages of a nice command line text editor. The two that I can think of are truly command line based editing (where you have commands or can easily build shell scripts to do canned editing operations, and then you invoke the command to do the edit) and quick text editing in a way that doesn't lose the context of what's already on your screen. I imagine the latter as something akin to current shell 'readline' command line editing, which basically uses only a line or two on the screen. I don't know if either of these could be made to work well, but I'd love to see someone try. It would certainly be different from what we usually get.
(I don't consider terminal emulator alternate screens to be a solution to the 'loss of context' issue, because you still can't see the context at the same time as your editing. You just get it back after you quit your editor again.)
2016-05-29
What does 'success' mean for a research operating system?
Sometimes people talk about how successful (nor not successful) an operating system has been, when that operating system was created as a research project instead of a product. One of the issues here is that there are several different things that people can mean by a research OS being a success. In particular, I think that there are at least four sorts of it:
- The OS actually works and thus serves as a proof of concept for the
underlying ideas that motivated this particular research OS
variation. What 'works' means may vary somewhat, since research
projects rarely reach production status; generally you get some
demos running acceptably fast.
Having your research OS actually work is about the baseline definition of success. It means that your ideas don't conflict with each other, can be made to work acceptably, and don't require big compromises to be implemented.
- The OS works well enough and is attractive enough that people
in your research group can and do build things on it and actively
use it. If it's a general purpose OS, people voluntarily and
productively use it for everyday activity; if it's a specialized
real time or whatever OS, people voluntarily build their own
projects on top of it and have them work.
A research OS that has reached this sort of success is more than just a technology demonstration and proving ground. It can do real things.
- At least some of your OS's ideas are attractive enough that they
get implemented in other OSes or at least clearly influence the
development of other OSes. This is especially so if your ideas
propagate to production OSes in some form or other (often in a
somewhat modified and less pure form, because that's just how
things go).
(As anyone who's familiar with academic research knows, a lot of research is basically not particularly influential. Being influential means you've achieved more success than usual.)
- Some form of your research OS winds up being used by outside people to do real work; it becomes a 'success' in the sense of 'it is out in the real world doing things'. Sometimes this is your OS relatively straight, sometimes it's a heavily adopted version of your work, and I'm sure that there have been cases where companies took the ideas and redid the implementation.
Most research OSes reach the first level of success, or at least most that you ever hear about (the research community rarely publishes negative results, among other issues). Or at least they reach the appearance of it; there may be all sorts of warts under the surface in practice in terms of performance, reliability, and so on. On the other hand some research OSes are genuine attempts to achieve genuinely usable, reliable, and performant results in order to demonstrate that their ideas are not merely possible but are actively practical.
It's quite rare for a research OS to reach the fourth level of success of making it into the real world. There are not many 'real world' OSes in the first place and there are very large practical obstacles in the way. To put it one way, there is a lot of non-research work involved in making something a product (even a free one).
(In general purpose OSes, I think only two research OSes have made a truly successful jump into the real world from the 1970s onwards, although it's probably been tried with a few more. I don't know enough about the real time and embedded computing worlds to have an idea there.)
2016-05-28
A problem with using old OmniOS versions: disconnection from the community
One of the less obvious problems with us probably never doing another OmniOS upgrade is that I'm clearly going to become more and more disconnected from the OmniOS community. This is only natural, since most or almost all of the community is using recent versions; as time goes on, those versions and the version we're running are only going to drift more and more apart.
(It's true that OmniOS r151014 is an OmniOS LTS release, supported through early 2018 per here. But in practice I expect that most OmniOS people will be running the one of the more up to date stable releases instead, since they won't have our upgrade concerns.)
Being disconnected from the community makes me sad, because the OmniOS community is one of the great parts about OmniOS. There are several dimensions to this disconnection. First, the more disconnected I am from the community, the less I'll be able to give back to it, the less I can contribute answers or information or whatever. Giving back to the community is something that I would like to do for all sorts of reasons (including that I plain like being able to contribute).
Obviously, the more distant we are from what the community is running the less the community can help us with advice and information and all of that if we run into issues or just have questions about how best to do something or what the community's experiences are. At best they may be able to tell us how things would look or would be done on a newer version of OmniOS. Of course, some things only change slowly, but I suspect that there is only going to be more and more of a gap here over time. I don't want to put too much weight on this; I'm very grateful to the help that the community has given us, but at the same time it's not help that I think we should count on and significantly factor into our plans.
(To put it one way, community help comes from the goodness of its heart and is best considered a pleasant surprise instead of a guarantee or an entitlement. I don't know if all of this makes sense to anyone but me, though.)
Finally, I'll just plain be paying less attention to the community and drifting away it. It's inevitable; more and more, community discussions will be about things that aren't relevant to our version and that I can't contribute to. If people have problems or questions, I'll only have outdated information or more and more uninformed opinions. That's a recipe for disengagement, even from a nice community.
Having written all of this, I think that what I should do is build one experimental OmniOS server to keep up to date. It doesn't have to use our fileserver hardware; for a lot of things, any old server running OmniOS will serve to keep me at least somewhat current. As a bonus it will provide me with a platform to test things on the current OmniOS version (whatever that is at the time).
(We have enough spare SSDs for our current fileservers so that I could take the test fileserver and build a system SSD set for the current OmniOS, just so I have it around. We did this sort of back and forth OmniOS version testing during our transition to r151014, so we actually have a template for it.)
2016-05-27
Your overall anti-spam system should have manual emergency blocks
We mostly rely on a commercial anti-spam system for our incoming spam filtering (as described here), and many other people rely on a variety of open source options for their spam filtering. This generally works very well, with us (and you) getting to offload the work of maintaining a high quality anti-spam system to other people (and it's certainly a lot of work). But not always (and not just because it malfunctions). The realities of life are that sooner or later you will be hit by a spam run that your anti-spam system doesn't recognize, either because the spam run is really new or because it's pretty specific to you.
Much of the time, you can shrug your shoulders and let this go. No anti-spam system is perfect and one of the tradeoffs you make when relying on a third-party system is that it's broadly out of your hands (sometimes this is an advantage). But some of the time this isn't going to be good enough; either the volume or the threat to your users will be so high that you can't just sit on your hands.
(Modern ransomware is making this clear by creating a potentially very high cost of allowing some things through.)
When this day comes to pass, you'll want to have the ability to step in and block the traffic even though your automated anti-spam system is happy with it. This can take many forms, depending on how you want to handle it; you could figure out how to write custom rules for your anti-spam system (so you can outright block certain sorts of files or certain URLs or whatever), or you can build blocking features into your mailer configuration itself, or any number of other options.
Having been through having to do this on the fly during an emergency, my strong suggestion is that you build the infrastructure for these manual blocks now, before you need them. It's some additional up front work and if you're lucky you may never need it, but doing it now when you have time to plan and test and figure out the best way to do things beats having to do it on the fly, under pressure.
Sidebar: What I think you should have manual blocks for
On the one hand attacker ingenuity is very deep, but on the other hand certain patterns repeat over and over again. So my view is that you can probably cover most ground with the ability to put in place manual blocks against sending IPs, sending domains, file extensions (including inside file containers like ZIP files), and whole and partial URLs (for phishing campaigns). You might also want a general message header and body regular expression matching system, but that's starting to feel like scope creep to me.
(Of course real scope creep would be to start by creating a general, generic framework for writing relatively arbitrary manual blocks on message attributes.)