Wandering Thoughts: Recent Entries

2012-01-24

Why I use exec in my shell scripts

As with the little example yesterday, a fair number of my shell scripts end with running a program and when they do, I almost invariably go the little extra distance and do it with exec. In the old days, the reason to do this was that it used slightly less resources, since it got rid of the shell process and left only the process for the real program you wound up running. But, while I was around then, the reason I use it today isn't that; it's that it lets you freely edit the script while that final program is running.

At this point some of you may be going 'wait, what?' That's because most Bourne shell implementations are a little bit peculiar.

In most interpreted languages on Unix (like Python, Ruby, and Perl), the interpreter completely loads and parses the script file before it starts running it. This means that once your script has actually started running, once that initial load and parse has finished, you can freely change the script's file without the interpreter caring; it will only look at the actual file and its contents again if and when you re-run your script.

Bourne shell implementations have historically not worked this way (and it's possible that it's actually impossible to preparse Bourne shell scripts for some reason). Instead they not only parse the script on the fly as it executes, but also they read the file on the fly as the script runs. This means that if you edit a shell script while it's running you can literally shuffle the code around underneath the script. When the shell resumes reading and parsing the script after the current command finishes, it can be reading from partway through a line, from something that it had already read, or (if you deleted text) wind up skipping over something that it should have run. This often causes the shell script to fail with weird errors or, worse, to malfunction spectacularly. This can happen even if the shell is on the last line of the script.

But if you end a shell script with exec, you avoid this. The actual shell interpreter effectively exits (by turning itself into the actual program) and so there's nothing there to try to read anything more and get confused by your edits.

(Of course nothing helps if you can't use exec; then you just have to remember to never edit the script while it's running, at least with an editor that overwrites the file in place.)

Sidebar: a detailed example of what happens

Let's start with a little script:

#!/bin/sh
echo "a"
firefox

Run this script. While Firefox is running, edit it so that the echo string is four or five characters longer (using vi or some other editor that overwrites files in place). When you exit Firefox, the script will complain something like 'script: line 4: efox: command not found'.

When the shell was running Firefox, its read position in the file was just after the newline at the end of firefox. When you edited the script and added more letters, that same byte position was now pointing to the e in the 'firefox'. When Firefox exited and the shell resumed reading from that byte position, it read 'efox<newline>', saw a perfectly valid command execution, and tried to run 'efox' (and failed).

(It reports that this happened on line 4 because it knew it had already read three lines, so clearly this is line 4. As a corollary, you can't trust the line numbers that are printed when something like this happens.)

WhyShellScriptExec written at 00:06:12; Add Comment

2012-01-22

My view of the purpose of object orientation

A while back I read Rise and Fall of Classic OOP. This caused me to realize that I am kind of a heathen as far as object oriented programming is concerned, probably because I came to explicit OO late and never actually learned how to do it the 'right way'. You see, to me object orientation is a technique for code organization and nothing more.

This gives me a very pragmatic view of when to write OO code and when not to; I use objects and classes where they make my code simpler, and I don't use them when they don't. I don't consider them something that has to be followed at all costs or as the only way to model the real world (or any arbitrary artificial world). If the real world entities that you're working with aren't amenable to being wedged into an OO hierarchy, then don't. Given the wide variety of both code structure and ways of organizing code so that it makes sense, it would be fairly absurd to say that OO is always the right answer; it is just one technique among many. Sometimes it's the right answer, sometimes not.

(Of course, some languages as so in love with OO that they don't give you a choice about it; you can't really have freestanding functions and data containers.)

I won't say that all of those OO examples that modeled the real world always struck me as a bit hokey and artificial, because honestly I never really thought that much about it (and any small example is hokey and artificial if you really look at it). But if people are switching towards my view of the purpose of OO, I'm all for it.

(I would be shocked if this was new and novel. I sure hope that lots of people have had this thought before me, because it just feels so obvious.)

ObjectOrientationPurpose written at 02:27:01; Add Comment

2012-01-21

The C juggernaut illustrated

Perhaps it is tempting, looking back at history from the vantage point of today, to say that C succeeded so much because it was at the right place at the right time. As you could tell the story, all sorts of people in the 1980s wanted a low level programming language, C was around, and so they seized on it. Any similar language would have done; it's just that C was lucky enough to be the one that came out on top, partly because of network effects.

(This story is especially tempting to people who don't like C and Unix.)

This significantly understates the real appeal of C at the time, even and especially to people who had alternative languages. A great illustration of this is C on the early Macintosh. You see, unlike environments like MS-DOS (which had no language associated with it, just assembler), the early Macintosh systems already had a programming language; they were designed to be programmed in Pascal (and the Mac ROMs were originally written in Pascal before being converted to assembler).

This was more than just an issue of Apple's suggested language being Pascal instead of C. The entire Mac API was designed around Pascal calling conventions and various Pascal data structures; it really was a Pascal API. Programming a Mac in C involved basically swimming upstream against this API, full of dealing with things like non-C strings (if I remember right, Mac ROM strings were one byte length plus data). I believe that Mac C compilers had to introduce a special way of declaring that a C function should have the Pascal calling convention so that it could be used as a callback function.

Despite all of this, C crushed Pascal to become by far the dominant programming language on the Macintosh. I don't think it even took all that long. Programmers didn't care that dealing with the API issues were a pain; working in C was worth it to them. It didn't matter that Pascal was the natural language to write Mac programs in or that it was a perfectly good language in its own right. C was enough better to displace Pascal in a hostile environment.

C did not win just because it was at the right place at the right time. C won in significant part because it was (and is) a genuinely good language for the job it does. As a result it was the language that a lot of pragmatic people picked if you gave them anything like a choice.

CTriumph written at 01:03:49; Add Comment

2011-12-31

My thoughts on the mockist versus classicalist testing approaches

To summarize aggressively, one of the quiet long-running disputes in the OO test driven community is between classical TDD, where you use real or stub support classes, and mockist TDD, where you use behavior-based mock objects. My guide on this is primarily Martin Fowler (via Jake Goulding). Jake Goulding summarizes the difference as stubs assert state while mocks assert messages (or method calls). I'm mostly a classicalist as far as testing goes for no greater reason than I generally find it easier, but reading Martin Fowler's article started me rethinking my passive attitude on this. On the whole I think I'm going to remain a classicalist, but I want to run down why.

One way to look at the divide is to look at what is actually being tested. When you use stubs (or real objects), what you are really testing is the end result of invoking the code under test. When you use mocks, you're testing the path that code under test used to get to its end result. So the real question is whether or not the path the code under test uses to derive its result is actually important.

Phrased this way, the answer is clearly 'sometimes'. The most obvious case is situations where the calls to other objects create real side effects; for example, exactly how you debit and credit accounts matters significantly, not just that every account winds up with the right totals in the end. This means means that sometimes you should use mocks. If you're testing with stubs and the path is important, you're not actually testing the full specification of your code; the specification is not just 'it gets proper results', the specification is 'it gets proper results in the following way'.

At the same time I feel that you should not test with mocks unless the specific behavior actually is part of the specification of the thing under test. Otherwise what you are actually testing is not just that the code works but also that it has a specific implementation. I strongly dislike testing specific implementations unless necessary because I've found it to be a great way to unnecessarily break tests when you later change the implementation.

This also ties into what sort of interfaces and interactions your objects have with each other; there's a whole spectrum of closely coupled objects to loosely coupled objects to deliberately isolated objects. Where you have deliberately isolated objects, objects used to create hard and limited interface boundaries, I think you should almost always test the behavior as well as the actual outcome for things that call them (because you want to make sure that the interface is being respected and used in the right way). Conversely, closely coupled objects (where you are only using multiple sorts of objects because it fits the problem best) are a case where I'd almost never test behavior because the split into different objects is essentially an implementation artifact.

(Possibly some or all of this is obvious to experienced practitioners. One of my weaknesses as a programmer is that I learned programing before both OO and the testing boom, and I have never really caught up with either.)

MockistVsClassicalist written at 01:04:56; Add Comment

2011-12-16

Avoiding the classic C quoting bug in your language

To summarize an earlier entry, the classic C quoting bug that I'm talking about here is writing 'printf(ustr)' or 'syslog(pri, ustr)' where 'ustr' is a string that comes from some form of external input. In that entry I mentioned that it's possible for languages to avoid this entire class of bugs with the right design. Well, it's time to elaborate on that parenthetical aside.

To start with, let's turn the issue around: what is it about these functions that causes the bug? The answer is that the bug happens because all of these functions do two things; they do something useful and in the process of doing it they format their arguments for you. The bug happens when you (the programmer) just want to do the useful thing and you overlook the fact that you're also getting the formatting for free. The conclusion is straightforward. To make this bug impossible, we need to make functions like this do only one thing; they should take a plain string and not format it. But people still need string formatting, so we need some easy way for people to do it in order to make these single purpose functions both feasible and convenient, a way that involves as close to no extra work as possible. In short, we need effortless generic string formatting.

(The need for no extra work is why we can't do this in C. We need a language where you don't need to explicitly manage string lifetimes, since the result of string formatting is another string.)

In theory you can implement generic string formatting as a function call (ideally with a very short function name). In practice I think that it isn't going to work the way you want because of perception issues; if string formatting is just a function call, it's still tempting to create convenience functions that bundle the two function calls together for you (one to format the arguments then one to do the useful thing). Doing generic string formatting as an operator (such as Python's '%') has the pragmatic benefit of drawing a much more distinct line between regular function calls and formatting strings.

(The third approach to effortless generic string formatting is string interpolation in certain sorts of strings. This has the benefit of sidestepping the entire problem, although it has problems of its own.)

PS: another approach in an OO language is to give strings an explicit formatting or interpolation method, so that you might write '"%s: %s".fmt(a, b)'. My gut thinks that this is closer to string formatting as an operator than anything else.

AvoidingQuotingBug written at 00:36:14; Add Comment

2011-12-05

Debuggers and two sorts of bugs

In the middle of reading Go Isn't C I ran across some remarks about how it was bad that programmers ignore debuggers in favour of print statements. This immediately sparked my standard reaction, which is that debuggers are focused on telling you what happens next but generally I want to know how on earth my code got into its current state. Then I started thinking some more and realized that I was being too strong because this isn't really accurate. In fact I use debuggers periodically, but only on certain sorts of bugs.

Let us say that there are two sorts of bugs. For lack of better names, I will call them direct bugs and indirect bugs. A direct bug's cause can be determined immediately by looking at the call stack, the local variables, the code, and so on at the time when it happens. You can say 'oh, the caller forgot that this function couldn't be called with a NULL', or see that you forgot to handle a case and something fell through to code that should never have been reached in this situation. A decent debugger works very well on direct bugs, or even features like automatic call stack backtraces on uncaught exceptions (as you get in languages like Python).

Indirect bugs are data structure corruption bugs (or sometimes flow of control bugs), where you are now in a 'this can't happen' situation (whether caught by an assert() or not). Finding the immediate problem in the code or diagnosing the source of corrupt results is only a starting point; the real challenge is discovering where and how things went off the rails so as to get you to where you are now. Indirect bugs are the bugs where you need to look back into the past to answer 'how did I get here?' questions.

(For practical examples, my recent Liferea issue was a direct bug; if I had read the code carefully, the first stack backtrace would have shown me the problem. My SIGCHLD signal handler race in Python was an indirect bug; I always knew what the direct problem was, but I had no idea how the program got into that state until I did some careful analysis.)

My unconscious bias until now has been that direct bugs are uninteresting because they are easy to solve from basic inspection, so I only really thought about what I wanted to deal with indirect bugs. But the main reason that direct bugs are easy to deal with is that I already have tools like stack backtraces and inspection of local variables so it's easy to see what's wrong with the program's current state.

Sidebar: a hazard of dealing with indirect bugs

It's often popular to 'fix' an indirect bug that crashes the program or generates obviously bad results by making the code accept the impossible state; for example, by adding a NULL check to the low-level routine that's crashing with a NULL pointer exception. This is generally the wrong idea (you're treating the symptoms instead of the disease), but it's tempting as a quick fix and it's an easy approach to fall into if you don't understand the code well enough to know that you're dealing with an indirect bug, not a direct bug.

This is one of the issues that always makes me wary about fixing 'obvious' crash bugs, especially if I want to send the fix upstream. Before I add a NULL pointer check or the like I need to be sure that it's the real bug, and I need to understand what the code should do instead of crashing (which is not always obvious).

DebuggersAndBugTypes written at 00:38:56; Add Comment

2011-11-17

A classic and standard C quoting bug

I recently read Evan Martin's entry on quoting and escaping, which brought to mind one of those classic C and Unix programming mistakes which hopefully is not done very much any more.

Suppose that ustr is a string that comes from user input. The classic form of this particular quoting bug is to write printf(ustr) (or perhaps fprintf(out, ustr)); a much obvious version of this is syslog(pri, ustr).

(Some people might think that the printf() version is silly since there are much more nominally efficient ways to print plain strings, but in practice plenty of people find it much more efficient on programmer time to use printf() or fprintf() to output everything.)

The problem with all of these is that the first argument to printf() and syslog() is not a plain string to be printed; it is a format. Like many quoting bugs, this goes undetected for much of the time because a true plain string is also a valid format that formats to itself. However, if someone supplies a 'plain string' that includes printf formatting directives, things rapidly go off the rails; if you are lucky the program crashes right away so you can immediately notice the problem.

(If you are sufficiently unlucky, this is an exploitable security vulnerability. And yes, people have written code like this that made it into important programs.)

The right solution is of course to quote the user supplied string. The simple way to do this is to supply your own simple formatting string: syslog(pri, "%s", ustr) (or the equivalent with printf() et al). Of course at this point you might want to think about other bad characters that could appear in ustr and how you want to display the message to make it clear that the string comes from the user, not from your program's internals.

This bug can happen with any function that takes a format string as an argument and in any language, not just C. C is just a more dangerous language to have it happen in because C generally has no check for the wrong number of arguments being supplied to a function.

(I have opinions on how languages can avoid this entire class of bugs, but that's something for another entry.)

(Perhaps this is not strictly a quoting bug. It is in my mind, but I may have a somewhat odd view of what constitutes one.)

ClassicCQuotingBug written at 00:39:57; Add Comment

2011-11-12

(Not) parsing wikitext

I have a confession: I don't know how to parse WikiText. Oh, I know how to process it (using regular expressions and ad-hoc code), but I don't know how to write a formal parser for it, something based on parsing theory in order to have all of the advantages of real parsers. Parsers matter because they're much easier to write, to maintain, and to reason about than ad-hoc code; ad-hoc processing code is ultimately just a big pile of mud. Complex, fragile mud.

I've been thinking about this for some time, and I've more or less concluded that there are three aspects of wikitext that make it hard to handle with traditional parsing techniques and tools. The first issue is that wikitext is a mixture of stream-oriented and line-oriented content. All of the parsing techniques that I learned in my now long-ago compilers course were built around stream oriented parsing, where you simply have a running stream of tokens and the most line oriented stuff you have is that end of line is a delimiter. This is fine for parsing running text, but pretty much all wikitexts also have an entire second layer of line to line markup to signal things like headings, lists, quoted (or indented) text, preformatted text, and so on. You need to parse this second layer as well as the running text, and you generally can't parse each line's running text independently because markup constructs in running text can span physical lines.

(Even Python is parsed with a straightforward stream oriented parser under the hood; I wrote about it here and here. Unfortunately the technique that Python uses can't be extended to parsing wikitext.)

I'm sure that there are parsing techniques for dealing with line oriented languages (my impression is that a number of early pre-Algol languages were significantly line oriented), but I didn't learn any such techniques and I think they're now out of favour because everyone sensibly shifted to stream oriented language designs decades ago.

The second big issue is that unlike a traditional language, wikitext doesn't have any errors. In a traditional parser, a parse production either succeeds or results in an error. In wikitext, there are no errors and something like an incomplete external link is simply raw text; '[[some text' is not an error, it is '[[some text'. The presence of errors allows parsers to avoid huge amounts of backtracking (or equivalently, a huge amount of lookahead). The dominant forms of parsing today have very little lookahead because people found that that was both possible and efficient for computer languages (and then people carefully fiddled with languages so that they didn't need lookahead). More advanced things like parsing expression grammars can tolerate more lookahead, but my impression is that even a PEG based grammar struggles with wikitext's potential massive backtracking.

The third issue is that the proper meaning of a lot of wikitext is highly context dependent. For example, consider whitespace at the start of a line. In DWikiText, the standard meaning of this is preformatted text, but if we're in a list context it instead means a continued list entry. Similar issues can bedevil parsing various font change characters in running text, especially if you want to insist on proper nesting. I think that some of this can be handled naturally by things like parsing expression grammars, but I'm not sure that all of it can be.

(I read a paper from sweble.org that admitted they didn't handle determining what was and wasn't a font change in the parser but instead deferred all of that to a post-processing pass over the resulting AST. Maybe it's still possible to do this in clear, systematic code instead of gradually devolving back to ad-hoc processing.)

ParsingWikitext written at 01:41:04; Add Comment

2011-10-24

Salting passwords compared to slow password hashing

When you're storing encrypted passwords, you can do two different things to ruin the life of an attacker who got your password hashes, or at least annoy them; you can salt your passwords in various ways and you can use a slow password hashing function. These two have different but complementary effects.

(I've talked about the effects of the various ways of salting passwords back in SaltingPasswords.)

Salting passwords reduces the payoff an attacker gets for making a single password guess. They slow down how fast an attacker can compromise a bunch of your accounts, but salts do not slow down how fast an attacker can compromise a specific account. Individually salted passwords with a fast hash function increase the time to mass compromises but do not necessarily increase the time to the first account compromise.

Using a slow password hashing function slows down how fast an attacker can make password guesses but by itself does nothing to reduce the payoff the attacker gets for making those guesses. If you have no salts you're vulnerable to rainbow tables, and if you just have a single site-wide salt every (slow) password guess can be checked against all of your users at once.

If you want to delay the time to the first account compromise, the worst situation is a fast password hash (like SHA1 or MD5) with at most a single site wide salt; an attacker can make very fast hash checks and all they need is some person to have a reasonably weak password. If you have per user salts with a fast hash, the attacker needs to get lucky by picking a user that used a weak password. How much luck they need depends on the password practices of your users; if most users choose weak passwords, the time to first compromise will barely budge between a site-wide salt and a per-user salt.

(If you want scary figures on how fast this can be done today, see How To Safely Store A Password.)

It's tempting to summarize this as that having good salting means that the attacker can't rapidly break lots of your users at once while having slow hashing means that they can't rapidly break any of your users, but it's not quite that simple. Even with slow password hashing, if you don't have per user salts and you have users who pick (very) weak passwords the attacker won't need to make very many slow password guesses in order to find a vulnerable account.

(Even with a slow hashing function it doesn't take too much time to try 20,000 to 30,000 guesses. That's almost certainly enough to find most seriously weak common passwords.)

(This entry was sparked by reading On cryptography and dogmas (via Hacker News). I wanted to fix some of this in my own head.)

SaltingAndCryptSpeed written at 01:12:56; Add Comment

2011-10-11

Why people are attracted to minimal language cores

In the last entry I described how some languages rewrite loops with loop bodies by turning the loop body into an anonymous function that the loop invokes repeatedly. I then mentioned that some languages go all the way to turning loops into tail-recursive function calls. You might wonder why languages do this and why these sort of crazy transformations are considered attractive.

There are at least two reasons that these things are popular, for some definition of popular. First, the intellectual purity of a minimal core language appeals to a certain sort of language wonk; they tend to call the result simpler. These people are often drawn towards Lisp (especially Scheme, perhaps the canonical illustration of this philosophy in action).

(Lisp is not the only language family that has this sort of minimalism. For example, I think that Forth is just as minimal in its own way, although it gets far less language design attention.)

Second, it means that your code generator or interpreter core only needs to handle a minimal set of things because higher levels have transformed code written in the general version of the language down into this minimal form (often automatically). This has traditionally simplified optimizers; rather than implementing very similar analysis for each of a whole bunch of control flow constructs, they only have to analyze one really well. Which control flow construct you pick as your base depends on what language you're compiling; some languages pick goto, for example.

(Then you can get a PhD thesis or two out of how to do this analysis.)

My understanding is that the pragmatic evidence is mixed on whether this is a good idea or not. There have certainly been some significant successes, but I have also heard stories of compilers where the frontend carefully reduced all control flow constructions down to the single fundamental control flow atom, passed the whole thing to the optimizer, and had the optimizer reverse engineer the high level control flow stuff again from the low-level minimized control flow information.

(The argument for still doing this even when it's inefficient is that this lets the optimizer (re)discover the true high level control flows in your program, regardless of how you actually wrote the code. In a sense it's discovering what you actually meant (or at least, what you created) instead of just what you wrote.)

WhyLanguageTransformations written at 01:28:21; Add Comment

These are my WanderingThoughts
(About the blog)

GettingAround
Full index of entries
Recent comments

This is part of CSpace, and is written by ChrisSiebenmann.

* * *

Atom feeds are available; see the bottom of most pages.

This is a DWiki.
(Help)

Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web

Search:
[There's more, starting at 2011/10/10 or Previous 10]
(Previous day)
By day for January 2012: 21 22 24; before January.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.