What good secure string expansion on Unix should look like

July 23, 2011

In yesterday's entry, I covered some options for how to make string expansion and tokenization of command lines aware of each other. Before I pick what I think is the best approach, let's take a step back and talk about what results we want.

Consider the following hypothetical example:

av_scanner = cmdline:/opt/avscanner ${if isset{$heloname} {-h $heloname}} $recipients %s

Assuming that %s expands to a single argument, the straightforward reading of what we want to happen is for /opt/avscanner to be invoked with four arguments if $heloname is set and with only two if $heloname is unset. The various alternate interpretations and results are all absurd in various ways.

I think that the simple way to achieve this is to perform string expansion before tokenization but to mark the result of variable expansions as being all in a single token. You don't quite want variable expansion to force token boundaries (otherwise '-h$somevar' would wind up actually meaning '-h $somevar', and that's absurd in its own way), but you don't want the tokenizer to split things inside variable expansions. Fortunately getting this right is only a small matter of programming.

(Possibly you want to expose an explicit operator to group several expansions together as a single non-breakable entity. You could call it '${arg ...}'.)

If you want to tokenize before expansion, clearly the tokenizer needs to be language aware. Roughly speaking, I think what you wind up wanting to do is parse the string into an AST that is composed partly of tokenized literal text, partly of language operators, and partly of variable expansions. Then you evaluate the AST to generate a stream of tokenized text, where a straightforward variable expansion like $heloname or $recipients always gives you a single token regardless of what the contents are.

(I have ripped this idea off from my understanding of the general approach that web frameworks usually take to parsing and evaluating their page templates.)

Sidebar: an alternate tokenization approach

An alternate tokenization approach is to say that the AST should include explicit token boundary markers instead of pre-tokenized text (and whitespace normally turns into such a boundary marker). Then the AST evaluation produces a stream that is a mixture of token boundary markers and text chunks; you take the stream and fuse all text between two boundary markers together into a single argument. This naturally handles cases like '-h$somevar' and '$var1$var2'; in both cases there is no token boundary marker in the middle, so although the AST has two separate nodes the end result fuses the text from both nodes together into a single argument.

Written on 23 July 2011.
« String expansion and securely running programs on Unix
On documenting (or not documenting) binary protocols »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jul 23 01:49:08 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.