2011-07-23
What good secure string expansion on Unix should look like
In yesterday's entry, I covered some options for how to make string expansion and tokenization of command lines aware of each other. Before I pick what I think is the best approach, let's take a step back and talk about what results we want.
Consider the following hypothetical example:
av_scanner = cmdline:/opt/avscanner ${if isset{$heloname} {-h $heloname}} $recipients %s
Assuming that %s expands to a single argument, the straightforward
reading of what we want to happen is for /opt/avscanner to be
invoked with four arguments if $heloname is set and with only two if
$heloname is unset. The various alternate interpretations and results
are all absurd in various ways.
I think that the simple way to achieve this is to perform string
expansion before tokenization but to mark the result of variable
expansions as being all in a single token. You don't quite want variable
expansion to force token boundaries (otherwise '-h$somevar' would
wind up actually meaning '-h $somevar', and that's absurd in its own
way), but you don't want the tokenizer to split things inside variable
expansions. Fortunately getting this right is only a small matter of
programming.
(Possibly you want to expose an explicit operator to group several
expansions together as a single non-breakable entity. You could call it
'${arg ...}'.)
If you want to tokenize before expansion, clearly the tokenizer needs to
be language aware. Roughly speaking, I think what you wind up wanting to
do is parse the string into an AST that is composed partly of tokenized
literal text, partly of language operators, and partly of variable
expansions. Then you evaluate the AST to generate a stream of tokenized
text, where a straightforward variable expansion like $heloname or
$recipients always gives you a single token regardless of what the
contents are.
(I have ripped this idea off from my understanding of the general approach that web frameworks usually take to parsing and evaluating their page templates.)
Sidebar: an alternate tokenization approach
An alternate tokenization approach is to say that the AST should include
explicit token boundary markers instead of pre-tokenized text (and
whitespace normally turns into such a boundary marker). Then the AST
evaluation produces a stream that is a mixture of token boundary markers
and text chunks; you take the stream and fuse all text between two
boundary markers together into a single argument. This naturally handles
cases like '-h$somevar' and '$var1$var2'; in both cases there is
no token boundary marker in the middle, so although the AST has two
separate nodes the end result fuses the text from both nodes together
into a single argument.
2011-07-22
String expansion and securely running programs on Unix
One of the corollaries of how to securely run programs on Unix is that a general purpose, generic string expansion system is a bad fit with securely running programs. The problem is that there is a fundamental clash of goals between the two systems: a generic string expansion system wants to treat everything as a generic string to be expanded (regardless of what it actually is), and a secure system for running programs wants to tokenize everything using simple rules.
At this point I am going to pick on Exim for illustrative examples. Unfortunately, Exim tries to have it both ways at once and thus is a great source for showing the problems that this causes, no matter how much I like it otherwise. Please note that the problems here are generic; any program that takes either approach (or both at once as Exim does) will have the same issues.
First up is Exim's av_scanner setting. This is not expanded at all
unless it starts with a '$', at which point the entire string must
be expanded before Exim knows how to tokenize it:
av_scanner = ${if bool{true} {cmdline:/opt/avscanner $recipients %s}}
If you are concerned about arbitrary characters appearing in
$recipients, there is no way to make this secure (as discussed
before).
Second, the command setting for running things in pipes. This
tokenizes things before string expansion, but it does the tokenization
purely on a textual basis. As the documentation notes, this causes
serious problems:
command = /some/path ${if eq{$local_part}{postmaster} {xx} {yy}}
Since tokenization is expansion-blind, this fails because all the string
expansion evaluator winds up seeing is '${if' (which is a clear syntax
error). To get this to work you have to force the tokenizer to treat the
entire string expansion as a single token by 'quoting' it.
(The documentation does not quite put it the way that I have here.)
A side effect of tokenization before expansion is that a single string
expansion can only ever expand to a single argument. (You may or may not
be able to expand to nothing instead of a '' empty argument, depending
on the implementation.)
What this points out is that command line tokenization and string
expansion need to be aware of each other. Once the dust settles,
either string expansion needs to be able to mark hard token boundaries
(so that $recipients can be marked as a single token regardless of
contents) or tokenization needs to know about the string expansion
language (so that ${if ...} can be parsed into a single token
despite the presence of internal spaces or other special characters).
(I have opinions on the answer here, but this entry is already long enough as it is.)
PS: if you want to be secure with minimal effort, it's clear that you need to do tokenization before expansion and provide some sort of 'quoting' mechanism to glue a string expansion expression into a single token. This is secure while being merely inconvenient and annoying to people writing configuration files. Simple expansion before tokenization cannot be made secure at all, as previously discussed.