Avoiding the classic C quoting bug in your language

December 16, 2011

To summarize an earlier entry, the classic C quoting bug that I'm talking about here is writing 'printf(ustr)' or 'syslog(pri, ustr)' where 'ustr' is a string that comes from some form of external input. In that entry I mentioned that it's possible for languages to avoid this entire class of bugs with the right design. Well, it's time to elaborate on that parenthetical aside.

To start with, let's turn the issue around: what is it about these functions that causes the bug? The answer is that the bug happens because all of these functions do two things; they do something useful and in the process of doing it they format their arguments for you. The bug happens when you (the programmer) just want to do the useful thing and you overlook the fact that you're also getting the formatting for free. The conclusion is straightforward. To make this bug impossible, we need to make functions like this do only one thing; they should take a plain string and not format it. But people still need string formatting, so we need some easy way for people to do it in order to make these single purpose functions both feasible and convenient, a way that involves as close to no extra work as possible. In short, we need effortless generic string formatting.

(The need for no extra work is why we can't do this in C. We need a language where you don't need to explicitly manage string lifetimes, since the result of string formatting is another string.)

In theory you can implement generic string formatting as a function call (ideally with a very short function name). In practice I think that it isn't going to work the way you want because of perception issues; if string formatting is just a function call, it's still tempting to create convenience functions that bundle the two function calls together for you (one to format the arguments then one to do the useful thing). Doing generic string formatting as an operator (such as Python's '%') has the pragmatic benefit of drawing a much more distinct line between regular function calls and formatting strings.

(The third approach to effortless generic string formatting is string interpolation in certain sorts of strings. This has the benefit of sidestepping the entire problem, although it has problems of its own.)

PS: another approach in an OO language is to give strings an explicit formatting or interpolation method, so that you might write '"%s: %s".fmt(a, b)'. My gut thinks that this is closer to string formatting as an operator than anything else.

Written on 16 December 2011.
« Practical issues with REST's use of the Accept header
Python 3 from the perspective of a Unix sysadmin »

Page tools: View Source.
Search:
Login: Password:

Last modified: Fri Dec 16 00:36:14 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.