2013-08-26
An example GNU Readline quoting function
Because I would have killed for an actual example when I was doing this,
here is the quoting function I'm currently using (as kind of promised
in my entry on how to do this in general). It does
rc-style filename quoting (put a ' at the front and end and escape
every ' with a second '):
char *my_rl_quote(char *text, int m_t, char *qp) {
char *r, *p, *tp;
/* The worst case is that every character of
text needs to be escaped; at that point we
need 2x its space plus the ' at the start
and end and a NULL byte. */
p = r = malloc(strlen(text)*2 + 3);
if (r == NULL)
return NULL;
*p++ = '\'';
for (tp = text; *tp; tp++) {
if (*tp == '\'')
*p++ = '\'';
*p++ = *tp;
}
if (m_t == SINGLE_MATCH)
*p++ = '\'';
*p++ = 0;
return r;
}
(I suppose you could bum the for loop a bit more by making the
increment operation be '*p++ = *tp++' and taking out the last line of
the loop body. I feel that would make it a little bit too clever.)
You should set up things for Readline as follows:
rl_filename_quoting_function = my_rl_quote;
rl_attempted_completion_function = my_rl_yesquote;
rl_completer_quote_characters = "'";
/* for rc, probably incomplete: */
rl_filename_quote_characters = " `'=[]{}()<>|&\\\t";
my_rl_yesquote is from my entry on how to do this in general.
2013-08-25
Adding basic quoting to your use of GNU Readline
Suppose that you have a program (or) that makes basic use of GNU Readline
(essentially just calling readine()) and you want to add the feature
of quoting filename expansions when it's needed. Sadly the GNU Readline
documentation is a little bit scanty on what you need to do, so here is
what has worked for me.
(The rest of this assumes that you've read the Readline programming documentation.)
As documented in the manual (eventually) you first need a function
that will actually do the quoting, which you will activate by pointing
rl_filename_quoting_function at. Although the documentation neglects
to mention it, this function must return a malloc()'d string; Readline
will free() it for you. As far as I can tell from running my code
under valgrind, you don't need to free() the TEXT argument you are
handed.
You must also set rl_filename_quote_characters and
rl_completer_quote_characters to appropriate values. To be fully
correct you probably also want to define a dequoter function, but
I've gotten away without it so far. In simple cases Readline will
simply ignore your quote character at the front when doing further
filename completion; I think you only need a dequoter function to
handle the case were you've had to escape something in the filename.
With a sane library this would be good enough. But contrary to what the documentation alleges, this doesn't seem to be sufficient for Readline. Instead you need to hook into Readline completion in order to tell Readline that yes really, it should quote things. You do this by the following:
char **my_rl_yesquote(const char *init, int start, int end) {
rl_filename_quoting_desired = 1;
return NULL;
}
/* initialize by setting:
rl_attempted_completion_function = my_rl_yesquote;
*/
Your 'attempted completion function' exists purely for this, although you can of course do more if you want. Note that the need for this function and its actions is in direct contradiction to the Readline documentation as far as I can tell. On the other hand, following the documentation doesn't work (yes, I tried it). Possibly there is some magic involved in just how you invoke Readline and some unintentional side effects going on.
(On the other hand I got this from a Stackoverflow answer, so other people are having the same problem.)
Note that a really good job of quoting and dequoting filenames needs a certain number of other functions, per the Readline documentation. I can't be bothered to worry about them (or write them) so far.
I was going to put my actual code in here as an example but it turns out it is too embarrassingly ugly and hacky for me to do it in its current state and I'm not willing to include cleaner code that I haven't actually run and tested. Check back later for acceptable code that I know doesn't explode.
(Normally I clean up my hacky 'it finally works' first pass code, but I was rather irritated by the time I got something that worked so I just stopped and put the whole thing out of my mind.)
Update: my example quoting function is now in ReadlineQuotingExample.
2013-08-15
My understanding of modern C undefined behavior and its effects
Back in the old days, it was famously said that using undefined behavior
in your C program gave the compiler license to delete all of your files
if it felt like it. When people heard that we laughed, nodded sagely,
and went cheerfully on our way because of course no actual compiler was
ever going to react to undefined behavior in that way and everyone knew
it. (The closest real compilers ever came to that was how early versions
of GCC reacted to #pragma.)
This left a whole generation of programmers with the attitude that C's large collection of undefined and implementation defined behavior was no big deal. Different CPUs or compilers might behave differently but the whole result would be fundamentally sane and often even predictable in advance (given knowledge of CPU behavior).
In the modern world, as John Regehr has taught me, this is both wrong and dangerous. Modern compilers do not delete your files or launch ICBMs when they encounter undefined behavior, because that would still be very stupid. Instead they do something much more dangerous: modern compilers will assume that undefined behavior can't happen. This knowledge that certain things can't happen is then used in optimization; for example, the compiler may deduce things about variable values which then gets fed through into dead code elimination and pretty soon you are removing a security check because the compiler knows it can 'never' trigger (in proper code).
(That led to a cute Linux kernel security vulnerability, by the way.)
The practical upshot is that it is now basically impossible to reason about how a chunk of code will behave in the face of undefined behavior and anyways, it changes. To even start requires a thorough understanding of modern compiler optimizations and a ruthlessly objective skeptic's eye so that you can see what the code actually says, not what you think it does. Only then are you in a position to start following the implications of, say, dereferencing a structure pointer as part of local variable initialization before you explicitly check said pointer to see if it's NULL.
Or in short modern C compilers do terrifying things with undefined behavior.
PS: I recommend you read John Regehr's blog. It's hair-raising.
(This was inspired by C J Silverio pointing to this HN comment.)
2013-08-13
You should convert wikitext to HTML through an AST
Suppose that you are turning wikitext or some other form of structured markup into HTML. The straightforward and often easiest way to do this is to directly generate the HTML as you process the wikitext; when you encounter and parse a particular bit of markup, you immediately output the relevant HTML. Having done this and stubbed my toes very vigorously, I have a bit of advice: you should parse into an AST and then generate HTML from that AST. Yes, it's more code and it seems more indirect, but it has some significant advantages.
The first general advantage is that it decouples the process of parsing your wikitext from the process of generating HTML. Rather than being two sides of a single chunk of code they now communicate through an API, the AST. The AST then gives you a vantage point to examine and verify each side of the process independently (and to evolve them separately). For example, if you're working on the parsing code you can verify that the results are the same by checking the AST instead of having to compare the output HTML.
(If you use automated tests I expect that having an AST in the middle will make both parsing and HTML generation much easier to test. It should also make it much less annoying to evolve either side, because many fewer tests are likely to need changes if you change parsing or HTML generation.)
The second general advantage is that once you have an AST you don't have to output just HTML. For instance (as I mentioned once before) you can output a different wikitext dialect, giving you a fully reliable way of doing wikitext format conversions. Decide that some part of your markup should be different? Now you can fix that. Or you could transition to a significantly different format (eg, to Markdown or MediaWiki from your own custom format) without giving your users and yourself heartburn. All of these options are simply an AST walker away.
(Go shows the power of being able to do this
sort of change automatically and reliably with their 'go fix' tool, which they've used to do any number
of language and library transitions. My impression is that the existence
of go fix makes the Go people more willing to make such changes.)
A smaller advantage of an AST is that it gives you structured information. As I've found out the hard way, a large monolithic blob of HTML is not necessarily what you want. Even when you want HTML (as opposed to metadata) it can be very useful to get things like 'the first paragraph' or 'every top-level section header text' and so on. Generating HTML from an AST also lets you defer certain rendering decisions until very late in the process; this can let you cache more (or cache things more easily).
Another AST advantage is simply that it will almost certainly push you to write a relatively systematic parser for your wikitext. Real parsers are important because they are easier to understand.
(This was inspired by the comment left on my earlier entry about my mistake. My new revised code still falls well short of producing an AST, but if I was writing a new parser from scratch I've realized that I definitely would go to an AST as the intermediate form.)
2013-08-07
Understanding how generators help asynchronous programming
I've been reading for a while about how generators mean we can do callback free asynchronous programming, instead of trapping us in callback hell (lately all of the buzz has been about generators in JavaScript; this is typical of what I've read). But I have to confess that I never really got how the whole thing actually worked; all of the example code that people wrote seemed to have a great big 'and then magic happens' surrounding it. Recently I finally had a sudden burst of enlightenment about how it all works (after banging my head on yet another article about JavaScript generators (via)). This is my attempt to explain that enlightenment, if only to stick it in my head.
(I'm going to use a pseudo-Python for example code rather than try to pretend that I can write valid JavaScript without careful testing.)
Let's start with some sort of routine that we want to write in a straight-line way while actually having it be asynchronous:
def process(request): user = yield db.getuser(request.user) group = yield db.getgroup(request.group) ....
The key trick here is that yield (and generators in general) allow
two-way communication. process() both returns values to the outside
world (when it uses yield) and can have values injected into it (as
the value of those yield expressions). These two sorts of values are
not necessarily the same thing, even though they look like it. What
process() yields to its caller is not necessarily what its caller
gives it back.
This leads to how the magic happens. The db.get* functions don't
return actual results; instead they return some sort of object which
will let us register a callback to be invoked when their operation
completes. The main loop takes these objects (returned through the
yield's) and registers something which will add process() (and the
value to inject back into it) to a scheduling queue. When the callback
fires the main loop will wind up restarting process() with the actual
result of the database lookup, which emerges as yield's value inside
process(). Effectively the main loop's job is to convert delayed,
asynchronous results into actual results and then give them back to
process() (and other similar routines).
(You could re-invoke process() directly from the callback but it's
possible that the code structure will get more involved that way.)
The main loop and the db.get* functions may be complex, but those
only get written once (they're library routines). Everyone writes lots
of versions of process() and those get to be simple (or at least
simpler).
PS: the main loop needs some additional magic, of course, because the outside world has to inject traffic into this whole thing somewhere. I wave my hands about that part.