Wandering Thoughts archives

2015-04-20

An interesting trick for handling line numbers in little languages

One of the moderately annoying issues you have to deal with when writing a lexer for a language is handling line numbers. Being able to report line numbers is important for passably good error messages, but actually doing this can be a pain in the rear end.

The usual straightforward way is to have your lexer keep track of the current line number and make it available to higher levels on demand. One problem this runs into is that the lexer's current position is not necessarily where the error actually is. The simple case is languages that don't allow multi-line constructs, but even here you can wind up off by a line in some situations.

A more sophisticated approach is to include the line number (and perhaps the position in the line) as part of what you return for every token. Both the parser and the lexer can then use this to report accurate positions for everything without any problems, although the lexer still has to keep counting lines and so on.

Somewhat recently I wound up writing a lexer in Go as part of a project, and I modeled it after Rob Pike's presentation on lexing in Go. Pike's lexer uses an interesting low-rent trick for handling line numbers, although it's one that's only suitable for use with a little language. Pike's lexer is given the entire source code to lex at once, so rather than explicitly tracking line numbers it just tracks the absolute character position in the source code (which it needs anyways) and includes this absolute character position as part of the tokens. If you turn out to need the line number, you call back to the lexer with the character position and the lexer counts how many newlines there are between the start of the source and the position.

Ever since I saw it this has struck me as a really clever approach if you can get away with it. Not only is it really easy to implement, but it's optimized for the common case of not needing the line number at all because you're parsing something without errors. Now that I've run into it, I'll probably reuse it in all future little language lexers.

Note that this isn't a general approach for several reasons. First, serious lexers are generally stream lexers that don't first read all of the source code into memory. Second, many languages routinely want line number information for things like profiling, debugging, and exception traces (and all of these uses are well after lexing has finished). That's why I say Pike's approach here is best for little languages, where it's common to read all of the source in at once for simplicity and you generally don't have those issues.

(If I was dealing with a 'bigger' language, I think that today I would take the approach of returning the line number as part of every token. It bulks up the returned token a bit but having the line number information directly in the token makes your life simpler in the long run, as I found out from the Go parser I wrote.)

LexerLineNumbersTrick written at 00:41:52; Add Comment

2015-04-17

In practice, programmers mostly understand complexity by superstition

A while back I said that a significant amount of programming is done by superstition. A corollary of that is that a great deal about programming is also understood primarily through superstition and mythology, as opposed to going to the actual deep technical detail and academic definitions of things. Today I want to point specifically to complexity.

You are a good, well educated programmer, so you know what 'constant time' or O(1) in the context of hash tables really means (for example). You probably know many of the effects that can distort hash table operations from being fast; there's that 'constant time' really only refers to 'relative to the number of entries', there's pathological cases that violate this (like extensive hash collisions), there's the practical importance of constant factors, there's the time taken by other necessary operations (which may not be constant time themselves), and then there's the low-level effects of real hardware (RAM fetches, TLB fills, L2 cache misses, and so on). This is an incomplete list, because everything is complex when you dig in.

Most programmers either don't know about this or don't think about it very much. If something is called 'constant time', they will generally expect it to be fast and to be consistently so under most conditions (certainly normal conditions). Similar things hold for other broad complexity classes, like linear or n log n. In fact you're probably doing well if a typical programmer can even remember what complexity class things are in. Generally this doesn't matter; as before, the important thing is the end result. If you're writing code with a significantly wrong understanding of something's practical behavior and thus time profile, you're probably going to notice.

(Not always, though; see Accidentally Quadratic, especially the rationale. You can choose to see this as a broad problem with the mismatch between superstition and actual code behavior here, in that people are throwing away performance without realizing it.)

ComplexitySuperstition written at 02:02:54; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.