Some thoughts on testing parsers
I've been writing a parser lately. Since I am not completely crazy, this means that I've been writing tests for it too. This has naturally left me with some thoughts about testing one's parsers in ways that don't completely drive you up the wall.
(The disclaimer is that I've never tried to write a parser for a serious language. All of mine have been for relatively modest DSLs.)
First off, you really want to be able to serialize your parsed syntax tree to some canonical string representation, or at least something very close to this. In my biased opinion, building and comparing complex objects in tests is for the birds; serializing a complex tree to a string gives you a clean way of comparing parse results with what you expected. The serialization should obviously be unambiguous, even if this sprays it with more parentheses than anyone sane would normally use. The serialization format doesn't have to be complete (you might drop some extraneous information), but it helps if it comes close.
(Your test serialization format doesn't have to be your debugging dump format, of course. Sometimes languages push you towards doing this because they provide a single convenient way to stringify an object, but you can always resist and also write a nicer display format with nice tree indentation and so on.)
Once I have a serialization format, it's also proven useful to make it one that is valid input for the parser. Among other things, this lets you test what happens when you round-trip the serialization output through the parser again. Generally you want to wind up with the same ouput. In the past different output has been a sign that, for example, I was letting unnecessary AST nodes sneak into my parses.
(To detect this sort of stuff you want every AST node in the tree to leave its fingerprints somewhere in the serialization output.)
I suspect that this is only really feasible with relatively simple languages and parsers. Sophisticated parse tree transformations are probably going to make it too hard to go back to anything that really resembles language input.
Beyond that, it's useful to be able to directly invoke specific subsections of your parser, for example the bits dealing with expressions. Direct access makes it much easier to test these subsections; you can just write the appropriate input text without having to wrap it up in a complete and fully syntactically valid program, function, or whatever.
(Parsing less also makes for smaller parse trees and smaller serialized representations, which makes it easier to see things.)
Unfortunately this can wind up in conflict with the most natural way to code your parser. It can be easiest for the internal sub-pieces of the parser to assume a fair amount about the environment they're running in and want a lot of stuff set up already.
(I've been fortunate to write my parsers so far in languages where my tests could be extremely intimate with the insides of the parser, so they can at least in theory reach through all of this. The resulting tests are of course a bit fragile; if I change the internal details of how the bits of the parser interact, the tests are going to explode.)
PS: I suspect that people who write parsers more seriously have developed a bunch of techniques and tricks for good testing. As always, I don't feel quite motivated enough to spend a bunch of time trying to find and digest their stockpiles of information for what is a hobby project where I feel most motivated when I get stuff running instead of writing a lot of test cases.