Wandering Thoughts archives

2022-06-03

Regular expressions are effectively a (hard) programming language

A while back I read Hillel Wayne's Regexes are Cool and Good, which sparked some thoughts about why regular expressions are famously considered difficult. So here is one of them: regular expressions are effectively a programming language, and that language is a 'hard' one. This gives regular expressions two broad problems.

The first problem is the general problem of working with programming languages. Writing code in any programming language requires figuring out how to solve your problem with the features that the language has, without overlooking anything or creating an incomplete solution. Figuring out what code in any programming language does often requires simulating that code in your head, either with concrete inputs or with abstract ones, and if you overlook anything you'll get a wrong answer.

The second problem is that the language most regular expressions are expressed in is, objectively, a terrible language. It's extremely compact and uses a wide assortment of symbols with specific meanings; as a result small differences in writing or reading a regular expression can give you a very different result. It also has a peculiar 'execution model', one that effectively requires people to keep track of a potentially large amount of state. Some of this can be addressed with the verbose regular expression format with comments.

(As a programming language, regular expressions also lack almost all of the supporting tooling for reading, writing, and debugging them. Some of this could be provided by editors and IDEs, but it mostly isn't today.)

Considered just as a programming language (without worrying about syntax), regular expressions may never be an easy thing to write or follow because their execution model is so different from how programs normally work. Part of this is human psychology; we're not the best at considering all of the corner cases, not missing anything, and not being optimistic about things meeting our expectations (so that we spot problems like too-long or too-short matches). Throw in alternates and the number of things we have to keep track of can go up drastically.

All of this makes me think that the regular expressions I write probably need better documentation, but I'm not sure how to do that. Perhaps I should write them in verbose form and annotate every section with what it matches (including a literal text sample) and why it works. In environments without verbose regular expressions, I could at least write comments that break down the regular expression into sections against a sample input.

(As usual, writing out the comments may well make me realize that my regular expression is wrong or incomplete, since writing comments is a form of talking to the rubber duck.)

tech/RegularExpressionsHardLanguage written at 22:26:42; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.