Wandering Thoughts archives

2013-09-15

Regular expression performance and performance folklore

I've spent a certain amount of time looking into the performance of the regular expression engines in both (some version(s) of) Perl and Python and reading things like Russ Cox's series on regular expressions. Like a lot of other people I've also read a certain amount of general wisdom on regular expression performance out on the Internet.

Here is what my direct experience has convinced me: I have no real idea how a regular expression engine is going to perform in some situation unless I measure it. The performance of regexp engines is only vaguely predictable and to do it well you need to know an uncommon amount about the internals of their implementations. Regexp engines that use very similar or identical RE languages can perform very differently on common things you do (the poster child here is the behavior of Perl versus Python on RE alteration, '|').

All of this makes me very wary of airy Internet pronouncements about how to get fast regular expressions in any particular language (it should be clear that a language-independent prediction is either immediately obvious or completely laughable). These bits of advice may be well researched and thus actually correct or they may be what the author thinks should be the case, and it's often hard to tell which is which.

(Even when they're correct, regular expression engines can and do get changed over time. Hopefully the changes improve performance without any regressions, but you never know.)

I came very close to writing such an entry myself until I came to my senses. That's the other thing that this has convinced me of: I cannot possible write anything about this subject without actually measuring it. Per my aside in a recent entry I should also then archive and publish my measuring program.

(By the way, the other minefield for regexp engine performance is whether you are using them on Unicode data or 'plain bytes'. Many tests, mine included, have traditionally been run on plain bytes; these days Unicode performance is much more interesting and relevant, especially as some environments turn all normal strings into Unicode.)

programming/RegexpPerformanceFolklore written at 23:28:16; Add Comment

Identities, trust, and work

As part of thinking about 'web of trust' systems, I've recently come to think that there are effectively two sorts of identities on the Internet. For lack of a better terminology I will call these 'internal' and 'external'.

An external identity is an identity that is linked to something in the outside Internet world. In one sense, the identity exists to assert that the person behind a series of work is the same person and this new work comes from the same person as a series of previous work. 'Trust' for such an identity within your identity system is essentially meaningless; people don't care that Linus Torvalds' GPG key has lots of signatures, they care that it continues to sign Linux kernel releases and that the 'Linus Torvalds' on kernel mailing lists doesn't denounce it as forged and so on. The work done in the name of the identity is its proof and source of trust.

An internal identity is an identity without this property. Its only significant existence is within your identity system and it is otherwise free-floating, not tied to something else out on the Internet that people care about or look at. Trust for these identities is necessarily created within your identity system because there is nothing else to do it; there is nothing significant on the Internet to say 'yes, this is my identity'.

Internal identities are necessarily much more vulnerable than external identities because there is nothing else there; your identity system is it.

Man in the middle attacks are possible on unsupported external identities in situations where you can actually do two-way impersonation and keep it up. When it comes to personal identities I think that this is rare. Other sorts of identities are much more attackable this way and so need stronger internal support from your identity system; here the 'trust' your identity system needs to create is that you are talking to the real thing, not an imposter in the middle.

tech/IdentitiesAndWork written at 00:41:30; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.