Wandering Thoughts archives

2013-07-19

A bit on the performance of lexers in Python

This all starts with Eli Bendersky's Hand-written lexer in Javascript compared to the regex-based ones (via) where he writes in part:

I was expecting the runtime [of the hand written lexer] to be much closer to the single-regex version; in fact I was expecting it to be a bit slower (because most of the regex engine work is done at a lower level). But it turned out to be much faster, more than 2.5x.

In the comments Caleb Spare pointed to Rob Pike's Regular expressions in lexing and parsing which reiterates the arguments for simple lexers that don't use regular expressions. Despite all of this, regular expression based lexers are extremely common in the Python world.

Good lexing and even parsing algorithms are both extremely efficient and very well known (the problem has been studied almost since the start of computer science). A good high performance lexer generally looks at each input character only once and runs a relatively short amount of focused code per character to tokenize the input stream. A good regular expression engine can avoid backtracking but is almost invariably going to run more complex code (and often use more memory) to examine each character. As covered in Russ Cox's series on regular expressions, garden variety regular expression engines in Python, Perl, and several other languages aren't even that efficient and do backtrack (sometimes extensively).

So why don't people use hand-written lexers in Python? Because most Python implementations are slow. Your hand-written lexer may examine each character only once with simple code but that efficiency advantage is utterly crushed by the speed difference between the C-based regular expression engine and the slowness of interpreting your Python code. Hand written JavaScript lexers are comparatively fast because modern JavaScript interpreters have devoted a lot of effort to translating straightforward JavaScript code into efficient native code. Since the lexer actually is straightforward, a JIT engine is generally going to do quite well on it.

(Given this I wouldn't be surprised if a hand-written Python lexer that was run under PyPy was quite fast, either competitive with or even faster than a Python regex-based one. Assembling a test case and doing the benchmarking work is left as an exercise.)

python/PythonLexerPerformance written at 20:31:00; Add Comment

Thinking about the security issues in HTTP 403 versus 404 errors

In the Unix world, the classical view of error responses for login attempts is that you should behave exactly the same for invalid logins as for valid logins; you should prompt for a password, you should spend just as long 'verifying' the password, and then you should tell whatever it is making the login attempt simply 'you didn't succeed'. To do otherwise is to leak information to an attacker about what logins are actually valid, which is potentially useful in all sorts of contexts.

Assuming that HTTP error codes actually matter and that people you do not like will accurately interpret small differences in your error responses, there is an equivalent issue in the HTTP world; HTTP explicitly has a difference between 'thing does not exist' (404) and 'you do not have access to thing' (403). Under some circumstances this can leak information to attackers.

Part of what makes HTTP tricky here is the question of what information you want to protect. Put simply, is it more important to hide the existence or non-existence of URLs or to hide the existence of (some) restricted access URLs? This has no general answer; what you choose depends on your local circumstances and security needs.

Now, it's worth mentioning an important difference between HTTP requests and login attempts: in HTTP requests on normal sites, it's routine for people to innocently make requests for bad URLs. Bad URLs come from all over, including you making them bad by reorganizing your site and breaking all of those links. This is unlike logins, where bad logins are frequently from attackers.

(There are situations where the URLs are strictly internal constructs inside your complex web app and no human should ever save such an URL or pass it around. In these cases bad URLs are much more like bad logins.)

There are of course some pragmatic considerations. First, it's generally safer to check authentication very early rather than run substantial parts of your code at the behest of an unauthenticated person; in the jargon, you're said to have less exposed attack surface. In web apps this often means that you have no idea whether not a URL is valid when you reject the request. As a result you probably can't hide the presence of authentication from a sophisticated attacker but you often do effectively wind up hiding what URLs under the authentication point actually exist.

Second, if you want to deal with thie issue by using only one generic error code for 'you can't have this URL for whatever reason' I think you probably want to use a 404 unless your entire site is protected by authentication (in which case 403 is better). You will get innocent bad URLs for things that aren't protected by authentication and returning anything other than a 'page not found' report of some sort will just confuse the poor user.

(Technically you could return a 403 status but with HTML text that says that the page is not accessible for some reason. Most or all actual people will read the text and never know the status code.)

(What started me thinking about this whole issue was a Tweet from Sean M Puckett.)

web/HTTP403Vs404 written at 00:49:35; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.