2010-10-18
A module that I want in the standard library: Parsing
Given what I've written before, it will probably not surprise you that I have a little list of modules that I think should be in the standard library. The first one of them is a decent parsing or parser generator module.
There are two reasons in combination for this. First off, various sorts of parsing is an important part of the work that a decent number of programs do. Full bore language parsers are somewhat uncommon, but (at least in my work) it's reasonably common to be parsing (or at least looking at) less complex things, including the output of other programs. The Python standard library currently has a number of specialized parsers (for XML, for certain standard sorts of configuration files, and so on), but it has no general parser framework.
But wait, you might be saying, Python has a general parser framework in the form of regular expressions. That is the second reason that a real parser module is needed; to quote Jamie Zawinski, using regular expressions for parsing means that you now have two problems. Trying to do parsing with regular expressions is the classic example of using a bad solution to the problem. Regular expressions do not help you much if you want to do a real job of parsing something, but they make it very easy to assemble an incomplete, difficult to maintain, and fragile 'parser' that is likely to mishandle any input that it doesn't expect (and there have been lots of demonstrations of this). But as long as the standard library does have a regular expression module and doesn't have a parser module, people will continue building parsers using regular expressions and then blowing their feet off.
(Regular expressions are useful for lexing, but that is only a part of parsing things. My ideal parsing module would also have good support for building lexers, possibly integrated into the grammar specification.)
A good parsing module for the standard library would make it easier to build a real parser than to parse things with regular expressions, thereby encouraging people to solve their problems in the good, right way. (I would suggest that things like good error checking and error recovery would also be attractive, but people who are happy with regular expression based parsers are unlikely to be missing them now.)
Python has a number of (third party) parsing modules, none of which I've tried out. Ned Batchelder's comprehensive index and summary is the best resource on this that I know of.
(Since I have tried none of the existing modules, I have no opinion on which one should be in the standard library, if any. I just want some decent parsing module to be in there.)
The cache eviction death spiral
Here's a cache behavior I've touched on in passing several times but I feel like covering explicitly now. It's what I've taken to call the cache eviction death spiral, where cache evictions start you down a path to grinding death.
Cache evictions of useful data have an obvious problem; they slow down the program when you have to pull the needed data back into cache. But in a contended cache under pressure from multiple sources, they have a second effect as well; because your program is now running slower, it is touching the data that it needs less often, 'cooling down' the data and making it more likely to be evicted from the cache in turn. Of course this then repeats, worse.
(Technically this only happens in caches that are based on some variety of least-recently-used or least-frequently-used. That would be pretty much all of them.)
This is the cache eviction death spiral. It can happen in any situation where the cache is being contended over (as opposed to the single process case where the cache just isn't big enough), and does not necessarily result in a winner; if you have enough sources of contention, they can all thrash each other to death, with everyone too busy stealing entries from other processes in order to get their own entries back into memory to do any real work.
An especially dangerous situation for a cache eviction death spiral is when some users of the cache have a much easier time keeping their data hot (or in general claiming cache space) than other users. This can create a brutal feedback cycle that rapidly more or less freezes the slow users out of the cache entirely (or at least confines them to whatever small portion the fast users don't need). Generally this results in them grinding to a complete halt (although performance monitoring may claim that they are very active).
If you're designing a cache in a system it pays to consider if you're susceptible to this issue, especially if your system can wind up starving some users. (I don't know if the literature has good ways around this, but if I was searching I would start with the operating system people; this crops up a lot in operating system caching.)