The problem of conditional GET and caches for dynamic websites

March 10, 2014

For reasons beyond the margins of this entry, Aristotle Pagaltzis recently noticed an oddity with syndication feeds for this blog. To paraphrase his message, he made an initial plain feed request with no If-None-Match header, got back something with an ETag header, redid the request as a conditional GET with the same tag in If-None-Match, and got back a different result with a different ETag. On the surface this sounds like my caching is broken, but what is really going on is that the traditional irony of conditional GET for dynamic sites is interacting with the desire to reduce load through caching.

The dynamic site conditional GET problem is that in many dynamic environments you need to more or less build the entire page in order to determine its ETag and Last-Modified information. If you want to have a full page cache to reduce your load under some circumstances and you don't have explicit cache invalidation (which is very hard in a file based engine), you don't necessarily have fully accurate ETag values; reducing load implies relying on the ETag cached in the page cache even though the actual page and its ETag may have changed since then. If you serve an old cached version of a page, expire it, and then regenerate it, the newly generated version may well be different. This is basically the traditional conflict between a desire for more cache hits and a desire for absolutely current information.

(You can try to track dependency information in the page cache and revalidate it before you use a cache entry and its ETag, but the general problem there is that the more you revalidate the slower a cache hit is. This is especially acute in a file based engine because the validators are harder and often less efficient.)

However this is not a full explanation of what Aristotle Pagaltzis saw (not unless the cache entry for the feed expired between the two requests, which it probably didn't). What is also going on is that DWiki is doing some special hacks in certain circumstances in order to reduce the impact of generating syndication feeds. This is relatively important here because feeds are requested quite often and they're one of the most expensive things to generate (partly because I set the number of entries in feeds quite high).

What I found when I started looking at my conditional GET logs at some point was that I was getting a significant number of requests that were not conditional GETs, ie they lacked both an If-None-Match and an If-Modified-Since header. Overcome by grumpyness I decided that if these people could not be bothered to do conditional GET I was not going to go out of my way to (re)generate current content for them, so what my DWiki setup does is serve them syndication feeds from the page cache for much longer than it does for people who make conditional GETs. This means that if you do what Aristotle did, your first request may get served from an old cache entry but then your second one is recomputed from scratch (now that it's a proper conditional GET). Under the right circumstances this will result in a changed ETag.

(It looks like roughly one seventh of my successful syndication feed requests are being sent without conditional GET at the moment. This doesn't count things like Googlebot that are getting their requests refused outright.)

Conceptually this is tilting the balance between cache hits and avoiding staleness in the direction of more cache hits under some circumstances. I don't think there's anything wrong with this as long as you're doing it deliberately and with your eyes open (and ideally based on numbers).

With that said, now that I've had my attention drawn to this I'm probably going to rethink how I want to handle caching for various sorts of syndication feed requests. My initial syndication feed caching was set up rather a long time ago and there have been several generations of overall cache improvements since then. It's quite likely that the relative cost of generating syndication feeds has shifted in favour of caching them less and generating them more often.

(One of the things that has happened on Wandering Thoughts is that syndication feeds are requested so often that they're almost always in the page cache. I actually routinely flush them from cache by hand any time I publish or revise an entry, which is probably a warning sign I should have paid attention to some time ago.)

Written on 10 March 2014.
« Solaris gives us a lesson in how not to write documentation
How functions become bound or unbound methods »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Mar 10 23:08:43 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.