The problem of conditional GET and caches for dynamic websites

March 10, 2014

For reasons beyond the margins of this entry, Aristotle Pagaltzis recently noticed an oddity with syndication feeds for this blog. To paraphrase his message, he made an initial plain feed request with no If-None-Match header, got back something with an ETag header, redid the request as a conditional GET with the same tag in If-None-Match, and got back a different result with a different ETag. On the surface this sounds like my caching is broken, but what is really going on is that the traditional irony of conditional GET for dynamic sites is interacting with the desire to reduce load through caching.

The dynamic site conditional GET problem is that in many dynamic environments you need to more or less build the entire page in order to determine its ETag and Last-Modified information. If you want to have a full page cache to reduce your load under some circumstances and you don't have explicit cache invalidation (which is very hard in a file based engine), you don't necessarily have fully accurate ETag values; reducing load implies relying on the ETag cached in the page cache even though the actual page and its ETag may have changed since then. If you serve an old cached version of a page, expire it, and then regenerate it, the newly generated version may well be different. This is basically the traditional conflict between a desire for more cache hits and a desire for absolutely current information.

(You can try to track dependency information in the page cache and revalidate it before you use a cache entry and its ETag, but the general problem there is that the more you revalidate the slower a cache hit is. This is especially acute in a file based engine because the validators are harder and often less efficient.)

However this is not a full explanation of what Aristotle Pagaltzis saw (not unless the cache entry for the feed expired between the two requests, which it probably didn't). What is also going on is that DWiki is doing some special hacks in certain circumstances in order to reduce the impact of generating syndication feeds. This is relatively important here because feeds are requested quite often and they're one of the most expensive things to generate (partly because I set the number of entries in feeds quite high).

What I found when I started looking at my conditional GET logs at some point was that I was getting a significant number of requests that were not conditional GETs, ie they lacked both an If-None-Match and an If-Modified-Since header. Overcome by grumpyness I decided that if these people could not be bothered to do conditional GET I was not going to go out of my way to (re)generate current content for them, so what my DWiki setup does is serve them syndication feeds from the page cache for much longer than it does for people who make conditional GETs. This means that if you do what Aristotle did, your first request may get served from an old cache entry but then your second one is recomputed from scratch (now that it's a proper conditional GET). Under the right circumstances this will result in a changed ETag.

(It looks like roughly one seventh of my successful syndication feed requests are being sent without conditional GET at the moment. This doesn't count things like Googlebot that are getting their requests refused outright.)

Conceptually this is tilting the balance between cache hits and avoiding staleness in the direction of more cache hits under some circumstances. I don't think there's anything wrong with this as long as you're doing it deliberately and with your eyes open (and ideally based on numbers).

With that said, now that I've had my attention drawn to this I'm probably going to rethink how I want to handle caching for various sorts of syndication feed requests. My initial syndication feed caching was set up rather a long time ago and there have been several generations of overall cache improvements since then. It's quite likely that the relative cost of generating syndication feeds has shifted in favour of caching them less and generating them more often.

(One of the things that has happened on Wandering Thoughts is that syndication feeds are requested so often that they're almost always in the page cache. I actually routinely flush them from cache by hand any time I publish or revise an entry, which is probably a warning sign I should have paid attention to some time ago.)


Comments on this page:

By Ewen McNeill at 2014-03-11 00:19:53:

If you're already flushing the cache of syndication feeds by hand each time you post, and are willing to continue to do so, then you could just pre-generate them to a file and let the conditional GET hit that file instead. (This is basically how Ikiwiki works for feeds -- they get rendered out to files too when the wiki/blog is regenerated.)

In the case of posting to my Ikiwiki-based blog, committing to the git tree that holds the blog causes a git post-hook to run that regenerates everything that needs regenerating (including syndication feed files) and makes the updated files public. Which means I get "behaves like a static file" syndication feeds for no extra effort.

Ewen

By cks at 2014-03-11 00:54:41:

There's three problems with your idea for here:

  • there's too many syndication feeds for me to want to generate them by hand or by make/etc (and in fact many of the possible syndication feeds in the overall wiki-thing here are never requested).
  • even if I did, I'd need a new static file generator for them and a bunch of infrastructure around it and around serving them.
  • because a count of the comments on an entry is part of it in the syndication feed, syndication feed entries can change at random times.

(Cached syndication feeds have a predictable name pattern under the cache area, so they are flushed with 'find ... | xargs rm'. A similar trick can be done with any cache where you can inventory the current keys.)

If syndication feeds for Wandering Thoughts were a major resource consumer even after all of DWiki's optimizations it would be worth such major surgery in order to deal with them. But they would probably have to be requested several orders of magnitude more frequently than they are now in order for that to be necessary and it really would be a hack.

(In general, serving static files well requires a web site with an URL layout that is designed for this. Essentially you want something that is the reverse of ADynamicSitePeril.)

By Ewen McNeill at 2014-03-11 15:22:21:

Your cache files are (I assume) static files, served in a (semi-)static-file way. Just saying. (Serving cache files until manually invalidated is a moderately well known front end caching strategy, that makes good sense in some situations -- especially if targeted at certain files/file types, and in some cases a automated regeneration via a suitable GET.)

FWIW, the kludge that Wordpress seems to use for comments count is to include an IMG reference which fetches an image with the comment count (rendered in text) in it. I'm not sure if that's overall a load/bandwidth win (over updating the syndication feed), but it does at least offer another tradeoff to play with.

I just wanted to offer you the perspective on other approaches given that a couple of recent posts made it sound like your syndication feed generation tradeoffs weren't an ideal match to current requests/criteria. ("ideal match" is hard to achieve :-) )

Ewen

By cks at 2014-03-11 16:28:17:

All of DWiki's caching, including the full page cache, sits 'behind' it. It's actually relatively hard to do otherwise for full page caches on real dynamic websites because you need a way of forcing the webserver and your application to agree on the ETag for something that is dynamically generated and then cached.

(Synchronization of timestamps is usually relatively easy, but not so much for ETags. Web servers usually like ETag schemes for static files that don't involve actually reading them and often don't document exactly what the scheme is so that you can reproduce it. And the scheme invariably depends on what exact web server you're running under, which has various drawbacks.)

Written on 10 March 2014.
« Solaris gives us a lesson in how not to write documentation
How functions become bound or unbound methods »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Mar 10 23:08:43 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.