2014-03-10
The problem of conditional GET and caches for dynamic websites
For reasons beyond the margins of this entry, Aristotle Pagaltzis recently noticed an oddity with syndication
feeds for this blog. To paraphrase his message, he
made an initial plain feed request with no If-None-Match
header,
got back something with an ETag
header, redid the request as a
conditional GET with the same tag in If-None-Match
, and got back
a different result with a different ETag
. On the surface this
sounds like my caching is broken, but what is really going on is
that the traditional irony of conditional GET for dynamic sites is interacting with the desire to reduce load
through caching.
The dynamic site conditional GET problem is that in many dynamic
environments you need to more or less build the entire page in order
to determine its ETag
and Last-Modified
information. If you
want to have a full page cache to reduce
your load under some circumstances and
you don't have explicit cache invalidation (which is very hard in
a file based engine), you don't necessarily have fully accurate
ETag
values; reducing load implies relying on the ETag
cached
in the page cache even though the actual page and its ETag
may
have changed since then. If you serve an old cached version of a
page, expire it, and then regenerate it, the newly generated version
may well be different. This is basically the traditional conflict
between a desire for more cache hits and a desire for absolutely
current information.
(You can try to track dependency information in the page cache and
revalidate it before you use a cache entry and its ETag
, but the
general problem there is that the more you revalidate the slower a
cache hit is. This is especially acute in a file based engine
because the validators are harder and often less efficient.)
However this is not a full explanation of what Aristotle Pagaltzis saw (not unless the cache entry for the feed expired between the two requests, which it probably didn't). What is also going on is that DWiki is doing some special hacks in certain circumstances in order to reduce the impact of generating syndication feeds. This is relatively important here because feeds are requested quite often and they're one of the most expensive things to generate (partly because I set the number of entries in feeds quite high).
What I found when I started looking at my conditional GET logs at some
point was that I was getting a significant number of requests that were
not conditional GETs, ie they lacked both an If-None-Match
and an
If-Modified-Since
header. Overcome by grumpyness I decided that if
these people could not be bothered to do conditional GET I was not going
to go out of my way to (re)generate current content for them, so what
my DWiki setup does is serve them syndication feeds from the page cache
for much longer than it does for people who make conditional GETs. This
means that if you do what Aristotle did, your first request may get
served from an old cache entry but then your second one is recomputed
from scratch (now that it's a proper conditional GET). Under the right
circumstances this will result in a changed ETag
.
(It looks like roughly one seventh of my successful syndication feed requests are being sent without conditional GET at the moment. This doesn't count things like Googlebot that are getting their requests refused outright.)
Conceptually this is tilting the balance between cache hits and avoiding staleness in the direction of more cache hits under some circumstances. I don't think there's anything wrong with this as long as you're doing it deliberately and with your eyes open (and ideally based on numbers).
With that said, now that I've had my attention drawn to this I'm probably going to rethink how I want to handle caching for various sorts of syndication feed requests. My initial syndication feed caching was set up rather a long time ago and there have been several generations of overall cache improvements since then. It's quite likely that the relative cost of generating syndication feeds has shifted in favour of caching them less and generating them more often.
(One of the things that has happened on Wandering Thoughts is that syndication feeds are requested so often that they're almost always in the page cache. I actually routinely flush them from cache by hand any time I publish or revise an entry, which is probably a warning sign I should have paid attention to some time ago.)