2024-10-07
Things syndication feed readers do with 'conditional GET'
In HTTP, a conditional GET is a nice way of saving bandwidth (but not always work) when a web browser or other HTTP agent requests a URL that hasn't changed. Conditional GET is very useful for things that fetch syndication feeds (Atom or RSS), because they often try fetches much more often than the syndication feed actually changes. However, just because it would be a good thing if feed readers and other things did conditional GETs to fetch feeds doesn't mean that they actually do it. And when feed readers do try conditional GETs, they don't always do it right; for instance, Tiny Tiny RSS at least used to basically make up the 'If-Modified-Since' timestamps it sent (which I put in a hack for).
For reasons beyond the scope of this entry, I recently looked at my feed fetching logs for Wandering Thoughts. As usually happens when you turn over any rock involving web server logs, I discovered some multi-legged crawling things underneath, and in this case I was paying attention to what feed readers do (or don't do) for conditional GETs. Consider this a small catalog.
- Some or perhaps all versions of NextCloud-News send an
If-Modified-Since
header with the value 'Wed, 01 Jan 1800 00:00:00 GMT'. This is
always going to fail validation and turn into a regular GET
request, whether you compare If-Modified-Since values literally
or consider them as a timestamp and do timestamp comparisons.
NextCloud-News might as well not bother sending an If-Modified-Since
header at all.
- A number of feed readers appear to only update their stored ETag
value for your feed if your Last-Modified
value also changes. In practice there are a variety of things
that can change the ETag without changing the Last-Modified value,
and some of them regularly happen here on Wandering Thoughts,
which causes these feed readers to effectively decay into doing
unconditional GET requests the moment, for example, someone leaves
a new comment.
- One feed reader sends If-Modified-Since values that use a numeric
time offset, as in 'Mon, 07 Oct 2024 12:00:07 -0000'. This is
also not a reformatted version of a timestamp I've ever given
out, and is after the current Last-Modified value at the time
the request was made. This client reliably attempts to pull my
feed three times a day, at 02:00, 08:00, and 20:00, and the times
of the If-Modified-Since values for those fetches are reliably
00:00, 06:00, and 12:00 respectively.
(I believe it may be this feed fetcher, but I'm not going to try to reverse engineer its If-Modified-Since generation.)
- Another feed fetcher, possibly Firefox or an extension, made up its own timestamps that were set after the current Last-Modified of my feed at the time it made the request. It didn't send an If-None-Match header on its requests (ie, it didn't use the ETag I return). This is possibly similar to the Tiny Tiny RSS case, with the feed fetcher remembering the last time it fetched the feed and using that as the If-Modified-Since value when it makes another request.
All of this is what I turned over in a single day of looking at feed fetchers that got a lot of HTTP 200 results (as opposed to HTTP 304 results, which shows a conditional GET succeeding). Probably there are more fun things lurking out there.
(I'm happy to have people read my feeds and we're not short on bandwidth, so this is mostly me admiring the things under the rock rather than anything else. Although, some feed readers really need to slow down the frequency of their checks; my feed doesn't update every few minutes.)