Potential pragmatic handling of partial matches for HTTP conditional GET

October 11, 2024

In HTTP, a conditional GET is a GET request that potentially can be replied with a HTTP '304 Not Modified' status; this is quite useful for polling relatively unchanging resources like syndication feeds (although syndication feed readers don't always do so well at it). Generally speaking, there are two potential validators for conditional GET requests; the If-None-Match header, validated against the ETag of the reply, and the If-Modified-Since header, validated against the Last-Modified of the reply. A HTTP client can remember and use either or both of your ETag and your Last-Modified values (assuming you provide both).

When a HTTP client sends both If-Modified-Since and If-None-Match, the fully correct, specifications compliant validation is to require both to match. This makes intuitive sense; both your ETag and your Last-Modified values are part of the state of what you're replying with, and if one doesn't match, the client has a different view of the URL's state than you do so you shouldn't claim it's 'not modified' from their state. Instead you should return the entire response so that they can update their view of your Last-Modified state.

In practice, two things potentially get in the way. First, it's common for syndication feed readers and other things to treat the 'If-Modified-Since' value they provide as a timestamp, not as an opaque string that echoes back your previous Last-Modified. Programs will put in what's probably some default time value, they'll use timestamps from internal events, and various other fun things. By contrast, your ETag value is opaque and has no meaning for programs to interpret, guess at, and make up; if a HTTP client sends an ETag, it's very likely to be one you provided (although this isn't certain). Second, it's not unusual for your ETag to be a much stronger validator than your Last-Modified; for example, your ETag may be a cryptographic hash of the contents and will definitely change if they do, while your Last-Modified is an imperfect approximation and may not change even if the content does.

In this situation, if a client presents an If-None-Match header that matches your current ETag and a If-Modified-Since that doesn't match your Last-Modified, it's extremely likely that they have your current content but have done one of the many things that make their 'timestamp' not match your Last-Modified. If you know you have a strong validator in your ETag and they're doing something like fetching your syndication feed (where it's very likely that they're going to do this a lot), it's pragmatically tempting to give them a HTTP 304 response even though you're technically not supposed to.

To reduce the temptation, you can change to comparing your Last-Modified value against people's If-Modified-Since as a timestamp (if you can parse their value that way), and giving people a HTTP 304 response if their timestamp is equal to or after yours. This is what I'd do today given how people actually handle If-Modified-Since, and it would work around many of the bad things that people do with If-Modified-Since (since usually they'll create timestamps that are more recent than your Last-Modified, although not always).

Despite everything I've written above, I don't know if this happens all that often. It's entirely possible that syndication feed readers and other programs that invent things for their If-Modified-Since values are also not using If-None-Match and ETag values. I've recently added instrumentation to the software here so that I can tell, so maybe I'll have more to report soon.

(If I was an energetic person I would hunt through the data that rachelbythebay has accumulated in their feed reader behavioral testing project to see what it has to say about this (the most recent update for which is here and I don't know of an overall index, see their archives). However, I'm not that energetic.)


Comments on this page:

By nell at 2024-10-12 13:44:56:

the If-Modified-Since header, validated against the Last-Modified of the reply.

Strictly speaking, based on the Mozilla pages you linked, it's validated against the "Last-Modified date of the distant resource", which need not ever have been sent by the server.

First, it's common for syndication feed readers and other things to treat the 'If-Modified-Since' value they provide as a timestamp, not as an opaque string that echoes back your previous Last-Modified.

Well, sure—it's explicitly documented as a timestamp: "The If-Modified-Since request HTTP header makes the request conditional: the server sends back the requested resource, with a 200 status, only if it has been last modified after the given date." It says the date required to be in "GMT", has explicit instructions for parsing and generating it, and notes the risk of the clients and servers having different clocks (a risk that's significantly decreased in the 25 years since publication).

you can change to comparing your Last-Modified value against people's If-Modified-Since as a timestamp

Where would one get the idea it should be treated as an opaque string, and compared by strict equality? RFC 2616 says the server "may" do that, but I think you're interpreting the word "match" too literally; 14.25 is quite explicit that it should be compared as a "less-than"—although that's only a "should", and it's also stated that clients should use the exact date sent by the server, if known.

So, it's explicitly valid, and possibly even useful, to use arbitrary timestamps that were never sent by the server. Maybe I just want to see "what's Chris posted in the last week?", not knowing the exact last time I checked (like if I'm setting up a feed-reader on a new machine and don't want hundreds of old posts appearing). For that matter, it seems valid for clients to "make up" ETag values. Hypothetically, if I told some stateless downloader to save URL X to path Y, where a file already exists, it could hash that file in various ways to generate plausible ETag values and potentially avoid needless re-downloading. I don't, however, see much benefit to a client sending an ETag and timestamp together.

By cks at 2024-10-12 15:02:10:

The pragmatic problem with treating the If-Modified-Since value as a timestamp and doing time comparisons with it against your own timestamp is that if you do so, your entire web server environment needs to make absolutely sure that the timestamp of a URL can never go backward, and it's my view that this is very hard to guarantee in practice. If the URL timestamp every goes backward, you will incorrectly tell people that they have the current, correct version of a resource when they don't (unless they gave you an If-None-Match too so you can check a better validator).

(I should have linked to my past entry on this problem in this entry, I just forgot.)

Note that making up a recent If-Modified-Since will not normally limit how many syndication feed entries or the like you get. It's not RFC-compliant to serve you different replies based on the specific values of your If-Modified-Since, so any server that does it would be going well out of its way to guess what a HTTP client means by a random If-Modified-Since. If you make up an If-Modified-Since for a fresh client, either you get nothing (a 304 Not Modified) or you get the full current version of the resource, however big and old it is.

As for ETag (well, If-None-Match), it's not useful for clients to try to make up a value on their own because it is explicitly an opaque server-generated value. Unless you know extremely specific information about how the server generates its ETag values and the server does so only from information that you have access to, you cannot duplicate the server's ETag calculations. If you calculate a hypothetical ETag and it matches the server, there's absolutely no guarantee that you actually have the same content, and in general it's much more likely that it doesn't.

(For example, Apache generates ETag values for static files from stat() information for their inodes. Clients do not know, for example, the inode number of a server side file, so even in the best case you can't duplicate this calculation.)

I agree with you about sending both If-Modified-Since and If-None-Match; if I had my way, everyone would use only If-None-Match. However, tons of actual clients send both (for example, most syndication feed readers that support ETags also send If-Modified-Since). Perhaps part of it has to do with HTTP caches, but I haven't looked into this.

By nell at 2024-10-12 19:58:23:

You are correct that I was misremembering how RSS worked, which invalidates my example. Thus, the actual question I could ask is "did Chris post anything within the last week?", which seems quite a bit less useful.

it's not useful for clients to try to make up a[n ETag] value on their own because it is explicitly an opaque server-generated value. Unless you know extremely specific information about how the server generates its ETag values

The servers etc. are almost all open-source, right? And it appears the client can pass multiple ETag values, with only one having to match. So, while I'm not actually suggesting clients do this, it might be an interesting thing to experiment with.

If you calculate a hypothetical ETag and it matches the server, there's absolutely no guarantee that you actually have the same content, and in general it's much more likely that it doesn't.

I'm not following here. If I were to pass an MD5 tag of the existing data (again, just hypothetically), and a SHA-1 and SHA-256 tag, I don't see why those would have any meaningful chance of matching the actual ETag, if the data were different. Unlike the timestamps, these have to match exactly (even for a "weak" match, the server has to recognize the exact string).

If the URL timestamp every goes backward, you will incorrectly tell people that they have the current, correct version of a resource when they don't (unless they gave you an If-None-Match too so you can check a better validator).

Okay, thanks for linking your previous page. I can kind of see your point; but, also, it'll get me on a bit of a rant about the primitive state of URL management. Before I start that, I'll just note that if you don't think you can manage Last-Modified dates, I reckon it's better to omit the header entirely; maybe your instrumentation will support or refute this idea.

The W3C wrote a document long ago (as you may be able to tell from the images being in GIF, and highly aliased—it was 1998) called "Cool URIs don't change". Unfortunately, it seems that they never actually wrote any software to help people manage that, and nor did anyone else; at least nothing that became popular. In my view, the simplistic file-system-based web-serving should've been stopped long ago, except maybe for "toy" sites. Instead, there should be a database—in 2024, I'm thinking maybe something git-based—relating URLs to data hashes (which could double as ETags; weak ETags would require extra tracking). Discontinuing or changing any URL should be a deliberate choice, something very difficult to do by accident; and records should be kept of all such changes.

By contrast, today's "state of the art" seems to be to accept that any URL is bound to break after a few years, and to have the 404 page apologize, mention some recent "re-organization", and maybe link to Google.

By Ian Z aka nobrowser at 2024-10-13 15:20:33:

Suppose -- half hypothetically :-P -- I'm writing Yet Another Feed Reader. Can I in fact depend on the server providing an ETag? What should I do if there is none? Fall back on timestamps? That's already more complexity than I can tolerate.

By cks at 2024-10-13 23:21:09:

I don't think you can count on the server providing an ETag, so as a good feed reader you have to be willing to use the Last-Modified instead. If I was writing a feed reader that was going to do its best against arbitrary and potentially screwed up feed sources, I would send both If-Modified-Since and If-None-Match headers if the server had provided both, because who knows, it could have broken ETag handling but working Last-Modified.

Unfortunately these are the breaks of writing a good feed reader.

Written on 11 October 2024.
« Linux software RAID and changing your system's hostname
Some thoughts on why 'inetd activation' didn't catch on »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Fri Oct 11 22:02:00 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.