The conflict between caching and tracking on the web

August 21, 2011

The web user privacy story of the recent past has been the news about web tracking companies that are using ETag and Last-Modified headers to covertly track users. In the process of thinking about the issue and writing yesterday's entry, I've come to the probably unsurprising conclusion that there is a fundamental conflict between browser caching and avoiding tracking.

The attacks on ETag and Last-Modified are the tip of an iceberg. Both of these headers are quite convenient for tracking because the browser will directly store them and report them back to the server, which means that you can encode a value into them and then recover it later. But cache state itself is also stored information, and the very nature of caching means that the browser has to report the information back to the server if the cache is going to do any good.

This leads directly to the conflict: the more effective the browser cache is, the easier it is to use the browser cache contents to track you. Conversely, all of the methods of making this tracking harder have the necessary effect of making your browser cache less effective. To make yourself completely untrackable, in theory you need to have no browser cache.

(In practice I think that what you really need to do is inject enough noise into the tracking process that it can't reliably tell people apart. However this rapidly gets into an arms race between the two sides, with the tracking side storing and reading back more and more redundant information in order to defeat noise-injection things like browsers that drop random entries from their cache.)

Thus I'm very doubtful that technical countermeasures in browsers can defeat this sort of 'undeletable' tracking; the only technical countermeasure that I see being fully effective is to have no long-lived cache at all. This is only viable in some environments, so I don't expect browsers to make it a default.

(This doesn't mean that we're doomed; it means that we have to use non-technical solutions to the problem, like publicity, shaming, and so on.)

(I doubt that this is new to web privacy people.)


Comments on this page:

From 87.79.236.202 at 2011-08-21 02:55:41:

This affects revalidating cache approaches (any mechanism in HTTP that does the 304 dance), but not Expires-based caching. The latter can be made perfectly accurate by never recycling URLs, too. And it’s as much in the interests of servers to support caching as it is in those of clients.

Aristotle Pagaltzis

By cks at 2011-08-21 13:18:44:

My instinct is that Expires-based caching can be exploited as well, although I haven't worked out an exploit in order to be sure. The basic idea is that every URL from the tracking site gives it one bit of information (from whether or not it was re-requested), and the site can reassemble things from there. It would be potentially noisy, but there are ways to deal with noise.

From 97.107.130.220 at 2011-08-21 18:28:51:

What I mean is the strategy where Expires is set to “essentially never” and if an asset changes, the new version gets a new URL and the referring page changes its reference. The client never needs to revalidate (nor does any caching intermediary, more importantly, which disperses trackability further).

This is not applicable to all scenarios, but it takes care of very many.

Aristotle Pagaltzis

From 66.57.46.132 at 2011-08-22 15:56:16:

Could someone point to where I might learn how a cache server such as Squid affects these elements?

   - Stephen P. Schaefer
Written on 21 August 2011.
« Why browsers can't really change or validate Last-Modified
V8's neat encoding trick for type tracking »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Aug 21 01:25:45 2011
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.