Wandering Thoughts archives

2011-08-21

The conflict between caching and tracking on the web

The web user privacy story of the recent past has been the news about web tracking companies that are using ETag and Last-Modified headers to covertly track users. In the process of thinking about the issue and writing yesterday's entry, I've come to the probably unsurprising conclusion that there is a fundamental conflict between browser caching and avoiding tracking.

The attacks on ETag and Last-Modified are the tip of an iceberg. Both of these headers are quite convenient for tracking because the browser will directly store them and report them back to the server, which means that you can encode a value into them and then recover it later. But cache state itself is also stored information, and the very nature of caching means that the browser has to report the information back to the server if the cache is going to do any good.

This leads directly to the conflict: the more effective the browser cache is, the easier it is to use the browser cache contents to track you. Conversely, all of the methods of making this tracking harder have the necessary effect of making your browser cache less effective. To make yourself completely untrackable, in theory you need to have no browser cache.

(In practice I think that what you really need to do is inject enough noise into the tracking process that it can't reliably tell people apart. However this rapidly gets into an arms race between the two sides, with the tracking side storing and reading back more and more redundant information in order to defeat noise-injection things like browsers that drop random entries from their cache.)

Thus I'm very doubtful that technical countermeasures in browsers can defeat this sort of 'undeletable' tracking; the only technical countermeasure that I see being fully effective is to have no long-lived cache at all. This is only viable in some environments, so I don't expect browsers to make it a default.

(This doesn't mean that we're doomed; it means that we have to use non-technical solutions to the problem, like publicity, shaming, and so on.)

(I doubt that this is new to web privacy people.)

CachingVersusTracking written at 01:25:45; Add Comment

2011-08-20

Why browsers can't really change or validate Last-Modified

Quoting from Nik Cubrilovic's Persistant and Unblockable Cookies Using HTTP Headers (via Hacker News):

I will be filing a bug report with the open source browsers and requesting that the date is parsed properly. This won't completely solve the problem, since users can still be tracked by setting a unique datetime - but perhaps one of the more innovative browsers will come up with a solution where the time is rounded off to the nearest hour, and some basic sanity checking is done.

There's two issues here, validating Last-Modified and changing it. As it happens, I feel that changing Last-Modified is basically impossible for the browser to do in a way that is both safe and useful.

Let's set aside the server's view of Last-Modified for now, and talk about how modifying Last-Modified affects caching if we assume a server that does time comparisons on L-M. First, it's effectively pointless for a browser to shift L-M backwards in time, since it guarantees that the server can never give you a 304 response; you're claiming that you only have something that's older than what the server has, so it must give you the current version. You might as well not cache the page at all. Second, it's clearly dangerous to shift L-M into the future (the further the shift the more dangerous), because you'll miss any server updates made between now and that future point.

In theory you might think that it's safe to shift L-M forward provided that the new L-M time is still in the past. In practice I think that there are a number of realistic scenarios where this still causes you to miss server updates; for example, there might have been a server-side rolling deployment of a content update that has not yet gotten to the server that you use. The 'new' content has an old timestamp because it was initially deployed some time ago on the first server (and because the server is keeping timestamps in sync to promote caching).

(Backing out of a deployment is one reason to avoid a time-based Last-Modified comparison in your server.)

This scenario may seem unusual. But the problem with making general browser changes that modify cache behavior is that they must be correct in general, not just for 'usual' situations, because someday some of your users will hit an unusual situation. And showing out of date content to users because you lied to the web server is a pretty bad sin.

The problem with validating Last-Modified headers is a pragmatic one. It's virtually guaranteed that today, there are plenty of websites and web applications that serve up Last-Modified timestamps in formats that are not quite correctly formed and RFC-compliant (for all I know, DWiki is one of them; I'm not sure I paid careful attention to that bit of the RFC when writing the code). This means that you have three choices: you can ignore non-RFC dates entirely, which means that you cache less, you can try to be increasingly generous in your date parsing so that you accept common RFC violations, which is a lot of work, or you can not validate the Last-Modified value at all, treating it as a magic cookie. It should be no wonder that the last option is relatively popular.

(I admit that I would like to see browsers reject clearly impossible things, like the example that Nik Cubrilovic shows. I'm just not sure it's all that easy or reliable for a computer to tell 'clearly impossible' from a merely badly formatted date.)

BrowsersAndLastModified written at 03:24:02; Add Comment

2011-08-10

The 'key plus authenticator' pattern in web apps

As part of our account request system, I need to generate an unguessable URL for a particular account request. This is a widely solved problem, because it's a variant on session IDs; you generate a large random number, encode it to ASCII somehow (base64 is popular), and put it in the URL as the identifier of the particular object. When you get an incoming HTTP request you extract that portion of the URL and use it as the key to do a database lookup to find the record you want.

However, there's an issue. Database keys have to be completely unique, but large random numbers are merely almost certainly unique (cue the birthday paradox again). Since I do not enjoy debugging weird database errors that happen once in a blue moon, I decided that I wanted to do database lookups with something that was guaranteed to be unique; in this case, a primary key. But I still wanted the URL to be unguessable.

My solution was to use what I'm calling a 'key plus authenticator' pattern. I embedded in the URL both the account request's primary key (which is unique but easily guessed) and a large random number (encoded to base64) that I call the access hash. When a URL comes in I extract both pieces of information, use the primary key to look up the account request, and then check that the account request has the right access hash.

An attacker can easily guess potentially valid primary keys but they can't get access to anything unless they also get the access hash correct (they can't even tell whether or not the primary key exists; I return an identical 404 error in both cases). With enough bits in the large random number this is just as secure as using the large random number alone. I call this the 'key plus authenticator' pattern because demonstrated knowledge of the access hash has served to authenticate your knowledge of the guessable primary key.

(You may well want to store the authenticators in your database in some hashed form, for the same reason as session IDs. We don't do so in the account request system for a number of reasons, including that I only found out about the session ID issue after I'd written the system.)

KeyPlusAuthenticator written at 01:29:18; Add Comment

2011-08-08

You need to hash web app session IDs

Up until very recently (in fact, today), I could confidently recite what I thought was the best way to handle login sessions in web apps: every time someone logs in, you pick a large random number to be their session ID, give them a cookie with the session ID, and then store all of the details in your database under the session ID. When they come back, you get the session ID cookie, look it up in the database, make sure it's (still) valid, and go.

WRONG.

This approach is dangerously incomplete, as I discovered by reading a stackoverflow thread on web authentication (via Hacker News). To be secure you need to always store your session IDs in your database in some cryptographically hashed form, never in plain text.

To see why, suppose an attacker gets surreptitious read access to your database and you use plaintext session IDs. Of course you follow best practices and salt and hash your user passwords (using something like bcrypt), so that they are not feasibly recoverable. However, the attacker has all of the session IDs for all active sessions. While those sessions remain valid, the attacker can hijack any of them (by creating their own valid session cookie for the session) and thus do anything in your app as any currently logged in user that doesn't require people to reauthenticate. Depending on your application this may be quite a lot, especially if the attacker gets the sessions of administrative users.

(This is one reason why you should require the account's current password in your password change form, something that I hadn't quite realized until I started thinking about it now.)

On a side note, this applies just as much to sessions that use session cookies as it does for persistent cookies (contrary to what the stackoverflow writeup implies), because an attacker can use valid session cookies just as easily as they can use valid persistent cookies. What matters is whether they can create appropriate cookies given a session ID (you should assume yes) and whether that session is still valid, not what form the cookie takes and what options you've set in your official cookies.

As mentioned, to solve this you need to store the session ID only in some hashed form, just as if it was a password (because in fact it is a somewhat limited password). I can see a number of approaches to doing this.

The simplest change is to hash all session IDs with the same global value, which means that an attacker cannot directly recover a session ID given database access (you should assume that they recover your global value along with your database). This reduces the attacker back to the birthday paradox attack on session IDs, although they can do this offline and in bulk; you should thus pick a slow hash function, like bcrypt, and make sure that you are still using a suitably large and random session ID. The advantage is that you need no cookie or (significant) database changes; you simply hash the session ID before recording it in the database, and then hash it again before looking it up.

The better change is to treat session IDs like passwords and salt them individually. However, this means that you need some way to recover the session ID record given information in the cookie so that you can find the right salt. I can think of two approaches. First, you can store some index to the user record in the cookie, then have a way of recovering all sessions records for a given user; this gives you a feasibly small number of sessions to check (by taking each session's salt, hashing the salt plus the session ID from the cookie, and checking to see if it matches the session record's hash). The other approach is to directly store a key to the session record itself in the cookie. In effect, what you have done is change the problem; your hashed 'session ID' is in fact a session ID validator, and the key is the real session ID (which you might as well make a full sized random number of however many bits).

If I was doing this in a web app, my preference would be for the second approach because it leaks less information into the cookie. The first approach necessarily puts some sort of information about the user there, which might be useful for attackers or eavesdroppers.

(I'm sure that all of this is well known in the web app security community; I just feel like writing it down so that it sinks into my head. See the stackoverflow thread for more.)

PS: I maintain that you can't solve this by signing your session cookies with HMAC and a server-side secret; you should assume that your server-side secret can be compromised just as your database was. And if you are going to believe in a server-side secret, you might as well use the 'hashed with a global value' approach to storing session IDs with the server-side secret as the global value; you're just as well off either way if the attacker compromises only the database, and you're better off if the attacker compromises both the database and the server-side secret.

HashYourSessionIDs written at 22:52:10; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.