Hashes are not complete protectors of privacy

November 29, 2007

Suppose that you have been charged with building a traffic tracking system for your NAT gateway to be used when the campus NOC calls you up to report that your gateway is doing too much of the wrong sort of traffic and gives you the remote IP addresses involved. At the same time you don't want to log too much information, so that all this juicy traffic data you're gathering can't be used to do something like find out all the websites someone has been visiting.

No problems; you're smart, so you know about hashes and their uses for privacy, so you just hash all the remote IP addresses and store the hashes instead of the actual IP addresses (you have enough storage space that this isn't a problem). Fishing expeditions are prevented, because while someone could get all the hashes that their grad student had visited, they can't reverse them to IP addresses, but when the NOC gives you an IP address you can hash it and then look up the hash in your traffic database to find the originators.

(We will ignore the inconvenient fact that a database of the SHA1s of all 2^32 IP addresses is only 80 GB, and you don't even need that much since a bunch of IP address space is unused, reserved, or not relevant.)

Unfortunately, this hashing scheme is not completely protecting your users' privacy. While it has stopped people from finding out all the websites someone has visited, it is not stopping people from finding out who has visited a specific website. People can't get traffic patterns, but they can find out whether a Chinese grad student is browsing a Falun Gong website (or even which Chinese grad students are doing so).

In other words, using a hash here has created privacy in only one direction; your users have secrecy (their browsing habits can't be determined) but not deniability (it can be shown that they visited something specific). If you need full privacy, with both secrecy and deniability, you need to use more than simple hashes.

(Disclaimer: this is not how we implemented our NAT gateway traffic tracking system.)

Written on 29 November 2007.
« The problem the automounter was trying to solve
BitTorrent's protocol is not designed to hide »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Nov 29 23:24:56 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.