Hashed Ethernet addresses are not anonymous identifiers

August 21, 2017

Somewhat recently, I read What we've learned from .NET Core SDK Telemtry, in which Microsoft mentioned that in .NET Core 2.0 they will be collecting, well, let's quote them:

  • Hashed MAC address — Determine a cryptographically (SHA256) anonymous and unique ID for a machine. Useful to determine the aggregate number of machines that use .NET Core. This data will not be shared in the public data releases.

(My path to the Microsoft article involved Your tools shouldn't spy on you, via, but it's complicated.)

So, here's the question: is a hashed Ethernet address really anonymous, or at least sufficiently anonymous for most purposes? I will spoil the answer: hashing Ethernet addresses with SHA256 does not appear to make them anonymous in practice.

Hashing by itself does not make things anonymous. For instance, suppose you want to keep anonymous traffic records for IPv4 traffic and you propose to (separately) hash the source and destination IPs with MD5. Unfortunately this is at best weakly anonymous. There are few enough IPv4 addresses that an attacker can simply pre-compute the hashes of all of them, probably keep them in memory, and then immediately de-anonymize your 'anonymous' source and destination data.

Ethernet MAC addresses are 6 bytes long, meaning that there are 2^48 of them that are theoretically possible. However the first three bytes (24 bits) are the vendor OUI, and there are only a limited number of them that have been assigned (you can see one list of these here), so the practical number of MACs is significantly smaller. Even at full size, six bytes is not that many these days and is vulnerable to brute force attacks. Modern GPUs can apparently compute SHA256 hashes at a rate of roughly 2.9 billion hashes a second (from here), or perhaps 4 billion hashes a second (from here). Assuming I'm doing the math right, it would take roughly a day or so to compute the SHA256 hash of all possible Ethernet addresses, which is not very long. The sort of good news is that using SHA256 probably makes it infeasible to pre-compute a rainbow table reverse lookup table for this, due to the massive amount of space required.

However, we shouldn't brute force search the entire theoretical Ethernet address space, because we can do far better (with far worse results for the anonymity of the results). If we confine ourselves to known OUIs, the search space shrinks significantly. There appear to be only around 23,800 assigned OUIs at the moment; even at only 2.9 billion SHA256 hashes a second, it takes less than three minutes to exhaustively hash and search all their MACs (and that's with only a single GPU). The memory requirements for a rainbow table reverse lookup table remain excessive, but it doesn't really matter; three minutes is fast enough for non-realtime deanonymization for analysis and other things. In practice those Ethernet addresses that Microsoft are collecting are not anonymous in the least; they're simply obscured, so it would take Microsoft a modest amount of work to see what they were.

I don't know whether Microsoft is up to evil here or simply didn't run the numbers before they decided that using SHA256 on Ethernet addresses produced anonymous results. It doesn't really matter, because not running the numbers when planning data collection such as this is incompetence. If you proposed to collect anonymous identifiers, it is your responsibility to make sure that they actually are anonymous. Microsoft has failed to do so.


Comments on this page:

By dozzie at 2017-08-21 08:15:48:

The sort of good news is that using SHA256 probably makes it infeasible to pre-compute a rainbow table for this, due to the massive amount of space required.

I believe you misunderstand what a rainbow table is. It's a hash lookup table, yes, but one that saves memory by making single lookup use more computation, and you choose how much memory to save when you build such table. Your comments would apply to a regular lookup table, e.g. a hash or B-tree, which remembers all possible values directly.

Rainbow table is built of strips, each containing a start text, end hash, and number of iterations used to get from start to end. A strip is built by passing the start text through hash function (1 step), and its output again through the hash (2 steps), and its output again (3 steps), and again. After N steps, N is remembered (explicitly or implicitly; there are different strategies of determining N) along with the end result. Strips are indexed by the end hash.

The lookup requires you to calculate hash of your hash value and check if it is in the list of end hashes. If not, you need to calculate hash of this hash and check again, and keep doing it until either you find a match (you've found a strip) or make a chain of hashes longer than N (hash cannot be found in the table). If you have a strip, you rebuild the chain of hashes until you hit the one you're trying to reverse; the hash just before that is what you want.

If you make the strips in the rainbow table longer, it will take less memory, but you'll need more computation for a lookup.

By cks at 2017-08-21 08:59:49:

Whoops, you're completely right. I misused 'rainbow table' because I thought I knew what it mean, but I didn't. I've sort of updated the entry to correct my usage.

(Having now actually looked at the Wikipedia writeup of rainbow tables, I wonder how feasible a rainbow table is for this. I can see an obvious reduction function, but it has a limited output.)

Written on 21 August 2017.
« The surprising longevity of Unix manpage formatting
Understanding rainbow tables »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Mon Aug 21 01:05:04 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.