Hashed Ethernet addresses are not anonymous identifiers

August 21, 2017

Somewhat recently, I read What we've learned from .NET Core SDK Telemtry, in which Microsoft mentioned that in .NET Core 2.0 they will be collecting, well, let's quote them:

  • Hashed MAC address — Determine a cryptographically (SHA256) anonymous and unique ID for a machine. Useful to determine the aggregate number of machines that use .NET Core. This data will not be shared in the public data releases.

(My path to the Microsoft article involved Your tools shouldn't spy on you, via, but it's complicated.)

So, here's the question: is a hashed Ethernet address really anonymous, or at least sufficiently anonymous for most purposes? I will spoil the answer: hashing Ethernet addresses with SHA256 does not appear to make them anonymous in practice.

Hashing by itself does not make things anonymous. For instance, suppose you want to keep anonymous traffic records for IPv4 traffic and you propose to (separately) hash the source and destination IPs with MD5. Unfortunately this is at best weakly anonymous. There are few enough IPv4 addresses that an attacker can simply pre-compute the hashes of all of them, probably keep them in memory, and then immediately de-anonymize your 'anonymous' source and destination data.

Ethernet MAC addresses are 6 bytes long, meaning that there are 2^48 of them that are theoretically possible. However the first three bytes (24 bits) are the vendor OUI, and there are only a limited number of them that have been assigned (you can see one list of these here), so the practical number of MACs is significantly smaller. Even at full size, six bytes is not that many these days and is vulnerable to brute force attacks. Modern GPUs can apparently compute SHA256 hashes at a rate of roughly 2.9 billion hashes a second (from here), or perhaps 4 billion hashes a second (from here). Assuming I'm doing the math right, it would take roughly a day or so to compute the SHA256 hash of all possible Ethernet addresses, which is not very long. The sort of good news is that using SHA256 probably makes it infeasible to pre-compute a rainbow table reverse lookup table for this, due to the massive amount of space required.

However, we shouldn't brute force search the entire theoretical Ethernet address space, because we can do far better (with far worse results for the anonymity of the results). If we confine ourselves to known OUIs, the search space shrinks significantly. There appear to be only around 23,800 assigned OUIs at the moment; even at only 2.9 billion SHA256 hashes a second, it takes less than three minutes to exhaustively hash and search all their MACs (and that's with only a single GPU). The memory requirements for a rainbow table reverse lookup table remain excessive, but it doesn't really matter; three minutes is fast enough for non-realtime deanonymization for analysis and other things. In practice those Ethernet addresses that Microsoft are collecting are not anonymous in the least; they're simply obscured, so it would take Microsoft a modest amount of work to see what they were.

I don't know whether Microsoft is up to evil here or simply didn't run the numbers before they decided that using SHA256 on Ethernet addresses produced anonymous results. It doesn't really matter, because not running the numbers when planning data collection such as this is incompetence. If you proposed to collect anonymous identifiers, it is your responsibility to make sure that they actually are anonymous. Microsoft has failed to do so.

Written on 21 August 2017.
« The surprising longevity of Unix manpage formatting
Understanding rainbow tables »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Aug 21 01:05:04 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.