Hashed Ethernet addresses are not anonymous identifiers
Somewhat recently, I read What we've learned from .NET Core SDK Telemtry, in which Microsoft mentioned that in .NET Core 2.0 they will be collecting, well, let's quote them:
- Hashed MAC address — Determine a cryptographically (SHA256) anonymous and unique ID for a machine. Useful to determine the aggregate number of machines that use .NET Core. This data will not be shared in the public data releases.
So, here's the question: is a hashed Ethernet address really anonymous, or at least sufficiently anonymous for most purposes? I will spoil the answer: hashing Ethernet addresses with SHA256 does not appear to make them anonymous in practice.
Hashing by itself does not make things anonymous. For instance, suppose you want to keep anonymous traffic records for IPv4 traffic and you propose to (separately) hash the source and destination IPs with MD5. Unfortunately this is at best weakly anonymous. There are few enough IPv4 addresses that an attacker can simply pre-compute the hashes of all of them, probably keep them in memory, and then immediately de-anonymize your 'anonymous' source and destination data.
Ethernet MAC addresses are 6 bytes long, meaning that there are
2^48 of them that are theoretically possible. However the first
three bytes (24 bits) are the vendor OUI, and there are only a limited number of them that have
been assigned (you can see one list of these here), so the practical number
of MACs is significantly smaller. Even at full size, six bytes is
not that many these days and is vulnerable to brute force attacks.
Modern GPUs can apparently compute SHA256 hashes at a rate of roughly
2.9 billion hashes a second (from here),
or perhaps 4 billion hashes a second (from here).
Assuming I'm doing the math right, it would take roughly a day or
so to compute the SHA256 hash of all possible Ethernet addresses,
which is not very long. The sort of good news is that using SHA256
probably makes it infeasible to pre-compute a
reverse lookup table for
this, due to the massive amount of space required.
However, we shouldn't brute force search the entire theoretical
Ethernet address space, because we can do far better (with far worse
results for the anonymity of the results). If we confine ourselves
to known OUIs, the search space shrinks significantly. There appear
to be only around 23,800 assigned OUIs at the moment; even at only
2.9 billion SHA256 hashes a second, it takes less than three minutes
to exhaustively hash and search all their MACs (and that's with
only a single GPU). The memory requirements for a
reverse lookup table
remain excessive, but it doesn't really matter; three minutes is
fast enough for non-realtime deanonymization for analysis and other
things. In practice those Ethernet addresses that Microsoft are
collecting are not anonymous in the least; they're simply obscured,
so it would take Microsoft a modest amount of work to see what they
I don't know whether Microsoft is up to evil here or simply didn't run the numbers before they decided that using SHA256 on Ethernet addresses produced anonymous results. It doesn't really matter, because not running the numbers when planning data collection such as this is incompetence. If you proposed to collect anonymous identifiers, it is your responsibility to make sure that they actually are anonymous. Microsoft has failed to do so.