Working to understand PCI Express and how it interacts with modern CPUs
PCI Express sort of crept up on me while I wasn't looking. One day everything was PCI and AGP, then there was some PCI-X in our servers, and then my then-new home machine had PCIe instead but I didn't really have anything to put in those slots so I didn't pay attention. With a new home machine in my future, I've been working to finally understand all of this better. Based on my current understanding, there are two sides here.
PCI Express itself is well described by the Wikipedia page. The core unit of PCIe connections is the lane, which carries one set of signals in either direction. Multiple lanes may be used together by the same device (or connection) in order to get more bandwidth, and these lane counts are written with an 'x' prefix, such as 'x4' or 'x16'. For a straightforward PCIe slot or card, the lane count describes both its size and how many PCIe lanes it uses (or wants to use); this is written as, for example 'PCIe x16'. It's also common to have a slot that's one physical size but provides fewer PCIe lanes; this is commonly written with two lane sizes, eg 'PCIe x16 @ x4' or 'PCIe x16 (x4 mode)'.
While a PCIe device may want a certain number of lanes, that doesn't mean it's going to get them. Lane counts are negotiated by both ends, which in practice means that the system can decide that a PCIe x16 graphics card in an x16 slot is actually only going to get 8 lanes (or less). I don't know if in theory all PCIe devices are supposed to work all the way down to one lane (x1), but if so I cynically suspect that in practice there are PCIe devices that can't or won't cope well if their lane count is significantly reduced.
(PCIe interconnection can involve quite sophisticated switches.)
All of this brings me around to how PCIe lanes connect to things. Once upon a time, the Northbridge chip was king and sat at the heart of your PC; it connected to the CPU, it connected to RAM, it connected to your AGP slot (or maybe a PCIe slot). Less important and lower bandwidth things were pushed off to the southbridge. These days, the CPU has dethroned the northbridge by basically swallowing it; a modern CPU directly connects to RAM, integrated graphics, and a limited number of PCIe lanes (and perhaps a few other high-importance things). Additional PCIe lanes, SATA ports, and everything else are connected to the motherboard chipset, which then connects back to the CPU through some interconnect. On modern Intel CPUs, this is Intel's DMI and is roughly equivalent to a four-lane PCIe link; on AMD's modern CPUs, this is apparently literally an x4 PCIe link.
Because you have to get to the CPU to talk to RAM, all PCIe devices that use non-CPU PCIe lanes are collectively choked down to the aggregate bandwidth of the chipset to CPU link for DMA transfers. Since SATA ports, USB ports, and so on are also generally connected to the chipset instead of the CPU, your PCIe devices are contending with them too. This is especially relevant with high-speed x4 PCIe devices such as M.2 NVMe SSDs, but I believe it comes up for 10G networking as well (especially if you have multiple 10G ports, where I think you need x4 PCIe 3.0 to saturate two 10G links).
(I don't know if you can usefully do PCIe transfers from one device to another device directly through the chipset, without touching the CPU and RAM and thus without having to go over the link between the chipset and the CPU.)
Typical Intel desktop CPUs have 16 onboard PCIe lanes, which are almost always connected to an x16 and an x16 @ x8 PCIe slot for your graphics cards. Current Intel motherboard chipsets such as the Z370 have what I've seen quoted as '20 to 24' additional PCIe lanes; these lanes must be used for M.2 NVMe drives, additional PCIe slots, and additional onboard chips that the motherboard vendor has decided to integrate (for example, to provide extra USB 3.1 gen 2 ports or extra SATA ports).
The situation with AMD Ryzen and its chipsets is more tangled and gets us into the difference between PCIe 2.0 and PCIe 3.0. Ryzen itself has 24 PCIe lanes to Intel's 16, but the Ryzen chipsets seem to have less additional PCIe lanes and many of them are slower PCIe 2.0 ones. The whole thing is confusing me, which makes it fortunate that I'm not planning to get a Ryzen-based system for various reasons, but for what it's worth I suspect that Ryzen's PCIe lane configuration is better for typical desktop users.
Unsurprisingly, server-focused CPUs and chipsets have more PCIe lanes and more lanes directly connected to the CPU or CPUs (for multi-socket configurations). Originally this was probably aimed at things like multiple 10G links and large amounts of high-speed disk IO. Today, with GPU computing becoming increasingly important, it's probably more and more being used to feed multiple x8 or x16 GPU card slots with high bandwidth.
Understanding M.2 SSDs in practice and how this relates to NVMe
Modern desktop motherboards feature 'M.2' sockets (or slots) for storage in addition to some number of traditional SATA ports (and not always as many of them as I'd like). This has been confusing me lately as I plan a future PC out, so today I sat down and did some reading to understand the situation both in theory and in practice.
M.2 is more or less a connector (and mounting) standard. Like USB-C ports with their alternate mode, M.2 sockets can potentially support multiple protocols (well, bus interfaces) over the same connector. M.2 sockets are keyed to mark different varieties of what they support. Modern motherboards using M.2 sockets for storage are going to be M.2 with M keying, which potentially supports SATA and PCIe x4 (and SMBus).
M.2 SSDs use either SATA or NVMe as their interface, and generally are M keyed these days. My impression is that M.2 NVMe SSDs cost a bit more than similar M.2 SATA SSDs, but can perform better (perhaps much better). A M.2 SATA SSD requires a M.2 socket that supports SATA; a M.2 NVMe SSD requires a PCIe (x4) capable M.2 socket. Actual motherboards with M.2 sockets don't necessarily support both SATA and PCIe x4 on all of their M.2 sockets. In particular, it seems common to have one M.2 socket that supports both and then a second M.2 socket that only supports PCIe x4, not SATA.
(The corollary is that if you want to have two more or less identical M.2 SSDs in the same motherboard, you generally need them to be M.2 NVMe SSDs. You can probably find a motherboard that has two M.2 sockets that support SATA, but you're going to have a more limited selection.)
On current desktop motherboards, it seems to be very common to not be able to use all of the board's M.2 sockets, SATA ports, and PCIe card slots at once. One or more M.2 SATA ports often overlap with normal SATA ports, while one or more M.2 PCIe x4 can overlap with either normal SATA ports or with a PCIe card slot. The specific overlaps vary between motherboards and don't seem to be purely determined by the Intel or AMD chipset and CPU being used.
(Some of the limitations do seem to be due to the CPU and the chipset, because of a limited supply of PCIe lanes. I don't know enough about PCIe lane topology to fully understand this, although I know more than I did when I started writing this entry.)
The ASUS PRIME Z370-A is typical of what I'm seeing in current desktop motherboards. It has two M.2 sockets, the first with both SATA and PCIE x4 and the second with just PCIe x4. The first socket's SATA steals the regular 'SATA 1' port, but its PCIe x4 is unshared and doesn't conflict with anything. The second socket steals two SATA ports (5 and 6) in PCIe x4 mode but can also run in PCIe x2 mode, which leaves those SATA ports active. So if you only want one NVMe SSD, you get 6 SATA port; if you want one M.2 SATA SSD, you're left with 5 SATA ports; and if you want two M.2 NVMe SSDs (at full speed), you're left with 4 regular SATA ports. The whole thing is rather confusing, and also tedious if you're trying to work out whether a particular current or future disk configuration is possible.
Since I use disks in mirrored pairs, I'm only interested in motherboards with two or more M.2 PCIe (x4) capable sockets. M.2 SATA support is mostly irrelevant to me; if I upgrade from my current SSD (and HD) setup to M.2 drives, it will be to NVMe SSDs.
(You can also get PCIe cards that are PCIe to some number of M.2 PCIe sockets, but obviously they're going to need to go into one of the PCIe slots with lots of PCIe lanes.)
Thinking about whether I'll upgrade my next PC partway through its life
There are some people who routinely upgrade PC components through the life of their (desktop) machine, changing out CPUs, memory, graphics cards, and close to every component (sometimes even the motherboard). I've never been one of those people; for various reasons my PCs have been essentially static once I bought them. I'm in the process of planning a new home and work PC, and this time around I'm considering deliberately planning for some degree of a midway upgrade. Given that I seem to keep PCs for at least five years, that would be two or three years from now.
(Part of my reason for not considering substantial upgrades was I hadn't assembled PCs myself, so thoughts of things like replacing my CPU with a better one were somewhat scary.)
Planning for a significant midway upgrade in advance is always a bit daring and uncertain, since you never know what companies like Intel are going to do in the future with things like new CPU sockets and so on. Despite that I think you can at least consider some things, and thus perhaps build up your initial PC with some eye towards the future changes you're likely to want to make. Well, let me rephrase that; I can think about these things, and I probably should.
However, for me the big change would be a change in mindset. My PC would no longer be something that I considered immutable and set in stone, with me just having whatever I had. Merely deliberately deciding that I'll have this mindset probably makes it more likely that I'll actually carry through and do some upgrades, whatever they turn out to be.
In terms of actual upgrades, the obvious midway change is an increase in RAM and I can make that easier by picking a DIMM layout that only populates two out of four motherboard DIMM slots. These days it's easy to get 32 GB with two 16 GB DIMM modules, and that's probably enough for now (and RAM is still surprisingly expensive, unfortunately).
Planning a midway CPU upgrade is more chancy because who knows if a compatible CPU will still be available at a reasonable price in a few years. Probably I'd have to actively keep up with CPU and socket developments, so that when my PC's CPU socket stops being supported I can wait for the last compatible CPU to hit a suitably discounted price and then get one. If this happens too soon, well, I get to abandon that idea. It's also possible that CPUs won't progress much in the next five years, although I'm hoping that we get more cores at least.
Graphics card upgrades are so common I'm not sure that people think of them as 'upgrading your PC', but they're mostly irrelevant for me (as someone who runs Linux and sticks to open source drivers). However I do sometimes use a program that could use a good GPU if I had one, and someday there may be real open source drivers for a suitable GPU card from either AMD or nVidia (I'm aware that I'm mostly dreaming). This will be an easy drop-in upgrade, as I plan to use Intel motherboard graphics to start with.
(Hard drives are a sufficiently complicated issue that I don't think they fit into the margins of this entry. Anyway, I need to do some research on the current state of things in terms of NVMe and similar things.)
Sorting out the world of modern USB (at least a bit)
Part of thinking about new machines for home and work is figuring out what motherboard I want, and part of that is figuring out what I want and need in motherboard features. I've looked into how many SATA ports I want and what it will take to drive a 4K monitor with onboard graphics, so now I've been trying to figure out USB ports. Part of this is trying to understand the different sorts of USB ports that there are and what you can do with them.
(This would be easier if I'd kept up with all of the twists and turns in PC hardware standards, but I haven't.)
USB is a vast and complicated world, with both a set of signalling standards (the old USB 2.0, USB 3.0, and now USB 3.1 aka USB 3.1 gen 2) and a set of port shapes and sizes (the original USB-A and now USB-C) that may be combined in various ways. Fortunately I'm only interested in modern and non-perverse motherboards, so for me I believe that it breaks down this way:
- old fashioned USB 2.0 ports (with black USB-A connectors)
are too slow for disks but are
(probably) fine for things like keyboards
and mice. But I only need a few of these, and there's no need to
have any USB 2.0 only ports if I have enough better USB ports.
- USB 3.0 ports (often using
blue USB-A connectors) are good enough for general usage
(theoretically including disks) but are not the latest hotness.
USB 3.0 is old enough that any decent modern (desktop) motherboard
should really include a bunch of USB 3.0 ports. Even inexpensive
based motherboards have a number of them.
USB 3.0 is not infrequently called 'USB 3.1 gen 1' in advertising and product specifications. This is technically correct but practically misleading, because it's not the type of USB 3.1 that you and I want if we care about USB 3.1.
- USB 3.1 ports are either USB-C or USB-A, and you may need to look for things specifically described as 'USB 3.1 gen 2'. It's the latest hotness with the fastest connection speed (twice that of USB 3.0 aka USB 3.1 gen 1), but the more that I look the less I'm sure that this will matter to me for the next five years or so.
Then there is USB-C, the new (and small) connector standard for things. When I started writing this entry I thought life was simple and modern USB-C ports were always USB 3.1 (gen 2), but this is not actually the case. It appears not uncommon for H270 and Z270 based motherboards to have USB-C ports that are USB 3.0, not USB 3.1 (gen 2). It seems likely that over time more and more external devices will expect you to have USB-C connectors even if they don't use USB 3.1 (gen 2), which strongly suggests that any motherboard I get should have at least one USB-C port and ideally more.
(The state of connecting USB-C devices to USB-A ports is not clear to me. According to the Wikipedia page on USB-C, you aren't allowed to make an adaptor with a USB-C receptacle and a USB-A connector that will plug into a USB-A port. On the other hand, you can find a lot of cables that are a USB-A connector on one end and USB-C connector on the other end and advertised as letting you connect devices with USB-C with old devices with USB-A, and some of them appear to support USB 3.1 gen 2 USB-A ports. There are devices that you plug USB-C cables in to, and devices that basically have a USB-C cable or connector coming out of them; the former you can convert to USB-A but the later not.)
USB-C ports may support something called alternate mode, where some of the physical wires in the port and the cable are used for another protocol instead of USB. Standardized specifications for this theoretically let your USB-C port be a DisplayPort or Thunderbolt port (among others). On a desktop motherboard, this seems far less useful than simply having, say, a DisplayPort connector; among other advantages, this means you get to drive your 4K display at the same time as you have a USB-C thing plugged in. As a result I don't think Alternate Mode support matters to me, which is handy because it seems to be very uncommon on desktop motherboards.
(Alternate Mode support is obviously attractive if you have limited space for connectors, such as on a laptop or a tablet, because it may let you condense multiple ports into one. And USB-C is designed to be a small connector.)
Intel's current H270 and Z270 chipsets don't appear to natively support USB 3.1 gen 2. This means that any support for it on latest-generation motherboards is added by the motherboard vendor using an add-on controller chipset, and I think you're unlikely to find it on inexpensive motherboards. It also means that I get to search carefully to find motherboards with genuine USB 3.1 gen 2, which is being a pain in the rear so far. An alternate approach would be to get USB 3.1 gen 2 through an add-on PCIE card (based on information from here); this might be a lot less of a pain than trying to find and select a suitable motherboard.
(As for how many of each type of port I need or want, I haven't counted them up yet. My current bias is towards at least two USB 3.1 gen 2 ports, at least one USB-C port, and a bunch of USB 3.0 ports. I probably have at least four or five USB 2.0 devices to be plugged in, although some can be daisy-chained to each other. I'm a little surprised by that count, but these things have proliferated while I wasn't paying attention. Everything is USB these days.)
I failed to notice when my network performance became terrible
I've written about how I didn't notice how comparatively slow one of my machines became over time. I have now run into another excellent and uncomfortable illustration of this phenomenon; this time around, it was part of my home network performance quietly becoming rather terrible.
My current home Internet is generally around 15000 Kbps down and 7500 Kbps up; its speed is stable and solid. I also have a little GRE over IPSec VPN tunnel between my home and work Linux machines, and somewhat over a year ago I used it to do some graphics-intensive remote X work, which quite impressed me at the time. Unfortunately, at some time since then the performance of that GRE-over-IPSec VPN tunnel fell off a cliff. Today, my office machine can send my home machine data over it at only about 120 KB/sec; another machine on campus that's several network hops away from my office machine can manage only 4.7 KB/sec. Talking directly to my home machine without the GRE-over-IPsec tunnel, both can manage around 1800 KB/sec.
In one version of this story, I would now tell you how I didn't
notice the decrease in performance. Looking back, that isn't what
happened; instead, I noticed signs of the decrease but I casually
blamed them on other causes. For instance, when
rsyncing a backup
copy of Wandering Thoughts to my home machine started
being visibly slow, I thought 'oh, my disks must be slow'. When
merely refreshing the front page of Wandering Thoughts involved
a visible lurch due to the browser redoing layout as more of the
page showed up, I assumed that either my browser was slow or the
web server was slow (or both). In reality all of these had a single
root cause, that being that I can only get 5 KB/sec of streaming
TCP bandwidth from the web server.
(It was actually the slow
rsyncs that caused me to start digging
recently. Not only did things reach the point where it was actively
irritating, but my office workstation did its own
rsync at blazing
speed, and I was now using SSDs at home anyway, so they shouldn't
be slow. If I had a slow IO problem at home, I had real problems
either with my SSDs or with ZFS, so I decided I'd better try to
figure out what was going on. Eventually this got me check network
bandwidth just in case, since I was increasingly ruling out everything
I could think of, like disk IO or network latency.)
What interests me most is the psychology of all of this. I'm pretty sure that when problems started, I just assumed that they were inevitable and more or less beyond my control. Since I thought there was nothing that could be done, I didn't pay any real attention to things and I certainly didn't investigate. All of this is the result of sensible human decision-making heuristics, but these heuristics misfire every so often.
(And now I'm irritated with myself for not investigating much earlier, when I might have been able to file a bug report that gives a specific 'it was good here then became bad here' set of software versions. This is as irrational as ever, but humans are not rational creatures even if we like to pretend that we are.)
Sidebar: What I know about the situation so far
My home and office machines are both running Fedora 26, but this problem was present in Fedora 25 as well and perhaps in earlier Fedora versions. I'm pretty sure that it can't have been present when I did my trick with remote X, but that's more than a year ago, with Fedora 23.
The problem is specific to the combination of GRE over IPSec. I've tested IPSec alone and GRE alone (in their actual operating configuration), and both get the full 1800 KB/sec down that I'd expect. Only when I encapsulate my GRE tunnel in IPSec do things go wrong. Conveniently (or inconveniently), this means that the problem is entirely in the Linux kernel, so diagnosing this will probably be what they call 'fun'.
(My GRE tunnel has the same relatively low MTU whether it's inside or outside IPSec, and it's had that MTU for a very long time now.)
Letting go of having an optical drive in my machine
For various reasons I'm finally getting somewhat serious about planning out a new PC to replace my current one, and this is forcing me to confront a number of issues. One of them is the question of an optical drive.
I've had a succession of PCs over the years, and from the start they've had at least one optical drive (initially CD-ROM drives and then DVD drives). Back in those days it was somewhere between a required feature (for reading CD-ROMs and playing music) and a slam-dunk 'why not' thing. For example, look at what I said for the last machine:
I definitely want a DVD reader and the extra cost for a DVD writer is trivial, even if I haven't burned any DVDs at home over the past five years that I've had a DVD writer in my current home machine.
That emphasis is foreshadowing. It's been five years since I wrote that, and in those five years I don't think I've actually used my current home machine's DVD drive more than a handful of times (and I definitely haven't burned anything with it). In fact right now my DVD drive has been broken for more than a year and I haven't missed it.
(At work, I've burned a handful of DVDs over the years because we still do Ubuntu installs from DVDs for various reasons.)
Under normal circumstances I would put some sort of optical drive into my next machine as well, but unfortunately the circumstances are not normal (as far as I know). I now want to be able to connect at least six hard drives to my machine (although I expect to normally have only four), and modern optical drives are all SATA drives and so consume a SATA port on your motherboard. The current generation of Intel chipsets provide at most 6 SATA ports, which means that motherboards with more than 6 SATA ports are much less common and are supplying those extra SATA ports through some additional controller chipset that may or may not work really well with Linux.
(I have a jaundiced view of add-on controller chipsets of all sorts. Even server vendors have cheaped out and used chipsets that just didn't work very well, such as old nVidia Ethernet chipsets.)
In addition, not putting an optical drive in my next machine doesn't mean going without an optical drive at all, because external optical drives generally work fine, both for reading and for writing (and probably also for things like playing music and watching movies). At work we've mostly given up on getting new servers with optical drives unless they're basically free; instead we have a collection of external DVD drives that we move around as needed.
All of this is completely sensible and logical, but it's still something I've had to talk myself into and it's going to feel weird and a bit unnerving to have a machine without an optical drive. It feels like I'm going to be traitorously giving up on the old optical disc media that I have, even if I have an external drive that should be perfectly good for working with them.
(Humans are not entirely rational creatures, myself included, and so this entry is partly to talk myself into this.)
Sidebar: Blu-ray drives aren't a consideration right now
The short version for why not is that Linux can't really play Blu-ray discs. I mean, yes, it's theoretically possible, but reading about it doesn't inspire me with any confidence for its long term viability or general usability. For the immediate future I expect that if I want to view Blu-ray media, I'm going to need a standalone player with all of the annoyance that that implies.
Blu-ray discs don't have enough data capacity to deal with my backup issues, which are going to call for external hard drives for the foreseeable future.
Understanding rainbow tables
In yesterday's entry, I casually misused the term 'rainbow table' because I thought I knew what it meant when I actually didn't; what I was actually talking about was a (reverse) lookup table, which is not at all the same thing. I'm sure that I've read a bit about rainbow tables before but evidently it didn't stick and I didn't really get them. As a result of dozzie's comment pointing out my error, I wound up reading the Wikipedia writeup of rainbow tables, and I think I may now understand them. In an attempt to make this understanding stick, I'm going to write down my version of how they work and some remaining questions I have.
The core ingredient of a rainbow table is a set of reduction functions, R1 through Rn. Reduction functions take a hash value and generate a password from it in some magic way. To create one entry in a rainbow table, you start with a password p1, hash the password to generate hash h1, apply the first reduction function R1 to generate a new password p2, hash that password to generate h2, apply R2 to generate password p3, and so on along this hash chain. Eventually you hit hash hn, use your last reduction function Rn, and generate p(n+1). You then store p1, the first password (the starting point), and p(n+1), the endpoint (I think you could store the hash of p(n+1) instead, but that would take up more space).
To see if a given hash of interest is in the rainbow table, you successively pretend that it is every hash h1 through hn and compute the output of each of the hash chains from that point onward. When you're pretending the hash is hn, this is just applying Rn; when it's h(n-1) this is applying R(n-1) to generate pn, hashing it to get hn and then applying Rn, and so on. You then check to see if any of the computed endpoint passwords are in your rainbow table. If any of them are, you recreate that chain starting with your stored p1 starting point up to the point where in theory you should find the hash of interest, call it hx. If the chain's computed hx actually is the hash of interest, password px from just before it is the corresponding password.
Matching an endpoint password doesn't guarantee that you've found a password for the hash; instead it could be that you have two hashes that a given reduction function maps to the same next password. If your passwords of interest are shorter than the output of the password hash function this is guaranteed to happen some of the time (and shorter passwords is the usual case, especially with modern hash functions like SHA256 that have large outputs).
The length of the hash chains in your rainbow table is a tradeoff between storage space and compute time. The longer the hash chains are the less you have to store to cover roughly the same number of passwords, but the longer it will take to check each hash of interest because you will have to compute more versions of chains (and check more endpoint passwords). Also, the longer the hash chain is, the more reduction function variations you have to come up with.
(See the Wikipedia page for an explanation of why you have multiple reduction functions.)
As far as I can see, a rainbow table doesn't normally give you exhaustive coverage of a password space (eg, 'all eight character lower case passwords'). Most of the passwords covered by the rainbow table come from applying your reduction functions to cryptographic hashes; the hashes should have randomly distributed values so normally this will mean that your reduction functions produce passwords that are randomly distributed through your password space. There's no guarantee that these randomly distributed passwords completely exhaust that space. To get a good probability of this, I think you'd need to 'oversample' your rainbow table so that the total number of passwords it theoretically covers is greater than your password space.
(Although I haven't attempted to do any math on this, I suspect that oversampled rainbow tables still take up (much) less space than a full lookup table, especially if you're willing to lengthen your hash chains as part of it. Longer hash chains cover more passwords in the same space, at the cost of more computation.)
The total number of passwords a rainbow table theoretically covers is simply the number of entries times the length of the hash chains. If you have hash chains with 10 passwords (excluding the endpoint password) and you have 10,000 entries in your rainbow table, your table covers at most 100,000 passwords. The number of unique passwords that a rainbow table actually contains is not something that can be determined without recording all of the passwords generated by all of the reduction functions during table generation.
Sidebar: Storing the endpoint password versus the endpoint's hash
Storing the hash of the endpoint password instead of the endpoint password seems superficially attractive and it feels like it should be better, but I've wound up believing that it's usually or always going to be a bad tradeoff. In most situations, your passwords are a lot shorter than your hash values, and often you already have relatively long hash chains. If you have long hash chains, adding one more entry is a marginal gain and you pay a real space penalty for it. Even if you have relatively short chains and a relatively small table, you get basically the same result in less space by adding another reduction function and officially lengthening your chain by one.
(Reduction functions are easily added as far as I can tell;
apparently they're often basically '(common calculation) + i',
i is the index of the reduction function.)
Hashed Ethernet addresses are not anonymous identifiers
Somewhat recently, I read What we've learned from .NET Core SDK Telemtry, in which Microsoft mentioned that in .NET Core 2.0 they will be collecting, well, let's quote them:
- Hashed MAC address — Determine a cryptographically (SHA256) anonymous and unique ID for a machine. Useful to determine the aggregate number of machines that use .NET Core. This data will not be shared in the public data releases.
So, here's the question: is a hashed Ethernet address really anonymous, or at least sufficiently anonymous for most purposes? I will spoil the answer: hashing Ethernet addresses with SHA256 does not appear to make them anonymous in practice.
Hashing by itself does not make things anonymous. For instance, suppose you want to keep anonymous traffic records for IPv4 traffic and you propose to (separately) hash the source and destination IPs with MD5. Unfortunately this is at best weakly anonymous. There are few enough IPv4 addresses that an attacker can simply pre-compute the hashes of all of them, probably keep them in memory, and then immediately de-anonymize your 'anonymous' source and destination data.
Ethernet MAC addresses are 6 bytes long, meaning that there are
2^48 of them that are theoretically possible. However the first
three bytes (24 bits) are the vendor OUI, and there are only a limited number of them that have
been assigned (you can see one list of these here), so the practical number
of MACs is significantly smaller. Even at full size, six bytes is
not that many these days and is vulnerable to brute force attacks.
Modern GPUs can apparently compute SHA256 hashes at a rate of roughly
2.9 billion hashes a second (from here),
or perhaps 4 billion hashes a second (from here).
Assuming I'm doing the math right, it would take roughly a day or
so to compute the SHA256 hash of all possible Ethernet addresses,
which is not very long. The sort of good news is that using SHA256
probably makes it infeasible to pre-compute a
reverse lookup table for
this, due to the massive amount of space required.
However, we shouldn't brute force search the entire theoretical
Ethernet address space, because we can do far better (with far worse
results for the anonymity of the results). If we confine ourselves
to known OUIs, the search space shrinks significantly. There appear
to be only around 23,800 assigned OUIs at the moment; even at only
2.9 billion SHA256 hashes a second, it takes less than three minutes
to exhaustively hash and search all their MACs (and that's with
only a single GPU). The memory requirements for a
reverse lookup table
remain excessive, but it doesn't really matter; three minutes is
fast enough for non-realtime deanonymization for analysis and other
things. In practice those Ethernet addresses that Microsoft are
collecting are not anonymous in the least; they're simply obscured,
so it would take Microsoft a modest amount of work to see what they
I don't know whether Microsoft is up to evil here or simply didn't run the numbers before they decided that using SHA256 on Ethernet addresses produced anonymous results. It doesn't really matter, because not running the numbers when planning data collection such as this is incompetence. If you proposed to collect anonymous identifiers, it is your responsibility to make sure that they actually are anonymous. Microsoft has failed to do so.
On the Internet, merely blocking eavesdropping is a big practical win
One of the things said against many basic encryption measures, such as SMTP's generally weak TLS when one mail server is delivering email to another one, is that that they're unauthenticated and thus completely vulnerable to man in the middle attacks (and sometimes to downgrade attacks). This is (obviously) true, but it is focused on the mathematical side of security. On the practical side, the reality is simple:
Forcing attackers to move from passive listening to active interception is almost always a big win.
There are a lot of attackers that can (and will) engage in passive eavesdropping. It is relatively easy, relatively covert, and quite useful, and as a result can be used pervasively and often is. Far fewer attackers can and will engage in active attacks like MITM interception or forced protocol downgrades; such attacks are not always possible for an attacker (they may have only limited network access) and when the attacks are possible they're more expensive and riskier.
Forcing attackers to move from passive eavesdropping to some form of active interception is thus almost always a big practical win. Most of the time you'll wind up with fewer attackers doing fewer things against less traffic. Sometimes attackers will mostly give up; I don't think there are very many people attempting to MITM SSH connections, for example, although in theory you might be able to get away with it some of the time.
(There certainly were people snooping on Telnet and
connections back in the days.)
If you can prevent eavesdropping, the theoretical security of the environment may not have gotten any better (you have to assume that an attacker can run a MITM attack if they really want to badly enough), but the practical security certainly has. This makes it a worthwhile thing to do by itself if you can. Of course full protection against even active attacks is better, but don't let the perfect be the enemy of the good. SMTP's basic server to server TLS encryption may be easily defeated by an active attacker and frequently derided by security mavens, but it has probably kept a great deal of email out of the hands of passive listeners (see eg Google's report on this).
(I mentioned this yesterday in the context of the web, but I think it's worth covering in its own entry.)
Understanding a bit about the SSH connection protocol
The SSH connection protocol is the final SSH protocol; depending on your perspective, it sits either on top of or after the SSH transport protocol and SSH user authentication protocol. It's the level of SSH where all of the useful things happen, like remote logins, command execution, and port forwarding.
An important thing to know about the connection protocol is that it's a multiplexed protocol. There is not one logical connection that everything happens over but instead multiple channels, all operating more or less independently. SSH has its own internal flow control mechanism for channels to make sure that data from a single channel won't saturate the overall stream and prevent other channels from being responsive. There are different types of channels (and subtypes of some of them as well); one type of channel is used for 'sessions', another for X11 forwarding, another for port forwarding, and so on. However, the connection protocol and SSH as a whole doesn't really interpret the actual data flowing over a particular channel; once a channel is set up, data is just data and is shuttled back and forth blindly. Like many things in the SSH protocol, channel types and subtypes are specified as strings.
(SSH opted to make a lot of things be strings to make them easily extensible, especially for experiments and implementation specific extensions. With strings, you just need a naming convention to avoid collisions instead of any sort of central authority to register your new number. This is why you will see some SSH ciphers with names like 'firstname.lastname@example.org'.)
The major channel type is a 'session', which are basically containers
that are used to ask for login shells, command execution, X11
forwarding, and 'subsystems', which is a general concept for other
sorts of sessions that can be used to extend SSH (with the right
magic on both the server and the client). Subsystems probably aren't
used much, although they are used to implement SFTP. A single session
can ask for and contain multiple things; if you
ssh in to a server
interactively with X11 forwarding enabled, your session will ask for
both a shell and X11 forwarding.
(However, the RFC requires a session to only have one of a shell, a command execution, or a subsystem. This is probably partly because of data flow issues; if you asked for more than one, the connection protocol provides no way to sort out which input and output is attached to which thing within a single channel. X11 forwarding is different, because a new channel gets opened for each client.)
Channels can be opened by either end. Normally the client opens
most channels, but the server can wind up opening channels for X11
clients and for ports being forwarded from the server to the client
(with eg OpenSSH
-R set of options).
OpenSSH's connection sharing works through channel multiplexing, since a client can open multiple 'session' channels over a single connection. The client side is going to be a little complicated, but from the server side everything is generic and general.