Sorting out 'PCIe bifurcation' and how it interacts with NVMe drives

December 4, 2024

Suppose, not hypothetically, that you're switching from one mirrored set of M.2 NVMe drives to another mirrored set of M.2 NVMe drives, and so would like to have three or four NVMe drives in your desktop at the same time. Sadly, you already have one of your two NVMe drives on a PCIe card, so you'd like to get a single PCIe card that handles two or more NVMe drives. If you look around today, you'll find two sorts of cards for this; ones that are very expensive, and ones that are relatively inexpensive but require that your system supports a feature that is generally called PCIe bifurcation.

NVMe drives are PCIe devices, so a PCIe card that supports a single NVMe drive is a simple, more or less passive thing that wires four PCIe lanes and some other stuff through to the M.2 slot. I believe that in theory, a card could be built that only required x2 or even x1 PCIe lanes, but in practice I think all such single drive cards are physically PCIe x4 and so require a physical x4 or better PCIe slot, even if you'd be willing to (temporarily) run the drive much slower.

A PCIe card that supports more than one M.2 NVMe drive has two options. The expensive option is to put a PCIe bridge on the card, with the bridge (probably) providing a full set of PCIe lanes to the M.2 NVMe drives locally on one side and doing x4, x8, or x16 PCIe with the motherboard on the other. In theory, such a card will work even at x4 or x2 PCIe lanes, because PCIe cards are supposed to do that if the system says 'actually you only get this many lanes' (although obviously you can't drive four x4 NVMe drives at full speed through a single x4 or x2 PCIe connection).

The cheap option is to require that the system be able to split a single PCIe slot into multiple independent groups of PCIe lanes (I believe these are usually called links); this is PCIe bifurcation. In PCIe bifurcation, the system takes what is physically and PCIe-wise an x16 slot (for example) and splits it into four separate x4 links (I've seen this sometimes labeled as 'x4/x4/x4/x4'). This is cheap for the card because it can basically be four single M.2 NVMe PCIe cards jammed together, with each set of x4 lanes wired through to a single M.2 NVMe slot. A PCIe card for two M.2 NVMe drives will require an x8 PCIe slot bifurcated to two x4 links; if you stick this card in an x16 slot, the upper 8 PCIe lanes just get ignored (which means that you can still set your BIOS to x4/x4/x4/x4).

As covered in, for example, this Synopsys page, PCIe bifurcation isn't something that's negotiated as part of bringing up PCIe connections; a PCIe device can't ask for bifurcation and can't be asked whether or not it supports it. Instead, the decision is made as part of configuring the PCIe root device or bridge, which in practice means it's a firmware ('BIOS') decision. However, I believe that bifurcation may also requires hardware support in the 'chipset' and perhaps the physical motherboard.

I put chipset into quotes because for quite some time, some PCIe lanes come directly from the CPU and only some others come through the chipset as such. For example, in desktop motherboards, the x16 GPU slot is almost always driven directly by CPU PCIe lanes, so it's up to the CPU to have support (or not have support) for PCIe bifurcation of that slot. I don't know if common desktop chipsets support bifurcation on the chipset PCIe slots and PCIe lanes, and of course you need chipset-driven PCIe slots that have enough lanes to be bifurcated in the first place. If the PCIe slots driven by the chipset are a mix of x4 and x1 slots, there's no really useful bifurcation that can be done (at least for NVMe drives).

If you have a limited number of PCIe slots that can actually support x16 or x8 and you need a GPU card, you may not be able to use PCIe bifurcation in practice even if it's available for your system. If you have only one PCIe slot your GPU card can go in and it's the only slot that supports bifurcation, you're stuck; you can't have both a bifurcated set of NVMe drives and a GPU (at least not without a bifurcated PCIe riser card that you can use).

(This is where I would start exploring USB NVMe drive enclosures, although on old desktops you'll probably need one that doesn't require USB-C, and I don't know if a NVMe drive set up in a USB enclosure can later be smoothly moved to a direct M.2 connection without partitioning-related problems or other issues.)

(This is one of the entries I write to get this straight in my head.)

Sidebar: Generic PCIe riser cards and other weird things

The traditional 'riser card' I'm used to is a special proprietary server 'card' (ie, a chunk of PCB with connectors and other bits) that plugs into a likely custom server motherboard connector and makes a right angle turn that lets it provide one or two horizontal PCIe slots (often half-height ones) in a 1U or 2U server case, which aren't tall enough to handle PCIe cards vertically. However, the existence of PCIe bifurcation opens up an exciting world of general, generic PCIe riser cards that bifurcate a single x16 GPU slot to, say, two x8 PCIe slots. These will work (in some sense) in any x16 PCIe slot that supports bifurcation, and of course you don't have to restrict yourself to x16 slots. I believe there are also PCIe riser cards that bifurcate an x8 slot into two x4 slots.

Now, you are perhaps thinking that such a riser card puts those bifurcated PCIe slots at right angles to the slots in your case, and probably leaves any cards inserted into them with at least their tops unsupported. If you have light PCIe cards, maybe this works out. If you don't have light PCIe cards, one option is another terrifying thing, a PCIe ribbon cable with a little PCB that is just a PCIe slot on one end (the other end plugs into your real PCIe slot, such as one of the slots on the riser card). Sometimes these are even called 'riser card extenders' (or perhaps those are a sub-type of the general PCIe extender ribbon cables).

Another PCIe adapter device you can get is an x1 to x16 slot extension adapter, which plugs into an x1 slot on your motherboard and has an x16 slot (with only one PCIe lane wired through, of course). This is less crazy than it sounds; you might only have an x1 slot available, want to plug in a x4, x8, or x16 card that's short enough, and be willing to settle for x1 speeds. In theory PCIe cards are supposed to still work when their lanes are choked down this way.


Comments on this page:

By jmassey at 2024-12-04 23:29:55:

I don't know if a NVMe drive set up in a USB enclosure can later be smoothly moved to a direct M.2 connection without partitioning-related problems or other issues.)

I've seen sector sizes change when moving hard drives from USB to SATA, and I imagine it could happen on SSDs too. If you keep all your partitions aligned to a large power of 2, as gdisk now does by default (to 2^20 bytes), that shouldn't cause too much trouble; you'll just have to re-create your GPT after doing some math on the sector start/end positions. Re-program the same GUIDs if you have anything depending on them.

Your udev names might change, too. And if the drive is to be bootable, you might have to screw around with efibootmgr or its BIOS-menu equivalent. "Smoothly" is probably optimistic, but I don't imagine it would be too difficult.

I generally recommend obtaining motherboard manuals ahead of time, for anything you're considering buying. A quick pdfgrep shows some mentions of bifurcation support in various Asus and Giga-Byte boards, with Asus mentioning a specific product to buy (to support 2 or 4 SSDs in a slot, if you set the slot to "PCIe RAID Mode").

From 193.219.181.219 at 2024-12-05 09:37:35:

I've seen sector sizes change when moving hard drives from USB to SATA,

Yes, some USB-SATA adapter controllers emulate 4k logical sectors despite the disk being 512b, presumably to allow the use of above-2TB disks with MBR partitioning... It kind of makes sense for sealed "external HDD" units but is annoying to deal with for removable SATA bays/toasters/adapters.

If you can detect this before writing data, then the easiest workaround is to create a loop device on top of the physical device (turns out loop devices don't have to be backed by regular files) and have it emulate 512b sectors again. Or vice versa if you're rescuing data out of a disk that was first partitioned through such an adapter.

losetup -P -b 512 /dev/sdc /dev/loop0
By cks at 2024-12-05 11:06:09:

In the case of this system, it was bought many years ago, sufficiently long ago that M.2 support wasn't something we really thought about, and I'm not sure PCIe bifurcation was particularly available at the time. Support for it certainly wasn't on our minds.

(Any new motherboard I look at will have that as a consideration, along with the number of M.2 NVMe slots it has. A surprising number of motherboards seem to have three or four M.2 slots now, which is nice and useful. Sadly they also seem to have a shortage of PCIe slots, which I care about for things like additional network cards.)

By jmassey at 2024-12-05 11:53:56:

I wonder how that losetup block-size-changing thing would interact with a boot drive. The GPT serial numbers wouldn't change, but the udev name very likely would, and I'm not sure how things like grub-install deal with loop devices. As for "external HDD units", a lot of people open those up to get cheap(er) SATA drives; they're not "sealed" in any meaningful sense. (The PWDIS pin can be a minor problem here, and is a great example of an easily avoidable backward-compatibility screw-up: the feature should've been defined to not activate when all 3 former-3.3V pins were high.)

Chris, yeah, I'm also annoyed at the M.2 vs. PCIe-card situation in motherboards. We might end up going "backward" at some point, with M.2-to-PCIe-card adapters for extra network ports (though I suppose USB's been able to do 10Gbe for a while and will do 100Gbe in its next iteration). I've seen boards with these slots sharing physical space, such that a drive can be under a card. Why not share lanes between the PCIe and M.2 slots if necessary? I'd find it convenient to have 2 dedicated M.2 positions, 2 to 4 PCIe-card positions, plus 1 or 2 accepting either. Or they could just include a passive PCIe-to-multiple-M.2 card in the box (as they've sometimes included stuff like USB/eSATA/serial brackets and "diagnosis cards"), and then we'd know the thing would work.

Anyway, it seems like a good sysadmin-experiment to see what happens with those USB M.2 enclosures/adapters, before you actually need to. It should be a trivial thing to expense and might get you out of a jam someday. I've actually got one sitting around and have been meaning to try it myself.

By nanaya at 2024-12-06 01:00:01:

PCIe x1 to M.2 NVMe adapter cards do exist and I've been using them. Anything but x1 and x16 physical slots are just so rare on consumer boards. At most there's open ended x1 slots where you can put larger cards in (usually from asrock).

From 193.219.181.219 at 2024-12-06 05:57:10:

As for "external HDD units", a lot of people open those up to get cheap(er) SATA drives; they're not "sealed" in any meaningful sense.

They are for their intended use. That is, the manufacturer hasn't designed them so that they would be opened up, even if that's physically possible. (I have some that I'm sure wouldn't close back up once opened.)

But more to the point, the manufacturer hasn't designed them so that they would be repeatedly opened up and the disks arbitrarily swapped in and out, but rather that the disk already in there would be accessed through the same controller throughout its lifetime – so it's reasonable for them to do whatever they wish with the data going through that controller (and much less so when the same is done by products that are explicitly designed to have arbitrary disks swapped in and out of them).

By jmassey at 2024-12-06 11:15:48:

But more to the point, the manufacturer hasn't designed them so that they would be repeatedly opened up and the disks arbitrarily swapped in and out,

Honestly, I'm not sure the manufacturer's really designed them (as disk and adapter pairs) at all. The ones I've used, being full-sized models, seem to have totally ordinary SATA drives, with totally generic USB adapters. A co-worker with more experience has mentioned that even external drives that are nominally the same model sometimes have differently-buggy USB adapters. The disk companies probably have thousands of USB adapters and cases sitting around, before they know which SATA drives they'll be selling as externals.

I've heard that some 2.5-inch models have soldered connections or directly speak USB, and that USB2 "WD Elements" drives (not the USB3) had some kind of encryption. But I've never seen anything worse that a sector-size change. It's definitely convenient to test and fill drives in their enclosures, then shuck them for internal use. That leaves me with an extra SATA-to-USB adapter, which usually works just fine with another SATA drive; putting a formerly-internal drive in there is a convenient way to handle backups.

Even the ones I plan to use externally have to be opened once to cover up their LEDs. I can do the EasyStores without damaging them, and they'd otherwise light up the whole room (their light is bright, leaks out the ventilation holes, and could apparently be disabled with a proprietary tool if I had Windows—which I don't, so I use electrical tape).

From what I've read, the situation with external SSDs is somewhat less convenient. Some speak USB directly, some have SATA-to-USB boards, but I don't think any are using M.2 drives on USB interfaces yet. But if they ever did, they'd probably have basically the same USB-to-M.2 chips you can buy at retail today.

By Miksa at 2024-12-12 04:45:58:

I would disagree with the assessment that x4 slot wouldn't be useful for bifurcation. I suspect that for most uses x1 is adequate for NVMe drives. That still gives the benefits of NVMe protocol and latency and the bandwidth is plenty enough.

You also mentioned the "terrifying" PCIe extension ribbons. Apparently those work better than could be expected. Few years ago Linus Tech Tips made a video where the tested how they would affect a powerful GPU. They did this by chaining several 30-60cm ribbon extensions. IIRC, the GPU benchmarks didn't drop until the chain was 3 meters long. Seems that PCIe is badly overengineered.

Written on 04 December 2024.
« The modern world of server serial ports, BMCs, and IPMI Serial over LAN
Buffered IO in Unix before V7 introduced stdio »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Wed Dec 4 22:01:40 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.