Working out which of your NVMe drives is in what slot under Linux

December 13, 2019

One of the perennial problems with machines that have multiple drives is figuring out which of your physical drives is sda, which is sdb, and so on; the mirror problem is arranging things so that the drive you want to be the boot drive actually is the first drive. In sanely made server hardware this is generally relatively easy, but with desktops you can run into all sorts of problems, such as how desktop motherboards can wire things up oddly. Under some situations, NVMe drives make this easier than with SATA drives, because NVMe drives are PCIe devices and so have distinct PCIe bus addresses and possibly PCIe bus topologies.

First off, I will admit something. The gold standard for doing this reliably under all circumstances is to record the serial numbers of your NVMe drives before you put them into your system and then use 'smartctl -i /dev/nvme0n1' to find each drive from its serial number. It's always possible for a motherboard with multiple M.2 slots to do perverse things with its wiring and PCIe bus layout, so that what it labels as the first and perhaps best M.2 slot is actually the second NVMe drive as Linux sees it. But I think that generally it's pretty likely that the first M.2 slot will be earlier in PCIe enumeration than the second one (if there is a second one). And if you have only one M.2 slot on the motherboard and are using a PCIe to NVMe adapter card for your second NVMe drive, the PCIe bus topology of the two NVMe drives is almost certain to be visibly different.

All of this raises the question of how you get the PCIe bus address of a particular NVMe drive. We can do this by using /sys, because Linux makes your PCIe devices and topology visible in sysfs. In specific, every NVMe device appears as a symlink in /sys/block that gives you the path to its PCIe node (and in fact the full topology). So on my office machine in its current NVMe setup, I have:

; readlink nvme0n1
../devices/pci0000:00/0000:00:03.2/0000:0b:00.0/[...]
; readlink nvme1n1
../devices/pci0000:00/0000:00:01.1/0000:01:00.0/[...]

This order on my machine gives me a surprise, because the two NVMe drives are not in the order I expected. In fact they're apparently not in the order that the kernel initially detected them in, as a look into 'dmesg' reports:

nvme nvme0: pci function 0000:01:00.0
nvme nvme1: pci function 0000:0b:00.0

This is the enumeration order I expected, with the motherboard M.2 slot at 01:00.0 detected before the adapter card at 0b:00.0 (for more on my current PCIe topology, see this entry). Indeed the original order appears to be preserved in bits of sysfs, with path components like nvme/nvme0/nvme1n1 and nvme/nvme1/nvme0n1. Perhaps the kernel assigned actual nvmeXn1 names backward, or perhaps udev renamed my disks for reasons known only to itself.

(But at least now I know which drive to pull if I have trouble with nvme1n1. On the other hand, I'm now doubting the latency numbers that I previously took as a sign that the NVMe drive on the adapter card was slower than the one in the M.2 slot, because I assumed that nvme1n1 was the adapter card drive.)

Once you have the PCIe bus address of a NVMe drive, you can look for additional clues as to what physical M.2 slot or PCIe slot that drive is in beyond just how this fits into your PCIe bus topology. For example, some motherboards (including my home machine) may wind up running the 'second' M.2 slot at 2x instead of x4 under some circumstances, so if you can find one NVMe drive running at x2 instead of x4, you have a strong clue as to which is which (assuming that your NVMe drives are x4 drives). You can also have a PCIe slot be forced to x2 for other reasons, such as motherboards where some slots share lanes and bandwidth. I believe that the primary M.2 slot on most motherboards always gets x4 and is never downgraded (except perhaps if you ask the BIOS to do so).

You can also get the same PCIe bus address information (and then a lot more) through udevadm, as noted by a commentator on yesterday's entry; 'udevadm info /sys/block/nvme0n1' will give you all of the information that udev keeps. This doesn't seem to include any explicit information on whether the device was renamed, but it does include the kernel's assigned minor number and on my machine, nvme0n1 has minor number 1 while nvme1n1 has minor number 0, which suggests that it was assigned first.

(It would be nice if udev would log somewhere when it renames a device.)

PS: Looking at the PCIe bus addresses associated with SATA drives usually doesn't help, because most of the time all of your SATA drives are attached to the same PCIe device.


Comments on this page:

From 193.219.181.226 at 2019-12-13 02:28:49:

Perhaps the kernel assigned actual nvmeXn1 names backward, or perhaps udev renamed my disks for reasons known only to itself.

udev doesn't assign these index-based names anymore; and even if it did, that would only affect /dev and not /sys (that's one of the reasons why it no longer does this – the other reasons being race conditions and the existence of devtmpfs).

(It seems the only remaining explicit devnode rename in udev base rules is for /dev/device-mapper to /dev/mapper/control. Other than that, the only thing that's still being renamed is network interfaces, since they exist separately from /dev, and the rename does show up in syslog.)

The nvme0/nvme1n1 part is a bit weird though.

(Side note: Yesterday I checked its Git history for an IRC discussion, and the %e rule token that would provide an incrementing index was removed in 2013... A user was asking about it because they found it in an online copy of the manpage, and it turns out that the first Google result – linux.die.net – still publishes a version of udev(8) from 2005. Yikes.)

The gold standard for doing this reliably under all circumstances is to record the serial numbers of your NVMe drives before you put them into your system and then use 'smartctl -i /dev/nvme0n1' to find each drive from its serial number.

They're also in /dev/disk/by-id, both for the serial number and the WWN (which is also present on the top sticker for most my disks).

(These symlinks are created by udev.)

From 193.219.181.226 at 2019-12-13 04:14:39:

I'm looking at drivers/nvme/host/ and it looks like the nvme%dn%d names can be based on a "subsystem instance number" if multipath is enabled in the kernel, while the parent nvme%d name doesn't do that and always uses the controller instance number.

as for the weired `/sys/class/nvme/nvme0/nvme1n1` path

This issue (https://github.com/linux-nvme/nvme-cli/issues/510) and specifically this comment (https://github.com/linux-nvme/nvme-cli/issues/510#issuecomment-500431277) explain that well.

It aparently has to do with NVMe multipathing, where one NVMe "endpoint" (i think called subsystem), the thing identified by a NQN (= NVMe Qualified Name), is reachable through multiple NVMe controllers. (that is done for extra speed and in case one link breaks i guess, especially makes sense when you go to nvme fabrics)

so then in `/sys/class/nvme/nvmeX/nvmeYnZ` - Y is the subsystem by NQN - Z the namespace - and X the controller through which the subsystem can be reached (which can be multiple... but only one is part of this path)

There would be an extra /dev/nvme1c1n1 ... nvme subsystem 1 reached through controller 1 ... but that is aparently hidden.

in linux 5.4 this naming was changed, so that the the controller that is shown for the NQN in /sys is the one with the same number then the subsystem. (as far as i understand) commit for that: - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=733e4b69d508d03c20adfdcf4bd27abc60fae9cc

Written on 13 December 2019.
« Linux makes your PCIe topology visible in sysfs (/sys)
It's unfortunately time to move away from using '/usr/bin/python' »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Fri Dec 13 00:41:42 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.