Wandering Thoughts archives

2019-12-25

Why udev may be trying to rename your VLAN interfaces to bad names

When I updated my office workstation to Fedora 30 back in August, I ran into a little issue:

It has been '0' days since systemd/udev blew up my networking. Fedora 30 systemd/udev attempts to rename VLAN devices to the interface's base name and fails spectacularly, causing the sys-subsystem*.device units to not be present. We hope you didn't depend on them! (I did.)

I filed this as Fedora bug #1741678, and just today I got a clue so that now I think I know why this happens.

The symptom of this problem is that during boot, your system will log things like:

systemd-udevd[914]: em-net5: Failed to rename network interface 4 from 'em-net5' to 'em0': Device or resource busy

As you might guess from the name I've given it here, em-net5 is a VLAN on em0. The name 'em0' itself is one that I assigned, because I don't like the network names that systemd-udevd would assign if left on its own (they are what I would call ugly, or at least tangled and long). The failure here prevents systemd from creating the sys-subsystem-net-devices-em-net5.device unit that it normally would (and then this had further consequences because of systemd's lack of good support for networks being ready).

I use networkd with static networking, so I set up the em0 name through a networkd .link file (as covered here). This looks like:

[Match]
MACAddress=60:45:cb:a0:e8:dd

[Link]
Description=Onboard port
MACAddressPolicy=persistent
Name=em0

Based on what 'udevadm test' reports, it appears that when udevd is configuring the em-net5 VLAN, it (still) matches this .link file for the underlying device and applying things from it. My guess is that this is happening because VLANs and their underlying physical interfaces normally share MACs, and so the VLAN MAC matches the MAC here.

This appears to be a behavior change in the version of udev shipped in Fedora 30. Before Fedora 30, systemd-udevd and networkd did not match VLAN MACs against .link files; from Fedora 30 onward, it appears to do so. To stop this, presumably one needs to limit your .link files to only matching on physical interfaces, not VLANs, but unfortunately this seems difficult to do. The systemd.link manpage documents a 'Type=' match, but while VLANs have a type that can be used for this, native interfaces do not appear to (and there doesn't seem to be a way to negate the match). There are various hacks that could be committed here, but all of them are somewhat unpleasant to me (such as specifying the kernel driver; if the kernel's opinion of what driver to use for this hardware changes, I am up a creek again).

UdevNetworkdVLANLinkMatching written at 01:47:33; Add Comment

2019-12-20

My new Linux office workstation disk partitioning for the end of 2019

I've just had the rare opportunity to replace all of my office machine's disks at once, without having to carry over any of the previous generation the way I've usually had to. As part of replacing everything I got the chance to redo the partitioning and setup of all of my disks, again all at once without the need to integrate a mix of the future and the past. For various reasons, I want to write down the partitioning and filesystem setup I decided on.

My office machine's new set of disks are a pair of 500 GB NVMe drives and a pair of 2 TB SATA SSDs. I'm using GPT partitioning on all four drives for various reasons. All four drives start with my standard two little partitions, a 256 MB EFI System Partition (ESP, gdisk code EF00) and a 1 MB BIOS boot partition (gdisk code EF02). I don't currently use either of them (my past attempt to switch from MBR booting to UEFI was a failure), but they're cheap insurance for a future. Similarly, putting these partitions on all four drives instead of just my 'system' drives is more cheap insurance.

(Writing this down has made me realize that I didn't format the ESPs. Although I don't use UEFI for booting, I have in the past put updated BIOS firmware images there in order to update the BIOS.)

The two NVMe are my 'system' drives. They have three additional partitions; a 70 GB partition used for a Linux software RAID mirror of the root filesystem (including /usr and /var, since I put all of the system into one filesystem), a 1 GB partition that is a Linux software RAID mirror swap partition, and the remaining 394.5 GB as a mirrored ZFS pool that holds filesystems that I want to be as fast as possible and that I can be confident won't grow to be too large. Right now that's my home directory filesystem and the filesystem that holds source code (where I build Firefox, Go, and ZFS on Linux, for example).

The two SATA SSDs are my 'data' drives, holding various larger but less important things. They have two 70 GB partitions that are Linux software RAID mirrors and the remaining space is in in a single partition for another mirrored ZFS pool. One of the two 70 GB partitions is so that I can make backup copies of my root filesystem before upgrading Fedora (if I bother to do so); the other is essentially an 'overflow' filesystem for some data that I want on an ext4 filesystem instead of in a ZFS pool (including a backup copy of all recent versions of ZFS on Linux that I've installed on my machine, so that if I update and the very latest version has a problem, I can immediately reinstall a previous one). The ZFS pool on the SSDs contains larger and generally less important things like my VMWare virtual machine images and the ISOs I use to install them, and archived data.

Both ZFS pools are set up following my historical ZFS on Linux practice, where they use the /dev/disk/by-id names for my disks instead of the sdX and nvme... names. Both pools are actually relatively old; I didn't create new pools for this and migrate my data, but instead just attached new mirrors to the old pools and then detached the old drives (more or less). The root filesystem was similarly migrated from my old SSDs by attaching and removing software RAID mirrors; the other Linux software RAID filesystems are newly made and copied through ext4 dump and restore (and the new software RAID arrays were added to /etc/mdadm.conf more or less by hand).

(Since I just looked it up, the ZFS pool on the SATA SSDs was created in August of 2014, originally on HDs, and the pool on the NVMe drives was created in January of 2016, originally on my first pair of (smaller) SSDs.)

Following my old guide to RAID superblock formats, I continued to use the version 1.0 format for everything except the new swap partition, where I used the version 1.2 format. By this point using 1.0 is probably superstition; if I have serious problems (for example), I'm likely to just boot from a Fedora USB live image instead of trying anything more complicated.

All of this feels very straightforward and predictable by now. I've moved away from complex partitioning schemes over time and almost all of the complexity left is simply that I have two different sets of disks with different characteristics, and I want some filesystems to be fast more than others. I would like all of my filesystems to be on NVMe drives, but I'm not likely to have NVMe drives that big for years to come.

(The most tangled bit is the 70 GB software RAID array reserved for a backup copy of my root filesystem during major upgrades, but in practice it's been quite a while since I bothered to use it. Still, having it available is cheap insurance in case I decide I want to do that someday during an especially risky Fedora upgrade.)

WorkMachinePartitioning2019 written at 23:52:22; Add Comment

Splitting a mirrored ZFS pool in ZFS on Linux

Suppose, not hypothetically, that you're replacing a pair of old disks with a pair of new disks in a ZFS pool that uses mirrors. If you're a cautious person and you worry about issues like infant mortality in your new drives, you don't necessarily want to immediately switch from the old disks to the new ones; you want to run them in parallel for at least a bit of time. ZFS makes this very easy, since it supports up to four way mirrors and you can just attach devices to add extra mirrors (and then detach devices later). Eventually it will come time to stop using the old disks, and at this point you have a choice of what to do.

The straightforward thing is to drop the old disks out of the ZFS mirror vdev with 'zpool detach', which cleanly removes them (and they won't come back later, unlike with Linux software RAID). However this is a little bit wasteful, in a sense. Those old disks have a perfectly good backup copy of your ZFS pool on them, but when you detach them you lose any real possibility of using that copy. Perhaps you would like to keep that data as an actual backup copy, just in case. Modern versions of ZFS can do this through splitting the pool with 'zpool split'.

To quote the manpage here:

Splits devices off pool creating newpool. All vdevs in pool must be mirrors and the pool must not be in the process of resilvering. At the time of the split, newpool will be a replica of pool. [...]

In theory the manpage's description suggests that you can split a four-way mirror vdev in half, pulling off two devices at once in a 'zpool split' operation. In practice it appears that the current 0.8.x version of ZFS on Linux can only split off a single device from each mirror vdev. This meant that I needed to split my pool in a multi-step operation.

Let's start with a pool, maindata, with four disks in a single mirrored vdev, oldA, oldB, newC, and newD. We want to split maindata so that there is a new pool with oldA and oldB. First, we split one old device out of the pool:

zpool split -R /mnt maindata maindata-hds oldA

Normally the just split off newpool is not imported (as far as I know), and certainly you don't want it imported if your filesystems have explicit 'mountpoint' settings (because then filesystems from the original and the split off pool will fight over who gets to be mounted there). However, you can't add devices to exported pools and we need to add oldB, so we have to import the new pool in an altroot. I use /mnt here out of tradition but you can use any convenient empty directory.

With the pool split off, we need to detach oldB from the regular pool and attach it to oldA in the new pool to make the new pool actually be mirrored:

zpool detach maindata oldB
zpool attach maindata-hds oldA oldB

This will then resilver the maindata-hds new pool on to oldB (even though oldB has an almost exact copy already). Once the resilver is done, you can export the pool:

zpool export maindata-hds

You now have your mirrored backup copy sitting around with relatively little work on your part.

All of this appears to have worked completely fine for me. I scrubbed my maindata pool before splitting it, just in case, but I don't think I bothered to scrub the maindata-hds new pool after the resilver. It's only an emergency backup pool anyway (and it gets less and less useful over time, since there are more divergences between it and the live pool).

PS: I don't know if you can make snapshots, split a pool, and then do incremental ZFS sends from filesystems in one copy of the pool to the other to keep your backup copy more or less up to date. I wouldn't be surprised if it worked, but I also wouldn't be surprised if it didn't.

ZFSSplitPoolExperience written at 00:33:45; Add Comment

2019-12-18

Linux kernel Security Modules (LSMs) need their own errno value

Over on Twitter, I said something I've said before:

Once again, here I am hating how Linux introduced additional kernel security modules without also adding an errno for 'the loadable security module denied permissions'.

Lack of a LSM errno significantly complicates debugging problems, especially if you don't normally use LSMs.

Naturally there's a sysadmin story here, but let's start with the background (even if you probably know it).

SELinux and Ubuntu's AppArmor are examples of Linux Security Modules; each of them adds additional permission checks that you must pass over and above the normal Unix permissions. However, when they reject your access, they don't actually tell you this specifically; instead you get the generic Unix error of EPERM, 'operation not permitted', which is normally what you get if, say, the file is unreadable to your UID for some reason.

We have an internal primary master DNS server for our DNS zones (a so called 'stealth master'), which runs Ubuntu instead of OpenBSD for various reasons. We have the winter holiday break coming up and since we've had problems with it coming up cleanly in the past, so last week it seemed like a good time to reboot it under controlled circumstances to make sure that at least that worked. When I did that, named (aka Bind) refused to start with a 'permission denied' error (aka EPERM) when it tried to read its named.conf configuration file. For reasons beyond the scope of this entry, this file lives on our central administrative NFS filesystem, and when you throw NFS into the picture various things can go wrong with access permissions. So I spent some time looking at file and directory permissions, NFS mount state, and so on, until I remembered something my co-worker had mentioned in passing.

Ubuntu defaults to installing and using AppArmor, but we don't like it and we turn it off almost everywhere (we can't avoid it for MySQL, although we can make it harmless). That morning we had applied the pending Ubuntu packages updates, as one does, and one of the packages that got updated had been the AppArmor package. It turns out that in our environment, when an AppArmor package update is applied, AppArmor gets re-enabled (but I think not started immediately); when I rebooted our primary DNS master, it now started AppArmor. AppArmor has a profile for Bind that only allows for a configuration file in the standard place, not where we put our completely different and customized one, and so when Bind tried to read our named.conf, the AppArmor LSM said 'no'. But that 'no' was surfaced only as an EPERM error and so I went chasing down the rabbit hole of all of the normal causes for permission errors.

People who deal with LSMs all of the time will probably be familiar with this issue and will immediately move to the theory that any unfamiliar and mysterious permission denials are potentially the LSM in action. But we don't use LSMs normally, so every time one enables itself and gets in our way, we have to learn all about this all over again. The process of troubleshooting would be much easier if the LSM actually told us that it was doing things by having a new errno value for 'LSM permission denied', because then we'd know right away what was going on.

(If Linux kernel people are worried about some combination of security concerns and backward compatibility, I would be happy if they made this extra errno value an opt-in thing that you had to turn on with a sysctl. We would promptly enable it for all of our servers.)

PS: Even if we didn't have our named.conf on a NFS filesystem, we probably wouldn't want to overwrite the standard version with our own. It's usually cleaner to build your own completely separate configuration file and configuration area, so that you don't have to worry about package updates doing anything to your setup.

ErrnoForLSMs written at 23:59:19; Add Comment

2019-12-13

Working out which of your NVMe drives is in what slot under Linux

One of the perennial problems with machines that have multiple drives is figuring out which of your physical drives is sda, which is sdb, and so on; the mirror problem is arranging things so that the drive you want to be the boot drive actually is the first drive. In sanely made server hardware this is generally relatively easy, but with desktops you can run into all sorts of problems, such as how desktop motherboards can wire things up oddly. Under some situations, NVMe drives make this easier than with SATA drives, because NVMe drives are PCIe devices and so have distinct PCIe bus addresses and possibly PCIe bus topologies.

First off, I will admit something. The gold standard for doing this reliably under all circumstances is to record the serial numbers of your NVMe drives before you put them into your system and then use 'smartctl -i /dev/nvme0n1' to find each drive from its serial number. It's always possible for a motherboard with multiple M.2 slots to do perverse things with its wiring and PCIe bus layout, so that what it labels as the first and perhaps best M.2 slot is actually the second NVMe drive as Linux sees it. But I think that generally it's pretty likely that the first M.2 slot will be earlier in PCIe enumeration than the second one (if there is a second one). And if you have only one M.2 slot on the motherboard and are using a PCIe to NVMe adapter card for your second NVMe drive, the PCIe bus topology of the two NVMe drives is almost certain to be visibly different.

All of this raises the question of how you get the PCIe bus address of a particular NVMe drive. We can do this by using /sys, because Linux makes your PCIe devices and topology visible in sysfs. In specific, every NVMe device appears as a symlink in /sys/block that gives you the path to its PCIe node (and in fact the full topology). So on my office machine in its current NVMe setup, I have:

; readlink nvme0n1
../devices/pci0000:00/0000:00:03.2/0000:0b:00.0/[...]
; readlink nvme1n1
../devices/pci0000:00/0000:00:01.1/0000:01:00.0/[...]

This order on my machine gives me a surprise, because the two NVMe drives are not in the order I expected. In fact they're apparently not in the order that the kernel initially detected them in, as a look into 'dmesg' reports:

nvme nvme0: pci function 0000:01:00.0
nvme nvme1: pci function 0000:0b:00.0

This is the enumeration order I expected, with the motherboard M.2 slot at 01:00.0 detected before the adapter card at 0b:00.0 (for more on my current PCIe topology, see this entry). Indeed the original order appears to be preserved in bits of sysfs, with path components like nvme/nvme0/nvme1n1 and nvme/nvme1/nvme0n1. Perhaps the kernel assigned actual nvmeXn1 names backward, or perhaps udev renamed my disks for reasons known only to itself.

(But at least now I know which drive to pull if I have trouble with nvme1n1. On the other hand, I'm now doubting the latency numbers that I previously took as a sign that the NVMe drive on the adapter card was slower than the one in the M.2 slot, because I assumed that nvme1n1 was the adapter card drive.)

Once you have the PCIe bus address of a NVMe drive, you can look for additional clues as to what physical M.2 slot or PCIe slot that drive is in beyond just how this fits into your PCIe bus topology. For example, some motherboards (including my home machine) may wind up running the 'second' M.2 slot at 2x instead of x4 under some circumstances, so if you can find one NVMe drive running at x2 instead of x4, you have a strong clue as to which is which (assuming that your NVMe drives are x4 drives). You can also have a PCIe slot be forced to x2 for other reasons, such as motherboards where some slots share lanes and bandwidth. I believe that the primary M.2 slot on most motherboards always gets x4 and is never downgraded (except perhaps if you ask the BIOS to do so).

You can also get the same PCIe bus address information (and then a lot more) through udevadm, as noted by a commentator on yesterday's entry; 'udevadm info /sys/block/nvme0n1' will give you all of the information that udev keeps. This doesn't seem to include any explicit information on whether the device was renamed, but it does include the kernel's assigned minor number and on my machine, nvme0n1 has minor number 1 while nvme1n1 has minor number 0, which suggests that it was assigned first.

(It would be nice if udev would log somewhere when it renames a device.)

PS: Looking at the PCIe bus addresses associated with SATA drives usually doesn't help, because most of the time all of your SATA drives are attached to the same PCIe device.

MappingNVMeDrives written at 00:41:42; Add Comment

2019-12-12

Linux makes your PCIe topology visible in sysfs (/sys)

Getting some NVMe drives for my office machine has been an ongoing education into many areas of PCIe, including how to see your PCIe topology with lspci and understanding how PCIe bus addresses and topology relate to each other. Today I coincidentally discovered that there is another way to look into your system's PCIe topology, because it turns out that the Linux kernel materializes it a directory hierarchy in the sysfs filesystem that is usually mounted on /sys.

Generally, the root of the PCI(e) bus hierarchy is going to be found at /sys/devices/pci0000:00. Given an understanding of PCIe addresses, we can see that 0000:00 is the usual domain and starting PCIe bus number. In this directory are a whole bunch of subdirectories, named after the full PCIe bus address of each device, so you get directories named things like '0000:00:03.2'. If you take of the leading '0000:', this corresponds to what 'lspci -v' will report as device 00:03.2. For PCIe devices that act as bridges, there will be subdirectories for the PCIe devices behind the bridge, with the full PCIe address of those devices. So in my office machine's current PCIe topology, there is a '0000:0b:00.0' subdirectory in the 0000:00:03.2 directory, which is my second NVMe drive behind the 00:03.2 PCIe bridge.

(And behind 0000:00:03.1 is my Radeon graphics card, which actually has two exposed PCIe functions; 0000:0a:00.0 is the video side, while 0000:0a:00.1 is 'HDMI/DP Audio'.)

There are a number of ways to use this /sys information, some of which are for future entries. The most obvious use is to confirm your understanding of the topology and implied PCIe bus addresses that 'lspci -tv' reports. If the /sys directory hierarchy matches your understanding of the output, you have it right. If it doesn't, something is going on.

The other use is a brute force way of finding out what the topology of a particular final PCIe device is, by simply finding it in the hierarchy with 'find /sys/devices/pci0000:00 -name ..', where the name is its full bus address (with the 0000: on the front). So, for example, if we know we have an Ethernet device at 06:00.0, we can find where it is in the topology with:

; cd /sys/devices/pci0000:00
; find . -type d -name 0000:06:00.0 -print
./0000:00:01.3/0000:02:00.2/0000:03:03.0/0000:06:00.0

(Using '-type d' avoids having to filter out some symlinks for the PCIe node in various contexts; in this case it shows up as '0000:00:00.2/iommu/ivhd0/devices/0000:06:00.0'.)

This shows us the path through the PCIe topology from the root, through 00:01.3, then 02:00.2, then finally 03:03.0. This complex path is because this is a device hanging off the AMD X370 chipset instead of off of the CPU, although not all chipset attached PCIe devices will have such a long topology.

Until I looked at the lspci manpage more carefully, I was going to say that this was the easiest way to go from a PCIe bus address to the full path to the device with all of the PCIe bus addresses involved. However, it turns out that in sufficiently modern versions of lspci, 'lspci -PP' will report the same information in a shorter and more readable way:

; lspci -PP -s 06:00.0
00:01.3/02:00.2/03:03.0/06:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06)

Unfortunately the version of lspci on our Ubuntu 18.04 machines is not sufficiently modern; on those machines, a find remains the easiest way. You can do it from the output of either 'lspci -tv' or 'lspci -v', as described in an earlier entry, but you have to do some manual work to reconstruct all of the PCIe bus addresses involved.

PCIeTopologyInSysfs written at 00:22:48; Add Comment

2019-12-11

Fedora is got a good choice if long term stability and usability is a high priority

Every so often I hear about people running servers or infrastructure that they care about on Fedora, and my eyebrows almost always go up. I like Fedora and run it by choice on all of my desktops and my work laptop, but I'm the sole user on these machines, I know what I'm getting into, and I'm willing to deal with the periodic disruptions that Fedora delivers. Fedora is a good Linux distribution on the whole, but it is what I would call a 'forward looking' distribution; it is one that is not that much interested in maintaining backward compatibility if that conflicts with making the correct choice for right now, the choice that you'd make if you were starting from scratch. The result is that every so often, Fedora will unapologetically kick something out from underneath long term users and you get to fix your setup to deal with the new state of affairs.

All of this sounds very theoretical, so let me make it quite concrete with my tweet:

Today I learned that in Fedora 31, /usr/bin/python is Python 3 instead of Python 2. I hope Ubuntu doesn't do that, because if it does our users are going to kill us.

I learned this because I've recently upgraded my work laptop to Fedora 31 and on it I run a number of Python based programs that started with '#!/usr/bin/python'. Before the upgrade, that was Python 2 and all of those programs worked. After the upgrade, as I found out today, that was Python 3 and many of the programs didn't work.

(Fedora provides no way to control this behavior as far as I can tell. What /usr/bin/python is is not controlled through Fedora's alternatives system; instead it's a symlink that's directly supplied by a package, and there's no version of the package that provides a symlink that goes to Python 2.)

This is fine for me. It's my own machine, I know what changed on it recently, I don't have to support a mixed base of older and newer Fedora machines, and I'm willing to put the pieces back together. At work, we've been running a Linux environment for fifteen years or so now, we have somewhere around a thousand users, we have to run a mixed base of distribution versions, and some of those users will have programs that start with '#!/usr/bin/python', possibly programs they've even forgotten about because they've been running quietly for so long. This sort of change would cause huge problems for them and thus for us.

Fedora's decision here is not wrong, for Fedora, but it is a very Fedora decision. If you were doing a distribution from scratch for today, with no history behind it at all, /usr/bin/python pointing to Python 3 is a perfectly rational and good choice. Making that decision in a distribution with history is choosing one set of priorities over another; it is prioritizing the 'correct' and modern choice over not breaking existing setups and not making people using your distribution do extra work.

I think it's useful to have Linux distributions that prioritize this way, and I don't mind it in the distribution that I use. But I know what I'm getting into when I choose Fedora, and it's not for everyone.

FedoraVsLongTermUse written at 00:49:11; Add Comment

2019-12-06

PCIe bus addresses, lspci, and working out your PCIe bus topology

All PCIe devices have a PCIe bus address, which is shown by lspci, listed in dmidecode (cf), and so on. As covered in the lspci manpage, the fully general form of PCIe bus addresses is <domain>:<bus>:<device>.<function>. On most systems, the domain is always 0000 and is omitted by lpsci and what you see is the bus, the device, and the function, which looks like this:

0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] [...]

That is the Radeon card on my office machine (or at least the video display portion of it), and it's function 0 of device 00 on bus 0a. Generally the device number and the function number are stable, but the bus number can definitely change depending on what other hardware you have in your machine. As I understand the situation, modern machines have many separate PCIe busses (behind PCIe bridges and other things), and a PCIe address's bus number depends on both the order that things are scanned by the BIOS and how many other busses and sub-busses there are on your system. Some cards have PCIe bridges and other things, and so whether or not you have one of them in your system (and where) can change how other bus numbers are assigned.

As covered in yesterday's entry on looking into PCIe slot topology, lspci will print the actual topology of your PCIe devices with 'lspci -tv'. Ever since I found out about this, I've wondered how to go from the topology to the PCIe addresses in plain 'lspci -v' output, and how I might verify the topology or decode it from 'lspci -vv' output. As it turns out, both are possible (and Matt's comment on yesterday's entry gave me a useful piece of information).

So let's start from 'lpsci -tv' output, because it's more complex. The first line of 'lspci -tv' looks like this:

-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex

The portion in brackets is the domain and bus that everything under this point in the tree is on. This is domain 0000 and bus 00, which is generally the root of the PCIe topology. Going along, we get to my first NVMe drive:

           +-01.1-[01]----00.0  Kingston Technology Company, Inc. Device 2263

This is not directly on bus 00; instead it is accessed through a device at 01.1 on this bus, which thus has the (abbreviated) PCIe address of 00:01.1. 'lspci -v' tells me that this is a PCIe bridge, as expected:

00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [...]

Much like the []s of the root of the tree, the '[01]' bit after it in 'lspci -tv' means that all PCIe devices under this bridge are on bus 01, and there is only one of them, the NVMe drive, which will thus have the PCIe address 01:00.0:

01:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. Device 2263 [...]

The X370 chipset controller presents a more complex picture:

           +-01.3-[02-09]--+-00.0  Advanced Micro Devices, Inc. [AMD] X370 Series Chipset USB 3.1 xHCI Controller

The actual bridge is at 00:01.3 (another PCIe GPP bridge), but it has multiple busses behind it, from 02 through 09. If you look at the PCIe topology in yesterday's entry, you can predict that there are three things directly on bus 02, which are actually all functions of a single device; 02:00.0 is a xHCI controller, 02:00.1 is a SATA controller, and 02:00.2 is a 'X370 Series Chipset PCIe Upstream Port'. Behind it are a series of 'Chipset PCIe Port' devices (all on bus 03), and behind them are the actual physical PCIe slots and some onboard devices (a USB 3.1 host controller and the Intel gigabit Ethernet port). Each of these gets their own PCIe bus, 04 through 09, so for example my onboard Ethernet is 08:00.0 (bus 08, device 00, function 0):

08:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)

Now let's go the other way, from 'lspci -vv' output to what device is under what. As I showed above, my Radeon card is 0a:00.0. Its upstream device is a PCIe GPP bridge at 00:03.1. If we examine that GPP bridge in 'lspci -vv', we will see a 'Bus:' line:

00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode])
   [...]
   Bus: primary=00, secondary=0a, subordinate=0a, sec-latency=0
   [...]

The primary bus is the 00: portion of its PCIe address, and the secondary bus is its direct downstream bus. So we would expect to find anything directly under this on bus 0a, which is indeed where the Radeon is.

A more interesting case is the 00:01.3 PCIe bridge to the X370 chipset. This reports:

   Bus: primary=00, secondary=02, subordinate=09, sec-latency=0

What this appears to mean is that while this bridge's direct downstream bus is 02, somewhere below it are PCIe busses up to 09. I suspect that the PCIe specification requires that busses be assigned sequentially this way in order to make routing simpler.

If you have a device, such as my Radeon card at 0a:00.0, there is no way I can see in the device's verbose PCIe information to find what its parent is (end devices don't have a 'Bus:' line). You have to search through the 'lspci -vv' of other devices for something with a 'secondary=' of bus 0a. I think you'll pretty much always find this somewhere, since generally something has to have this as a direct downstream PCIe bus even if it's under a fan-out PCIe setup like the X370 chipset bridge on this machine.

(Working backward this way instead of using 'lspci -tv' can be a useful way of reassuring yourself that you really do understand the topology. You may also want to look at the details of an upstream device to find out, for example, why your Radeon card appears to be running at PCIe 1.0 speeds. I haven't solved that mystery yet, partly because I've been too busy to reboot my office machine to get it into the BIOS.)

PCIeLspciBusAddresses written at 01:33:24; Add Comment

2019-12-05

Looking into your system's PCIe slot topology and PCIe lane count under Linux

Suppose, not hypothetically, that you want to understand the PCIe slot and bus topology of your Linux system and also work out how many PCIe lanes important things have been given. Sometimes the PCIe lane count doesn't matter, but things like NVMe drives may have lower latency under load and perform better if they get their full number of PCIe lanes (generally four lanes, 'x4'). You can do this under Linux, but it's not as straightforward as you'd like.

The easiest way to see PCIe topology is with 'lspci -tv', which shows you how PCIe slots relate to each other and also what's in them. On my office machine with the first configuration of PCIE to NVMe adapter card, this looked like the following:

-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
           +-00.2  [...]
           +-01.0  [...]
           +-01.1-[01]----00.0  Kingston Technology Company, Inc. Device 2263
           +-01.3-[02-09]--+-00.0  Advanced Micro Devices, Inc. [AMD] X370 Series Chipset USB 3.1 xHCI Controller
           |               +-00.1  Advanced Micro Devices, Inc. [AMD] X370 Series Chipset SATA Controller
           |               \-00.2-[03-09]--+-00.0-[04]----00.0  Kingston Technology Company, Inc. Device 2263
           |                               +-02.0-[05]--
           |                               +-03.0-[06]----00.0  Intel Corporation 82572EI Gigabit Ethernet Controller (Copper)
           |                               +-04.0-[07]----00.0  ASMedia Technology Inc. ASM1143 USB 3.1 Host Controller
           |                               +-06.0-[08]----00.0  Intel Corporation I211 Gigabit Network Connection
           |                               \-07.0-[09]--
           +-02.0  [...]
           +-03.0  [...]
           +-03.1-[0a]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X]
           |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]
           +-04.0  [...]
[...]

(By the way, the PCIe numbers shown here have no resemblance to the numbers in the output of plain 'lspci'. Also, I have left out the text for a lot of 'PCIe Dummy Host Bridge' devices and a couple of others.)

The two Kingston PCIe devices are the two NVMe drives. The first one listed is in the motherboard M.2 slot; the second one listed is on its adapter card in a PCIe x16 @ x4 slot driven by the X370 chipset instead of directly by the CPU. Although it's not obvious from the topology listing, one of the Intel Ethernets is on the motherboard (but driven through the X370 chipset) and the other is a card in a PCIe x1 slot.

Now here's the same machine with the PCIe adapter for the second NVMe drive moved to a PCIe x16 @ x8 GPU card slot:

-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
           +-00.2  [...]
           +-01.0  [...]
           +-01.1-[01]----00.0  Kingston Technology Company, Inc. Device 2263
           +-01.3-[02-09]--+-00.0  Advanced Micro Devices, Inc. [AMD] X370 Series Chipset USB 3.1 xHCI Controller
           |               +-00.1  Advanced Micro Devices, Inc. [AMD] X370 Series Chipset SATA Controller
           |               \-00.2-[03-09]--+-00.0-[04]--
           |                               +-02.0-[05]--
           |                               +-03.0-[06]----00.0  Intel Corporation 82572EI Gigabit Ethernet Controller (Copper)
           |                               +-04.0-[07]----00.0  ASMedia Technology Inc. ASM1143 USB 3.1 Host Controller
           |                               +-06.0-[08]----00.0  Intel Corporation I211 Gigabit Network Connection
           |                               \-07.0-[09]--
           +-02.0  [...]
           +-03.0  [...]
           +-03.1-[0a]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X]
           |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]
           +-03.2-[0b]----00.0  Kingston Technology Company, Inc. Device 2263
           +-04.0  [...]
[...]

It's pretty clear that the second NVMe drive has shifted its position in the topology fairly significantly, and given that we have '03.1' and '03.2' labels for the Radeon GPU and the NVMe drive, it's not hard to guess that it's now in one of a pair of GPU slots. It's also clearly no longer under the X370 chipset, but instead is now on the same (PCIe) level as the first NVMe drive.

Working out how many PCIe lanes a card wants and how many it's actually getting is harder and more annoying. As far as I know, the best way of finding it out is to look carefully through the output of 'lspci -vv' for the device you're interested in and focus on the LnkCap and LnkSta portions, which will list what the card is capable of and what it actually got. For example, for my PCIe to NVMe adapter card in its first location (the first topology), where it was choked down from PCIe x4 to x2, this looks like the following:

04:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. Device 2263 (rev 03) (prog-if 02 [NVM Express])
 [...]
   LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
           ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
 [...]
   LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded)
           TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
 [...]

This says that it could do PCIe x4 at 8GT/s (which is PCIe 3.0) but was downgraded to PCIe x2 at 5GT/s (which is PCIe 2.0). In the PCIe adapter's new location in a GPU card slot where the NVMe drive could get full speed, the LnkCap didn't change but the LnkSta changed to:

   LnkSta: Speed 8GT/s (ok), Width x4 (ok)
           TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

So now the NVMe drive is getting x4 PCIE 3.0.

In general, searching 'lspci -vv' output for 'downgraded' can be interesting. For example, on my office machine, the Radeon GPU reports the following in both PCIe topologies:

0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7) (prog-if 00 [VGA controller])
 [...]
   LnkCap: Port #1, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <1us
           ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
 [...]
   LnkSta: Speed 2.5GT/s (downgraded), Width x8 (ok)
           TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
 [...]

I have no idea why the Radeon GPU has apparently been downgraded from PCIe 3.0 all the way to what looks like PCIe 1.0, although it may be because it's probably behind a 'PCIe GPP Bridge' that has also been downgraded to 2.5 GT/s. Perhaps I've missed some BIOS setting that is affecting things (BIOS settings can apparently influence this part of PCIe, as they do many bits of it). For my relatively basic X usage, this downgrade may not matter, but having noticed it I'm now curious and somewhat irritated. If my card can do PCIe 3.0, I want to be getting PCIe 3.0 if at all possible.

(Additional reading on this PCIe stuff under Linux includes here, here, and here. As usual, I'm writing this down so that I have it for reference the next time I need to poke around in this area.)

PCIeTopologyAndLanes written at 00:21:22; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.