2019-12-25
Why udev may be trying to rename your VLAN interfaces to bad names
When I updated my office workstation to Fedora 30 back in August, I ran into a little issue:
It has been '0' days since systemd/udev blew up my networking. Fedora 30 systemd/udev attempts to rename VLAN devices to the interface's base name and fails spectacularly, causing the sys-subsystem*.device units to not be present. We hope you didn't depend on them! (I did.)
I filed this as Fedora bug #1741678, and just today I got a clue so that now I think I know why this happens.
The symptom of this problem is that during boot, your system will log things like:
systemd-udevd[914]: em-net5: Failed to rename network interface 4 from 'em-net5' to 'em0': Device or resource busy
As you might guess from the name I've given it here, em-net5 is a VLAN on em0. The name 'em0' itself is one that I assigned, because I don't like the network names that systemd-udevd would assign if left on its own (they are what I would call ugly, or at least tangled and long). The failure here prevents systemd from creating the sys-subsystem-net-devices-em-net5.device unit that it normally would (and then this had further consequences because of systemd's lack of good support for networks being ready).
I use networkd with static networking, so I set up the em0 name through a networkd .link file (as covered here). This looks like:
[Match] MACAddress=60:45:cb:a0:e8:dd [Link] Description=Onboard port MACAddressPolicy=persistent Name=em0
Based on what 'udevadm test' reports, it appears that when udevd is configuring the em-net5 VLAN, it (still) matches this .link file for the underlying device and applying things from it. My guess is that this is happening because VLANs and their underlying physical interfaces normally share MACs, and so the VLAN MAC matches the MAC here.
This appears to be a behavior change in the version of udev shipped
in Fedora 30. Before Fedora 30, systemd-udevd and networkd did not
match VLAN MACs against .link files; from Fedora 30 onward, it
appears to do so. To stop this, presumably one needs to limit your
.link files to only matching on physical interfaces, not VLANs, but
unfortunately this seems difficult to do. The systemd.link manpage
documents a 'Type=
' match, but while VLANs have a type that can
be used for this, native interfaces do not appear to (and there
doesn't seem to be a way to negate the match). There are various
hacks that could be committed here, but all of them are somewhat
unpleasant to me (such as specifying the kernel driver; if the
kernel's opinion of what driver to use for this hardware changes,
I am up a creek again).
2019-12-20
My new Linux office workstation disk partitioning for the end of 2019
I've just had the rare opportunity to replace all of my office machine's disks at once, without having to carry over any of the previous generation the way I've usually had to. As part of replacing everything I got the chance to redo the partitioning and setup of all of my disks, again all at once without the need to integrate a mix of the future and the past. For various reasons, I want to write down the partitioning and filesystem setup I decided on.
My office machine's new set of disks are a pair of 500 GB NVMe drives and a pair of 2 TB SATA SSDs. I'm using GPT partitioning on all four drives for various reasons. All four drives start with my standard two little partitions, a 256 MB EFI System Partition (ESP, gdisk code EF00) and a 1 MB BIOS boot partition (gdisk code EF02). I don't currently use either of them (my past attempt to switch from MBR booting to UEFI was a failure), but they're cheap insurance for a future. Similarly, putting these partitions on all four drives instead of just my 'system' drives is more cheap insurance.
(Writing this down has made me realize that I didn't format the ESPs. Although I don't use UEFI for booting, I have in the past put updated BIOS firmware images there in order to update the BIOS.)
The two NVMe are my 'system' drives. They have three additional
partitions; a 70 GB partition used for a Linux software RAID mirror
of the root filesystem (including /usr
and /var
, since I put
all of the system into one filesystem), a 1 GB partition that is a
Linux software RAID mirror swap partition, and the remaining 394.5
GB as a mirrored ZFS pool that holds filesystems that I want to be
as fast as possible and that I can be confident won't grow to be
too large. Right now that's my home directory filesystem and the
filesystem that holds source code (where I build Firefox, Go, and
ZFS on Linux, for example).
The two SATA SSDs are my 'data' drives, holding various larger but less important things. They have two 70 GB partitions that are Linux software RAID mirrors and the remaining space is in in a single partition for another mirrored ZFS pool. One of the two 70 GB partitions is so that I can make backup copies of my root filesystem before upgrading Fedora (if I bother to do so); the other is essentially an 'overflow' filesystem for some data that I want on an ext4 filesystem instead of in a ZFS pool (including a backup copy of all recent versions of ZFS on Linux that I've installed on my machine, so that if I update and the very latest version has a problem, I can immediately reinstall a previous one). The ZFS pool on the SSDs contains larger and generally less important things like my VMWare virtual machine images and the ISOs I use to install them, and archived data.
Both ZFS pools are set up following my historical ZFS on Linux
practice, where they use the /dev/disk/by-id
names for my disks instead of the sdX and nvme... names. Both pools
are actually relatively old; I didn't create new pools for this and
migrate my data, but instead just attached new mirrors to the old
pools and then detached the old drives (more or less). The root filesystem was similarly migrated
from my old SSDs by attaching and removing software RAID mirrors;
the other Linux software RAID filesystems are newly made and copied
through ext4 dump
and restore
(and the new software RAID arrays
were added to /etc/mdadm.conf
more or less by hand).
(Since I just looked it up, the ZFS pool on the SATA SSDs was created in August of 2014, originally on HDs, and the pool on the NVMe drives was created in January of 2016, originally on my first pair of (smaller) SSDs.)
Following my old guide to RAID superblock formats, I continued to use the version 1.0 format for everything except the new swap partition, where I used the version 1.2 format. By this point using 1.0 is probably superstition; if I have serious problems (for example), I'm likely to just boot from a Fedora USB live image instead of trying anything more complicated.
All of this feels very straightforward and predictable by now. I've moved away from complex partitioning schemes over time and almost all of the complexity left is simply that I have two different sets of disks with different characteristics, and I want some filesystems to be fast more than others. I would like all of my filesystems to be on NVMe drives, but I'm not likely to have NVMe drives that big for years to come.
(The most tangled bit is the 70 GB software RAID array reserved for a backup copy of my root filesystem during major upgrades, but in practice it's been quite a while since I bothered to use it. Still, having it available is cheap insurance in case I decide I want to do that someday during an especially risky Fedora upgrade.)
Splitting a mirrored ZFS pool in ZFS on Linux
Suppose, not hypothetically, that you're replacing a pair of old disks with a pair of new disks in a ZFS pool that uses mirrors. If you're a cautious person and you worry about issues like infant mortality in your new drives, you don't necessarily want to immediately switch from the old disks to the new ones; you want to run them in parallel for at least a bit of time. ZFS makes this very easy, since it supports up to four way mirrors and you can just attach devices to add extra mirrors (and then detach devices later). Eventually it will come time to stop using the old disks, and at this point you have a choice of what to do.
The straightforward thing is to drop the old disks out of the ZFS
mirror vdev with 'zpool detach
', which cleanly removes them (and
they won't come back later, unlike with Linux software RAID). However this is a little bit
wasteful, in a sense. Those old disks have a perfectly good backup
copy of your ZFS pool on them, but when you detach them you lose
any real possibility of using that copy. Perhaps you would like to
keep that data as an actual backup copy, just in case. Modern
versions of ZFS can do this through splitting the pool with 'zpool
split
'.
To quote the manpage here:
Splits devices off pool creating newpool. All vdevs in pool must be mirrors and the pool must not be in the process of resilvering. At the time of the split, newpool will be a replica of pool. [...]
In theory the manpage's description suggests that you can split a
four-way mirror vdev in half, pulling off two devices at once in a
'zpool split
' operation. In practice it appears that the current
0.8.x version of ZFS on Linux can only
split off a single device from each mirror vdev. This meant that
I needed to split my pool in a multi-step operation.
Let's start with a pool, maindata
, with four disks in a single
mirrored vdev, oldA
, oldB
, newC
, and newD
. We want to split
maindata
so that there is a new pool with oldA
and oldB
.
First, we split one old device out of the pool:
zpool split -R /mnt maindata maindata-hds oldA
Normally the just split off newpool is not imported (as far as I
know), and certainly you don't want it imported if your filesystems
have explicit 'mountpoint
' settings (because then filesystems
from the original and the split off pool will fight over who gets
to be mounted there). However, you can't add devices to exported
pools and we need to add oldB
, so we have to import the new pool
in an altroot. I use /mnt
here out of tradition but you can use
any convenient empty directory.
With the pool split off, we need to detach oldB
from the regular
pool and attach it to oldA
in the new pool to make the new pool
actually be mirrored:
zpool detach maindata oldB zpool attach maindata-hds oldA oldB
This will then resilver the maindata-hds
new pool on to oldB
(even though oldB
has an almost exact copy already). Once the
resilver is done, you can export the pool:
zpool export maindata-hds
You now have your mirrored backup copy sitting around with relatively little work on your part.
All of this appears to have worked completely fine for me. I scrubbed
my maindata
pool before splitting it, just in case, but I don't
think I bothered to scrub the maindata-hds
new pool after the
resilver. It's only an emergency backup pool anyway (and it gets
less and less useful over time, since there are more divergences
between it and the live pool).
PS: I don't know if you can make snapshots, split a pool, and then do incremental ZFS sends from filesystems in one copy of the pool to the other to keep your backup copy more or less up to date. I wouldn't be surprised if it worked, but I also wouldn't be surprised if it didn't.
2019-12-18
Linux kernel Security Modules (LSMs) need their own errno
value
Over on Twitter, I said something I've said before:
Once again, here I am hating how Linux introduced additional kernel security modules without also adding an errno for 'the loadable security module denied permissions'.
Lack of a LSM errno significantly complicates debugging problems, especially if you don't normally use LSMs.
Naturally there's a sysadmin story here, but let's start with the background (even if you probably know it).
SELinux and Ubuntu's AppArmor
are examples of Linux Security Modules; each of
them adds additional permission checks that you must pass over and
above the normal Unix permissions. However, when they reject your
access, they don't actually tell you this specifically; instead you
get the generic Unix error of EPERM
, 'operation not permitted',
which is normally what you get if, say, the file is unreadable to
your UID for some reason.
We have an internal primary master DNS server for our DNS zones (a
so called 'stealth master'), which runs Ubuntu instead of OpenBSD
for various reasons. We have the winter holiday break coming up and
since we've had problems with it coming up cleanly in the past, so last week it seemed like
a good time to reboot it under controlled circumstances to make
sure that at least that worked. When I did that, named (aka Bind)
refused to start with a 'permission denied' error (aka EPERM
)
when it tried to read its named.conf
configuration file. For
reasons beyond the scope of this entry, this file lives on our
central administrative NFS filesystem, and when you throw NFS into
the picture various things can go wrong with access permissions.
So I spent some time looking at file and directory permissions, NFS
mount state, and so on, until I remembered something my co-worker
had mentioned in passing.
Ubuntu defaults to installing and using AppArmor, but we don't
like it and we turn it off almost everywhere (we can't avoid it for
MySQL, although we can make it harmless).
That morning we had applied the pending Ubuntu packages updates,
as one does, and one of the packages that got updated had been the
AppArmor package. It turns out that in our environment, when an
AppArmor package update is applied, AppArmor gets re-enabled (but
I think not started immediately); when I rebooted our primary DNS
master, it now started AppArmor. AppArmor has a profile for Bind
that only allows for a configuration file in the standard place,
not where we put our completely different and customized one, and
so when Bind tried to read our named.conf
, the AppArmor LSM said
'no'. But that 'no' was surfaced only as an EPERM
error and so I
went chasing down the rabbit hole of all of the normal causes for
permission errors.
People who deal with LSMs all of the time will probably be familiar
with this issue and will immediately move to the theory that any
unfamiliar and mysterious permission denials are potentially the
LSM in action. But we don't use LSMs normally, so every time one
enables itself and gets in our way, we have to learn all about this
all over again. The process of troubleshooting would be much easier
if the LSM actually told us that it was doing things by having a
new errno
value for 'LSM permission denied', because then we'd
know right away what was going on.
(If Linux kernel people are worried about some combination of security concerns and backward compatibility, I would be happy if they made this extra errno value an opt-in thing that you had to turn on with a sysctl. We would promptly enable it for all of our servers.)
PS: Even if we didn't have our named.conf
on a NFS filesystem,
we probably wouldn't want to overwrite the standard version with
our own. It's usually cleaner to build your own completely separate
configuration file and configuration area, so that you don't have to
worry about package updates doing anything to your setup.
2019-12-13
Working out which of your NVMe drives is in what slot under Linux
One of the perennial problems with machines that have multiple
drives is figuring out which of your physical drives is sda
, which
is sdb
, and so on; the mirror problem is arranging things so that
the drive you want to be the boot drive actually is the first drive.
In sanely made server hardware this is generally relatively easy,
but with desktops you can run into all sorts of problems, such as
how desktop motherboards can wire things up oddly. Under some situations, NVMe drives
make this easier than with SATA drives, because NVMe drives are
PCIe devices and so have distinct PCIe bus addresses and possibly
PCIe bus topologies.
First off, I will admit something. The gold standard for doing this
reliably under all circumstances is to record the serial numbers
of your NVMe drives before you put them into your system and then
use 'smartctl -i /dev/nvme0n1
' to find each drive from its serial
number. It's always possible for a motherboard with multiple M.2
slots to do perverse things with its wiring and PCIe bus layout,
so that what it labels as the first and perhaps best M.2 slot is
actually the second NVMe drive as Linux sees it. But I think that
generally it's pretty likely that the first M.2 slot will be earlier
in PCIe enumeration than the second one (if there is a second one).
And if you have only one M.2 slot on the motherboard and are using
a PCIe to NVMe adapter card for your second NVMe drive, the PCIe
bus topology of the two NVMe drives is almost certain to be visibly
different.
All of this raises the question of how you get the PCIe bus address
of a particular NVMe drive. We can do this by using /sys
, because
Linux makes your PCIe devices and topology visible in sysfs. In specific, every NVMe device appears as a
symlink in /sys/block
that gives you the path to its PCIe node
(and in fact the full topology). So on my office machine in its current NVMe setup, I have:
; readlink nvme0n1 ../devices/pci0000:00/0000:00:03.2/0000:0b:00.0/[...] ; readlink nvme1n1 ../devices/pci0000:00/0000:00:01.1/0000:01:00.0/[...]
This order on my machine gives me a surprise, because the two NVMe
drives are not in the order I expected. In fact they're apparently
not in the order that the kernel initially detected them in, as a
look into 'dmesg
' reports:
nvme nvme0: pci function 0000:01:00.0 nvme nvme1: pci function 0000:0b:00.0
This is the enumeration order I expected, with the motherboard M.2
slot at 01:00.0 detected before the adapter card at 0b:00.0 (for
more on my current PCIe topology, see this entry).
Indeed the original order appears to be preserved in bits of sysfs,
with path components like nvme/nvme0/nvme1n1
and nvme/nvme1/nvme0n1
.
Perhaps the kernel assigned actual nvmeXn1 names backward, or perhaps
udev renamed my disks for reasons known only to itself.
(But at least now I know which drive to pull if I have trouble with nvme1n1. On the other hand, I'm now doubting the latency numbers that I previously took as a sign that the NVMe drive on the adapter card was slower than the one in the M.2 slot, because I assumed that nvme1n1 was the adapter card drive.)
Once you have the PCIe bus address of a NVMe drive, you can look for additional clues as to what physical M.2 slot or PCIe slot that drive is in beyond just how this fits into your PCIe bus topology. For example, some motherboards (including my home machine) may wind up running the 'second' M.2 slot at 2x instead of x4 under some circumstances, so if you can find one NVMe drive running at x2 instead of x4, you have a strong clue as to which is which (assuming that your NVMe drives are x4 drives). You can also have a PCIe slot be forced to x2 for other reasons, such as motherboards where some slots share lanes and bandwidth. I believe that the primary M.2 slot on most motherboards always gets x4 and is never downgraded (except perhaps if you ask the BIOS to do so).
You can also get the same PCIe bus address information (and then a
lot more) through udevadm
, as noted by a commentator on yesterday's
entry; 'udevadm info /sys/block/nvme0n1
'
will give you all of the information that udev keeps. This doesn't
seem to include any explicit information on whether the device was
renamed, but it does include the kernel's assigned minor number and
on my machine, nvme0n1 has minor number 1 while nvme1n1 has minor
number 0, which suggests that it was assigned first.
(It would be nice if udev would log somewhere when it renames a device.)
PS: Looking at the PCIe bus addresses associated with SATA drives usually doesn't help, because most of the time all of your SATA drives are attached to the same PCIe device.
2019-12-12
Linux makes your PCIe topology visible in sysfs (/sys
)
Getting some NVMe drives for my office machine
has been an ongoing education into many areas of PCIe, including
how to see your PCIe topology with lspci
and
understanding how PCIe bus addresses and topology relate to each
other. Today I coincidentally discovered
that there is another way to look into your system's PCIe topology,
because it turns out that the Linux kernel materializes it a directory
hierarchy in the sysfs filesystem that is usually mounted on /sys
.
Generally, the root of the PCI(e) bus hierarchy is going to be found
at /sys/devices/pci0000:00
. Given an understanding of PCIe
addresses, we can see that 0000:00 is the
usual domain and starting PCIe bus number. In this directory are a
whole bunch of subdirectories, named after the full PCIe bus address
of each device, so you get directories named things like '0000:00:03.2
'.
If you take of the leading '0000:', this corresponds to what 'lspci
-v
' will report as device 00:03.2. For PCIe devices that act as
bridges, there will be subdirectories for the PCIe devices behind
the bridge, with the full PCIe address of those devices. So in my
office machine's current PCIe topology, there
is a '0000:0b:00.0
' subdirectory in the 0000:00:03.2
directory,
which is my second NVMe drive behind the 00:03.2 PCIe bridge.
(And behind 0000:00:03.1
is my Radeon graphics card, which actually
has two exposed PCIe functions; 0000:0a:00.0
is the video side,
while 0000:0a:00.1
is 'HDMI/DP Audio'.)
There are a number of ways to use this /sys
information, some of
which are for future entries. The most obvious use is to confirm
your understanding of the topology and implied PCIe bus addresses
that 'lspci -tv
' reports. If the /sys
directory hierarchy matches
your understanding of the output, you have it right. If it doesn't,
something is going on.
The other use is a brute force way of finding out what the topology
of a particular final PCIe device is, by simply finding it in the
hierarchy with 'find /sys/devices/pci0000:00 -name ..
', where the
name is its full bus address (with the 0000: on the front). So, for
example, if we know we have an Ethernet device at 06:00.0, we can
find where it is in the topology with:
; cd /sys/devices/pci0000:00 ; find . -type d -name 0000:06:00.0 -print ./0000:00:01.3/0000:02:00.2/0000:03:03.0/0000:06:00.0
(Using '-type d
' avoids having to filter out some symlinks for
the PCIe node in various contexts; in this case it shows up as
'0000:00:00.2/iommu/ivhd0/devices/0000:06:00.0'.)
This shows us the path through the PCIe topology from the root, through 00:01.3, then 02:00.2, then finally 03:03.0. This complex path is because this is a device hanging off the AMD X370 chipset instead of off of the CPU, although not all chipset attached PCIe devices will have such a long topology.
Until I looked at the lspci
manpage more carefully, I was going
to say that this was the easiest way to go from a PCIe bus address
to the full path to the device with all of the PCIe bus addresses
involved. However, it turns out that in sufficiently modern versions
of lspci
, 'lspci -PP
' will report the same information in a
shorter and more readable way:
; lspci -PP -s 06:00.0 00:01.3/02:00.2/03:03.0/06:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06)
Unfortunately the version of lspci
on our Ubuntu 18.04 machines
is not sufficiently modern; on those machines, a find
remains the
easiest way. You can do it from the output of either 'lspci -tv
'
or 'lspci -v
', as described in an earlier entry, but you have to do some manual work to
reconstruct all of the PCIe bus addresses involved.
2019-12-11
Fedora is got a good choice if long term stability and usability is a high priority
Every so often I hear about people running servers or infrastructure that they care about on Fedora, and my eyebrows almost always go up. I like Fedora and run it by choice on all of my desktops and my work laptop, but I'm the sole user on these machines, I know what I'm getting into, and I'm willing to deal with the periodic disruptions that Fedora delivers. Fedora is a good Linux distribution on the whole, but it is what I would call a 'forward looking' distribution; it is one that is not that much interested in maintaining backward compatibility if that conflicts with making the correct choice for right now, the choice that you'd make if you were starting from scratch. The result is that every so often, Fedora will unapologetically kick something out from underneath long term users and you get to fix your setup to deal with the new state of affairs.
All of this sounds very theoretical, so let me make it quite concrete with my tweet:
Today I learned that in Fedora 31, /usr/bin/python is Python 3 instead of Python 2. I hope Ubuntu doesn't do that, because if it does our users are going to kill us.
I learned this because I've recently upgraded my work laptop to
Fedora 31 and on it I run a number of Python based programs that
started with '#!/usr/bin/python
'. Before the upgrade, that was
Python 2 and all of those programs worked. After the upgrade, as I
found out today, that was Python 3 and many of the programs didn't
work.
(Fedora provides no way to control this behavior as far as I can
tell. What /usr/bin/python
is is not controlled through Fedora's
alternatives system;
instead it's a symlink that's directly supplied by a package, and
there's no version of the package that provides a symlink that goes
to Python 2.)
This is fine for me. It's my own machine, I know what changed on
it recently, I don't have to support a mixed base of older and newer
Fedora machines, and I'm willing to put the pieces back together.
At work, we've been running a
Linux environment for fifteen years or so now, we have somewhere
around a thousand users, we have to run a mixed base of distribution
versions, and some of those users will have programs that start
with '#!/usr/bin/python
', possibly programs they've even forgotten
about because they've been running quietly for so long. This sort of
change would cause huge problems for them and thus for us.
Fedora's decision here is not wrong, for Fedora, but it is a very
Fedora decision. If you were doing a distribution from scratch for
today, with no history behind it at all, /usr/bin/python
pointing
to Python 3 is a perfectly rational and good choice. Making that
decision in a distribution with history is choosing one set of
priorities over another; it is prioritizing the 'correct' and modern
choice over not breaking existing setups and not making people using
your distribution do extra work.
I think it's useful to have Linux distributions that prioritize this way, and I don't mind it in the distribution that I use. But I know what I'm getting into when I choose Fedora, and it's not for everyone.
2019-12-06
PCIe bus addresses, lspci
, and working out your PCIe bus topology
All PCIe devices have a PCIe bus address, which is shown by lspci
,
listed in dmidecode
(cf), and so on.
As covered in the lspci
manpage, the fully
general form of PCIe bus addresses is
<domain>:<bus>:<device>.<function>. On most systems, the
domain is always 0000 and is omitted by lpsci
and what you see
is the bus, the device, and the function, which looks like this:
0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] [...]
That is the Radeon card on my office machine (or at least the video display portion of it), and it's function 0 of device 00 on bus 0a. Generally the device number and the function number are stable, but the bus number can definitely change depending on what other hardware you have in your machine. As I understand the situation, modern machines have many separate PCIe busses (behind PCIe bridges and other things), and a PCIe address's bus number depends on both the order that things are scanned by the BIOS and how many other busses and sub-busses there are on your system. Some cards have PCIe bridges and other things, and so whether or not you have one of them in your system (and where) can change how other bus numbers are assigned.
As covered in yesterday's entry on looking into PCIe slot topology, lspci
will print the actual topology of
your PCIe devices with 'lspci -tv
'. Ever since I found out about
this, I've wondered how to go from the topology to the PCIe addresses
in plain 'lspci -v
' output, and how I might verify the topology
or decode it from 'lspci -vv
' output. As it turns out, both are
possible (and Matt's comment on yesterday's entry
gave me a useful piece of information).
So let's start from 'lpsci -tv
' output, because it's more complex.
The first line of 'lspci -tv
' looks like this:
-[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
The portion in brackets is the domain and bus that everything under this point in the tree is on. This is domain 0000 and bus 00, which is generally the root of the PCIe topology. Going along, we get to my first NVMe drive:
+-01.1-[01]----00.0 Kingston Technology Company, Inc. Device 2263
This is not directly on bus 00; instead it is accessed through a device
at 01.1 on this bus, which thus has the (abbreviated) PCIe address of
00:01.1. 'lspci -v
' tells me that this is a PCIe bridge, as expected:
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [...]
Much like the []
s of the root of the tree, the '[01]
' bit after
it in 'lspci -tv
' means that all PCIe devices under this bridge are
on bus 01, and there is only one of them, the NVMe drive, which will
thus have the PCIe address 01:00.0:
01:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. Device 2263 [...]
The X370 chipset controller presents a more complex picture:
+-01.3-[02-09]--+-00.0 Advanced Micro Devices, Inc. [AMD] X370 Series Chipset USB 3.1 xHCI Controller
The actual bridge is at 00:01.3 (another PCIe GPP bridge), but it has multiple busses behind it, from 02 through 09. If you look at the PCIe topology in yesterday's entry, you can predict that there are three things directly on bus 02, which are actually all functions of a single device; 02:00.0 is a xHCI controller, 02:00.1 is a SATA controller, and 02:00.2 is a 'X370 Series Chipset PCIe Upstream Port'. Behind it are a series of 'Chipset PCIe Port' devices (all on bus 03), and behind them are the actual physical PCIe slots and some onboard devices (a USB 3.1 host controller and the Intel gigabit Ethernet port). Each of these gets their own PCIe bus, 04 through 09, so for example my onboard Ethernet is 08:00.0 (bus 08, device 00, function 0):
08:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
Now let's go the other way, from 'lspci -vv
' output to what device
is under what. As I showed above, my Radeon card is 0a:00.0. Its
upstream device is a PCIe GPP bridge at 00:03.1. If we examine that
GPP bridge in 'lspci -vv
', we will see a 'Bus:' line:
00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode]) [...] Bus: primary=00, secondary=0a, subordinate=0a, sec-latency=0 [...]
The primary bus is the 00: portion of its PCIe address, and the secondary bus is its direct downstream bus. So we would expect to find anything directly under this on bus 0a, which is indeed where the Radeon is.
A more interesting case is the 00:01.3 PCIe bridge to the X370 chipset. This reports:
Bus: primary=00, secondary=02, subordinate=09, sec-latency=0
What this appears to mean is that while this bridge's direct downstream bus is 02, somewhere below it are PCIe busses up to 09. I suspect that the PCIe specification requires that busses be assigned sequentially this way in order to make routing simpler.
If you have a device, such as my Radeon card at 0a:00.0, there is
no way I can see in the device's verbose PCIe information to find
what its parent is (end devices don't have a 'Bus:' line). You have
to search through the 'lspci -vv
' of other devices for something
with a 'secondary=' of bus 0a. I think you'll pretty much always
find this somewhere, since generally something has to have this as
a direct downstream PCIe bus even if it's under a fan-out PCIe
setup like the X370 chipset bridge on this machine.
(Working backward this way instead of using 'lspci -tv
' can be a
useful way of reassuring yourself that you really do understand the
topology. You may also want to look at the details of an upstream
device to find out, for example, why your Radeon card appears to
be running at PCIe 1.0 speeds. I haven't solved that mystery yet,
partly because I've been too busy to reboot my office machine to
get it into the BIOS.)
2019-12-05
Looking into your system's PCIe slot topology and PCIe lane count under Linux
Suppose, not hypothetically, that you want to understand the PCIe slot and bus topology of your Linux system and also work out how many PCIe lanes important things have been given. Sometimes the PCIe lane count doesn't matter, but things like NVMe drives may have lower latency under load and perform better if they get their full number of PCIe lanes (generally four lanes, 'x4'). You can do this under Linux, but it's not as straightforward as you'd like.
The easiest way to see PCIe topology is with 'lspci -tv
', which
shows you how PCIe slots relate to each other and also what's in
them. On my office machine with the first
configuration of PCIE to NVMe adapter card, this looked like the
following:
-[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex +-00.2 [...] +-01.0 [...] +-01.1-[01]----00.0 Kingston Technology Company, Inc. Device 2263 +-01.3-[02-09]--+-00.0 Advanced Micro Devices, Inc. [AMD] X370 Series Chipset USB 3.1 xHCI Controller | +-00.1 Advanced Micro Devices, Inc. [AMD] X370 Series Chipset SATA Controller | \-00.2-[03-09]--+-00.0-[04]----00.0 Kingston Technology Company, Inc. Device 2263 | +-02.0-[05]-- | +-03.0-[06]----00.0 Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) | +-04.0-[07]----00.0 ASMedia Technology Inc. ASM1143 USB 3.1 Host Controller | +-06.0-[08]----00.0 Intel Corporation I211 Gigabit Network Connection | \-07.0-[09]-- +-02.0 [...] +-03.0 [...] +-03.1-[0a]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] | \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] +-04.0 [...] [...]
(By the way, the PCIe numbers shown here have no resemblance to the
numbers in the output of plain 'lspci
'. Also, I have left out the
text for a lot of 'PCIe Dummy Host Bridge' devices and a couple of
others.)
The two Kingston PCIe devices are the two NVMe drives. The first one listed is in the motherboard M.2 slot; the second one listed is on its adapter card in a PCIe x16 @ x4 slot driven by the X370 chipset instead of directly by the CPU. Although it's not obvious from the topology listing, one of the Intel Ethernets is on the motherboard (but driven through the X370 chipset) and the other is a card in a PCIe x1 slot.
Now here's the same machine with the PCIe adapter for the second NVMe drive moved to a PCIe x16 @ x8 GPU card slot:
-[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex +-00.2 [...] +-01.0 [...] +-01.1-[01]----00.0 Kingston Technology Company, Inc. Device 2263 +-01.3-[02-09]--+-00.0 Advanced Micro Devices, Inc. [AMD] X370 Series Chipset USB 3.1 xHCI Controller | +-00.1 Advanced Micro Devices, Inc. [AMD] X370 Series Chipset SATA Controller | \-00.2-[03-09]--+-00.0-[04]-- | +-02.0-[05]-- | +-03.0-[06]----00.0 Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) | +-04.0-[07]----00.0 ASMedia Technology Inc. ASM1143 USB 3.1 Host Controller | +-06.0-[08]----00.0 Intel Corporation I211 Gigabit Network Connection | \-07.0-[09]-- +-02.0 [...] +-03.0 [...] +-03.1-[0a]--+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] | \-00.1 Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X] +-03.2-[0b]----00.0 Kingston Technology Company, Inc. Device 2263 +-04.0 [...] [...]
It's pretty clear that the second NVMe drive has shifted its position in the topology fairly significantly, and given that we have '03.1' and '03.2' labels for the Radeon GPU and the NVMe drive, it's not hard to guess that it's now in one of a pair of GPU slots. It's also clearly no longer under the X370 chipset, but instead is now on the same (PCIe) level as the first NVMe drive.
Working out how many PCIe lanes a card wants and how many it's
actually getting is harder and more annoying. As far as I know,
the best way of finding it out is to look carefully through the
output of 'lspci -vv
' for the device you're interested in and
focus on the LnkCap
and LnkSta
portions, which will list what
the card is capable of and what it actually got. For example, for
my PCIe to NVMe adapter card in its first location (the first
topology), where it was choked down from PCIe x4 to x2, this
looks like the following:
04:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. Device 2263 (rev 03) (prog-if 02 [NVM Express]) [...] LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ [...] LnkSta: Speed 5GT/s (downgraded), Width x2 (downgraded) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- [...]
This says that it could do PCIe x4 at 8GT/s (which is PCIe 3.0) but
was downgraded to PCIe x2 at 5GT/s (which is PCIe 2.0). In the PCIe
adapter's new location in a GPU card slot where the NVMe drive could
get full speed, the LnkCap
didn't change but the LnkSta
changed
to:
LnkSta: Speed 8GT/s (ok), Width x4 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
So now the NVMe drive is getting x4 PCIE 3.0.
In general, searching 'lspci -vv
' output for 'downgraded' can be
interesting. For example, on my office machine, the Radeon GPU
reports the following in both PCIe topologies:
0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7) (prog-if 00 [VGA controller]) [...] LnkCap: Port #1, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <1us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ [...] LnkSta: Speed 2.5GT/s (downgraded), Width x8 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- [...]
I have no idea why the Radeon GPU has apparently been downgraded from PCIe 3.0 all the way to what looks like PCIe 1.0, although it may be because it's probably behind a 'PCIe GPP Bridge' that has also been downgraded to 2.5 GT/s. Perhaps I've missed some BIOS setting that is affecting things (BIOS settings can apparently influence this part of PCIe, as they do many bits of it). For my relatively basic X usage, this downgrade may not matter, but having noticed it I'm now curious and somewhat irritated. If my card can do PCIe 3.0, I want to be getting PCIe 3.0 if at all possible.
(Additional reading on this PCIe stuff under Linux includes here, here, and here. As usual, I'm writing this down so that I have it for reference the next time I need to poke around in this area.)