Wandering Thoughts

2020-01-15

Stopping udev from renaming your VLAN interfaces to bad names

Back in early December I wrote about Why udev may be trying to rename your VLAN interfaces to bad names, where modern versions of udev tried to rename VLAN devices from the arbitrary names you give them to the base name of the network device they're on. Since the base name is already taken, this fails.

There turns out to be a simple cause and workaround for this, at least in my configuration, from Zbigniew Jędrzejewski-Szmek. In Fedora, all I need to do is add 'NamePolicy=keep' to the [Link] section of my .link file. This makes my .link file be:

[Match]
MACAddress=60:45:cb:a0:e8:dd

[Link]
Description=Onboard port
MACAddressPolicy=persistent
Name=em0
# Stop VLAN renaming
NamePolicy=keep

Setting 'NamePolicy=keep' doesn't keep the actual network device from being renamed from the kernel's original name for it to 'em0', but it makes udev leave the VLAN devices alone. In turn this means udev and systemd consider them to have been successfully created, so you get the usual systemd sys-subsystem-net-devices .devices units for them showing up as fully up.

In a way, 'NamePolicy=keep' in a .link file is an indirect way for me to tell apart real network hardware from created virtual devices that share the same MAC, or at least ones created through networkd. As covered in the systemd.netdev manpage, giving a name to your virtual device is mandatory (Name= is a required field), so I think such devices will always be considered to already have a name by udev.

(This was a change in systemd-241, apparently. It changes the semantics of existing .link files in a way that's subtly not backward compatible, but such is the systemd way.)

However, I suspect that things might be different if I didn't use 'biosdevname=0' in my kernel command line parameters. These days this is implemented in udev, so allowing udev to rename your network devices from the kernel assigned names to the consistent network device naming scheme may be considered a rename for the purposes of 'NamePolicy=keep'. That would leave me with the same problem of telling real hardware apart from virtual hardware that I had in the original entry.

However, for actual matching against physical hardware, I suspect that you can also generally use a Property= on selected attributes (as suggested by Alex Xu in the comments on the original entry). For instance, most people's network devices are on PCI busses, so:

Property=ID_BUS=pci

There are a whole variety of properties that real network hardware has that VLANs don't (based on 'udev info' output), although I don't know about other types of virtual network devices. It does seem pretty safe that no virtual network device will claim to be on a PCI bus, though.

(I haven't tested the Property= approach, since 'NamePolicy=keep' is sufficient in my case.)

UdevNetworkdVLANLinkMatchingII written at 00:42:43; Add Comment

2020-01-10

Fedora 31 has decided to allow (and have) giant process IDs (PIDs)

Every new process and thread on Linux gets a new PID (short for process ID). PIDs are normally assigned sequentially until they hit some maximum value and roll over. The traditional maximum PID value on Unixes has been some number related to a 16-bit integer, either signed or unsigned, and Linux is no exception; the kernel default is generally still 32768 (which is 2^15 exactly, and so not quite authentic to a signed 16-bit int).

(You can find the current limit in /proc/sys/kernel/pid_max, but it may have been increased through sysctls.)

A few years ago I discovered that a Fedora package had raised this limit on me, which I was able to see because it turned out that my Fedora machines routinely go through a lot of PIDs. I reverted this by removing the package for various reasons, including that I don't really like gigantic process IDs (they bulk up the output of ps, top, and other similar tools). Then recently I updated to Fedora 31, and not too long afterward noticed that I was getting giant process IDs again (as I write this a new shell on one machine gets PID 4,085,915).

This turns out to be a deliberate choice in modern versions of systemd, instead of another stray package deciding it knows best. In Fedora 31 (with systemd 243), /usr/lib/sysctl.d/50-pid-max.conf says:

# Bump the numeric PID range to its maximum of 2^22
# (from the in-kernel default of 2^16), to make PID
# collisions less likely.
kernel.pid_max = 4194304

(Since the PID that new processes get is so close to the maximum, I suspect that I have actually rolled over even this large range a couple of times in the 21 days that this machine has been up since the last time I got around to a kernel update.)

Given that this is a new official systemd thing, I'm going to let it be and live with gigantic PIDs. It's not really worth fighting systemd; it generally doesn't end well for me.

(Hopefully there aren't any programs on the system that assume PIDs are small and always fit into five-character fields in ASCII. Or at least no programs that will fail when this assumption is incorrect, as opposed to producing ugly output.)

Fedora31GiantPids written at 00:52:15; Add Comment

2020-01-07

eBPF based tools are still a work in progress on common Linuxes

These days, it seems that a lot of people are talking about and praising eBPF and tools like BCC and bpftrace, and you can read about places such as Facebook and Netflix routinely using dozens of eBPF programs on systems in production. All of these are true, by and large; if you have the expertise and in the right environment, eBPF can do great things. Unfortunately, though, a number of things get in the way of more ordinary sysadmins being able to use eBPF and these eBPF tools for powerful things.

The first problem is that these tools (and especially good versions of these tools) are fairly recent, which means that they're not necessarily packaged in your Linux distribution. For instance, Ubuntu 18.04 doesn't package bpftrace or anything other than a pretty old version of BCC. You can add third party repositories to your Ubuntu system to (try to) fix this, but that comes with various sorts of maintenance problems and anyway a fair number of nice eBPF features also require somewhat modern kernels. Ubuntu LTS's standard server kernel doesn't necessarily qualify. The practical result is that eBPF is off the table for us until 20.04 or later, unless we have a serious enough problem that we get desperate.

(Certainly we're very unlikely to try to use eBPF on 18.04 for the kinds of routine monitoring and so on that Facebook, Netflix, and so on use it for.)

Even on distributions with recent packages, such as Fedora, you can run into issues where people working in the eBPF world assume you're in a very current environment. The Cloudflare ebpf_exporter (also) is a great way to get things like local disk latency histograms into Prometheus, but the current code base assumes you're using a version of BCC that was released only in October. That's a bit recent, even for Fedora.

(The ebpf_exporter does have pre-built release binaries available, so that's something.)

Then there's the fact that sometimes all of this is held together with unreliable glue because it's not really designed to all work together. Fedora has just updated the Fedora 31 to be a 5.4.x kernel, and now all BCC programs (including examples) fail to compile with a stream of reports about "error: expected '(' after 'asm'" being reported for various bits of the 5.4 kernel headers. Based on some Internet reading, this is apparently a sign of clang attempting to interpret inline assembly things that were written for gcc (which is what the Linux kernel is compiled with). Probably this will get fixed at some point, but for now Fedora people get to choose either 5.4 or BCC but not both.

(bpftrace still works on the Fedora 5.4 kernel, at least in light testing.)

Finally, there's the general problem (shared with DTrace on Solaris and Illumos) that a fair number of the things you might be interested in require hooking directly into the kernel code and the Linux kernel code famously can change all the time. My impression is that eBPF is slowly getting more stable tracepoints over time, but also that a lot of the time you're still directly attaching eBPF hooks to kernel functions.

In time, all of this will settle down. Both eBPF and the eBPF tools will stabilize, current enough versions of everything will be in all common Linux distributions, even the long term support versions, and the kernel will have stable tracepoints and so on that cover most of what you need. But that's not really the state of things today, and it probably won't be for at least a few years to come (and don't even ask about Red Hat Enterprise 7 and 8, which will be around for years to come in some places).

(This more or less elaborates on a tweet of mine.)

EBPFStillInProgress written at 00:03:38; Add Comment

2019-12-25

Why udev may be trying to rename your VLAN interfaces to bad names

When I updated my office workstation to Fedora 30 back in August, I ran into a little issue:

It has been '0' days since systemd/udev blew up my networking. Fedora 30 systemd/udev attempts to rename VLAN devices to the interface's base name and fails spectacularly, causing the sys-subsystem*.device units to not be present. We hope you didn't depend on them! (I did.)

I filed this as Fedora bug #1741678, and just today I got a clue so that now I think I know why this happens.

The symptom of this problem is that during boot, your system will log things like:

systemd-udevd[914]: em-net5: Failed to rename network interface 4 from 'em-net5' to 'em0': Device or resource busy

As you might guess from the name I've given it here, em-net5 is a VLAN on em0. The name 'em0' itself is one that I assigned, because I don't like the network names that systemd-udevd would assign if left on its own (they are what I would call ugly, or at least tangled and long). The failure here prevents systemd from creating the sys-subsystem-net-devices-em-net5.device unit that it normally would (and then this had further consequences because of systemd's lack of good support for networks being ready).

I use networkd with static networking, so I set up the em0 name through a networkd .link file (as covered here). This looks like:

[Match]
MACAddress=60:45:cb:a0:e8:dd

[Link]
Description=Onboard port
MACAddressPolicy=persistent
Name=em0

Based on what 'udevadm test' reports, it appears that when udevd is configuring the em-net5 VLAN, it (still) matches this .link file for the underlying device and applying things from it. My guess is that this is happening because VLANs and their underlying physical interfaces normally share MACs, and so the VLAN MAC matches the MAC here.

This appears to be a behavior change in the version of udev shipped in Fedora 30. Before Fedora 30, systemd-udevd and networkd did not match VLAN MACs against .link files; from Fedora 30 onward, it appears to do so. To stop this, presumably one needs to limit your .link files to only matching on physical interfaces, not VLANs, but unfortunately this seems difficult to do. The systemd.link manpage documents a 'Type=' match, but while VLANs have a type that can be used for this, native interfaces do not appear to (and there doesn't seem to be a way to negate the match). There are various hacks that could be committed here, but all of them are somewhat unpleasant to me (such as specifying the kernel driver; if the kernel's opinion of what driver to use for this hardware changes, I am up a creek again).

UdevNetworkdVLANLinkMatching written at 01:47:33; Add Comment

2019-12-20

My new Linux office workstation disk partitioning for the end of 2019

I've just had the rare opportunity to replace all of my office machine's disks at once, without having to carry over any of the previous generation the way I've usually had to. As part of replacing everything I got the chance to redo the partitioning and setup of all of my disks, again all at once without the need to integrate a mix of the future and the past. For various reasons, I want to write down the partitioning and filesystem setup I decided on.

My office machine's new set of disks are a pair of 500 GB NVMe drives and a pair of 2 TB SATA SSDs. I'm using GPT partitioning on all four drives for various reasons. All four drives start with my standard two little partitions, a 256 MB EFI System Partition (ESP, gdisk code EF00) and a 1 MB BIOS boot partition (gdisk code EF02). I don't currently use either of them (my past attempt to switch from MBR booting to UEFI was a failure), but they're cheap insurance for a future. Similarly, putting these partitions on all four drives instead of just my 'system' drives is more cheap insurance.

(Writing this down has made me realize that I didn't format the ESPs. Although I don't use UEFI for booting, I have in the past put updated BIOS firmware images there in order to update the BIOS.)

The two NVMe are my 'system' drives. They have three additional partitions; a 70 GB partition used for a Linux software RAID mirror of the root filesystem (including /usr and /var, since I put all of the system into one filesystem), a 1 GB partition that is a Linux software RAID mirror swap partition, and the remaining 394.5 GB as a mirrored ZFS pool that holds filesystems that I want to be as fast as possible and that I can be confident won't grow to be too large. Right now that's my home directory filesystem and the filesystem that holds source code (where I build Firefox, Go, and ZFS on Linux, for example).

The two SATA SSDs are my 'data' drives, holding various larger but less important things. They have two 70 GB partitions that are Linux software RAID mirrors and the remaining space is in in a single partition for another mirrored ZFS pool. One of the two 70 GB partitions is so that I can make backup copies of my root filesystem before upgrading Fedora (if I bother to do so); the other is essentially an 'overflow' filesystem for some data that I want on an ext4 filesystem instead of in a ZFS pool (including a backup copy of all recent versions of ZFS on Linux that I've installed on my machine, so that if I update and the very latest version has a problem, I can immediately reinstall a previous one). The ZFS pool on the SSDs contains larger and generally less important things like my VMWare virtual machine images and the ISOs I use to install them, and archived data.

Both ZFS pools are set up following my historical ZFS on Linux practice, where they use the /dev/disk/by-id names for my disks instead of the sdX and nvme... names. Both pools are actually relatively old; I didn't create new pools for this and migrate my data, but instead just attached new mirrors to the old pools and then detached the old drives (more or less). The root filesystem was similarly migrated from my old SSDs by attaching and removing software RAID mirrors; the other Linux software RAID filesystems are newly made and copied through ext4 dump and restore (and the new software RAID arrays were added to /etc/mdadm.conf more or less by hand).

(Since I just looked it up, the ZFS pool on the SATA SSDs was created in August of 2014, originally on HDs, and the pool on the NVMe drives was created in January of 2016, originally on my first pair of (smaller) SSDs.)

Following my old guide to RAID superblock formats, I continued to use the version 1.0 format for everything except the new swap partition, where I used the version 1.2 format. By this point using 1.0 is probably superstition; if I have serious problems (for example), I'm likely to just boot from a Fedora USB live image instead of trying anything more complicated.

All of this feels very straightforward and predictable by now. I've moved away from complex partitioning schemes over time and almost all of the complexity left is simply that I have two different sets of disks with different characteristics, and I want some filesystems to be fast more than others. I would like all of my filesystems to be on NVMe drives, but I'm not likely to have NVMe drives that big for years to come.

(The most tangled bit is the 70 GB software RAID array reserved for a backup copy of my root filesystem during major upgrades, but in practice it's been quite a while since I bothered to use it. Still, having it available is cheap insurance in case I decide I want to do that someday during an especially risky Fedora upgrade.)

WorkMachinePartitioning2019 written at 23:52:22; Add Comment

Splitting a mirrored ZFS pool in ZFS on Linux

Suppose, not hypothetically, that you're replacing a pair of old disks with a pair of new disks in a ZFS pool that uses mirrors. If you're a cautious person and you worry about issues like infant mortality in your new drives, you don't necessarily want to immediately switch from the old disks to the new ones; you want to run them in parallel for at least a bit of time. ZFS makes this very easy, since it supports up to four way mirrors and you can just attach devices to add extra mirrors (and then detach devices later). Eventually it will come time to stop using the old disks, and at this point you have a choice of what to do.

The straightforward thing is to drop the old disks out of the ZFS mirror vdev with 'zpool detach', which cleanly removes them (and they won't come back later, unlike with Linux software RAID). However this is a little bit wasteful, in a sense. Those old disks have a perfectly good backup copy of your ZFS pool on them, but when you detach them you lose any real possibility of using that copy. Perhaps you would like to keep that data as an actual backup copy, just in case. Modern versions of ZFS can do this through splitting the pool with 'zpool split'.

To quote the manpage here:

Splits devices off pool creating newpool. All vdevs in pool must be mirrors and the pool must not be in the process of resilvering. At the time of the split, newpool will be a replica of pool. [...]

In theory the manpage's description suggests that you can split a four-way mirror vdev in half, pulling off two devices at once in a 'zpool split' operation. In practice it appears that the current 0.8.x version of ZFS on Linux can only split off a single device from each mirror vdev. This meant that I needed to split my pool in a multi-step operation.

Let's start with a pool, maindata, with four disks in a single mirrored vdev, oldA, oldB, newC, and newD. We want to split maindata so that there is a new pool with oldA and oldB. First, we split one old device out of the pool:

zpool split -R /mnt maindata maindata-hds oldA

Normally the just split off newpool is not imported (as far as I know), and certainly you don't want it imported if your filesystems have explicit 'mountpoint' settings (because then filesystems from the original and the split off pool will fight over who gets to be mounted there). However, you can't add devices to exported pools and we need to add oldB, so we have to import the new pool in an altroot. I use /mnt here out of tradition but you can use any convenient empty directory.

With the pool split off, we need to detach oldB from the regular pool and attach it to oldA in the new pool to make the new pool actually be mirrored:

zpool detach maindata oldB
zpool attach maindata-hds oldA oldB

This will then resilver the maindata-hds new pool on to oldB (even though oldB has an almost exact copy already). Once the resilver is done, you can export the pool:

zpool export maindata-hds

You now have your mirrored backup copy sitting around with relatively little work on your part.

All of this appears to have worked completely fine for me. I scrubbed my maindata pool before splitting it, just in case, but I don't think I bothered to scrub the maindata-hds new pool after the resilver. It's only an emergency backup pool anyway (and it gets less and less useful over time, since there are more divergences between it and the live pool).

PS: I don't know if you can make snapshots, split a pool, and then do incremental ZFS sends from filesystems in one copy of the pool to the other to keep your backup copy more or less up to date. I wouldn't be surprised if it worked, but I also wouldn't be surprised if it didn't.

ZFSSplitPoolExperience written at 00:33:45; Add Comment

2019-12-18

Linux kernel Security Modules (LSMs) need their own errno value

Over on Twitter, I said something I've said before:

Once again, here I am hating how Linux introduced additional kernel security modules without also adding an errno for 'the loadable security module denied permissions'.

Lack of a LSM errno significantly complicates debugging problems, especially if you don't normally use LSMs.

Naturally there's a sysadmin story here, but let's start with the background (even if you probably know it).

SELinux and Ubuntu's AppArmor are examples of Linux Security Modules; each of them adds additional permission checks that you must pass over and above the normal Unix permissions. However, when they reject your access, they don't actually tell you this specifically; instead you get the generic Unix error of EPERM, 'operation not permitted', which is normally what you get if, say, the file is unreadable to your UID for some reason.

We have an internal primary master DNS server for our DNS zones (a so called 'stealth master'), which runs Ubuntu instead of OpenBSD for various reasons. We have the winter holiday break coming up and since we've had problems with it coming up cleanly in the past, so last week it seemed like a good time to reboot it under controlled circumstances to make sure that at least that worked. When I did that, named (aka Bind) refused to start with a 'permission denied' error (aka EPERM) when it tried to read its named.conf configuration file. For reasons beyond the scope of this entry, this file lives on our central administrative NFS filesystem, and when you throw NFS into the picture various things can go wrong with access permissions. So I spent some time looking at file and directory permissions, NFS mount state, and so on, until I remembered something my co-worker had mentioned in passing.

Ubuntu defaults to installing and using AppArmor, but we don't like it and we turn it off almost everywhere (we can't avoid it for MySQL, although we can make it harmless). That morning we had applied the pending Ubuntu packages updates, as one does, and one of the packages that got updated had been the AppArmor package. It turns out that in our environment, when an AppArmor package update is applied, AppArmor gets re-enabled (but I think not started immediately); when I rebooted our primary DNS master, it now started AppArmor. AppArmor has a profile for Bind that only allows for a configuration file in the standard place, not where we put our completely different and customized one, and so when Bind tried to read our named.conf, the AppArmor LSM said 'no'. But that 'no' was surfaced only as an EPERM error and so I went chasing down the rabbit hole of all of the normal causes for permission errors.

People who deal with LSMs all of the time will probably be familiar with this issue and will immediately move to the theory that any unfamiliar and mysterious permission denials are potentially the LSM in action. But we don't use LSMs normally, so every time one enables itself and gets in our way, we have to learn all about this all over again. The process of troubleshooting would be much easier if the LSM actually told us that it was doing things by having a new errno value for 'LSM permission denied', because then we'd know right away what was going on.

(If Linux kernel people are worried about some combination of security concerns and backward compatibility, I would be happy if they made this extra errno value an opt-in thing that you had to turn on with a sysctl. We would promptly enable it for all of our servers.)

PS: Even if we didn't have our named.conf on a NFS filesystem, we probably wouldn't want to overwrite the standard version with our own. It's usually cleaner to build your own completely separate configuration file and configuration area, so that you don't have to worry about package updates doing anything to your setup.

ErrnoForLSMs written at 23:59:19; Add Comment

2019-12-13

Working out which of your NVMe drives is in what slot under Linux

One of the perennial problems with machines that have multiple drives is figuring out which of your physical drives is sda, which is sdb, and so on; the mirror problem is arranging things so that the drive you want to be the boot drive actually is the first drive. In sanely made server hardware this is generally relatively easy, but with desktops you can run into all sorts of problems, such as how desktop motherboards can wire things up oddly. Under some situations, NVMe drives make this easier than with SATA drives, because NVMe drives are PCIe devices and so have distinct PCIe bus addresses and possibly PCIe bus topologies.

First off, I will admit something. The gold standard for doing this reliably under all circumstances is to record the serial numbers of your NVMe drives before you put them into your system and then use 'smartctl -i /dev/nvme0n1' to find each drive from its serial number. It's always possible for a motherboard with multiple M.2 slots to do perverse things with its wiring and PCIe bus layout, so that what it labels as the first and perhaps best M.2 slot is actually the second NVMe drive as Linux sees it. But I think that generally it's pretty likely that the first M.2 slot will be earlier in PCIe enumeration than the second one (if there is a second one). And if you have only one M.2 slot on the motherboard and are using a PCIe to NVMe adapter card for your second NVMe drive, the PCIe bus topology of the two NVMe drives is almost certain to be visibly different.

All of this raises the question of how you get the PCIe bus address of a particular NVMe drive. We can do this by using /sys, because Linux makes your PCIe devices and topology visible in sysfs. In specific, every NVMe device appears as a symlink in /sys/block that gives you the path to its PCIe node (and in fact the full topology). So on my office machine in its current NVMe setup, I have:

; readlink nvme0n1
../devices/pci0000:00/0000:00:03.2/0000:0b:00.0/[...]
; readlink nvme1n1
../devices/pci0000:00/0000:00:01.1/0000:01:00.0/[...]

This order on my machine gives me a surprise, because the two NVMe drives are not in the order I expected. In fact they're apparently not in the order that the kernel initially detected them in, as a look into 'dmesg' reports:

nvme nvme0: pci function 0000:01:00.0
nvme nvme1: pci function 0000:0b:00.0

This is the enumeration order I expected, with the motherboard M.2 slot at 01:00.0 detected before the adapter card at 0b:00.0 (for more on my current PCIe topology, see this entry). Indeed the original order appears to be preserved in bits of sysfs, with path components like nvme/nvme0/nvme1n1 and nvme/nvme1/nvme0n1. Perhaps the kernel assigned actual nvmeXn1 names backward, or perhaps udev renamed my disks for reasons known only to itself.

(But at least now I know which drive to pull if I have trouble with nvme1n1. On the other hand, I'm now doubting the latency numbers that I previously took as a sign that the NVMe drive on the adapter card was slower than the one in the M.2 slot, because I assumed that nvme1n1 was the adapter card drive.)

Once you have the PCIe bus address of a NVMe drive, you can look for additional clues as to what physical M.2 slot or PCIe slot that drive is in beyond just how this fits into your PCIe bus topology. For example, some motherboards (including my home machine) may wind up running the 'second' M.2 slot at 2x instead of x4 under some circumstances, so if you can find one NVMe drive running at x2 instead of x4, you have a strong clue as to which is which (assuming that your NVMe drives are x4 drives). You can also have a PCIe slot be forced to x2 for other reasons, such as motherboards where some slots share lanes and bandwidth. I believe that the primary M.2 slot on most motherboards always gets x4 and is never downgraded (except perhaps if you ask the BIOS to do so).

You can also get the same PCIe bus address information (and then a lot more) through udevadm, as noted by a commentator on yesterday's entry; 'udevadm info /sys/block/nvme0n1' will give you all of the information that udev keeps. This doesn't seem to include any explicit information on whether the device was renamed, but it does include the kernel's assigned minor number and on my machine, nvme0n1 has minor number 1 while nvme1n1 has minor number 0, which suggests that it was assigned first.

(It would be nice if udev would log somewhere when it renames a device.)

PS: Looking at the PCIe bus addresses associated with SATA drives usually doesn't help, because most of the time all of your SATA drives are attached to the same PCIe device.

MappingNVMeDrives written at 00:41:42; Add Comment

2019-12-12

Linux makes your PCIe topology visible in sysfs (/sys)

Getting some NVMe drives for my office machine has been an ongoing education into many areas of PCIe, including how to your PCIe topology with lspci and understanding how PCIe bus addresses and topology relate to each other. Today I coincidentally discovered that there is another way to look into your system's PCIe topology, because it turns out that the Linux kernel materializes it a directory hierarchy in the sysfs filesystem that is usually mounted on /sys.

Generally, the root of the PCI(e) bus hierarchy is going to be found at /sys/devices/pci0000:00. Given an understanding of PCIe addresses, we can see that 0000:00 is the usual domain and starting PCIe bus number. In this directory are a whole bunch of subdirectories, named after the full PCIe bus address of each device, so you get directories named things like '0000:00:03.2'. If you take of the leading '0000:', this corresponds to what 'lspci -v' will report as device 00:03.2. For PCIe devices that act as bridges, there will be subdirectories for the PCIe devices behind the bridge, with the full PCIe address of those devices. So in my office machine's current PCIe topology, there is a '0000:0b:00.0' subdirectory in the 0000:00:03.2 directory, which is my second NVMe drive behind the 00:03.2 PCIe bridge.

(And behind 0000:00:03.1 is my Radeon graphics card, which actually has two exposed PCIe functions; 0000:0a:00.0 is the video side, while 0000:0a:00.1 is 'HDMI/DP Audio'.)

There are a number of ways to use this /sys information, some of which are for future entries. The most obvious use is to confirm your understanding of the topology and implied PCIe bus addresses that 'lspci -tv' reports. If the /sys directory hierarchy matches your understanding of the output, you have it right. If it doesn't, something is going on.

The other use is a brute force way of finding out what the topology of a particular final PCIe device is, by simply finding it in the hierarchy with 'find /sys/devices/pci0000:00 -name ..', where the name is its full bus address (with the 0000: on the front). So, for example, if we know we have an Ethernet device at 06:00.0, we can find where it is in the topology with:

; cd /sys/devices/pci0000:00
; find . -type d -name 0000:06:00.0 -print
./0000:00:01.3/0000:02:00.2/0000:03:03.0/0000:06:00.0

(Using '-type d' avoids having to filter out some symlinks for the PCIe node in various contexts; in this case it shows up as '0000:00:00.2/iommu/ivhd0/devices/0000:06:00.0'.)

This shows us the path through the PCIe topology from the root, through 00:01.3, then 02:00.2, then finally 03:03.0. This complex path is because this is a device hanging off the AMD X370 chipset instead of off of the CPU, although not all chipset attached PCIe devices will have such a long topology.

Until I looked at the lspci manpage more carefully, I was going to say that this was the easiest way to go from a PCIe bus address to the full path to the device with all of the PCIe bus addresses involved. However, it turns out that in sufficiently modern versions of lspci, 'lspci -PP' will report the same information in a shorter and more readable way:

; lspci -PP -s 06:00.0
00:01.3/02:00.2/03:03.0/06:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06)

Unfortunately the version of lspci on our Ubuntu 18.04 machines is not sufficiently modern; on those machines, a find remains the easiest way. You can do it from the output of either 'lspci -tv' or 'lspci -v', as described in an earlier entry, but you have to do some manual work to reconstruct all of the PCIe bus addresses involved.

PCIeTopologyInSysfs written at 00:22:48; Add Comment

2019-12-11

Fedora is got a good choice if long term stability and usability is a high priority

Every so often I hear about people running servers or infrastructure that they care about on Fedora, and my eyebrows almost always go up. I like Fedora and run it by choice on all of my desktops and my work laptop, but I'm the sole user on these machines, I know what I'm getting into, and I'm willing to deal with the periodic disruptions that Fedora delivers. Fedora is a good Linux distribution on the whole, but it is what I would call a 'forward looking' distribution; it is one that is not that much interested in maintaining backward compatibility if that conflicts with making the correct choice for right now, the choice that you'd make if you were starting from scratch. The result is that every so often, Fedora will unapologetically kick something out from underneath long term users and you get to fix your setup to deal with the new state of affairs.

All of this sounds very theoretical, so let me make it quite concrete with my tweet:

Today I learned that in Fedora 31, /usr/bin/python is Python 3 instead of Python 2. I hope Ubuntu doesn't do that, because if it does our users are going to kill us.

I learned this because I've recently upgraded my work laptop to Fedora 31 and on it I run a number of Python based programs that started with '#!/usr/bin/python'. Before the upgrade, that was Python 2 and all of those programs worked. After the upgrade, as I found out today, that was Python 3 and many of the programs didn't work.

(Fedora provides no way to control this behavior as far as I can tell. What /usr/bin/python is is not controlled through Fedora's alternatives system; instead it's a symlink that's directly supplied by a package, and there's no version of the package that provides a symlink that goes to Python 2.)

This is fine for me. It's my own machine, I know what changed on it recently, I don't have to support a mixed base of older and newer Fedora machines, and I'm willing to put the pieces back together. At work, we've been running a Linux environment for fifteen years or so now, we have somewhere around a thousand users, we have to run a mixed base of distribution versions, and some of those users will have programs that start with '#!/usr/bin/python', possibly programs they've even forgotten about because they've been running quietly for so long. This sort of change would cause huge problems for them and thus for us.

Fedora's decision here is not wrong, for Fedora, but it is a very Fedora decision. If you were doing a distribution from scratch for today, with no history behind it at all, /usr/bin/python pointing to Python 3 is a perfectly rational and good choice. Making that decision in a distribution with history is choosing one set of priorities over another; it is prioritizing the 'correct' and modern choice over not breaking existing setups and not making people using your distribution do extra work.

I think it's useful to have Linux distributions that prioritize this way, and I don't mind it in the distribution that I use. But I know what I'm getting into when I choose Fedora, and it's not for everyone.

FedoraVsLongTermUse written at 00:49:11; Add Comment

(Previous 10 or go back to December 2019 at 2019/12/06)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.