Wandering Thoughts

2021-07-26

Understanding plain Linux NVMe device names (in /dev and kernel messages)

On Linux, plain disk names for most modern disk devices are in the form /dev/sda for the whole disk and /dev/sda3 for the partition (regardless of whether the disk is partitioned through modern GPT or old MBR). When I got NVMe SSDs for my office workstation, one of my many discoveries about them is that Linux gives them different and more oddly formed names. Since I had many other NVMe related issues on my mind at the time, I didn't look into the odd names; I just accepted them and moved on. But now I want to actually understand how Linux's NVMe device names are formed and what they mean, and it turns out to be relatively simple.

Let's start with the actual names. On Linux, NVMe devices have three levels of names. On my office workstation, for the first NVMe device there is /dev/nvme0, /dev/nvme0n1, and then a series of /dev/nvme0n1p<X> devices for each partition. Unusually, /dev/nvme0 is a character device, not a block device. Kernel messages will talk about both 'nvme0' and 'nvme0n1':

nvme nvme0: pci function 0000:01:00.0
nvme nvme0: 15/0/0 default/read/poll queues
 nvme0n1: p1 p2 p3 p4 p5

(I don't know yet what names will appear in kernel messages about IO errors.)

If I want to partition the disk, install GRUB bootblocks, or the like, I want to use the 'nvme0n1' name. Querying certain sorts of NVMe information is done using 'nvme0'. I can apparently use either name for querying SMART information with smartctl.

Numbering NVMe SSDs instead of giving them letters and naming partitions with 'p<X>' instead of plain numbers are both sensible changes from the somewhat arcane sd... naming scheme. The unusual thing is the 'n1' in the middle. This is present because of a NVMe feature called "namespaces", which allows you (or someone) to divide up a NVMe SSD into multiple separate ranges of logical block addresses that are isolated from each other. Namespaces are numbered starting from one, and I think that most NVMe drives have only one, hence 'nvme0n1' as the base name for my first NVMe SSD's disk devices.

(This is also likely part of why 'nvme0' is a character device instead of a block device. Although I haven't checked the NVMe specification, I suspect that you can't read or write blocks from a NVMe SSD without specifying the namespace.)

The Arch wiki page on NVMe drives has a nice overview of all sorts of things you can find out about your NVMe drives through the nvme command. Based on the Arch nvme manpage, it has a lot of sub-commands and options.

For my expected uses, I suspect that I will never change or manipulate NVMe namespaces on my NVMe drives. I'll just leave them in their default state, as shipped by the company making them. Probably all consumer NVMe SSDs will come with only a single namespace by default, so that people can use the entire drive's official capacity as a single filesystem or partition without having to do strange things.

NVMeDeviceNames written at 23:10:06; Add Comment

2021-07-25

I should probably learn command-line NetworkManager usage

I'm generally not a fan of NetworkManager on the machines I deal with, but I do wind up dealing with it at the command line level every so often, most recently for setting up a WireGuard client on my work laptop. There was a time when it felt that NetworkManager was the inevitable future of networking on Linux even on servers, but fortunately systemd-networkd has mostly made that go away. Still, systemd-networkd has limitations and isn't as comprehensive as NetworkManager, NetworkManager is the face of networking on a lot of Linux configurations, and someday I may be forced to deal with NetworkManager on a regular basis.

(Fedora keeps threatening to remove the ifup and ifdown scripts that drive my DLS PPPoE link, and systemd-networkd doesn't currently have support for PPPoE.)

All of this leaves me feeling that not really knowing even the basics of NetworkManager general concepts and command line usage is a gap in my practical Linux knowledge that matters, and that I should fix. Well, to put it bluntly, it feels like I'm burying my head in the sand. Even if I never really use it, learning the basics of NetworkManager command line usage would give me an informed opinion, instead of my current mostly uninformed one.

The low impact approach to learning NetworkManager command line usage would be to explore it on my work laptop, which already uses NetworkManager. I normally use the Cinnamon Network Manager GUI (which is not nm-applet, it turns out), but I could switch to doing my network manipulation through the command line, and also read and try to understand all of the configured connection parameters.

The high impact approach would be to try to set up a version of my home desktop's DSL PPPoE connection in NetworkManager. Many years ago I configured a version of my DSL connection on my laptop, so in theory I could cross-check my NetworkManager flailing against that version (although I should first make sure it still works). As a side benefit, this would leave me prepared for when Fedora carries through its threat to remove ifup and my current DSL PPPoE setup immediately stops working.

(I've written this partly in the hopes of motivating myself into doing some NetworkManager learning, even if I don't manage much.)

NetworkManagerLearning written at 22:38:29; Add Comment

2021-07-21

It's nice when programs switch to being launched from systemd user units

I recently upgraded my home machine from Fedora 33 to Fedora 34. One of the changes in Fedora 34 is that the audio system switched from PulseAudio to PipeWire (the Fedora change proposal, an article on the switch). Part of this switch is that you need to run different daemons in your user session. For normal people, this is transparently handled by whichever standard desktop environment they're using. Unfortunately I use a completely custom desktop, so I have to sort this out myself (this is one way Fedora upgrades are complicated for me). Except this time I didn't need to do anything; PipeWire just worked after the switch.

One significant reason for this is that PipeWire arranges to be started in your user session not through old mechanisms like /etc/xdg/autostart but through a systemd user unit (actually two, one for the daemon and one for the socket). Systemd user units are independent of your desktop and get started automatically, which means that they just work even in non-standard desktop environments (well, so far).

(As covered in the Arch Wiki, there are some things you need to do in an X session.)

One of the things that's quietly making my life easier in my custom desktop environment is that more things are switching to being started through systemd user units instead of the various other methods. It's probably a bit more work for some of the programs involved (since they can't assume direct access to your display any more and so on), but it's handy for me, so I'm glad that they're investing in the change.

PS: It turns out that the basic PulseAudio daemon was also being set up through systemd user units on Fedora 33. But PulseAudio did want special setup under X, with an /etc/xdg/autostart file that ran /usr/bin/start-pulseaudio-x11. It's possible that PipeWire is less integrated with the X server than PulseAudio is. See the PulseAudio X11 modules (also).

PPS: Apparently I now need to find a replacement for running 'amixer -q set Master ...' to control my volume from the keyboard. This apparently still works for some people (also), but not for me; for now 'pactl' does, and it may be the more or less official tool for doing this with PipeWire for the moment, even though it's from PulseAudio.

SystemdUserUnitsNice written at 01:01:16; Add Comment

2021-07-16

Setting up a WireGuard client with NetworkManager (using nmcli)

For reasons beyond the scope of this entry, I've been building a VPN server that will support WireGuard (along with OpenVPN and L2TP). A server needs a client, so I spent part of today setting up my work laptop as a WireGuard client in a 'VPN' configuration, under NetworkManager because that's what my laptop uses. I was hoping to do this through the Cinnamon GUIs for NetworkManager, but unfortunately while NetworkManager itself has supported WireGuard for some time, this support hasn't propagated into GUIs such as the GNOME Control Center (cf) or the NetworkManager applet that Cinnamon uses.

I'm already quite familiar with WireGuard in general, so I found that the easiest way to start was to set up a basic WireGuard configuration file for the connection in /etc/wireguard/wg0.conf, including both the main configuration (with the laptop's key and my local port) and a [Peer] section for the server. Since I'm using WireGuard here in a VPN configuration, instead of to reach just some internal IPs, I set AllowedIPs to 0.0.0.0/0. After writing wg0.conf, I then imported it into NetworkManager:

nmcli connection import type wireguard file /etc/wireguard/wg0.conf

(For what can go in the configuration file, start with wg(8) and wg-quick(8). I suspect that NetworkManager doesn't support some of the more advanced keys. I stuck to the basics. The import process definitely ignores the various script settings supported by wg-quick(8). Currently, see nm_vpn_wireguard_import() in nm-vpn-helpers.c.)

Imported connections are apparently set to auto-connect, which isn't what I wanted, plus there were some other things to adjust (following the guide of Thomas Haller's WireGuard in NetworkManager):

nmcli con modify wg0 \
   autoconnect no \
   ipv4.method manual \
   ipv4.address 172.29.50.10/24 \
   ipv4.dns <...>

At this point you might be tempted to set ipv4.gateway, and indeed that's what I did the first time around. It turns out that this is a mistake, because these days NetworkManager will do the right thing based on the 'accept everything' AllowedIPs I set, right down to setting up policy based routing with a fwmark so that encrypted traffic to the WireGuard VPN server doesn't try to go over WireGuard. If you set ipv4.gateway as well, you wind up with two default routes and then your encrypted WireGuard traffic may try to go over your WireGuard connection again, which doesn't work.

(See the description of 'ip4-auto-default-route in the WireGuard configuration properties. The full index of available NetworkManager settings in various sections is currently here; the ones most useful to me are probably connection.* and ipv4.*.)

Getting DNS to work correctly requires a little extra step, or at least did for me. While the wg0 connection is active, I want all of my DNS queries to go to our internal resolving DNS server and also to have a search path of our university subdomain. This apparently requires explicitly including '~' in the NetworkManager DNS search path:

nmcli con modify wg0 \
  ipv4.dns-search "cs.toronto.edu,~"

This comes from Fedora bug #1895518, which also has some useful resolvectl options.

You (I) can see a lot of settings for the WireGuard setup with 'nmcli connection show wg0', including active ones, but this seems to omit NetworkManager's view of the WireGuard peers. To see that, I needed to look directly at the configuration file that NetworkManager wrote, in /etc/NetworkManager/system-connections/wg0.nmconnection. I'm someday going to need to edit this directly to modify the WireGuard VPN server's endpoint from my test machine to the production machine.

(The NetworkManager RFE for configuring WireGuard peers in nmcli is issue #358.)

With no GUI support for WireGuard connections, I have to bring this WireGuard VPN up and down with 'nmcli con up wg0' and 'nmcli con down wg0'. Once I have the new VPN server in production, I'll be writing little scripts to do this for me. Hopefully this will be improved some day, so that the NetworkManager applet allows you to activate and deactivate WireGuard connections and shows you that one is active.

If I wanted a limited VPN that only sent traffic to our internal networks over my WireGuard link, I would configure the server's AllowedIPs to the list of networks and then I believe that NetworkManager would automatically set up routes for them. However, I don't know how to make this work (in NetworkManager) if the WireGuard VPN server itself was on one of the subnets I wanted to reach over WireGuard. For my laptop, routing all traffic over WireGuard to work is no worse than using our OpenVPN or L2TP VPN servers, which also do the same thing by default.

(On my home desktop, I use hand built fwmark-based policy rules to deal with my WireGuard endpoint being on a subnet I want to normally reach over WireGuard. NetworkManager will build the equivalents for me when I'm routing 0.0.0.0/0 over the WireGuard link, but I believe not in other situations.)

(For information, I primarily relied on Thomas Haller's WireGuard in NetworkManager, supplemented with a Fedora Magazine article and this article.)

NetworkManagerWireGuardClient written at 01:00:49; Add Comment

2021-07-14

Some ways to get (or not get) information about system memory ranges on Linux

I recently learned about lsmem, which is described as "list[ing] the ranges of available memory [...]". The source I learned it from was curious why lsmem on a modern 64-bit machine didn't list all of the low 4 GB as a single block (they were exploring kernel memory zones, where the low 4 GB of RAM are still a special 'DMA32' zone). To start with, I'll show typical lsmem default output from a machine with 32 GB of RAM:

; lsmem
RANGE                                  SIZE  STATE REMOVABLE  BLOCK
0x0000000000000000-0x00000000dfffffff  3.5G online       yes   0-27
0x0000000100000000-0x000000081fffffff 28.5G online       yes 32-259

Memory block size:       128M
Total online memory:      32G
Total offline memory:      0B

Lsmem is reporting information from /sys/devices/system/memory (see also memory-hotplug.txt). Both the sysfs hierarchy and lsmem itself apparently come originally from the IBM S390x architecture. Today this sysfs hierarchy apparently only exists for memory hotplug, and there are some signs that kernel developers aren't fond of it.

(Update: I'm wrong about where the sysfs memory hierarchy comes from; see this tweet from Dave Hansen.)

On the machines I've looked at, the hole reported by lsmem is authentic, in that /sys/devices/system/memory also doesn't have any nodes for that range (on the machine above, for blocks 28, 29, 30, and 31). The specific gap varies from machine to machine. However, all of the information from lsmem may well be a simplification of a more complex reality.

The kernel also exposes physical memory range information through /proc in /proc/iomem (on modern kernels you'll probably have to read this as root to get real address ranges). This has a much more complicated view of actual RAM, one with many more holes than what lsmem and /sys/devices/system/memory show. This is especially the case in the low 4G of memory, where for example the system above reports a whole series of chunks of reserved memory, PCI bus address space, ACPI tables and storage, and more. The high memory range is simpler, but still not quite the same:

100000000-81f37ffff : System RAM
81f380000-81fffffff : RAM buffer

The information from /proc/iomem has a lot of information about PCI(e) windows and other things, so you may want to narrow down what you look at. On the system above, /proc/iomem has 107 lines but only nine of them are for 'System RAM', and all but one of them are in the physical memory address range that lsmem lumps into the 'low' 3.5 GB:

00001000-0009d3ff : System RAM
00100000-09e0ffff : System RAM
0a000000-0a1fffff : System RAM
0a20b000-0affffff : System RAM
0b020000-d17bafff : System RAM
d17da000-da66ffff : System RAM
da7e5000-da8eefff : System RAM
dbac7000-ddffffff : System RAM

(I don't have the energy to work out how much actual RAM this represents.)

Another view of physical memory range information is the kernel's report of the BIOS 'e820' memory map, printed during boot. On the system above, this says that the top of memory is actually 0x81f37ffff:

BIOS-e820: [mem 0x0000000100000000-0x000000081f37ffff] usable

I don't know if the Linux kernel exposes this information in /sys. You can also find various other things about physical memory ranges in the kernel's boot messages, but I don't know enough to analyze them.

What's clear is that in general, a modern x86 machine's physical memory ranges are quite complicated. There are historical bits and pieces, ACPI and other data that is in RAM but must be preserved, PCI(e) windows, and other things.

(I assume that there is low level chipset magic to direct reads and writes for RAM to the appropriate bits of RAM, including remapping parts of the DIMMs around so that they can be more or less fully used.)

SystemMemoryRangeInfo written at 01:00:13; Add Comment

2021-07-12

Understanding something about udev's normal network device names on Linux

For a long time, systemd's version of udev has attempted to give network interfaces what the systemd people call predictable or stable names. The current naming scheme is more or less documented in systemd.net-naming-scheme, with an older version in their Predictable Netwwork Interface Names wiki page. To understand how the naming scheme is applied in practice by default, you also need to read the description of NamePolicy= in systemd.link(5), and inspect the default .link file, '99-default.link', which might be in either /lib/systemd/network or /usr/lib/systemd/network/. It appears that the current network name policy is generally going to be "kernel database onboard slot path", possibly with 'keep' at the front in addition. In practice, on most servers and desktops, most network devices will be named based on their PCI slot identifier, using systemd's 'path' naming policy.

A PCI slot identifier is what ordinary 'lspci' will show you as the PCIe bus address. As covered in the lspci manpage, the fully general form of a PCIe bus address is <domain>:<bus>:<device>.<function>, and on many systems the domain is always 0000 and is omitted. Systemd turns this into what it calls a "PCI geographical location", which is (translated into lspci's terminology):

prefix [Pdomain] pbus sdevice [ffunction] [nphys_port_name | ddev_port]

The domain is omitted if it's 0 and the function is only present if it's a multi-function device. All of the numbers are in decimal, while lspci presents them in hex. For Ethernet devices, the prefix is 'en'.

(I can't say anything about the 'n' and 'd' suffixes because I've never seen them in our hardware.)

The device portion of the PCIe bus address is very frequently 0, because many Ethernet devices are behind PCIe bridges in the PCIe bus topology. This is how my office workstation is arranged, and how almost all of our servers are. The exceptions are all on bus 0, the root bus, which I believe means that they're directly integrated into the core chipset. This means that in practice the network device name primarily comes from the PCI bus number, possibly with a function number added. This gives 'path' based names of, eg, enp6s0 (bus 6, device 0) or enp1s0f0 and enp1s0f1 (bus 1, device 0, function 0 or 1; this is a dual 10G-T card, with each port being one function).

(Onboard devices on servers and even desktops are often not integrated into the core chipset and thus not on PCIe bus 0. Udev may or may not recognize them as onboard devices and assign them 'eno<N>' names. Servers from good sources will hopefully have enough correct DMI and other information so that udev can do this.)

As always, the PCIe bus ordering doesn't necessarily correspond to what you think of as the actual order of hardware. My office workstation has an onboard Ethernet port on its ASUS Prime X370-Pro motherboard and an Intel 1G PCIe card, but they are (or would be) enp8s0 and enp6s0 respectively. So my onboard port has a higher PCIe bus number than the PCIe card.

There is an important consequence of this, which is that systemd's default network device names are not stable if you change your hardware around, even if you didn't touch the network card itself. Changing your hardware around can change your PCIe bus numbers, and since the PCIe bus number is most of what determines the network interface name, it will change. You don't have to touch your actual network card for this to happen; adding, changing, or relocating other hardware between physical PCIe slots can trigger changes in bus addresses (primarily if PCIe bridges are added or removed).

(However, adding or removing hardware won't necessarily change existing PCIe bus addresses even if the hardware changed has a PCIe bridge. It all depends on your specific PCIe topology.)

Sidebar: obtaining udev and PCIe topology information

Running 'udevadm info /sys/class/net/<something>' will give you a dump of what udev thinks and knows about any given network interface. The various ID_NET_NAME_* properties give you the various names that udev would assign based on that particular naming policy. The 'enp...' names are ID_NET_NAME_PATH, and on server hardware you may also see ID_NET_NAME_ONBOARD.

(The 'database' naming scheme comes from information in hwdb.)

On modern systems, 'lspci -PP' can be used to show the full PCIe path to a device (or all devices). On Ubuntu 18.04, you can also use sysfs to work through your PCIe topology, in addition to 'lspci -tv'. See also my entry on PCIe bus addresses, lspci, and working out your PCIe bus topology.

UdevNetworkDeviceNaming written at 00:16:04; Add Comment

2021-07-07

The initramfs for old kernels can hide old versions of things

In a recent entry, I more or less blamed a new minor Linux kernel version for changing the naming of my network interface. I had reasonable reasons to say this beyond just rebooting into 5.12.12 and having the problem appear; I also rebooted back into 5.12.11 and the problem disappeared again (I ended up going back and forth repeatedly and this was consistent). When the only changing thing is the kernel version, you can reasonably suspect it, instead of (say) an upgrade to udev that you also installed between the two kernels. However, I'm not so sure of that any more.

I'm running Fedora on this desktop, and Fedora normally doesn't rebuild the initramfs for existing kernels when you upgrade packages and install new kernels. This means that when I boot my Fedora 5.12.11, I'm not merely running that kernel, I'm running an initramfs with programs and configuration files that were frozen when that kernel was installed. If there was a udev update that changed its early boot behavior, that update isn't in the 5.12.11 initramfs. Although I thought I only changed the kernel version by booting back and forth between 5.12.11 and 5.12.12, I was also changing the versions of what ran during early boot, possibly as well as configuration files they used. This may well have fooled me about what the cause of my problem was.

(I know, I once said Fedora rebuilt the initramfs for all of your kernels when you installed new DKMS modules. Apparently I was wrong about that, and was seeing something else.)

In short, what looks like an issue in the new kernel may actually be a change in the new initramfs that you get along with the new kernel. It's hard to tell for sure, although you can try rebuilding the initramfs for an older kernel if you can work out how to do this correctly. Of course, if you do rebuild an initramfs for an old kernel to see if it's really the kernel that's at fault, you definitely want to save a copy of your working old initramfs.

(I've seen this before for configuration files, for example when Fedora embedded my current sysctl settings in the initramfs.)

Despite potentially causing issues, not rebuilding is quite sensible. Generally you want to preserve old working initramfses the way they are just as you want to preserve old kernels (certainly I did in this case, since my 5.12.11 environment kept working). People also want to do less work on package upgrades, and not rebuilding four or five initramfses is much less work than doing so.

InitramfsHidesOldThings written at 00:54:42; Add Comment

2021-06-30

Giving your Linux network interfaces fixed names (under udevd and networkd)

Suppose, not entirely hypothetically, that you always want your machine's primary network interface to be called 'em0' regardless of what the combination of the kernel, networkd, and the systemd udevd want to call them today (something that has been known to change). Until recently, my (incorrect) setup for this was a <link>.link file that looked like this:

[Match]
MACAddress=2c:fd:a1:xx:xx:xx

[Link]
Description=Onboard motherboard port
MACAddressPolicy=persistent
Name=em0
# Stop VLAN renaming
NamePolicy=keep

I had this NamePolicy because I had VLANs on top of em0 and this was how I made them work. This .link file worked for about a year and a half, and then I upgraded my Fedora 33 workstation from 5.12.11 to 5.12.12 and rebooted. It promptly dropped off the network because my interface had the wrong name and nothing got configured on it.

What I was trying to do was rename the interface with that MAC address to em0. What my addition of NamePolicy=keep did was create a situation where the interface would be renamed to em0 if and only if nothing else had renamed it before udevd processed my .link file. In 5.12.12 (but not 5.12.11), something (either the kernel or udevd) decided to rename my interface to enp8s0 before my .link file took effect, and then the interface didn't get renamed again to em0.

(This is the implication of '[...] or all policies configured [in NamePolicy] must fail' in the manpage's description of 'Name='. If the device hasn't already been given a name, the 'keep' policy would fail and it would be renamed to em0 by my 'Name='.)

If you (I) want to give your network interfaces fixed names but have your .link files apply only to real Ethernet interfaces instead of matching broadly, what I believe you want is:

[Match]
MACAddress=2c:fd:a1:xx:xx:xx
Type=ether
# Before systemd v245, use eg
# Property=ID_BUS=pci

[Link]
Description=Onboard motherboard port
MACAddressPolicy=persistent
Name=em0

With no NamePolicy, this will unconditionally rename anything matching that MAC to em0. With Type=ether, this will only apply to real Ethernet devices, not your VLANs or other things that inherit the MAC from the underlying Ethernet interface.

PS: At this point one may want to read the systemd.net-naming-scheme manpage. I believe that names of the form 'emX' are safe from ever colliding with kernel-assigned interface names, but I'm not completely sure.

PPS: In 5.12.12, my kernel boot logs clearly show that there are two renamings with this .link setup:

igb 0000:08:00.0 enp8s0: renamed from eth0
[...]
igb 0000:08:00.0 em0: renamed from enp8s0

So my new .link doesn't prevent the initial renaming in 5.12.12 to enp8s0; it just allows my .link to rename the interface again to the em0 that I want.

NetworkdNamingYourInterfaces written at 16:41:47; Add Comment

2021-06-28

Be careful when matching on Ethernet addresses in systemd-networkd

A not uncommon pattern in networkd is to write a <link>.link or <network>.network file that selects the hardware to work on by MAC address, because that's often more stable than many of the other alternatives. For instance, you might write a .link file for your motherboard like this:

[Match]
MACAddress=2c:fd:a1:xx:xx:xx

[Link]
Description=Onboard motherboard port
MACAddressPolicy=persistent
Name=em0

Unfortunately this is dangerous, because some virtual devices inherit Ethernet addresses from their parent device and networkd will allow virtual devices to match against just Ethernet addresses. In particular VLANs inherit the Ethernet address from their underlying network device, so if you have one or more VLANs on top of em0, they will all match this (and then they'll try to rename themselves to em0). The same can happen if you have a .network file that matches with MACAddress in order to deal with variable network names for the same underlying connection.

(If you have a real device that matches this way and creates VLANs on top of itself, networkd may be smart enough to recognize that it has a recursive situation, or it may blow up. I haven't tested.)

In other words, if you tell networkd that a .link or a .network file applies to anything with a specific Ethernet address, networkd takes that to really mean anything. You may have meant this to apply (only) to your actual Ethernet device, but the .link file doesn't say that and networkd won't infer it.

In systemd v245 or later, what you probably want is to restrict any Ethernet hardware matches to real Ethernet devices with the additional requirement of 'Type=ether':

[Match]
MACAddress=2c:fd:a1:xx:xx:xx
Type=ether

(Systemd v245 was released in February of 2020 and is in Ubuntu 20.04 and the current versions of Fedora, but isn't in Debian stable. Support for the current meaning of Type= that allows matching 'ether' was added in this commit as a result of issue #14952. To my surprise, this significant improvement doesn't seem to have been noted in the NEWS for v245.)

The 'ether' type applies to both PCI Ethernet ports and USB Ethernet devices, but it doesn't apply to wireless devices; those are 'wlan'. As the manpage covers, 'networkctl list' can tell you what your devices are. VLANs are type 'vlan'.

If you have a systemd (and thus a systemd-networkd) that's older than v245, I think the only thing you can do is match on a property of the device, obtained from 'udevadm info /sys/class/net/<what>'. For a lot of physical hardware, the obvious property is that it's on a PCI bus:

[Match]
MACAddress=2c:fd:a1:xx:xx:xx
Property=ID_BUS=pci

(I have to say that I haven't tested this, I'm just following the manpage.)

However, USB Ethernet devices are 'ID_BUS=usb', not PCI, while a laptop's onboard wireless most likely is a PCI device, which is the case on my Dell XPS 13. My laptop's wireless device is also 'DEVTYPE=wlan', while even now real Ethernet devices have no DEVTYPE (as of systemd v248 on a Fedora 34 virtual machine).

(This elaborates on a tweet of mine.)

PS: I'm not sure whether the matching here is being done by systemd-networkd, the systemd version of udevd, or both of them. It's quite possible that both programs and subsystems are doing it at different times and in different circumstances.

NetworkdMACMatchesWidely written at 23:45:06; Add Comment

2021-06-27

Some notes on what's in Linux's /sys/class/net for network interface status

Due to discovering that one of our servers had had a network interface at 100 Mbits/sec for some time, I've become interested in what information is exposed by the Linux kernel about network interfaces in /sys, specifically in /sys/class/net/<interface>. I'm mostly interested in the information there because it's the source of what the Prometheus host agent exposes as network interface status metrics, and thus what's easy to monitor and alert on in our metrics and monitoring setup.

The overall reference for this is the Linux kernel's sysfs-class-net, which documents the /sys fields directly. For the flags sysfs file, you also need the kernel's include/uapi/linux/if.h, and for the type file, include/uapi/linux/if_arp.h. Generally sysfs-class-net is pretty straightforward about what things mean, although you may have to read several entries together. Not all interfaces have all of the files, for instance the phys_port_* files aren't present on any servers we have.

The flags file has a number of common values you may see, which I'm going to write down here for my own reference:

0x1003 or 4099 decimal
This is the common value for active Ethernet interfaces. It is MULTICAST (0x1000) plus UP (0x1) and BROADCAST (0x2). Tools like ifconfig will report RUNNING as well, but that apparently doesn't appear in sysfs.

0x1002 or 4098 decimal
This is the common value for an inactive Ethernet interface, whether or not it has a cable plugged in. It is MULTICAST plus BROADCAST, but without UP.

0x9 or 9 decimal
This is the common value for the loopback interface, made from UP (0x1) and LOOPBACK (0x8).

0x91 or 145 decimal
This is an UP (0x1), POINTOPOINT (0x10) link that is NOARP (0x80). This is the flags value of my Wireguard endpoints.

0x1091 or 4241 decimal
This is an UP (0x1), POINTOPOINT (0x10) link that is MULTICAST (0x1000) in addition to being NOARP (0x80). This is the flags value of my PPPoE DSL link's PPP connection.

The 'addr_assign_type' file is about the (Ethernet) hardware address, not any IP addresses that may be associated with the interface. A physical interface will normally have a value of 0; a value of 3 means that you specifically set the MAC address. VLAN interfaces sitting on top of physical devices have a value of 2 (they take their MAC address from the underlying devices's MAC).

The name_assign_type is somewhat random, as far as I can tell. Our Ubuntu machines all have a name assignment type value of 4 ('renamed'), while my Fedora machines mostly have a name assignment type of 3 ('named by userspace'), with one Ethernet device being a 4. My Fedora home machine's ppp0 device has a value of 1.

The most common type values are 1 (Ethernet), 772 (the loopback interface), 512 (PPP), and 65534 ('none', what my Wireguard tunnels have). Possibly someday Wireguard will have its own type value assigned in include/uapi/linux/if_arp.h.

The speed value is, as mentioned in sysfs-class-net, in Mbits/sec. The values I've seen are 100 (100M), 1000 (1G), and 10000 (10G). What gets reported for interfaces without carrier seems to depend. An UP interface with no carrier will report a speed of -1; an interface that isn't up has no speed value and attempts to read the file will report 'Invalid argument'. The Prometheus host agent turns all of these into its speed in bytes metric node_network_speed_bytes by multiplying the speed value by 125000, which normally gives you a metric value of -125000 (UP but no carrier), 12500000 (100M), 125000000 (1G), or 1250000000 (10G).

(Some Linux distributions in some situations will set additional interfaces to UP as part of trying to do DHCP on them. Otherwise they'll quietly stay down.)

The Prometheus host agent exposes what it calls 'non-numeric data' from /sys/class/net in the node_network_info metric. This gives you the device's hardware address and broadcast address, its name, its duplex (which may be blank for things that don't have a duplex mode, such as Wireguard links or virtual Ethernets), and its state (from the operstate file). Somewhat to my surprise, the operstate of the loopback interface is 'unknown', not 'up'.

Update: it turns out that the carrier file is only available for interfaces that are configured 'UP' (and then is either 0 or 1 depending on if carrier is detected). If the interface is not UP, attempting to read carrier fails with 'Invalid argument'.

SysfsNetworkInterfaceStatus written at 01:45:27; Add Comment

(Previous 10 or go back to June 2021 at 2021/06/25)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.