Wandering Thoughts


Using lsblk to get extremely useful information about disks

Every so often I need to know the serial number of a disk, generally because it's the only way to identify one particular disk out of two (or more) identical ones. As one example, perhaps I need to replace a failed drive that's one of a pair. You can get this information from the disks through smartctl, but the process is somewhat annoying if you just want the serial number, especially if you want it for multiple disks.

(Sometimes you have a dead disk so you need to find it by process of elimination starting from the serial numbers of all of the live disks.)

I've used lsblk for some time to get disk UUIDs and raid UUIDs, but I never looked very deeply at its other options. Recently I discovered that lsblk can do a lot more, and in particular it can report disk serial numbers (as well as a bunch of other handy information) in an extremely convenient form. It's simplest to just show you an example:

$ lsblk -o NAME,SERIAL,HCTL,TRAN,MODEL --nodeps /dev/sd?
sda  S21NNXCGAxxxxxH 0:0:0:0    sata   Samsung SSD 850 
sdb  S21NNXCGAxxxxxE 1:0:0:0    sata   Samsung SSD 850 
sdc  Zxxxxx4E        2:0:0:0    sata   ST500DM002-1BC14
sdd  WD-WMC5K0Dxxxxx 4:0:0:0    sata   WDC WD1002F9YZ-0
sde  WD-WMC5K0Dxxxxx 5:0:0:0    sata   WDC WD1002F9YZ-0

(For obscure reasons I don't feel like publishing the full serial numbers of our disks. It might be harmless to do so, but let's not find out otherwise the hard way.)

You can get a full list of possible fields with 'lsblk --help', along with generally what they mean, although you'll find that some of them are less useful than you might guess. VENDOR is always 'ATA' for me, for example, and KNAME is the same as NAME for my systems; TRAN is usually 'sata', as here, but we have some machines where it's different. Looking for a PHY-SEC that's not 512 is a convenient way to find advanced format drives, which may be surprisingly uncommon in some environments. SIZE is another surprisingly handy field; if you know you're looking for a disk of a specific size, it lets you filter disks in and out without checking serial numbers or even the specific model, if you have multiple different sized drives from one vendor such as WD or Seagate.

(--nodeps tells lsblk to just report on the devices that you gave it and not also include their partitions, software RAID devices that use them, and so on.)

This compact lsblk output is great for summarizing all of the disks on a machine in something that's easy to print out and use. Pretty much everything I need to know is one spot and I can easily use this to identify specific drives. I'm quite happy to have stumbled over this additional use of lsblk, and I plan to make much more use of it in the future. Possibly I should routinely collect this output for my machines and save it away.

(This entry is partly to write down the list of lsblk fields that I find useful so I don't have to keep remembering them or sorting through lsblk --help and trying to remember the fields that are less useful than they sound.)

LsblkForDiskInfo written at 01:17:07; Add Comment


DTrace being GPL (and thrown into a Linux kernel) is just the start

The exciting news of the recent time interval comes from Mark J. Wielaard's dtrace for linux; Oracle does the right thing. To summarize the big news, I'll just quote from the Oracle kernel commit message:

This changeset integrates DTrace module sources into the main kernel source tree under the GPLv2 license. [...]

This is exciting news and I don't want to rain on anyone's parade, but it's pretty unlikely that we're going to see DTrace in the Linux kernel any time soon (either the kernel.org main tree or in distribution versions). DTrace being GPL compatible is just the minimum prerequisite for it to ever be in the main kernel, and Oracle putting it in their kernel only helps move things forward so much.

The first problem is simply the issue of integrating foreign code originally written for another Unix into the Linux kernel. For excellent reasons, the Linux kernel people have historically been opposed to what I've called 'code drops', where foreign code is simply parachuted into the kernel more or less intact with some sort of compatibility layer or set of shims. Getting them to accept DTrace is very likely to require modifying DTrace to be real native Linux kernel code that does things in the Linux kernel way and so on. This is a bunch of work, which means that it requires people who are interested in doing the work (and who can navigate the politics of doing so).

(I wrote more on this general issue when I talked about practical issues with getting ZFS into the main Linux kernel many years ago.)

Oracle could do this work, and it's certainly a good sign that they've at least got DTrace running in their own kernel. But since it is their own vendor kernel, Oracle may have just done a code drop instead of a real port into the kernel. Even if they've tried to do a port, similar efforts in the past (most prominently with XFS) took a fairly long time and a significant amount of work before the code passed muster with the Linux kernel community and was accepted into the main kernel.

A larger issue is whether DTrace would even be accepted in any form. At this point the Linux kernel has a number of tracing systems, so the addition of yet another one with yet another set of hooks and so on might not be viewed with particularly great enthusiasm by the Linux kernel people. Their entirely sensible answer to 'we want to use DTrace' might be 'use your time and energy to improve existing facilities and then implement the DTrace user level commands on top of them'. If Oracle followed through on this, we would effectively still get DTrace in the end (I don't care how it works inside the kernel if it works), but this also might cause Oracle to not bother trying to upstream DTrace. From Oracle's perspective, putting a relatively clean and maintainable patchset into their vendor kernel is quite possibly good enough.

(It's also possible that this is the right answer at a technical level. The Linux kernel probably doesn't need three or four different tracing systems that mostly duplicate each others work, or even two systems that do. Reimplementing the DTrace language and tools on top of, say, kprobes and eBPF would not be as cool as porting DTrace into the kernel, but it might be better.)

Given all of the things in the way of DTrace being in the main kernel, getting it included is unlikely to be a fast process (if it does happen). Oracle is probably more familiar with how to work with the main Linux kernel community than SGI was with XFS, but I would still be amazed if getting DTrace into the Linux kernel took less than a year. Then it would take more time before that kernel started making it into Linux distributions (and before distributions started enabling DTrace and shipping DTrace tools). So even if it happens, I don't expect to be able to use DTrace on Linux for at least the next few years.

(Ironically the fastest way to be able to 'use DTrace' would be for someone to create a version of the language and tools that sat on top of existing Linux kernel tracing stuff. Shipping new user-level programs is fast, and you can always build them yourself.)

PS: To be explicit, I would love to be able to use the DTrace language or something like it to write Linux tracing stuff. I may have had my issues with D, but as far as I can tell it's still a far more casually usable environment for this stuff than anything Linux currently has (although Linux is ahead in some ways, since it's easier to do sophisticated user-level processing of kernel tracing results).

DTraceKernelPessimism written at 00:41:31; Add Comment


The interesting error codes from Linux program segfault kernel messages

When I wrote up what the Linux kernel's messages about segfaulting programs mean, I described what went into the 'error N' codes and how to work out what any particular one meant, but I didn't inventory them all. Rather than put myself through reverse engineering what any particular error code means, I'm going to list them all here, in ascending order.

The basic kernel message looks like this:

testp[9282]: segfault at 0 ip 0000000000401271 sp 00007ffd33b088d0 error 4 in testp[400000+98000]

We're interested in the 'error N' portion, and a little bit in the 'at N' portion (which is the faulting address).

For all of these, the fault happens in user mode so I'm not going to mention it specifically for each one. Also, the list of potential reasons for these segfaults is not exhaustive or fully detailed.

  • error 4: (Data) read from an unmapped area.

    This is your classic wild pointer read. On 64-bit x86, most of the address space is unmapped so even a program that uses a relatively large amount of memory is hopefully going to have most bad pointers go to memory that has no mappings at all.

    A faulting address of 0 is a NULL pointer and falls into page zero, the lowest page in memory. The kernel prevents people from mapping page zero, and in general low memory is never mapped, so reads from small faulting addresses should always be error 4s.

  • error 5: read from a memory area that's mapped but not readable.

    This is probably a pointer read of a pointer that is so wild that it's pointing somewhere in the kernel's area of the address space. It might be a guard page, but at least some of the time mmap()'ing things with PROT_NONE appears to make Linux treat them as unmapped areas so you get error code 4 instead. You might think this could be an area mmap()'d with other permissions but without PROT_READ, but it appears that in practice other permissions imply the ability to read the memory as well.

    (I assume that the Linux kernel is optimizing PROT_NONE mappings by not even creating page table entries for the memory area, rather than carefully assembling PTEs that deny all permissions. The error bits come straight from the CPU, so if there are no PTEs the CPU says 'fault for an unmapped area' regardless of what Linux thinks and will report in, eg, /proc/PID/maps.)

  • error 6: (data) write to an unmapped area.

    This is your classic write to a wild or corrupted pointer, including to (or through) a null pointer. As with reads, writes to guard pages mmap()'d with PROT_NONE will generally show up as this, not as 'write to a mapped area that denies permissions'.

    (As with reads, all writes with small faulting addresses should be error 6s because no one sane allows low memory to be mapped.)

  • error 7: write to a mapped area that isn't writable.

    This is either a wild pointer that was unlucky enough to wind up pointing to a bit of memory that was mapped, or an attempt to change read-only data, for example the classical C mistake of trying to modify a string constant (as seen in the first entry). You might also be trying to write to a file that was mmap()'d read only, or in general a memory mapping that lacks PROT_WRITE.

    (All attempts to write to the kernel's area of address space also get this error, instead of error 6.)

  • error 14: attempt to execute code from an unmapped area.

    This is the sign of trying to call through a mangled function pointer (or a NULL one), or perhaps returning from a call when the stack is in an unexpected or corrupted state so that the return address isn't valid. One source of mangled function pointers is use-after-free issues where the (freed) object contains embedded function pointers.

    (Error 14 with a faulting address of 0 often means a function call through a NULL pointer, which in turn often means 'making an indirect call to a function without checking that it's defined'. There are various larger scale causes of this in code.)

  • error 15: attempt to execute code from a mapped memory area that isn't executable.

    This is probably still a mangled function pointer or return address, it's just that you're unlucky (or lucky) and there's mapped memory there instead of nothing.

    (Your code could have confused a function pointer with a data pointer somehow, but this is a lot rarer a mistake than confusing writable data with read-only data.)

If you're reporting a segfault bug in someone else's program, the error code can provide useful clues as to what's wrong. Combined with the faulting address and the instruction pointer at the time, it might be enough for the developers to spot the problem even without a core dump. If you're debugging your own programs, well, hopefully you have core dumps; they'll give you a lot of additional information (starting with a stack trace).

(Now that I know how to decode them, I find these kernel messages to be interesting to read just for the little glimpses they give me into what went wrong in a program I'm using.)

On 64-bit x86 Linux, generally any faulting address over 0x7fffffffffff will be reported as having a mapping and so you'll get error codes 5, 7, or 15 respective for read, write, and attempt to execute. These are always wild or corrupted pointers (or addresses more generally), since you never have valid user space addresses up there.

A faulting address of 0 (sometimes printed as '(null)', as covered in the first entry) is a NULL pointer itself. A faulting address that is small, for example 0x18 or 0x200, is generally an offset from a NULL pointer. You get these offsets if you have a NULL pointer to a structure and you try to look at one of the fields (in C, 'sptr = NULL; a = sptr->fld;'), or you have a NULL pointer to an array or a string and you're looking at an array element or a character some distance into it. Under some circumstances a very large address, one near 0xffffffffffffffff (the very top of memory space), can be a sign of a NULL pointer that your code then subtracted from.

(If you see a fault address of 0xffffffffffffffff itself, it's likely that your code is treating -1 as a pointer or is failing to check the return value of something that returns a pointer or '(type *)-1' on error. Sadly there are C APIs that are that perverse.)

KernelSegfaultErrorCodes written at 00:49:09; Add Comment


My failure to migrate my workstation from MBR booting to UEFI

I wrote earlier about how I planned to migrate my work Fedora system from MBT booting to UEFI booting once I moved to its new hardware, complete with a plan of what to do. Today I put that plan into action but unfortunately it didn't go well. I've now reverted back to MBR booting and plan to stay on it for at least the lifetime of this hardware (probably at least five years). Here is what happened, the various surprises I ran into, and what went wrong.

To start with, when I made my rushed switch to the new hardware I forgot to do one step I'd planned; I didn't save the output of efibootmgr -v from my scratch Fedora install. I don't think this made a real difference in the eventual outcome, but I would have at least liked to know. (I did 'save' the scratch /boot/efi contents in that I set the scratch install's disk aside and later retrieved /boot/efi from it.)

The initial steps to set up my /boot/efi on my real (U)EFI system partition went fine but led to my first surprise. It turns out that grub2-mkconfig won't create an EFI enabled grub.cfg and efibootmgr won't work at all unless your system was booted with UEFI. This creates an obvious but unfortunate chicken and egg situation when you're trying to transition to UEFI, so I created my (U)EFI grub.cfg by copying my existing MBR one and changing 'linux ...' and 'initrd ...' to 'linuxefi ...' and 'initrd ...'. I figured that I would have to manually set up a suitable UEFI boot entry in the Asus BIOS, as I had with my Dell XPS 13 laptop.

When I rebooted, I got my second surprise. Unlike Dell's laptop BIOS, the Asus BIOS does not let you configure UEFI boot entries by hand. Instead it automatically and magically hunts around your EFI system partitions to look for whatever plausible EFI boot things it can find, adds them as UEFI boot entries, and then generally tries to boot one. In my case what it tried to boot was what it labeled as 'RedHat Boot Manager', aka EFI/redhat/grub.efi, which was from a very left over grub-efi package from Fedora 18. Since I had not set up any sort of configuration for it, this did not go well (and I don't know if it'd have worked in general). I did eventually manage to get the BIOS to boot its 'Fedora' UEFI boot entry, which was booting EFI/fedora/shim.efi and using my grub.cfg.

(Why not shimx64.efi? I don't know. Both were present, but the Asus BIOS ignored shimx64.efi as it ignored several other EFI things that I believe were bootable.)

Booting my Fedora kernel through UEFI was visibly different. Based on the presence of a row of penguins at the top of the screen, it appears to have come up in framebuffer mode as opposed to the basic text mode that both Grub and it used in my MBR boot setup (based on kernel-parameters.txt, these penguins can be turned off with the kernel command line argument logo.nologo, which I plan to use in the future). With the kernel booted via UEFI, I could use efibootmgr to dump the setup and run grub2-mkconfig to create a brand new proper EFI-based grub.cfg.

Unfortunately, using that grub.cfg caused my system to hang on boot just after the kernel printed messages about initializing the amdgpu driver (if left alone, dracut eventually timed out and dropped me into a RAM-based rescue environment). At the time I was already aggravated and naturally suspicious of the amdgpu driver (because I'd already had problems with it, so I flailed around commenting out more and more weird grub bits of the generated grub.cfg to make it look like my old one, with no success. At a much greater distance from the whole situation, I've now run kdiff3 against the two grub.cfgs and have discovered that grub2-mkconfig probably left out vital kernel command line arguments that tell the boot environment how to find my actual root filesystem. This would explain why the boot sat around for a while before timing out; it was waiting in the hopes that something with the right UUID would magically show up, which would have let it continue the boot. I don't know why a modern Fedora system doesn't print a clear message about 'your root filesystem is missing', but either it doesn't or I failed to see it.

(I also don't know why grub2-mkconfig left out the magic rd.md.uuid arguments (also) that tell dracut what software RAID devices to assemble, but it did.)

Next, I theorized wildly that booting with shim.efi instead of shimx64.efi was part of my problems, so I flailed around with efibootmgr and was eventually successful at creating UEFI boot entries that used shimx64.efi and booting with them (using my grub.cfg). However this failed to fix my problems with the generated grub.cfg.

At this point I gave up on UEFI booting because there were too many things going wrong and that I didn't understand. I moved /boot/efi back into my root filesystem (which includes /boot as a whole) and completely erased the EFI system partitions on both of my system SSDs by mkfsing them as ext4 filesystems. The Asus BIOS then reverted back to entirely MBR booting and has probably magically erased all of those UEFI boot entries that it magically set up.

(I didn't want to leave the EFI system partitions as anything that the BIOS could understand because I was nervous about the BIOS doing something bad if I had a valid but empty EFI system partition, or an EFI system partition with contents but without any boot loaders.)

Even after working out what was probably wrong, I've decided that I'm going to leave things this way. This system only boots Linux and I don't particularly care about Secure Boot, so as far as I can see UEFI is inferior to MBR booting in practice (for example, I can't mirror /boot/efi across both my system SSDs; I'd have to set up some other mechanism to keep the backup EFI system partition in sync with the primary version). MBR booting involves less BIOS magic, works better for the case that I care about (which is not doing weird graphics things for as long as possible during boot), works perfectly today, and gives me valuable features that I would lose with (U)EFI. All I have to remember to do is to update my MBR bootblocks periodically.

I've taken some lessons learned from this whole episode, ones that I hope to remember to apply the next time around, but they don't go in this entry (partly because it's already long enough). The big meta-lesson is that a lot of things go wrong when I'm rushed, and also if a machine is my main workstation I'm always going to be rushed when dealing with any issues with it.

MBRToUEFIBootFailure written at 01:19:26; Add Comment


What the Linux kernel's messages about segfaulting programs mean on 64-bit x86

For quite a while the Linux kernel has had an option to log a kernel message about every faulting user program, and it probably defaults to on in your Linux distribution. I've seen these messages fly by for years, but for reasons beyond the scope of this entry I've recently wanted to understand what they mean in some moderate amount of detail.

I'll start with a straightforward and typical example, one that I see every time I build and test Go (as this is a test case that is supposed to crash):

testp[19288]: segfault at 0 ip 0000000000401271 sp 00007fff2ce4d210 error 4 in testp[400000+98000]

The meaning of this is:

  • 'testp[19288]' is the faulting program and its PID
  • 'segfault at 0' tells us the memory address (in hex) that caused the segfault when the program tried to access it. Here the address is 0, so we have a null dereference of some sort.
  • 'ip 0000000000401271' is the value of the instruction pointer at the time of the fault. This should be the instruction that attempted to do the invalid memory access. In 64-bit x86, this will be register %rip (useful for inspecting things in GDB and elsewhere).
  • 'sp 00007fff2ce4d210' is the value of the stack pointer. In 64-bit x86, this will be %rsp.
  • 'error 4' is the page fault error code bits from traps.h in hex, as usual, and will almost always be at least 4 (which means 'user-mode access'). A value of 4 means it was a read of an unmapped area, such as address 0, while a value of 6 (4+2) means it was a write of an unmapped area.
  • 'in testp[400000+98000]' tells us the specific virtual memory area that the instruction pointer is in, specifying which file it is (here it's the executable), the starting address that VMA is mapped at (0x400000), and the size of the mapping (0x98000).

With a faulting address of 0 and an error code of 4, we know this particular segfault is a read of a null pointer.

Here's two more error messages:

bash[12235]: segfault at 1054808 ip 000000000041d989 sp 00007ffec1f1cbd8 error 6 in bash[400000+f4000]

'Error 6' means a write to an unmapped user address, here 0x1054808.

bash[11909]: segfault at 0 ip 00007f83c03db746 sp 00007ffccbeda010 error 4 in libc-2.23.so[7f83c0350000+1c0000]

Error 4 and address 0 is a null pointer read but this time it's in some libc function, not in bash's own code, since it's reported as 'in libc-2.23.so[...]'. Since I looked at the core dump, I can tell you that this was in strlen().

On 64-bit x86 Linux, you'll get a somewhat different message if the problem is actually with the instruction being executed, not the address it's referencing. For example:

bash[2848] trap invalid opcode ip:48db90 sp:7ffddc8879e8 error:0 in bash[400000+f4000]

There are a number of such trap types set up in traps.c. Two notable additional ones are 'divide error', which you get if you do an integer division by zero, and 'general protection', which you can get for certain extremely wild pointers (one case I know of is when your 64-bit x86 address is not in 'canonical form'). Although these fields are formatted slightly differently, most of them mean the same thing as in segfaults. The exception is 'error:0', which is not a page fault error code. I don't understand the relevant kernel code enough to know what it means, but if I'm reading between the lines correctly in entry_64.txt, then it's either 0 (the usual case) or an error code from the CPU. Here is one possible list of exceptions that get error codes.

Sometimes these messages can be a little bit unusual and surprising. Here is a silly sample program and the error it produces when run. The code:

#include <stdio.h>
int main(int argc, char **argv) {
   int (*p)();
   p = 0x0;
   return printf("%d\n", (*p)());

If compiled (without optimization is best) and run, this generates the kernel message:

a.out[3714]: segfault at 0 ip           (null) sp 00007ffe872aa418 error 14 in a.out[400000+1000]

The '(null)' bit turns out to be expected; it's what the general kernel printf() function generates when asked to print something as a pointer and it's null (as seen here). In our case the instruction pointer is 0 (null) because we've made a subroutine call through a null pointer and thus we're trying to execute code at address 0. I don't know why the 'in ...' portion says that we're in the executable (although in this case the call actually was there).

The error code of 14 is in hex, which means that as bits it's 010100. This is a user mode read of an unmapped area (our usual '4' case), but it's an instruction fetch, not a normal data read or write. Any error 14s are a sign of some form of mangled function call or a return to a mangled address because the stack has been mashed.

(These bits turn out to come straight from the CPU's page fault IDT.)

For 64-bit x86 Linux kernels (and possibly for 32-bit x86 ones as well), the code you want to look at is show_signal_msg in fault.c, which prints the general 'segfault at ..' message, do_trap and do_general_protection in traps.c, which print the 'trap ...' messages, and print_vma_addr in memory.c, which prints the 'in ...' portion for all of these messages.

Sidebar: The various error code bits as numbers

+1 protection fault in a mapped area (eg writing to a read-only mapping)
+2 write (instead of a read)
+4 user mode access (instead of kernel mode access)
+8 use of reserved bits in the page table entry detected (the kernel will panic if this happens)
+16 (+0x10) fault was an instruction fetch, not data read or write
+32 (+0x20) 'protection keys block access' (don't ask me)

Hex 0x14 is 0x10 + 4; (hex) 6 is 4 + 2. Error code 7 (0x7) is 4 + 2 + 1, a user-mode write to a read-only mapping, and is what you get if you attempt to write to a string constant in C:

char *ex = "example";
int main(int argc, char **argv) {
   *ex = 'E';

Compile and run this and you will get:

a.out[8832]: segfault at 400540 ip 0000000000400499 sp 00007ffce6831490 error 7 in a.out[400000+1000]

It appears that the program code always gets loaded at 0x400000 for ordinary programs, although I believe that shared libraries can have their location randomized.

PS: Per a comment in the kernel source, all accesses to addresses above the end of user space will be labeled as 'protection fault in a mapped area' whether or not there are actual page table entries there. The kernel does this so you can't work out where its memory pages are by looking at the error code.

(I believe that user space normally ends around 0x07fffffffffff, per mm.txt, although see the comments about TASK_SIZE_MAX in processor.h and also page_64_types.h.)

KernelSegfaultMessageMeaning written at 01:40:07; Add Comment


What the Linux rcu_nocbs kernel argument does (and my Ryzen issues again)

It turns out that my Linux Ryzen kernel hangs appear to be a known bug or issue (Ubuntu, Fedora, kernel); more fortunately, people have found magic incantations that appear to work around the issue. Part of the magic is some kernel command line arguments, usually cited as:

rcu_nocbs=0-N processor.max_cstate=1

(where N is the number of CPUs you have minus one.)

Magic incantations that I don't understand bug me, especially when they seem to be essential to keeping my system from hanging, so I had to go digging.

What processor.max_cstate does is relatively straightforward. As briefly mentioned in kernel-parameters.txt, it limits the C-states (also, also) that Linux will allow processors to go into. Limiting the CPU to C1 at most doesn't allow for much idling and power saving; it might be safe to go as far as C5, since the usual additional advice is to disable C6 in the BIOS (if your BIOS supports doing this). On the other hand, I don't know if Ryzens do anything between C1 and C6.

The rcu_nocbs parameter is more involved (and mysterious). To more or less understand it, we need to start with Read-Copy-Update (RCU) (also Wikipedia). To simplify, RCU handles updates to shared data structures by setting up a new version of the data structure, changing a master location to point to it instead of the old version, and then waiting for everyone to have passed a synchronization point where they're guaranteed to be using the new version instead of the old version. At that point you know the old version is unused and you can free it.

The Linux kernel's main RCU code handles the RCU algorithm for you but it doesn't know how to free up your data structures. For that it relies on RCU callbacks that you give it; when RCU determines that the old version of your data structure can be disposed of, it will invoke your callback to do this. Normally, RCU callbacks are invoked in interrupt context as part of software interrupt (softirq) handling. Various people didn't like this because softirqs preempt whatever happens to be running at the time whenever an appropriate interrupt happens, so people came up with an alternate approach of having these potentially quite time-consuming RCU callbacks handled by regularly scheduled kernel threads instead. This is said to 'offload' RCU callbacks to these threads. Each offloaded CPU gets its own set of RCU offload kernel threads, but these kernel threads can run on any CPU, not just the CPU they're offloading.

This is what rcu_nocbs controls; it's a list of the CPUs in your system that should have their RCU callbacks offloaded to threads. Normally, people use it to fence off a few CPUs from the random interruptions of softirq RCU callbacks.

(See here and here for more information and details.)

However, the rcu_nocbs=0-N setting we're using specifies all CPUs, so it shifts all RCU callbacks from softirq context during interrupt handling (on whatever specific CPU involved) to kernel threads (on any CPU). As far as I can see, this has two potentially significant effects, given that Matt Dillon of DragonFly BSD has reported an issue with IRETQ that completely stops a CPU under some circumstances. First, our Ryzen CPUs will spend less time in interrupt handling, possibly much less time, which may narrow any timing window required to hit Matt Dillon's issue. Second, RCU callback processing will continue even if a CPU stops responding to IPIs, although I expect that a CPU not responding to IPIs is going to cause the Linux kernel various other sorts of heartburn.

(Unfortunately, Matt Dillon's issue doesn't correspond well with the observed symptoms, where Ryzens hang under Linux not while busy but while idle. My kernel stack backtraces do suggest that at least one CPU is spinning waiting for its IPI to other CPUs to be fully acknowledged, though, so perhaps there is a related problem. Perhaps there are even several problems.)

KernelRcuNocbsMeaning written at 23:54:55; Add Comment


Doing something when a Cinnamon-based laptop suspends or hibernates

I've used encrypted SSH keys for some time. To make this tolerable (and even convenient), I load the keys into ssh-agent. On my desktop, I make this more secure by automatically flushing the keys when I lock the screen (details here). To get a similar effect on my laptop, I want to flush the keys before it suspends (suspending the laptop is roughly the equivalent of locking the screen on a desktop; it's almost always what I do before I walk away from it). I also want to force-close any lingering SSH connection masters, because it's pretty likely that my network connection will be different when I un-suspend the laptop (and in any case, the server end will probably have timed out).

In the beginning I did this by hand with a shell script or two. I usually remembered to run it before I suspended (my custom Cinnamon environment made it not too hard), but not always, and it was kind of a pain in general. Then I found out that it's possible to hook into the modern suspend process to automate this.

The important magic is that there is a standard freedesktop DBus signal that is emitted when your modern DBus and systemd-enabled system is about to suspend or hibernate itself. The DBus details are covered in this unix.se answer (via), and I simply copied and modified the Python code from David Newgas' av program to make something I call presusp.py. My version does not have a start() action to do things after the laptop un-suspends, and its shutdown() action is simply to run my shell scripts that drop SSH keys and cleans up shared SSH sessions. If I used my Yubikey more on Cinnamon (which is possible), I'd also run my script to drop the Yubikey from ssh-agent (covered here).

Because presusp.py is directly tied to my login session, I just start it in a shell script I already have that does various things to set up my Cinnamon session. This also terminates it when I log out, although usually if I'm going to log out I just power down the laptop.

(Logging out and then back in again has been somewhat flaky under Cinnamon for me.)

PS: According to the logind DBus API there's also a signal emitted before screen locking, although it comes from your session instead of the overall manager. If I cared enough, I could presumably hack up my presusp.py to flush keys even on screen lock. Right now this is too complicated for me to bother with, since I rarely lock the screen on my laptop and step away from it.

Sidebar: Restoring keys to ssh-agent after an unsuspend

Unlike on my desktop, I don't try to automatically re-add my encrypted keys to ssh-agent when the laptop un-suspends. Instead I have my .ssh/config set up with 'AddKeysToAgent yes', so that the first time I ssh somewhere it automatically adds the keys to the agent after prompting me to unlock and use them. This is a bit less convenient than on my desktop but it works well enough under the circumstances. It helps that I don't try to do fancy things with remote X clients on the laptop; mostly what I do is SSH logins in terminals.

CinnamonActOnLaptopSuspend written at 00:50:48; Add Comment


A recent performance surprise with X on my Fedora Linux desktop

As I discussed yesterday, on Monday I 'upgraded' my office workstation by transplanted my system disks and data disks from my old office hardware to my new office hardware. When I turned the new hardware on, my Fedora install booted right up pretty much exactly as it had been (some advance planning made networking work out), I logged in on the text console, started up X, and didn't think twice about it. Modern X is basically configuration free and anyway, both the old and the new hardware had Radeon cards (and the same connectors for my monitors, so my dual-screen setup wouldn't get scrambled). I even ran various OpenGL test programs to exercise the new card and see if it would die under far more demanding load than I expected to ever put on it.

(This wound up leading to some lockups.)

All of this sounds perfectly ordinary, but actually I left out an important detail that I only discovered yesterday. My old graphics card is a Radeon HD 5450, which uses the X radeon driver. My new graphics card is a Radeon RX 550, but things have changed since 2011 so it uses the more modern amdgpu driver. And I didn't have the amdgpu driver installed in my Fedora setup (like most X drivers, it's in a separate RPM of its own), so the X server was using neither the amdgpu driver (which it didn't have) nor the radeon driver (which doesn't support the RX 550).

The first surprise is that X worked anyways and I didn't notice anything particular wrong or off about my X session. Everything worked and was as responsive as I expected, and the OpenGL tests I ran seemed to go acceptably fast (as did a full-screen video). In retrospect there were a few oddities that I noticed as I was trying things due to my system hangs (xdriinfo reported no direct rendering and vdpauinfo spat out odd errors, for example), but there was nothing obvious (and glxinfo reported plausible things).

The second surprise is what X was actually using to drive the display, which turns out to be something called the modesetting driver. This driver is a quite basic one that relies on kernel mode setting but is otherwise more or less unaccelerated. Well, sort of, because modesetting was apparently using glamor to outsource some rendering to OpenGL, in case you have hardware accelerated OpenGL, which I think that I did in this setup. I'm left unsure of how much hardware acceleration I was getting; maybe my CPU was rendering 24-bit colour across two 1920x1200 LCDs without me noticing, or maybe a bunch of it was actually hardware accelerated even with a generic X driver.

(There is a tangled web of packages here. I believe that the open source AMD OpenGL code is part of the general Mesa packages, so it's always installed if you have Mesa present. But I don't know if the Mesa code requires the X server to have an accelerated driver, or if a kernel driver is good enough.)

PS: Kernel mode setting was available because the kernel also has an amdgpu driver module that's part of the DRM system. That module is in the general kernel-modules package, so it's installed on all machines and automatically loaded whenever the PCI IDs match.

PPS: Given that I had system lockups before I installed the X server amdgpu driver, the Fedora and freedesktop bugs are really a kernel bug in the admgpu kernel driver. Perhaps this is unsurprising and already known.

XBasicDriverPerfSurprise written at 00:54:49; Add Comment


My new Ryzen desktop is causing Linux to hang (and it's frustrating)

Normally I try to stick to a sunny tone here. Today is an unfortunate exception, since it's a problem that I'm very close to and that I have no solution for.

Last Friday, I assembled the hardware for my new office workstation, updated the BIOS, and over the weekend let it sit turned on with a scratch Fedora install and periodically doing some burnin tests, like mprime. Everything went fine. Normally I might have probably let the assembled machine sit for a while and do additional burnin tests before doing anything more, but over the weekend my current (old) workstation showed worrying signs of more flakiness, so on Monday I swapped my disks over to the new Ryzen-based hardware. Everything came up quite easily and it all looked good (and clearly faster), right up until the machine started locking up. At first I thought I had a culprit in the amdgpu kernel driver used by the new machine's Radeon RX 550 based graphics card, and I turned up a Fedora bug with a workaround. Unfortunately that doesn't appear to be a complete fix, because the machine has hung several times since then. For the latest hangs I've had netconsole enabled, and I've actually gotten output; unfortunately this has just made it more frustrating, because it is just a steady stream of 'watchdog: BUG: soft lockup - CPU#4 stuck for 23s!' reports.

(These reports are interesting in a way, because apparently the system is not so stuck that it cannot increment the timer. However, it is so stuck that it doesn't respond to the network or to the console, and it doesn't seem to notice a console Magic SysRq.)

In both sets of netconsole trances I've collected so far, I see things running through cross-CPU communication and often TLB stuff in general and specifically native_flush_tlb_others. For example:

Call Trace:
 ? __vma_rb_erase+0x1f1/0x270

This is interesting because of an old reddit post that blames this on 'Core C6 State', and there's also this bug report. On the other hand, the machine sat idle all weekend and didn't hang; in fact, it would have been more idle on the weekend than it was when it hung recently. However I'm into grasping at straws here.

(There's also this Ubuntu bug reports, which has a long discussion of tangled and complicated options to work around things, and this Fedora one. Probably there are others.)

The sensible thing to do right now is probably to swap my disks back into my old hardware until I have more time to deal with the problem (I have to get things stabilized tomorrow). But it is tempting to grasp at a number of straws:

  • swap the RX 550 card out for the very basic card in my old machine. This should completely eliminate both amdgpu and the new GPU hardware itself as a source of issues.

  • switch back to a kernel before CONFIG_RETPOLINE, because I use a number of out of tree modules and I've noticed their build process muttering about my gcc not having the needed support for this. I'm using the latest Fedora released gcc and you'd hope that that would be good enough, but I have no idea what's going on.

  • go through the BIOS to turn off 'Core C6 State' and any other fancy tuning options (and verify that it hasn't decided to silently turn on some theoretically mild and harmless automatic overclocking options). It's possible that the BIOS is deciding to do things that Linux objects to, although I don't know why it would have started to fail only once I swapped disks around. (The paranoid person wonders about UEFI versus MBR booting, but I'm not sure I'm that paranoid.)

(If I did all of this and the machine hung anyway, well, I'd be able to swap my disks back into my old desktop with no regrets.)

In the longer term, troubleshooting this and reporting any issues is probably going to be quite complicated. One of the problems is that I absolutely have to have one out of kernel module (ZFS on Linux) and I very much want another one (WireGuard). I suspect that the presence of these will cause any bug reports to be rejected more or less out of hand. In an ideal world this problem will reproduce itself on a scratch Fedora install with a stock kernel environment that's doing things like running graphics stress programs, but I'm not going to hold my breath on that. It seems quite possible that it will only happen if I'm actually using the machine, which has all sorts of problems.

(I have one complicated idea but it is complicated and rather annoying.)

The whole thing is frustrating and puzzling. We have a stable Ubuntu machine with a Ryzen 1800X and the same motherboard (but a different GPU), and this machine itself seemed fine right up until I swapped in my existing disks. Even post-swap it was perfectly fine with a six+ hour mprime -t run over last night. But if I use it it hangs sooner or later, and it now seems to be hanging even when I don't use it.

(And it appears that this motherboard doesn't have a hardware watchdog timer that's currently supported by Linux. I tried enabling the software watchdog, but it didn't trigger for literally hours and then when it did, it apparently hasn't managed to actually reboot the system, which is perhaps not too surprising under the circumstances.)

PPS: This does put a rather large crimp in my Ryzen temptation, especially if this is something systemic and widespread.

Sidebar: It's possible that I've had multiple issues

I may have hit both an amdgpu issue with Radeon RX 550s, which I've now mitigated, and some sort of issue with the BIOS putting a chunk of the machine to sleep and then Linux not being able to wake it up again. My initial hangs definitely happened while I was in front of the machine actively using it, but I believe that the hangs since I set amdgpu.dpm=0 this morning have been when I wasn't around the machine and it was at least partially idle. These are the only hangs that I have netconsole logs for, too, and they show that the machine is partially alive instead of totally hung.

RyzenMachineLinuxHangs written at 02:19:28; Add Comment


Some plans for migrating my workstation from MBR booting to UEFI

Today I finally assembled the machine that is to be my new office workstation, although I wasn't anywhere near brave enough to change over to using it on a Friday. For the most part I expect the change over to be pretty straightforward, because I'm just going to transplant all of my current disks into it (and then I'll have to make sure the network interface gets the right name, although I have a plan). However, I want to switch my Fedora install from its current MBR based booting to UEFI booting (probably with Secure Boot turned on), because it seems pretty clear that UEFI booting is the way of the future (my toe-stubbing not withstanding).

My current set of SSD root disks are partitioned with GPT and have an EFI system partition, although the partition is unused and unmounted; while I have a /boot/efi directory with some stuff in it, that's part of /boot and the root filesystem. In order to switch over to UEFI booting, I think I need to arrange for the EFI system partition to take over /boot/efi, get populated with the necessary contents (whatever they are), and then set up an UEFI boot entry or two. But the devil is in the details.

Currently I have the machine set up with a Fedora 27 install on a scratch disk, so I can run burnin tests and similar things. This helpfully gives me a live Linux that I can poke around on and save things from, although it also means that I have pre-existing UEFI boot entries that are going to become invalid the moment I remove the scratch disk.

So I think that what I want to do is something like this:

  1. Copy all of /boot/efi from the current scratch install to an accessible place, so I can look at it easily in case I need to.
  2. Make sure that I have all of the necessary and relevant EFI packages installed on my machine. Based on the scratch install's package list, I'm definitely missing some (eg grub2-efi-x64), but I'm not sure if all of them are essential (eg mactel-boot). I might as well install everything, though.

    (It appears that all files in /boot/efi on my scratch install are owned by RPMs, which is what I'd hope.)

  3. Turn off Secure Boot in the new machine's BIOS and enable MBR booting. My motto is 'one problem at a time', so I'd like to move my disks over to the new machine and sort out the inevitable problems without also having to wrangle the UEFI boot transition.

  4. Figure out the right magic way to format my existing EFI system partition in the EFI-proper way, because apparently I never did that. Naturally, Arch's wiki has good information.
  5. Mount the EFI system partition somewhere, copy all of my current /boot/efi to it, and shuffle things around so it becomes /boot/efi.
  6. Either copy my current grub.cfg to /boot/efi/EFI/fedora and edit it up or try to (re)generate a completely new one through some magic command. Probably I should start with what grub2-mkconfig produces when run from scratch and see how different it is from my current one (hopefully it can be persuaded to produce an EFI-based configuration even though my system was MBR booted).

  7. Set up a UEFI boot entry for Fedora on my EFI system partition. As I found out, this should boot shimx64.efi. I'm inclined to try to do this by modifying the existing 'Fedora' boot entry instead of deleting it and recreating it; in theory probably the only thing that needs to change is the partition GUID for the EFI system partition.

Assuming that I got everything right, at this point I should be able to boot my machine through UEFI instead of through MBR booting. I'm not sure if the motherboard's BIOS defaults to UEFI boot or if I'll have to use the boot menu, but either way I can test it. If the UEFI boot works I can turn on Secure Boot, at which point I will be UEFI-only.

I think the most likely failure point is getting a working UEFI grub.cfg. Grub2-mkconfig somehow knows whether or not you're using UEFI (and there doesn't seem to be any command line option to control this), plus my current grub.cfg is fairly different from what the Fedora 27 grub2-mkconfig generates in things like what grub2 modules get loaded. Perhaps this is a good time to figure out and write down what sort of changes I want to make to the stock grub2-mkconfig result, or perhaps I should just abandon having a custom version in general.

(I don't think my custom version is doing anything vital; it's just got a look I'm used to. And I could test a non-custom version on my current machine.)

I have two system SSDs, so I have an EFI system partition on the second SSD as well. In the long run I should set up an EFI boot environment on it as well, but that can definitely wait until I have UEFI booting working in general. I'll also need to worry about keeping at least grub.cfg in sync between the two copies. Or maybe I should just stick to having the basic EFI shell and Grub2 boot environment present, and assume that if I have to boot off the second SSD I'll need to glue things together by hand.

(My root filesystem is mirrored, but I obviously can't do that with the EFI system partition. Yes, technically I might be able to get away with it with the right choice of software RAID superblock, but no, I'm not even going to try to go there.)

MBRToUEFIBootChallenge written at 02:04:38; Add Comment

(Previous 10 or go back to January 2018 at 2018/01/10)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.