Wandering Thoughts

2018-04-15

The unfortunate configuration choice Grub2 makes in UEFI configurations

When I talked about my new home machine, I mentioned that I wasn't even trying to use UEFI on it after my experiences with my work machine, fundamentally because Grub2's UEFI setup makes an unfortunate configuration choice. Today I'm going to talk about what that choice is and why it's an unfortunate one, at least for people like me (and for people with servers).

Put simply, it's that the UEFI Grub2 requires you to put grub.cfg in the EFI system partition. On the one hand, this is a reasonable choice, and probably one that simplifies Grub's life. In a UEFI environment, the EFI system partition exists and is easily accessible through (U)EFI services, so you have a natural place to put everything else that Grub2 needs, including both its additional modules and grub.cfg. In a traditional non-UEFI environment, grub2 needs to do some magic to be able to load your grub.cfg from whatever sort of filesystem and RAID setup and so on your /boot is in; the actual mechanisms of that are relatively impressive and more than a bit complex (cf). UEFI makes life much simpler.

On the other hand, grub.cfg is the one piece of your Grub2 configuration that changes frequently, because it gets updated every time you add a kernel, remove a kernel, or modify kernel command line arguments. This is an issue for me and for people with servers, because the EFI system partition can't be part of a RAID mirror. If you want to be able to boot from a second disk if your primary disk fails, you need a duplicate copy of your EFI system partition, and because grub.cfg changes frequently, you need to keep this up to date on a frequent basis. Otherwise, not only will you perhaps be booting from an older (but functional) version of Grub2 that you didn't update, you'll probably be trying to boot kernels that don't even exist any more (perhaps with wrong or missing kernel arguments). You'll probably be able to boot your system in the end, but it's not likely to be easy or automatic.

Life would be a lot easier and better here if you could configure Grub2 to load your real grub.cfg from your (non-EFI) /boot. You could use software RAID, LVM, and any filesystem normally supported by Grub2. People with mirrored system disks would still have all of the good stuff that you get with them in a non-EFI configuration (although every so often you'd need to remember to update the stuff in your EFI system partition on the second disk).

My guess is that the easiest way to add this to Grub2 would be to give Grub2 some way of including additional files in grub.cfg. With this, you'd still have a stub grub.cfg in the EFI system partition (which Grub2 could load with UEFI services just like today), and this stub would specify everything else. It would know the UUID of the filesystem with your /boot in it and also what additional RAID or LVM UUIDs it needed to look for and start in order to find it, just as a non-EFI Grub2 knows those details today (cf), but these wouldn't change very often so your EFI system partition grub.cfg would stay mostly unchanging.

Of course this Grub2 configuration choice isn't important unless you have mirrored system disks. If your system disk is unmirrored, an unmirrored EFI system partition creates no additional problems and the current UEFI Grub2 design is fine. Since this probably describes most systems using UEFI today, I don't expect UEFI Grub2 to change any time soon. Probably it will only start to change when servers become UEFI-only and people running them discover that their mirrored, redundant system disks actually aren't any more because of this.

Grub2UEFIBigMistake written at 03:07:46; Add Comment

2018-04-09

Power consumption numbers for my 2018 home and work machines

It's been a while since my previous set of power consumption numbers for a desktop, which I made back in 2011 for my 2011 home (and office) machine. This time around, I have two slightly different machines and two sets of power numbers. My office machine is a Ryzen 1800X with a Radeon RX 550 graphics card; my home machine is an Intel i7-8700K using its integrated graphics.

Unless otherwise stated, all of the following power measurements are with X running, my normal desktop environment, and the screen unlocked. This turns out to matter with the Radeon RX 550. All figures are somewhat approximate.

Ryzen Intel
powered off 1-2 watts 0 watts
in the BIOS's fancy 'EZ-Mode' screen, where the BIOS displays various hardware monitoring graphics ~83 watts ~56 watts
idle in Linux with the screen blanked and my LCD(s) in power save mode 56 watts ~40 watts
idle in Linux with the screen unlocked 66 watts 40 watts
a single core busy with a simple CPU soaker 92 watts 67 watts
four real cores (not HyperThread/SMT pairs) busy with a simple CPU soaker 121 watts 103-104 watts
all cores busy with a simple CPU soaker 194 watts 139 watts
running mprime -t (it uses all cores) 200 watts 174 watts
running GpuTest's 'fur' test at 1024x800 126-136 watts 73 watts
mprime -t plus GpuTest 258 watts 185 watts
compiling Firefox's development tree from SSDs 165-185 watts 144-154 watts
playing a fairly dynamic 1920x1080 video full screen with mplayer (using VDPAU) 77 watts 45 watts

(Based on power draw, GpuTest's 'fur' test appears to be more GPU intensive than glmark2. The mprime numbers are for relatively early in its test run, because I only have so much patience.)

On the Ryzen machine, blanking the screen and getting the Radeon RX 550 to put the LCDs into DPMS power saving mode consistently saves about 10 watts while doing any of the CPU-focused things (one exception is compiling Firefox, where it saves more, perhaps because Firefox's build system normally sprays a ton of text over your terminal). With the Intel's integrated 'UDH' graphics there appears to be basically no power savings to be had there, at least in the current Fedora 27 environment. The Radeon RX 550 is driving two LCD panels and the Intel is only driving one, but I hope that that doesn't make a big difference.

(They're all 1920x1200 Dell U2412M LCDs.)

Regardless of exactly why, it's clear that the Ryzen based system uses substantially more power than the Intel one. The gap would be smaller if the Intel had a graphics card, but it seems likely that there would still be one (one source says one RX 550 card drew almost 7 watts at idle, although it's not clear if that's with the LCD panel in power saving mode). The power draw difference rises as the CPU usage goes up, too, which suggests that it's not just the GPU.

I have an earlier set of Ryzen power measurements from when I had to use the 'amdgpu.dpm=0' kernel command line parameter to prevent system hangs (sort of covered here). Based on the power usage in them, amdgpu.dpm=0 on my hardware appears to put the Radeon in its lowest power state and avoid the 10 watt power usage for just lighting up the display. Running GpuTest added about 10 watts of system power draw and presumably didn't perform anywhere near as well as it does now (I didn't save performance information).

These two machines have slightly different sets of disks, which may affect their relative power usage; the Ryzen has two SSDs and three HDs, while the Intel has two SSDs and two HDs, although they're 3TB full-height drives and thus perhaps more power hungry. Unlike last time, I've decided that HD power usage isn't interesting enough to measure and report, especially since my important data is on SSDs now.

Compared to my 2011 machine, I think that both the Ryzen and the Intel have improved, although things are more mixed for the Ryzen (partly due to the increased power demands of the graphics card). The Ryzen is almost even at idle or low load, while the Intel i7-8700K is clearly better (despite still being a 95W TDP Intel CPU). At higher load, such as an all-core CPU soaker, the Intel is a clear winner and the Ryzen sort of pulls ahead at comparable levels (eg four CPUs to four CPUs). At truly full power both probably draw more than my 2011 machine (I don't have figures for mprime or GpuTest for it), but they deliver far more performance for that.

(The Intel runs all 12 CPUs of the CPU soaker at 139 watts, compared to 154 watts for 4 CPUs in the i5-2500 in my old machine.)

PS: If there's additional power draw figures of interest, ask; I'll have the Kill-A-Watts on both machines for a bit longer.

Sidebar: A tale of CPU soakers and power draws

I didn't keep notes, so I don't know for sure what I ran as a CPU soaker in my 2011 measurements. But I can guess, because I'm lazy and I would have kept any special tools I wrote for this (and I can't find any sign of them). So I'm pretty sure that back in 2011, I soaked CPU by just using a do nothing while loop in my usual shell (ie, 'while (;) {;}'). With only four CPUs at most, it was feasible and reasonable to test multiple CPUs by just getting more xterm windows. Here in 2018, though, I would up writing a very simple CPU soaker in Go; the heart of it is an endless loop that increments an integer.

(I wrote a Go program for this because I wanted it to automatically use all of the CPUs available and also to easily control how many CPUs it used. Go offers convenient features for both.)

Somewhat to my surprise, it turns out that this integer Go CPU soaker has a clearly different power impact than my shell while loop. Not just that, my while loop has a different power impact in rc and in Bash. The numbers above are for the rc version, so here are the numbers for all three, on both platforms, for one CPU, four CPUs, and all CPUs.

  • Go eatcpu: on the Ryzen, 83 watts, 96 watts, and 130 watts; on the Intel, 60 watts, 78 watts, and 97 watts.
  • while in rc: on the Ryzen, 92 watts, 121 watts, and 194 watts; on the Intel, 67 watts, 103-104 watts, and ~139 watts.
  • while in Bash: on the Ryzen, ~91 watts, 117-118 watts, and 182-183 watts; on the Intel, 71-72 watts, 115-116 watts, and ~155 watts.

(Note that neither shell while loop does any system calls, including fork(); everything going on is happening within the main shell process.)

In a way this is not a surprise. The Go version must be a very tight loop that quite possibly executes entirely out of the CPU's cache of decoded micro-instructions. The rc and especially the Bash version likely involve much more machine code and thus require many more pieces of the CPU to be powered up and doing things. But the interesting bit of this is the relative power difference between Bash and rc on the Ryzen, where the while loop in Bash actually appears to draw less power. I have no explanation for this.

PowerConsumptionV written at 23:08:56; Add Comment

2018-04-03

Sorting out my systemd mistake with a script-based service unit

Back in November I wrote about a systemd mistake I made with a script-based service unit, where I left out some service options and got a surprise when my service didn't work. A commentator recently made me realize that I didn't really understand what was going on and what had happened; instead I was working by superstition. So I've now done some experiments and read the systemd.service manpage again, and here's what I know.

The basic situation was that I wrote a .service file that had just this, where ExecStart and ExecStop are scripts that just run briefly and then exit:

[Service]
WorkingDirectory=/var/local/wireguard
ExecStart=/var/local/wireguard/startup
ExecStop=/var/local/wireguard/stop
Environment=LANG=C

(In this situation, systemd's defaults are that there is an implicit Type=simple and the default RemainAfterExit=no.)

If you don't have RemainAfterExit and your ExecStart exits with status 0, your service becomes inactive (as opposed to failing to start). If you have an ExecStop, systemd will then run it, even though you haven't explicitly asked for a 'stop' operation; in my situation this mysteriously reversed the effects of my start script. That this happens is unfortunately not clearly documented anywhere that I could see, although it makes a certain amount of sense if you consider ExecStop to be for cleanup actions, where you often want the cleanup actions to happen if the service started successfully and then stops, regardless of just why the service stopped.

(Looking through the stock Fedora 27 systemd .service units, quite a lot of the ExecStop actions appear to be this sort of cleanup, not 'signal the service to shut down' actions.)

It's easy to see this with a test service that just runs some scripts. You'll get output from 'systemctl status yourtest.service' that looks like this:

    Active: inactive (dead) since Mon 2018-04-02 15:37:19 EDT; 41s ago
   Process: 973 ExecStop=/root/stop-script (code=exited, status=0/SUCCESS)
   Process: 964 ExecStart=/root/start-script (code=exited, status=0/SUCCESS)
  Main PID: 964 (code=exited, status=0/SUCCESS)

The ExecStart script ran, was considered the main PID, exited with status 0, and then shortly afterward the ExecStop script was run (since it has a PID only a bit higher, and the start script ran a couple of commands).

Contrary to what I thought in my first entry, the Type=oneshot doesn't affect this as such. As my commentator noted, what Type=oneshot instead of Type=simple really affects is when other units will get started. If you have test.service with the implicit Type=simple, and another service says 'After=test.service', your other service will get started the moment that systemd has started running test.service's ExecStart. This is often not what you want; instead you want things that depend on test.service to only start when its ExecStart has finished preparing things and exited. That's what Type=oneshot enforces, by making it so that test.service is only considered 'started' when your ExecStart program or script exits. Systemd does more or less document this, at the end of the description of Type=:

If set to simple [...], as systemd will immediately proceed starting follow-up units.

[...]

Behavior of oneshot is similar to simple; however, it is expected that the process has to exit before systemd starts follow-up units. [...]

(This is not particularly clearly written, unfortunately. Energetic people can propose a documentation patch in the master repo.)

As the documentation notes, Type=oneshot probably mostly requires that you use RemainAfterExit=yes, because otherwise the service won't be considered to be active. Certainly things will be much less confusing if you use it, because then all of the units involved will stay 'active' and you won't ever have the experience of wondering why something is up and running despite a dependency having failed.

(After= doesn't actually create a dependency, of course, just an ordering. But that's another entry.)

SystemdScriptServiceFumbleII written at 00:01:30; Add Comment

2018-03-31

My new Linux home machine for 2018

Back in the fall I planned out a new home machine and then various things happened, especially Meltdown and Spectre (which I feel made it not a great time to get a new CPU) but also my natural inertia. This inertia sort of persisted despite a near miss scare, but in the end I wound up losing confidence in my current (now old) home machine and just wanting to get things over with, so I bit the bullet and got a new home machine (and then I wound up with questions on its graphics).

(There will be better CPUs in the future that probably get real performance boosts from hardware fixes for Meltdown and Spectre, but there are always better CPUs in the future. And on my home machine, I'm willing to run with the kernel mitigations turned off if I feel strongly enough about it.)

The parts list for my new machine is substantially the same as my initial plans, but I made a few changes and indulgences. Here's the parts list for my future reference, if nothing else:

Intel Core i7-8700K
This is the big change from my initial plans, where I'd picked the i7-8700. It's an indulgence, but I felt like it and using an overclocking capable CPU did eliminate most of my concerns over high-speed DDR4 memory. I'm not overclocking the 8700K because, well, I don't. After I didn't run into any TDP issues with my work machine and its 95W TDP Ryzen, I decided that I didn't have any concerns with the 8700K's 95W TDP either.

(My concerns about the TDP difference between the 8700K and the 8700 turn out to be overblown, but that's going to be another entry.)

Asus PRIME Z370-A motherboard
I saw no reason to change my initial choice (and it turns out Phoronix was quite positive about it too, which I only found out about later). The RGB LEDs are kind of amusing and they even show a little bit through the some air vents on the case (especially in the dark).

2x 16GB G.Skill Ripjaws V DDR4-3000 CL15 RAM
I can't remember if the G.Skill DDR4-2666 modules were out of stock at the exact point I was putting in my order, but they certainly had been earlier (when I assembled the parts list at the vendor I was going to use), and in the end I decided I was willing to pay the slightly higher cost of DDR4-3000 RAM just as an indulgence.

(Looking at the vendor I used, the current price difference is $4 Canadian.)

I opted not to try to go faster than DDR4-3000 because the RAM modules I could easily see stopped having flat CL timings above that speed; they would be eg CL 15-18-18-18 instead of a flat CL15. There were some faster modules with flat CL timings, but they were much more expensive. Since I think I care more about latency than slight bandwidth increases, I decided to stick with the inexpensive option that gave me flat latency.

Noctua NH-U12 CPU cooler
I quite liked the Noctua cooler in my work machine, so I just got the general version of it. I've been quite happy on my work machine not just with how well the Noctua cooler works, but also how quiet it is with the CPU under heavy load (and in fact it seems quite difficult to push the CPU hard enough that the Noctua's fan has to run very fast, which of course helps keep any noise down).

In short: real (and large-ish) CPU coolers are a lot better than the old stock Intel CPU coolers. I suppose this is not surprising.

(The Noctua is not inexpensive, but I'm willing to pay for quality and long term reliability.)

EVGA SuperNova G3 550W PSU
As with my work machine, the commentary on my original plans pushed me to getting a better PSU. Since I liked this in the work machine, I just got it again for the home one. It's an amusing degree of overkill for the actual measured power draw of either machine, but I don't really care about that.

(For my own future reference: always use the PSU screws that come with the PSU, not the PSU screws that may come with the case.)

Fractal Design Define R5 case
I got this in white this time around (the work version is black). The story about the colour shift is that we recently built a new Ryzen based machine for one of my co-workers and basically duplicated my work machine for the parts list, except his case is white because he didn't care and it was cheaper and in stock at the time. When his case came in and I got a chance to see it in person, I decided that maybe I liked the white version better than the black one and in the end my vacillation settled on the white one.

(When I pulled the trigger on buying the hardware, the two case colours were the same price (and both in stock). So it goes.)

LG GH24NSC0 DVD/CD Writer
I gave in to an emotional temptation beyond the scope of this entry and got an optical drive, despite what I'd written earlier. But I haven't actually installed it in the case; it's apparently enough that I could if I wanted to.

(It helps that powering the drive would be somewhat awkward, because the PSU doesn't come with SATA modular cables that are long enough or have enough plugs on them.)

I'm running the RAM at full speed by activating its XMP profile in the BIOS, which was a pleasant one-click option (without that it came up at a listed 2133 Mhz). There didn't seem to be a one-click option for 'run this at 2666 MHz with the best timings it supports', so I didn't bother trying to set that up by hand. The result appears stable and the BIOS at least claims that the CPU is still running at its official normal speed rating.

(Apparently it's common for Intel motherboards to default to running DDR4 RAM at 2133 MHz for maximum compatibility or something.)

In general I'm quite happy with this new machine. It's not perfect, but nothing seems to be under Linux, but everything works, it's pretty nice, and it's surprisingly fast. Honestly, it feels good simply to have a modern machine, especially one where I can't clearly hear the difference between an idle machine and one under CPU load.

(There's still a little bit of noise increase under full load, but it's pretty subtle and I have to really pay attention in a quiet environment to hear it. On my old machine, the stock Intel CPU cooler spinning up created a clearly audible difference in noise that I could hear over, eg, my typing. The old machine might have been improved by redoing the thermal paste, but my old work machine, with the redone paste, still had the same sort of audibility under load.)

After my experiences with UEFI on my work machine, I didn't try to switch from BIOS booting to UEFI on this one either (contrary to earlier plans for this). The two machines probably have somewhat different BIOSes (although their GUIs look very similar), but I didn't feel like taking the chance and I've wound up feeling that BIOS boot is better if you're using a mirrored root filesystem (which I am), fundamentally due to an unfortunate Linux (or Fedora) GRUB2 configuration choice.

PS: Given my experiences with Ryzen on my new work machine (eg, and), I wound up with absolutely no interest in going with a Ryzen home machine. This turned out to be a good choice in general, but that's another entry.

(The short version is on Twitter.)

HomeMachine2018 written at 02:44:54; Add Comment

2018-03-29

The problem with Linux's 'predictable network interface names'

I tweeted:

I find modern Linux auto-generated Ethernet device names to be a big pain, because they're such long jumbles. enp0s31f6? enp1s0f0? Please give me a break (and something short).

The fundamental problem with these 'predictable network interface names' is that they aren't. By that I mean that if you tell me that a system has, say, a motherboard 1G Ethernet port and a 10G-T Ethernet port on a PCIE card, I can't predict what those interfaces will be called unless I happen to have an exactly identical second machine that I can check. If I want to configure the machine's networking or ssh in to run a status check command on the 10G-T interface, I'm pretty much out of luck; I'm going to have to run ifconfig or some similar command to see what this machine has decided to call the interfaces.

(Yes, even for the motherboard network port, which may or may not show up as eno1 depending on the vagaries of life and your specific configuration. The enp0s31f6 from my tweet is such a port, and we have other machines where the motherboard port is, eg, enp7s0.)

The other problem with these names is that they're relatively long jumbles that are different from each other at various random positions (not just at the end). Names like this are hard to tell apart, hard to tell to people, and invite errors when you're working with them (because such errors won't stand out in the jumble). This might be tolerable if we were getting predictability in exchange for that jumble, but we aren't. If all we're going to get is stability, it would be nice to have names that are easier to deal with.

(We aren't even entirely getting stability, since PCI slot numbering isn't stable and that's what these names are based on.)

PS: There is a benefit to this naming scheme, which is that identical hardware will have identical names and you can freely transplant system disks (or system images) around between such hardware. If I have a fleet of truly identical machines (down to the PCIE cards being in the same physical slots), I know that enp1s0f0 on one machine is enp1s0f0 on every machine.

(Over the years, these device names have been implemented with somewhat different names by a number of different Linux components (eg). These days they come from udev, which is now developed as part of systemd in case you wish to throw the usual stones. I'm not sure if udev considers the specific naming scheme to be stable, considering the official documentation points you to the source code.)

Sidebar: What this scheme does give us

Given identical and unchanging hardware (and BIOS), we get names that are consistent from boot to boot, from machine to machine, and are 'stateless' in that they don't depend on the past history of the Linux install running on the machine (your five year old install that's been been moved between three machines sees the same names as a from-scratch install made yesterday).

ModernNetworkNameIssue written at 00:45:16; Add Comment

2018-03-27

My uncertainties around X drivers for modern Intel integrated graphics

When I switched from my old office hardware to my new office machine, I got surprised by how well the X server coped with not actually having a hardware driver for my graphics card. That experience left me jumpy about making sure that any new hardware I used was actually being driven properly by X, instead of X falling back to something that was just good enough that I didn't know I should be getting better. All of which is the lead up to me switching my home machine over to new modern Intel-based hardware, using Intel's generally well regarded onboard graphics. Everything looked good, but then I decided to find out if it actually was good, and now I'm confused.

(The specific hardware is only slightly different from my original plans.)

Let's start with the first issue, which is how to determine what X modules you're actually using. People who are expert in reading the X log files can probably work this out from the logs but it left me confused and puzzled, so I've now resorted to brute force. X server modules are shared libraries (well, shared objects), so the X server has to map them into its address space in order to use them. If we assume that it unmaps modules it's not using, we can use lsof to determine what current modules it has. On my machine, this reports that the driver it has loaded is the modesetting driver, along with libfb.so and libglamoregl.so (and various Mesa things). Staring at the X logs led me to see:

modeset(0): [DRI2] Setup complete
modeset(0): [DRI2]   DRI driver: i965
modeset(0): [DRI2]   VDPAU driver: va_gl
[...]
AIGLX: Loaded and initialized i965

That seems pretty definite; I'm using the modesetting driver, not the Intel driver. This raises another question, which is whether or not this is a good thing. Although I initially thought there might be problems, a bunch of research in the process of writing this entry suggests that using the modesetting driver is the right answer (eg the Arch wiki entry on Intel graphics, which led me to the announcement that Fedora was switching over to modesetting). In fact now that I look back at my earlier entry, a commentator even talked about this switch.

(Before I found this information, I tried forcing X to use the Intel driver. This sort of worked (with complaints about an unrecognized chipset), but Chrome didn't like life so I gave up pretty much immediately. This might be fixed in the latest git tree of the driver, but if modesetting X works and is preferred, my motivation for building the driver from source is low.)

Unfortunately this leaves me with questions and concerns that I don't have answers to. The first issue is that I don't know how much GPU accelerated OpenGL I have active in this configuration. Since I can run some OpenGL stress tests without seeing the CPU load go up very much, I'm probably not using much software rendering and it's mostly hardware. Certainly xdriinfo reports that I have DRI through an i965 driver and glxinfo seems to know about the hardware.

(Specifically, glxinfo reports in several sections that it's using 'Mesa DRI Intel(R) HD Graphics (Coffeelake 3x8 GT2) (0x3e92)'. The 0x3e92 matches the PCI ID of the integrated graphics controller.)

The second issue is video playback, which for reasons beyond the scope of this entry is one of my interests. One of the ways that I noticed problems the first time around was that vdpauinfo was completely unhappy. This time around it's partially unhappy, but it says its using the 'OpenGL/VAAPI backend for VDPAU'. VAAPI (also) is an Intel project, and it has its own information command, vainfo. Unfortunately vainfo is not happy with my system; if I can read its messages correctly, it's unable to initialize the hardware driver it wants to use. Of course I don't know whether this even matters for basic video playback, even of 1080p content, or if common Linux video players use VAAPI (or VDPAU).

(Part of the issue may be that I have a bit of a Frankenstein mess of packages, with some VAAPI related packages coming from RPM Fusion, including the libva Intel driver package itself. Possibly I should trim down those packages somewhat.)

XCoffeeLakeDriverQuestion written at 02:05:16; Add Comment

2018-03-26

xprt: data for NFS mounts in /proc/self/mountstats is per-fileserver, not per-mount

A while back I wrote about all of the handy NFS statistics that appear in mountstats for all of your NFS mounts, including the xprt: NFS RPC information. For TCP mounts, this includes the local port and at the time I said:

  1. port: The local port used for this particular NFS mount. Probably not particularly useful, since on our NFS clients all NFS mounts from the same fileserver use the same port (and thus the same underlying TCP connection).

I then blithely talked about all of the remaining statistics as if they were specific to the particular NFS mount that the line was for. This turns out to be wrong, and the port number is in fact vital. I can demonstrate how vital by a little exercise:

$ fgrep xprt: /proc/self/mountstats | sort | uniq -c | sort -nr
    105   xprt: tcp 903 1 1 0 62 97817460 97785284 11122 2101962256388 0 574 10700678 55890249
     82   xprt: tcp 1005 1 1 0 0 48538448 48536496 1788 48292655827 0 810 26226830 53362451
[...]

It's not a coincidence that we have 105 NFS filesystems mounted from one fileserver and 82 from another. It turns out that at least with TCP based NFS mounts, all NFS mounts from the same fileserver will normally share the same RPC xprt transport, and it is the xprt transport's statistics that are being reported here. As a result, all of that xprt: NFS RPC information is for all NFS RPC traffic to the entire fileserver, not just the NFS RPC traffic for this specific mount.

(For TCP mounts, the combination of the local port plus the mountaddr= IP address will identify which xprt transport a given NFS mount is using. On our systems all NFS mounts from a given fileserver use the same port and thus the same xprt transport, but this may not always be the case. Also, each different fileserver is using a different local port, but again I'm not sure this is guaranteed.)

If the system is sufficiently busy doing NFS (and has enough NFS mounts), it's possible to see slightly different xprt: values for different mounts from a given fileserver that are using the same xprt transport. This isn't a true difference; it's just an artifact of the fact that the information for mountstats isn't being gathered all at once. If things update sufficiently frequently and fast, an early mount will report slightly older xprt: values than a later mount.

If you want to get a global view of RPC to a given fileserver, this is potentially convenient. If you want to get a per-mount view, it's inconvenient. For instance, to get the total number of NFS requests sent by this mount or the total bytes sent and received by it, you can't just look at the xprt: stats; instead you'll need to add up the counts from the per-operation statistics. Much of the information you want can be found by summing up per-operation stats this way, but I haven't checked to see if all of it can be.

There are probably clever things that can be done by combining and contrasting the xprt global stats and the per-mount stats you can calculate. I haven't tried to wrangle those metrics yet, though.

PS: The way that I found this is that the current version of nfsiostat does its sorting for -s based on the xprt: statistics, which gave us results that were sufficiently drastically off that it was obvious something was wrong.

(I suppose I should file a bug report about this with the nfs-utils people. My last bug report experience there went pretty smoothly and the current nfsd(7) manpage is now accurate.)

NFSMountstatsXprtII written at 02:10:51; Add Comment

2018-03-14

What I think I want out of a hypothetical nfsiotop for Linux

I tweeted:

I wish there was a version of Linux's nfsiostat that worked gracefully when you have several hundred NFS mounts across multiple NFS fileservers.

(I'm going to have to write one, aren't I.)

Linux exposes a very large array of per-filesystem NFS client statistics in /proc/self/mountstats (see here) and there are some programs that digest this data and report it, such as nfsiostat(8). Nfsiostat generally works decently to give you useful information, but it's very much not designed for systems with, for example, over 250 NFS mounts. Unfortunately that describes us, and we would rather like to have a took which tells us what the NFS filesystem hotspots are on a given NFS client if and when it's clearly spending a lot of time waiting for NFS IO.

(We have some machines with this sort of problem.)

As suggested by the name, a hypothetical nfsiotop would have to only report on the top N filesystems, which raises the question of how you sort NFS filesystems here. Modern versions of nfsiostat sort by operations per second, which is a start, but I think that one should also be able to sort by total read and write volume and probably also by write volume alone. Other likely interesting things to sort on are the average response time and the current number of operations outstanding. An ideal tool would also be able to aggregate things into per fileserver statistics.

(All of this suggests that the real answer is that you should be able to sort on any field that the program can display, including some synthetic ones.)

As my aside in the tweet suggests, I suspect that I'm going to have to write this myself, and probably mostly from scratch. While nfsiostat is written in Python and so is probably reasonably straightforward for me to modify, I suspect that it has too many things I'd want to change. I don't want little tweaks for things like its output, I want wholesale restructuring. Hopefully I can reuse its code to parse the mountstats file, since that seems reasonably tedious to write from scratch. On the other hand, the current nfsiostat Python code seems amenable to a quick gut job to prototype the output that I'd want.

(Mind you, prototypes tend to drift into use. But that's not necessarily a bad thing.)

PS: I've also run across kofemann/nfstop, which has some interesting features such as a per-UID breakdown, but it works by capturing NFS network traffic and that's not the kind of thing I want to have to use on a busy machine, especially at 10G.

PPS: I'd love to find out that a plausible nfsiotop already exists, but I haven't been able to turn one up in Internet searches so far.

NfsiotopDesire written at 22:48:49; Add Comment

2018-03-09

In Fedora, your initramfs contains a copy of your sysctl settings

It all started when I discovered that my office workstation had wound up with its maximum PID value set to a very large number (as mentioned in passing in this entry). I managed to track this down to a sysctl.d file from Fedora's ceph-osd RPM package, which I had installed for reasons that are not entirely clear to me. That was straightforward. So I removed the package, along with all of the other ceph packages, and rebooted for other reasons. To my surprise, this didn't change the setting; I still had a kernel.pid_max value of 4194304. A bunch of head scratching ensued, including extreme measures like downloading and checking the Fedora systemd source. In the end, the culprit turned out to be my initramfs.

In Fedora, dracut copies sysctl.d files into your initramfs when it builds one (generally when you install a kernel update), and there's nothing that forces an update or rebuild of your initramfs when something modifies what sysctl.d files the system has or what they contain. Normally this is relatively harmless; you will have sysctl settings applied in the initramfs and then reapplied when sysctl runs a second time as the system is booting from your root filesystem. If you added new sysctl.d files or settings, they won't be in the initramfs but they'll get set the second time around. If you changed sysctl settings, the initramfs versions of the sysctl.d files will set the old values but then your updated settings will get set the second time around. But if you removed settings, nothing can fix that up; the old initramfs version of your sysctl.d file will apply the setting, and nothing will override it later.

(In Fedora 27's Dracut, this is done by a core systemd related Dracut module in /usr/lib/dracut/modules.d, 00systemd/module-setup.sh.)

It's my view that this behavior is dangerous. As this incident and others have demonstrated, any time that normal system files get copied into initramfs, you have the chance that the live versions will get out of sync with the versions in initramfs and then you can have explosions. The direct consequence of this is that you should strive to put as little in initramfs as possible, in order to minimize the chances of problems and confusion. Putting a frozen copy of sysctl.d files into the initramfs is not doing this. If there are sysctl settings that have to be applied in order to boot the system, they should be in a separate, clearly marked area and only that area should go in the initramfs.

(However, our Ubuntu 16.04 machines don't have sysctl.d files in their initramfs, so this behavior isn't universal and probably isn't required by either systemd or booting in general.)

Since that's not likely to happen any time soon, I guess I'm just going to have to remember to rebuild my initramfs any time I remove a sysctl setting. More broadly, I should probably adopt a habit of preemptively rebuilding my initramfs any time something inexplicable is going on, because that might be where the problem is. Or at least I should check what the initramfs contains, just in case Fedora's dracut setup has decided to captured something.

(It's my opinion that another sign that this is a bad idea in general is there's no obvious package to file a bug against. Who is at fault? As far as I know there's no mechanism in RPM to trigger an action when files in a certain path are added, removed, or modified, and anyway you don't necessarily want to rebuild an initramfs by surprise.)

PS: For extra fun you actually have multiple initramfses; you have one per installed kernel. Normally this doesn't matter because you're only using the latest kernel and thus the latest initramfs, but if you have to boot an earlier kernel for some reason the files captured in its initramfs may be even more out of date than you expect.

FedoraInitramfsSysctl written at 23:00:24; Add Comment

2018-03-07

The lie in Ubuntu source packages (and probably Debian ones as well)

I tweeted:

One of the things that pisses me off about the Debian and Ubuntu source package format is that people clearly do not actually use it to build packages; they use other tools. You can tell because of how things are broken.

(I may have been hasty in tarring Debian with this particular brush but it definitely applies to Ubuntu.)

Several years ago I wrote about one problem with how Debian builds from source packages, which is that it doesn't have a distinction between the package's source tree and the tree that the package is built in and as a result building the package can contaminate the source tree. This is not just a theoretical concern; it's happened to us. In fact it's now happened with both the Ubuntu 14.04 version of the package and then the Ubuntu 16.04 version, which was contaminated in a different way this time.

This problem is not difficult to find or notice. All you have to do is run debuild twice in the package's source tree and the second one will error out. People who are developing and testing package changes should be doing this all the time, as they build and test scratch versions of their package to make sure that it actually has what they want, passes package lint checks, and so on.

Ubuntu didn't find this issue, or if they found it they didn't care enough to fix it. The conclusion is inescapable; the source package and all of the documentation that tells you to use debuild on it is a lie. The nominal source package may contain the source code that went into the binary package (although I'm not sure you can be sure of that), but it's not necessarily an honest representation of how the package is actually built by the people who work on it and as a result building the package with debuild may or may not reproduce the binary package you got from Ubuntu. Certainly you can't reliably use the source package to develop new versions of the binary package; one way or another, you will have to use some sort of hack workaround.

(RPM based distributions should not feel too smug here, because they have their own package building issues and documentation problems.)

I don't build many Ubuntu packages. That I've stumbled over two packages out of the few that I've tried to rebuild and they're broken in two different ways strongly suggests to me that this is pretty common. I could be unlucky (or lucky), but I think it's more likely that I'm getting a reasonably representative random sample.

PS: If Ubuntu and/or Debian care about this, the solution is obvious, although it will slow things down somewhat. As always, if you really care about something you must test it and if you don't bother to test it when it's demonstrably a problem, you probably don't actually care about it. This is not a difficult test to automate.

(Also, if debuild is not what people should be using to build or rebuild packages these days, various people have at least a documentation problem.)

UbuntuPackageBuildingLie written at 01:43:26; Add Comment

(Previous 10 or go back to March 2018 at 2018/03/05)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.