Wandering Thoughts archives

2018-03-31

My new Linux home machine for 2018

Back in the fall I planned out a new home machine and then various things happened, especially Meltdown and Spectre (which I feel made it not a great time to get a new CPU) but also my natural inertia. This inertia sort of persisted despite a near miss scare, but in the end I wound up losing confidence in my current (now old) home machine and just wanting to get things over with, so I bit the bullet and got a new home machine (and then I wound up with questions on its graphics).

(There will be better CPUs in the future that probably get real performance boosts from hardware fixes for Meltdown and Spectre, but there are always better CPUs in the future. And on my home machine, I'm willing to run with the kernel mitigations turned off if I feel strongly enough about it.)

The parts list for my new machine is substantially the same as my initial plans, but I made a few changes and indulgences. Here's the parts list for my future reference, if nothing else:

Intel Core i7-8700K
This is the big change from my initial plans, where I'd picked the i7-8700. It's an indulgence, but I felt like it and using an overclocking capable CPU did eliminate most of my concerns over high-speed DDR4 memory. I'm not overclocking the 8700K because, well, I don't. After I didn't run into any TDP issues with my work machine and its 95W TDP Ryzen, I decided that I didn't have any concerns with the 8700K's 95W TDP either.

(My concerns about the TDP difference between the 8700K and the 8700 turn out to be overblown, but that's going to be another entry.)

Asus PRIME Z370-A motherboard
I saw no reason to change my initial choice (and it turns out Phoronix was quite positive about it too, which I only found out about later). The RGB LEDs are kind of amusing and they even show a little bit through the some air vents on the case (especially in the dark).

2x 16GB G.Skill Ripjaws V DDR4-3000 CL15 RAM
I can't remember if the G.Skill DDR4-2666 modules were out of stock at the exact point I was putting in my order, but they certainly had been earlier (when I assembled the parts list at the vendor I was going to use), and in the end I decided I was willing to pay the slightly higher cost of DDR4-3000 RAM just as an indulgence.

(Looking at the vendor I used, the current price difference is $4 Canadian.)

I opted not to try to go faster than DDR4-3000 because the RAM modules I could easily see stopped having flat CL timings above that speed; they would be eg CL 15-18-18-18 instead of a flat CL15. There were some faster modules with flat CL timings, but they were much more expensive. Since I think I care more about latency than slight bandwidth increases, I decided to stick with the inexpensive option that gave me flat latency.

Noctua NH-U12 CPU cooler
I quite liked the Noctua cooler in my work machine, so I just got the general version of it. I've been quite happy on my work machine not just with how well the Noctua cooler works, but also how quiet it is with the CPU under heavy load (and in fact it seems quite difficult to push the CPU hard enough that the Noctua's fan has to run very fast, which of course helps keep any noise down).

In short: real (and large-ish) CPU coolers are a lot better than the old stock Intel CPU coolers. I suppose this is not surprising.

(The Noctua is not inexpensive, but I'm willing to pay for quality and long term reliability.)

EVGA SuperNova G3 550W PSU
As with my work machine, the commentary on my original plans pushed me to getting a better PSU. Since I liked this in the work machine, I just got it again for the home one. It's an amusing degree of overkill for the actual measured power draw of either machine, but I don't really care about that.

(For my own future reference: always use the PSU screws that come with the PSU, not the PSU screws that may come with the case.)

Fractal Design Define R5 case
I got this in white this time around (the work version is black). The story about the colour shift is that we recently built a new Ryzen based machine for one of my co-workers and basically duplicated my work machine for the parts list, except his case is white because he didn't care and it was cheaper and in stock at the time. When his case came in and I got a chance to see it in person, I decided that maybe I liked the white version better than the black one and in the end my vacillation settled on the white one.

(When I pulled the trigger on buying the hardware, the two case colours were the same price (and both in stock). So it goes.)

LG GH24NSC0 DVD/CD Writer
I gave in to an emotional temptation beyond the scope of this entry and got an optical drive, despite what I'd written earlier. But I haven't actually installed it in the case; it's apparently enough that I could if I wanted to.

(It helps that powering the drive would be somewhat awkward, because the PSU doesn't come with SATA modular cables that are long enough or have enough plugs on them.)

I'm running the RAM at full speed by activating its XMP profile in the BIOS, which was a pleasant one-click option (without that it came up at a listed 2133 Mhz). There didn't seem to be a one-click option for 'run this at 2666 MHz with the best timings it supports', so I didn't bother trying to set that up by hand. The result appears stable and the BIOS at least claims that the CPU is still running at its official normal speed rating.

(Apparently it's common for Intel motherboards to default to running DDR4 RAM at 2133 MHz for maximum compatibility or something.)

In general I'm quite happy with this new machine. It's not perfect, but nothing seems to be under Linux, but everything works, it's pretty nice, and it's surprisingly fast. Honestly, it feels good simply to have a modern machine, especially one where I can't clearly hear the difference between an idle machine and one under CPU load.

(There's still a little bit of noise increase under full load, but it's pretty subtle and I have to really pay attention in a quiet environment to hear it. On my old machine, the stock Intel CPU cooler spinning up created a clearly audible difference in noise that I could hear over, eg, my typing. The old machine might have been improved by redoing the thermal paste, but my old work machine, with the redone paste, still had the same sort of audibility under load.)

After my experiences with UEFI on my work machine, I didn't try to switch from BIOS booting to UEFI on this one either (contrary to earlier plans for this). The two machines probably have somewhat different BIOSes (although their GUIs look very similar), but I didn't feel like taking the chance and I've wound up feeling that BIOS boot is better if you're using a mirrored root filesystem (which I am), fundamentally due to an unfortunate Linux (or Fedora) GRUB2 configuration choice.

PS: Given my experiences with Ryzen on my new work machine (eg, and), I wound up with absolutely no interest in going with a Ryzen home machine. This turned out to be a good choice in general, but that's another entry.

(The short version is on Twitter.)

HomeMachine2018 written at 02:44:54; Add Comment

2018-03-29

The problem with Linux's 'predictable network interface names'

I tweeted:

I find modern Linux auto-generated Ethernet device names to be a big pain, because they're such long jumbles. enp0s31f6? enp1s0f0? Please give me a break (and something short).

The fundamental problem with these 'predictable network interface names' is that they aren't. By that I mean that if you tell me that a system has, say, a motherboard 1G Ethernet port and a 10G-T Ethernet port on a PCIE card, I can't predict what those interfaces will be called unless I happen to have an exactly identical second machine that I can check. If I want to configure the machine's networking or ssh in to run a status check command on the 10G-T interface, I'm pretty much out of luck; I'm going to have to run ifconfig or some similar command to see what this machine has decided to call the interfaces.

(Yes, even for the motherboard network port, which may or may not show up as eno1 depending on the vagaries of life and your specific configuration. The enp0s31f6 from my tweet is such a port, and we have other machines where the motherboard port is, eg, enp7s0.)

The other problem with these names is that they're relatively long jumbles that are different from each other at various random positions (not just at the end). Names like this are hard to tell apart, hard to tell to people, and invite errors when you're working with them (because such errors won't stand out in the jumble). This might be tolerable if we were getting predictability in exchange for that jumble, but we aren't. If all we're going to get is stability, it would be nice to have names that are easier to deal with.

(We aren't even entirely getting stability, since PCI slot numbering isn't stable and that's what these names are based on.)

PS: There is a benefit to this naming scheme, which is that identical hardware will have identical names and you can freely transplant system disks (or system images) around between such hardware. If I have a fleet of truly identical machines (down to the PCIE cards being in the same physical slots), I know that enp1s0f0 on one machine is enp1s0f0 on every machine.

(Over the years, these device names have been implemented with somewhat different names by a number of different Linux components (eg). These days they come from udev, which is now developed as part of systemd in case you wish to throw the usual stones. I'm not sure if udev considers the specific naming scheme to be stable, considering the official documentation points you to the source code.)

Sidebar: What this scheme does give us

Given identical and unchanging hardware (and BIOS), we get names that are consistent from boot to boot, from machine to machine, and are 'stateless' in that they don't depend on the past history of the Linux install running on the machine (your five year old install that's been been moved between three machines sees the same names as a from-scratch install made yesterday).

ModernNetworkNameIssue written at 00:45:16; Add Comment

2018-03-27

My uncertainties around X drivers for modern Intel integrated graphics

When I switched from my old office hardware to my new office machine, I got surprised by how well the X server coped with not actually having a hardware driver for my graphics card. That experience left me jumpy about making sure that any new hardware I used was actually being driven properly by X, instead of X falling back to something that was just good enough that I didn't know I should be getting better. All of which is the lead up to me switching my home machine over to new modern Intel-based hardware, using Intel's generally well regarded onboard graphics. Everything looked good, but then I decided to find out if it actually was good, and now I'm confused.

(The specific hardware is only slightly different from my original plans.)

Let's start with the first issue, which is how to determine what X modules you're actually using. People who are expert in reading the X log files can probably work this out from the logs but it left me confused and puzzled, so I've now resorted to brute force. X server modules are shared libraries (well, shared objects), so the X server has to map them into its address space in order to use them. If we assume that it unmaps modules it's not using, we can use lsof to determine what current modules it has. On my machine, this reports that the driver it has loaded is the modesetting driver, along with libfb.so and libglamoregl.so (and various Mesa things). Staring at the X logs led me to see:

modeset(0): [DRI2] Setup complete
modeset(0): [DRI2]   DRI driver: i965
modeset(0): [DRI2]   VDPAU driver: va_gl
[...]
AIGLX: Loaded and initialized i965

That seems pretty definite; I'm using the modesetting driver, not the Intel driver. This raises another question, which is whether or not this is a good thing. Although I initially thought there might be problems, a bunch of research in the process of writing this entry suggests that using the modesetting driver is the right answer (eg the Arch wiki entry on Intel graphics, which led me to the announcement that Fedora was switching over to modesetting). In fact now that I look back at my earlier entry, a commentator even talked about this switch.

(Before I found this information, I tried forcing X to use the Intel driver. This sort of worked (with complaints about an unrecognized chipset), but Chrome didn't like life so I gave up pretty much immediately. This might be fixed in the latest git tree of the driver, but if modesetting X works and is preferred, my motivation for building the driver from source is low.)

Unfortunately this leaves me with questions and concerns that I don't have answers to. The first issue is that I don't know how much GPU accelerated OpenGL I have active in this configuration. Since I can run some OpenGL stress tests without seeing the CPU load go up very much, I'm probably not using much software rendering and it's mostly hardware. Certainly xdriinfo reports that I have DRI through an i965 driver and glxinfo seems to know about the hardware.

(Specifically, glxinfo reports in several sections that it's using 'Mesa DRI Intel(R) HD Graphics (Coffeelake 3x8 GT2) (0x3e92)'. The 0x3e92 matches the PCI ID of the integrated graphics controller.)

The second issue is video playback, which for reasons beyond the scope of this entry is one of my interests. One of the ways that I noticed problems the first time around was that vdpauinfo was completely unhappy. This time around it's partially unhappy, but it says its using the 'OpenGL/VAAPI backend for VDPAU'. VAAPI (also) is an Intel project, and it has its own information command, vainfo. Unfortunately vainfo is not happy with my system; if I can read its messages correctly, it's unable to initialize the hardware driver it wants to use. Of course I don't know whether this even matters for basic video playback, even of 1080p content, or if common Linux video players use VAAPI (or VDPAU).

(Part of the issue may be that I have a bit of a Frankenstein mess of packages, with some VAAPI related packages coming from RPM Fusion, including the libva Intel driver package itself. Possibly I should trim down those packages somewhat.)

XCoffeeLakeDriverQuestion written at 02:05:16; Add Comment

2018-03-26

xprt: data for NFS mounts in /proc/self/mountstats is per-fileserver, not per-mount

A while back I wrote about all of the handy NFS statistics that appear in mountstats for all of your NFS mounts, including the xprt: NFS RPC information. For TCP mounts, this includes the local port and at the time I said:

  1. port: The local port used for this particular NFS mount. Probably not particularly useful, since on our NFS clients all NFS mounts from the same fileserver use the same port (and thus the same underlying TCP connection).

I then blithely talked about all of the remaining statistics as if they were specific to the particular NFS mount that the line was for. This turns out to be wrong, and the port number is in fact vital. I can demonstrate how vital by a little exercise:

$ fgrep xprt: /proc/self/mountstats | sort | uniq -c | sort -nr
    105   xprt: tcp 903 1 1 0 62 97817460 97785284 11122 2101962256388 0 574 10700678 55890249
     82   xprt: tcp 1005 1 1 0 0 48538448 48536496 1788 48292655827 0 810 26226830 53362451
[...]

It's not a coincidence that we have 105 NFS filesystems mounted from one fileserver and 82 from another. It turns out that at least with TCP based NFS mounts, all NFS mounts from the same fileserver will normally share the same RPC xprt transport, and it is the xprt transport's statistics that are being reported here. As a result, all of that xprt: NFS RPC information is for all NFS RPC traffic to the entire fileserver, not just the NFS RPC traffic for this specific mount.

(For TCP mounts, the combination of the local port plus the mountaddr= IP address will identify which xprt transport a given NFS mount is using. On our systems all NFS mounts from a given fileserver use the same port and thus the same xprt transport, but this may not always be the case. Also, each different fileserver is using a different local port, but again I'm not sure this is guaranteed.)

If the system is sufficiently busy doing NFS (and has enough NFS mounts), it's possible to see slightly different xprt: values for different mounts from a given fileserver that are using the same xprt transport. This isn't a true difference; it's just an artifact of the fact that the information for mountstats isn't being gathered all at once. If things update sufficiently frequently and fast, an early mount will report slightly older xprt: values than a later mount.

If you want to get a global view of RPC to a given fileserver, this is potentially convenient. If you want to get a per-mount view, it's inconvenient. For instance, to get the total number of NFS requests sent by this mount or the total bytes sent and received by it, you can't just look at the xprt: stats; instead you'll need to add up the counts from the per-operation statistics. Much of the information you want can be found by summing up per-operation stats this way, but I haven't checked to see if all of it can be.

There are probably clever things that can be done by combining and contrasting the xprt global stats and the per-mount stats you can calculate. I haven't tried to wrangle those metrics yet, though.

PS: The way that I found this is that the current version of nfsiostat does its sorting for -s based on the xprt: statistics, which gave us results that were sufficiently drastically off that it was obvious something was wrong.

(I suppose I should file a bug report about this with the nfs-utils people. My last bug report experience there went pretty smoothly and the current nfsd(7) manpage is now accurate.)

NFSMountstatsXprtII written at 02:10:51; Add Comment

2018-03-14

What I think I want out of a hypothetical nfsiotop for Linux

I tweeted:

I wish there was a version of Linux's nfsiostat that worked gracefully when you have several hundred NFS mounts across multiple NFS fileservers.

(I'm going to have to write one, aren't I.)

Linux exposes a very large array of per-filesystem NFS client statistics in /proc/self/mountstats (see here) and there are some programs that digest this data and report it, such as nfsiostat(8). Nfsiostat generally works decently to give you useful information, but it's very much not designed for systems with, for example, over 250 NFS mounts. Unfortunately that describes us, and we would rather like to have a took which tells us what the NFS filesystem hotspots are on a given NFS client if and when it's clearly spending a lot of time waiting for NFS IO.

(We have some machines with this sort of problem.)

As suggested by the name, a hypothetical nfsiotop would have to only report on the top N filesystems, which raises the question of how you sort NFS filesystems here. Modern versions of nfsiostat sort by operations per second, which is a start, but I think that one should also be able to sort by total read and write volume and probably also by write volume alone. Other likely interesting things to sort on are the average response time and the current number of operations outstanding. An ideal tool would also be able to aggregate things into per fileserver statistics.

(All of this suggests that the real answer is that you should be able to sort on any field that the program can display, including some synthetic ones.)

As my aside in the tweet suggests, I suspect that I'm going to have to write this myself, and probably mostly from scratch. While nfsiostat is written in Python and so is probably reasonably straightforward for me to modify, I suspect that it has too many things I'd want to change. I don't want little tweaks for things like its output, I want wholesale restructuring. Hopefully I can reuse its code to parse the mountstats file, since that seems reasonably tedious to write from scratch. On the other hand, the current nfsiostat Python code seems amenable to a quick gut job to prototype the output that I'd want.

(Mind you, prototypes tend to drift into use. But that's not necessarily a bad thing.)

PS: I've also run across kofemann/nfstop, which has some interesting features such as a per-UID breakdown, but it works by capturing NFS network traffic and that's not the kind of thing I want to have to use on a busy machine, especially at 10G.

PPS: I'd love to find out that a plausible nfsiotop already exists, but I haven't been able to turn one up in Internet searches so far.

NfsiotopDesire written at 22:48:49; Add Comment

2018-03-09

In Fedora, your initramfs contains a copy of your sysctl settings

It all started when I discovered that my office workstation had wound up with its maximum PID value set to a very large number (as mentioned in passing in this entry). I managed to track this down to a sysctl.d file from Fedora's ceph-osd RPM package, which I had installed for reasons that are not entirely clear to me. That was straightforward. So I removed the package, along with all of the other ceph packages, and rebooted for other reasons. To my surprise, this didn't change the setting; I still had a kernel.pid_max value of 4194304. A bunch of head scratching ensued, including extreme measures like downloading and checking the Fedora systemd source. In the end, the culprit turned out to be my initramfs.

In Fedora, dracut copies sysctl.d files into your initramfs when it builds one (generally when you install a kernel update), and there's nothing that forces an update or rebuild of your initramfs when something modifies what sysctl.d files the system has or what they contain. Normally this is relatively harmless; you will have sysctl settings applied in the initramfs and then reapplied when sysctl runs a second time as the system is booting from your root filesystem. If you added new sysctl.d files or settings, they won't be in the initramfs but they'll get set the second time around. If you changed sysctl settings, the initramfs versions of the sysctl.d files will set the old values but then your updated settings will get set the second time around. But if you removed settings, nothing can fix that up; the old initramfs version of your sysctl.d file will apply the setting, and nothing will override it later.

(In Fedora 27's Dracut, this is done by a core systemd related Dracut module in /usr/lib/dracut/modules.d, 00systemd/module-setup.sh.)

It's my view that this behavior is dangerous. As this incident and others have demonstrated, any time that normal system files get copied into initramfs, you have the chance that the live versions will get out of sync with the versions in initramfs and then you can have explosions. The direct consequence of this is that you should strive to put as little in initramfs as possible, in order to minimize the chances of problems and confusion. Putting a frozen copy of sysctl.d files into the initramfs is not doing this. If there are sysctl settings that have to be applied in order to boot the system, they should be in a separate, clearly marked area and only that area should go in the initramfs.

(However, our Ubuntu 16.04 machines don't have sysctl.d files in their initramfs, so this behavior isn't universal and probably isn't required by either systemd or booting in general.)

Since that's not likely to happen any time soon, I guess I'm just going to have to remember to rebuild my initramfs any time I remove a sysctl setting. More broadly, I should probably adopt a habit of preemptively rebuilding my initramfs any time something inexplicable is going on, because that might be where the problem is. Or at least I should check what the initramfs contains, just in case Fedora's dracut setup has decided to captured something.

(It's my opinion that another sign that this is a bad idea in general is there's no obvious package to file a bug against. Who is at fault? As far as I know there's no mechanism in RPM to trigger an action when files in a certain path are added, removed, or modified, and anyway you don't necessarily want to rebuild an initramfs by surprise.)

PS: For extra fun you actually have multiple initramfses; you have one per installed kernel. Normally this doesn't matter because you're only using the latest kernel and thus the latest initramfs, but if you have to boot an earlier kernel for some reason the files captured in its initramfs may be even more out of date than you expect.

FedoraInitramfsSysctl written at 23:00:24; Add Comment

2018-03-07

The lie in Ubuntu source packages (and probably Debian ones as well)

I tweeted:

One of the things that pisses me off about the Debian and Ubuntu source package format is that people clearly do not actually use it to build packages; they use other tools. You can tell because of how things are broken.

(I may have been hasty in tarring Debian with this particular brush but it definitely applies to Ubuntu.)

Several years ago I wrote about one problem with how Debian builds from source packages, which is that it doesn't have a distinction between the package's source tree and the tree that the package is built in and as a result building the package can contaminate the source tree. This is not just a theoretical concern; it's happened to us. In fact it's now happened with both the Ubuntu 14.04 version of the package and then the Ubuntu 16.04 version, which was contaminated in a different way this time.

This problem is not difficult to find or notice. All you have to do is run debuild twice in the package's source tree and the second one will error out. People who are developing and testing package changes should be doing this all the time, as they build and test scratch versions of their package to make sure that it actually has what they want, passes package lint checks, and so on.

Ubuntu didn't find this issue, or if they found it they didn't care enough to fix it. The conclusion is inescapable; the source package and all of the documentation that tells you to use debuild on it is a lie. The nominal source package may contain the source code that went into the binary package (although I'm not sure you can be sure of that), but it's not necessarily an honest representation of how the package is actually built by the people who work on it and as a result building the package with debuild may or may not reproduce the binary package you got from Ubuntu. Certainly you can't reliably use the source package to develop new versions of the binary package; one way or another, you will have to use some sort of hack workaround.

(RPM based distributions should not feel too smug here, because they have their own package building issues and documentation problems.)

I don't build many Ubuntu packages. That I've stumbled over two packages out of the few that I've tried to rebuild and they're broken in two different ways strongly suggests to me that this is pretty common. I could be unlucky (or lucky), but I think it's more likely that I'm getting a reasonably representative random sample.

PS: If Ubuntu and/or Debian care about this, the solution is obvious, although it will slow things down somewhat. As always, if you really care about something you must test it and if you don't bother to test it when it's demonstrably a problem, you probably don't actually care about it. This is not a difficult test to automate.

(Also, if debuild is not what people should be using to build or rebuild packages these days, various people have at least a documentation problem.)

UbuntuPackageBuildingLie written at 01:43:26; Add Comment

2018-03-05

Getting chrony to not try to use IPv6 time sources on Fedora

Ever since I switched over to chrony, one of the quiet little irritations of its setup on my office workstation has been that it tried to use IPv6 time sources along side the IPv4 ones. It got these time sources from the default Fedora pool I'd left it using along side our local time sources (because I'm the kind of person who thinks the more time sources the merrier), and at one level looking up IPv6 addresses as well as IPv4 addresses is perfectly sensible. At another level, though, it wasn't, because my office workstation has no IPv6 connectivity and even no IPv6 configuration. All of those IPv6 time sources that chrony was trying to talk to were completely unreachable and would never work. At a minimum they were clutter in 'chronyc sources' output, but probably they were also keeping chrony from picking up some additional IPv4 sources.

I started out by reading the chrony.conf manpage, on the assumption that that would be where you configured this. When I found nothing, I unwisely gave up and grumbled to myself, eventually saying something on Twitter. This caused @rt2800pci1 to suggest using systemd restrictions so that chronyd couldn't even use IPv6. This had some interesting results. On the one hand, chronyd definitely couldn't use IPv6 and it said as much:

chronyd[4097894]: Could not open IPv6 command socket : Address family not supported by protocol

On the other hand, this didn't stop chronyd from trying to use IPv6 addresses as time sources:

chronyd[4097894]: Source 2620:10a:800f::14 replaced with 2620:10a:800f::11

(I don't know why my office workstation has such high PIDs at the moment. Something odd is clearly going on.)

However, this failure caused me to actually read the chronyd manpage, where I finally noticed the -4 command line option, which tells chrony to only use IPv4 addresses for everything. On Fedora, you can configure what options are given to chronyd in /etc/sysconfig/chronyd, which is automatically used by the standard Fedora chronyd.service systemd service for chrony(d). A quick addition and chrony restart, and now it's not trying to use IPv6 and I'm happy.

There are a number of lessons here. One of them is my perpetual one, which is that I should read the manual pages more often (and make sure I read all of them). There was no reason to stop with just the chronyd.conf manpage; I simply assumed that not using IPv6 would be configured there if it was configurable at all. I was wrong and I could had my annoyance fixed quite a while ago if I'd looked harder.

Another one, on the flipside, is that completely disabling IPv6 doesn't necessarily stop modern programs from trying to use it. Perhaps this is a bug on chrony's part, but I suspect that its authors will be uninterested in fixing it. It's likely becoming a de facto standard that Linux systems have IPv6 enabled, even if they don't have it configured and can't reach anything with it. Someday we're going to see daemons that bind themselves only to the IPv6 localhost, not the IPv4 one.

ChronyDisableIPv6 written at 22:28:36; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.