Wandering Thoughts archives

2014-02-26

PCI slot based device names are not necessarily stable

One of the ways that Linux tries to get stable device names these days is to base them on information about the PCI bus and slot that a particular device is located at. This naming is behind, for example, hardware-based Ethernet names (see also) and /dev/disk/by-path/ for SATA and SAS drives. The theory is that since the name describes the PCI(E) location, as long as you don't physically relocate the card the name will stay the same. This is especially useful for things on the motherboard (because you can't move them at all).

The only problem is that this is not necessarily the case. There exists PC hardware where adding, changing, or removing other hardware will change the PCI bus and slot information for your hardware without you touching it at all; this even includes hardware located on the motherboard. Really. And the shifts aren't necessarily small, either. In the case I ran into today, changing from a dual port to a single port PCIE Gigabit card and moving it one card slot to the left changed two SAS disk controllers from PCI 07:00.0 and 08:00.0 to 04:00.0 and 05:00.0. Of course this totally changed how their disks came up in /dev/disk/by-path.

(For more fun, the new single-port Ethernet became 07:00.0 when the old two ports had been 05:00.0 and 06:00.0.)

The resulting reality is that your PCI based names are only stable if you change no hardware in the system. The moment you change any hardware all bets are off for all hardware. You may get lucky and have some devices keep their current PCI names but you may well not. And I don't think you're necessarily protected against perverse things like two equivalent devices swapping names (or at least one of them winding up with what was the other's old name).

If I'm reading lspci output correctly, what is really going on is that an increasing number of things are behind PCI bridges. These things create additional PCI buses (the first two digits in the PCI device numbering), and some combination of Linux, the system BIOS, and the PCI specification doesn't have a stable assignment for these additional busses. In fact since PCI(E) cards can themselves include additional bridges, a fully stable assignment would be very hard. This is part of what happened in my case; the old dual-port PCIE gigabit card contained not just two Ethernet controllers but two bridges as well (one for each controller) and these forcibly perturbed the numbering of other PCI 'busses' (which were really individual cards behind their own bridges).

PS: This has probably been the case for some time and this is just the first occasion I've run into it. We normally configure machines identically; it just so happened this time around that the first hardware unit we got in was used in part to test the dual-port card while the final unit configuration only needs a single-port card.

PCINamesNotStable written at 23:15:27; Add Comment

2014-02-19

Some rough things about the naming of SAS drives on Linux

This isn't a comprehensive look at the names of SAS drives because SAS (as a technology) has a lot of options. Instead this is a look at how directly attached SATA drives behind SAS controllers get named. I assume that real SAS drives get named the same (but I don't have any to test with) and I have no idea how things look for SAS (or SATA) drives behind a SAS expander or more complex SAS topologies. Also some of this depends on the SAS controller and its driver; we're using LSI cards driven by the mpt2sas driver.

To start with, SAS drives have sdX names and kernel 'SCSI host names' just like SATA drives do. As with SATA drives, sdX drive names are not necessarily stable across hotswaps. Unlike SATA drives, the SCSI host names are not necessarily fully stable either; I've seen them get renumbered within a SCSI host after a hotswap, so that what was 'scsi6:0:3:0' becomes 'scsi6:0:8:0'. So far, LSI cards have a single SCSI host regardless of how many SAS ports they support and how those ports are physically connected (eg with a 4x SFF-8087 multiplexer or broken out as individual SAS ports).

(For background, see my basic overview of SAS and the names of Linux SATA devices, which covers kernel SCSI host names in more detail along with the three other device namespaces.)

All of this tells you nothing about the drive's physical slot. In SAS (or at least in mpt2sas for directly attached devices), physical location information is the SAS PHY number. To find this you must go grubbing around in sysfs, so I will just show you:

set -P                   # in bash
cd /sys/block/sdX/device
cd ../../..
ls phy-*

(There is probably a simpler path from the block device, but sysfs is the kind of thing where I find a working way and then stop.)

The PHY name will be of the form 'phy-6:0', meaning PHY 0 on SCSI host 6. PHY numbers sometimes also show up in kernel messages, such as:

scsi 6:0:2:0: SATA: handle(0x000a), sas_addr(0x4433221101000000), phy(1), device_name(0x50014ee25df6de01)
port-6:7: remove: sas_addr(0x4433221103000000), phy(3)

Mapping PHY numbers to actual physical hardware slots is something that you'll have to do yourself for your specific hardware. Please don't assume that PHY numbering matches how the card BIOS does it (as seen in BIOS information printouts during boot or if you go into the BIOS itself); for our LSI cards, it does not.

(Although it may be obvious, PHY numbers are reused between different SAS controllers. Several controllers may all have a PHY 0.)

Since the SCSI host name of a SAS drive in a given physical slot is not stable, /dev/disk/by-path sensibly does not use it for SAS drives. Instead it uses the 'SAS address' of each disk in combination with the PCI device number. The SAS address for each drive is exposed in sysfs as /sys/block/sdX/device/sas_address and on our hardware with mpt2sas appears to vary only due to the PHY number. You can see SAS addresses in the earlier kernel messages I gave; the first message results in a by-path filename that looks like 'pci-0000:07:00.0-sas-0x4433221101000000-lun-0' (for the whole disk, and the '01' portion appears to mark the PHY number).

Note that SAS addresses are only unique on their particular SAS controller. The system this message comes from has two SAS controllers and both controllers have disks with the SAS address of 0x4433221101000000.

(Possible interesting reading is this writeup based on what the mptsas driver does. What I've checked seems to match fairly well with what mpt2sas does on our system.)

LinuxSASNames written at 23:58:51; Add Comment

2014-02-14

'Broken by design: systemd' is itself kind of broken

Recently, an article by Rich Felker called Broken by design: systemd has been making the rounds. While I am sympathetic with complaints about systemd, the problem is that this article is both more or less deliberately misleading and factually wrong in various of its sections. Normally I would pass over this (per the lesson of the famous xkcd strip), but not today for various reasons. I'll be quoting from the article to comment on specific issues I have with it.

(To hopefully avoid possible misunderstandings, I've written up my overall views of systemd and put them in a sidebar at the bottom of this entry.)

Felker more or less opens with:

My view is that this idea is wrong: systemd is broken by design, and despite offering highly enticing improvements over legacy init systems, it also brings major regressions in terms of many of the areas Linux is expected to excel: security, stability, and not having to reboot to upgrade your system.

To start with, when Felker talks about 'broken by design' and 'major regressions' he means both in a theoretical or philosophical sense; in other words he objects to how systemd is designed and feels that it is a bad idea. He does not point out anything that systemd fails at, can't do, or does wrong today in actual use. In practice systems running systemd have not been less secure or less stable and do not have to reboot to upgrade any more (or less) than non-systemd Linux systems do.

(Desktop Linux systems have increasingly been wanting to reboot after upgrades but this is driven by factors independent from systemd.)

On a hardened system without systemd, you have at most one root-privileged process with any exposed surface: sshd. Everything else is either running as unprivileged users or does not have any channel for providing it input except local input from root. Using systemd then more than doubles the attack surface.

Unfortunately this is false on a modern Linux system unless part of Felker's hardening involves disabling DBus and then fixing everything that stops working as a result of that. Any Linux system using DBus has a DBus daemon running as root, whether it is part of systemd or not, and that is a significant and user-accessible exposed surface (although only to local users). It may also expose DBus APIs for other root processes such as udev stuff.

(My understanding is that DBus has become essentially mandatory because udev wants to talk to it to broadcast hotplug events. Udev itself is deeply entwined in the modern Linux boot process to the point where removing it is less 'hardening your system' and more 'creating a new Linux distribution'.)

Update: I'm less and less confident of my understanding of how udev and DBus are linked to each other and how DBus runs. I may be wrong here about how necessary DBus is for udev and the security implications of DBus; this would mean that I'm wrong here and systemd offering DBus services is a real new exposure.

This increased and unreasonable risk is not inherent to systemd's goal of fixing legacy init. However it is inherent to the systemd design philosophy of putting everything into the init process.

I disagree with this view because I feel that a great deal of the increased attack surface systemd exposes is inherent in a number of core design decisions. Systemd is an active supervising init, so you must be able to somehow tell it to manipulate services (and load information about new ones). It holds service state in memory instead of trying to write status files on disk and keep them in sync; this implies you need a way of querying that service state. Systemd has further decided that unprivileged users can query that state, which means that unprivileged users can talk to it in general.

While systemd uses DBus for most or all of this I think that there is a serious argument that it is better to use a general core facility that a lot of people are paying a lot of attention to rather than reinvent the wheel on your own. A lot of people are worrying about the security and integrity of DBus and DBus libraries, many more than would be worrying about a systemd-specific protocol and set of message encoding and decoding code.

Unfortunately, by moving large amounts of functionality that's likely to need to be upgraded into PID 1, systemd makes it impossible to upgrade without rebooting. [...]

As Felker later admits, this is somewhere between 'factually incorrect' and 'aggressively misleading'. Systemd can and does serialize its state and re-exec itself during upgrades, and in practice this works reliably. My machines have upgraded systemd repeatedly without any kernel reboots involved (and this includes upgrades as drastic as Fedora version upgrades, eg from Fedora 19 to Fedora 20; yes I rebooted afterwards, but systemd was upgraded before then).

Yes, there are theoretical failure modes of this (as Felker agonizes about). I have a number of views on this but the simple version is that this problem exists in any other init system (most of which have been re-execing themselves on upgrades for years) and for any number of important system daemons as well as init. For example, if sshd fails to restart during an upgrade many servers are just as screwed as if init dies.

Felker also raises the issue of compatibility problems with the serialized state between an old and a new version. If it happened, this would be a distribution bug; when a distribution ships any upgrade it's that distribution's responsibility to make sure that the upgrade is compatible and won't make an upgraded system explode. Distributions have failed at this without systemd, but this is not a failure of what they are packaging, it is a failure of the distribution and their processes.

  • Many of the selling-point features of systemd are server-oriented. State-of-the-art transaction-style handling of daemon starting and stopping is not a feature that's useful on desktop systems. The intended audience for that sort of thing is clearly servers.

If you read the systemd design documents, this is clearly incorrect. One of systemd's explicit goals is to not start daemons on desktop systems until they're needed, especially heavyweight daemons like CUPS. If anything this is a drawback on servers, where people like me want to know right away on reboot if something is not going to work a day from now when someone tries to use it for the first time.

(Systemd's fast boot time due to starting services in parallel and various other tricks is also primarily a desktop advantage in my opinion, with perhaps a sideline in cloud virtual instances. Physical servers reboot infrequently and their boot is often drastically slowed down by the firmware's burning need to lovingly fondle ever bit of hardware in sight. Not that I'm grumpy about it or anything.)

  • The desktop is quickly becoming irrelevant. The future platform is going to be mobile and is going to be dealing with the reality of running untrusted applications. While the desktop made the unix distinction of local user accounts largely irrelevant, the coming of mobile app ecosystems full of potentially-malicious apps makes "local security" more important than ever.

The systemd developers disagree about the future irrelevance of the desktop, as do I. Beyond that, systemd has a significant amount of support for running services and other things in confined environments via use of Linux cgroups, something that is highly useful on both servers (for running daemons in lesser-privileged environments or with strong resource limits) and on desktops and other user machines for exactly this sort of untrusted applications.

None of the things systemd "does right" are at all revolutionary. They've been done many times before. DJB's daemontools, runit, and Supervisor, among others, have solved the "legacy init is broken" problem over and over again (though each with some of their own flaws). Their failure to displace legacy sysvinit in major distributions had nothing to do with whether they solved the problem, and everything to do with marketing. [...]

I disagree with this at sufficient length that I wrote an entire entry on why systemd is winning the init wars and other things aren't. The short version is that only Upstart has even been trying to do so.

If none of [of the alternate init systems] are ready for prime time, then the folks eager to replace legacy init in their favorite distributions need to step up and either polish one of the existing solutions or write a better implementation based on the same principles. Either of these options would be a lot less work than fixing what's wrong with systemd.

The final sentence is demonstrably false. Systemd works today on a great number of machines and the alternate init systems do not. Making the alternative init systems work would be a significant amount of effort, especially if you do as Felker advocates and completely replace the current init code to shove most of what init historically has done off to new programs. What might take 'a lot less work' for alternate init systems than systemd is changing them to fit Felker's vision of how init should work, a vision that is not how things work today even in System V init.

Felker does not make it clear if he thinks that legacy init even needs to be replaced (and there is certainly a contingent of people who feel that it doesn't need to be). I feel that System V init has a number of significant issues, issues that really do make a difference when managing systems. Other people seem to share this view given that major Linux distributions have moved to adopt other init systems (first with Upstart in Ubuntu, Fedora, and RHEL, and now with a move to systemd). And going outside of Linux, Solaris's SMF is the granddaddy of drastic modern init overhauls. Clearly this is an idea that has resonated with a lot of technical people over time.

(And as Felker forthrightly says, systemd offers 'highly enticing improvements over legacy init systems'.)

Sidebar: Smaller issues in Felker's article

Among the reasons systemd wants/needs to run as PID 1 is getting parenthood of badly-behaved daemons that orphan themselves, preventing their immediate parent from knowing their PID to signal or wait on them.

This is not the case. Systemd runs parts of itself as PID 1 because that is what an init system does. Systemd actually handles badly behaved daemon processes not through noticing when they are reparented to PID 1 but through Linux cgroups, which provide accurate tracking of what service a process belongs to.

In general inheriting the parentage of badly behaved daemon processes is useless for an init system because in standard Unix the init system has no way of figuring out what (abstract) service a random process it has just inherited is associated with or otherwise where it came from. In short, inits inherit random daemon processes only because they inherit all random processes.

(Why does PID 1 inherit orphan processes as opposed to something else happening to them? The ultimate answer is 'because that's how Unix works'.)

[...] While legacy init systems basically deal with no inputs except SIGCHLD from orphaned processes exiting and manual runlevel changes performed by the administrator, [...]

This is the case much of the time on modern servers but is not historically the case. One of init's major roles over time has been handling getty processes for the console and for serial connections, a role which involves a fair amount of complexity (for instance, most inits have had rate-limiting so that a broken getty or line wouldn't eat the system). And runlevel changes are actually a subset of the more general init-managed facilities exposed in /etc/inittab in System V init.

With that said, it's completely true that systemd deals with a lot more input sources than traditional System V init. Some of this is intrinsic in being an active supervision-based init system instead of a passive one like System V init, as an active init system must have some way of telling to manipulate services.

Sidebar: My overall views of systemd

I want to summarize my view of systemd to avoid misunderstandings. First, I feel that systemd is currently the best Linux init system from a sysadmin's perspective for reasons that I mostly covered in an earlier entry on things that systemd gets right. Second, I don't think that systemd is the ultimate init system (especially the ultimate Unixy init system). Instead I see it as part of Unix's necessary experimentation and growth. System V init is not flawless and systemd is one of a number of attempts to move the state of the art in init systems forward. We'll collectively learn from this over time and either improve systemd or come up with better solutions and replace it.

SystemdAndBrokenByDesign written at 22:10:55; Add Comment

2014-02-11

Why systemd is winning the init wars and other things aren't

Recently, an article by Rich Felker called Broken by design: systemd has been making the rounds. I have a number of things to say about this article but today I want to talk about one specific issue it brings up, which is systemd's novelty (or lack thereof) and why it is succeeding. To start with, here is the relevant quote from Felker's article:

None of the things systemd "does right" are at all revolutionary. They've been done many times before. DJB's daemontools, runit, and Supervisor, among others, have solved the "legacy init is broken" problem over and over again (though each with some of their own flaws). Their failure to displace legacy sysvinit in major distributions had nothing to do with whether they solved the problem, and everything to do with marketing. [...]

This is wrong on several levels. To start with and as usual, social problems are the real problems. In specific, none of these alternate init systems did the hard work to actually become a replacement init system for anything much. Anyone can write an init system, especially a partial one (I did once, long ago). Getting it adopted by people is the hard part and none of these alternatives tackled that effectively (if they did so at all, and some of them certainly didn't). And as Felker admits, each of these theoretical alternatives have flaws of their own.

(Note that this is not a criticism of those alternate init systems. I don't think any of them have really been developed with replacing SysV init in Linux distributions or elsewhere as a goal. DJB daemontools certainly wasn't; I believe that DJB's attitude towards it, as towards more or less everything he's developed, can be summed up as 'I showed you the way, what you do with it is up to you'.)

The reason systemd has succeeded in becoming an SysV init replacement is simple: it did the work. Not only did it put together a lot of good ideas regardless of their novelty or lack thereof but its developers put in the time and effort to convince people that it was a good idea, the right answer, a good solution to problems and so on. Then they dealt with lots and lots of practical concerns, backwards compatibility, corner cases, endless arguments, and so on and so forth. I want to specifically mention here that one of the things the systemd people did was write extensive documentation on systemd's design, how to configure and operate it, and what sorts of neat things you can do with it. While this documentation is not perfect, most init systems are an order of magnitude less well documented.

(I am sure that in some quarters it's popular to believe that Lennart Poettering bulldozed the Fedora technical people into adopting his new thing. I do not think that the Fedora technical people are that easily overrun (or that impressed by Poettering, especially after PulseAudio), and for that matter at least some of the Debian technical people feel that systemd is the best option despite having looked deeply at the alternatives (cf).)

You can call this marketing if you want, although I don't think that that's a useful label for what is really happening. I call this 'trying' versus 'not trying'. If you don't try hard and work hard to become a replacement init system, it should be no surprise when you don't.

(In particular, note that SysV init is not a particularly bad init system so it should be no surprise when it is not particularly easy to displace.)

Beyond that I have some degree of experience with one of these alternate init systems, specifically DJB daemontools, and I've looked at the documentation for the other two. Speaking as a system administrator, systemd solves my problems better. The authors of systemd have looked at problems that are not solved by SysV init and come up with real solutions to them. Many of these problems are not solved by any of the alternatives that Felker put forward. In specific, often the alternatives assume (or require) cooperative daemon processes in order to fully realize their benefits; systemd is deliberately designed so that it does not and can fully manage even existing obstreperous Unix daemons with their willful backgrounding and other inconvenient behaviors.

(I don't know the field of Linux and Unix init-like systems well enough to say whether or not features like socket activation and clever use of control groups are genuinely novel in systemd or simply the first time I've become aware of them. They do feel novel.)

Since that may not be clear, let me be plain: systemd is a better init system than the alternatives. It does more to solve real problems and it does it better. That alone is a good reason for it to win in the practical world, the one where people care about getting stuff done. That systemd is not necessarily novel or the first to come up with the ideas that it embodies is irrelevant to this. Implementation matters more than ideas.

(Arguably it's an advantage that systemd feels no urge to reinvent different wheels when perfectly decent ones exist.)

PS: Please note that the reason that Unix itself succeeded is not its ideas alone, it is that Unix implemented them very well. A number of Unix's ideas are both great and novel, but a bad implementation would have doomed the whole enterprise. The fate of good ideas with a bad implementation is to be reimplemented elsewhere, cf the Xerox Alto and for that matter the Apple Lisa.

PPS: Also note that the one serious competitor to systemd is Upstart, which is also the product of a great deal of work and polishing.

SystemdWhyItWon written at 17:36:32; Add Comment

2014-02-09

Why I want a solid ZFS implementation on Linux

The short version of this is 'ZFS checksums and ZFS scrubs'. Without strong per-block integrity protections, there are two issues that I increasingly worry about for my Linux workstations with mirrored disks: read errors on the remaining live disk when resynchronizing a RAID-1 mirror after it loses one disk and slow data loss due to undetected read errors and corrupted on-disk data. Slow data loss is also a worry for backups on a single backup or especially an archival disk (I'll have more than one archive disk but cross-verification may be very painful).

(ZFS also offers flexible space management for filesystems, but this is less of an issue for me. In practice the filesystems on my workstation just grow slowly over time, which is a scenario that's already handled by LVM. I might do some reorganization if I could shrink filesystems easily but probably not much.)

ZFS's block checksums combined with regular scrubs basically immunize me against these creeping problems. Unless I'm very unlucky I can pretty much count on any progress disk damage getting repaired, and if I'm unlucky at least I'll know about it and maybe I can retrieve things from backups. Of course in theory Btrfs can do all of this too, but btrfs remains not ready for production and unlike ZFS this applies to the fundamental code, not just the bits that connect the core ZFS code to Linux.

(That ZFS is not integrated into the mainline kernel also makes it somewhat risky to use ZFS on distributions like Fedora that stick closely to the current mainline kernels and update frequently. Btrfs is obviously much better off here, so I really wish it was stable and proven in widespread usage.)

I suppose the brute force overkill solution to this dilemma is an OmniOS based fileserver that NFS exports things to my Linux workstation, but there are various drawbacks to that (especially at home).

(Running my entire desktop environment on OmniOS is a complete non-starter.)

(This is sort of the background explanation behind a tweet.)

LinuxZFSWant written at 20:58:04; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.