Wandering Thoughts

2017-08-18

How ZFS on Linux names disks in ZFS pools

Yesterday I covered how on Illumos and Solaris, disks in ZFS pools have three names; the filesystem path, the 'physical path' (a PCI device name, similar to the information that lspci gives), and a 'devid', with the vendor, model name, and serial number of the disk. While these are Solaris concepts, Linux has similar things and you could at least mock up equivalents of them in the kernel.

ZFS on Linux doesn't try to do this. Instead of having three names, it has only one:

# zdb -C vmware2
MOS Configuration:
[...]
  children[0]:
    type: 'disk'
    id: 0
    guid: 8206543908042244108
    path: '/dev/disk/by-id/ata-ST500DM002-1BC142_Z2AA6A4E-part1'
    whole_disk: 0
[...]

ZoL stores only the filesystem path to the device, using whatever path that you told it to use. To get the equivalent of Solaris devids and physical paths, you need to use the right sort of filesystem path. Solaris devids roughly map to /dev/disk/by-id and physical paths map to /dev/disk/by-path (and there isn't really an equivalent to Solaris /dev/dsk names, which are more stable than Linux /dev/sd* names).

The comment about this in vdev_disk_open in vdev_disk.c discusses this in some detail, and it's worth repeating it in full:

Devices are always opened by the path provided at configuration time. This means that if the provided path is a udev by-id path then drives may be recabled without an issue. If the provided path is a udev by-path path, then the physical location information will be preserved. This can be critical for more complicated configurations where drives are located in specific physical locations to maximize the systems tolerance to component failure. Alternatively, you can provide your own udev rule to flexibly map the drives as you see fit. It is not advised that you use the /dev/[hd]d devices which may be reordered due to probing order. Devices in the wrong locations will be detected by the higher level vdev validation.

(It's a shame that this information exists only as a comment in a source file that most people will never look at. It should probably be in large type in the ZFS on Linux zpool manpage.)

This means that with ZFS on Linux, you get only one try for the disk to be there; there's no fallback the way there is on Illumos for ordinary disks. If you've pulled an old disk and put in a new one and you use by-id names, ZoL will see the old disk as completely missing. If you use by-path names and you move a disk around, ZoL will not wind up finding the disk in its new location the way ZFS on Illumos probably would.

(The net effect of this is that with ZFS on Linux you should normally see a lot more 'missing device' errors and a lot fewer 'corrupt or missing disk label' errors than you would in the same circumstances on Illumos or Solaris.)

At this point, you might wonder how you change what sort of of name ZFS on Linux is using for disks in your pool(s). Although I haven't done this myself, my understanding is that you export the pool then import it again using the -d option to zpool import. With -d, the import process will end up finding the disks for the pool using the type of names that you want, and then actually importing the pool will rewrite the saved path data in the pool's configuration (and /etc/zfs/zpool.cache) to use these new names as a side effect.

(I'm not entirely sure how I feel about this with ZFS on Linux. I think I can see some relatively obscure failure modes where no form of disk naming works as well as things do in Illumos. On the other hand, in practice using /dev/disk/by-id names is probably at least as good an experience as Illumos provides, and the disk names are always clear and explicit. What you see is what you get, somewhat unlike Illumos.)

ZFSOnLinuxDiskNames written at 02:35:11; Add Comment

2017-08-15

How to get per-user fair share scheduling on Ubuntu 16.04 (with systemd)

When I wrote up imposing temporary CPU and memory limits on a user on Ubuntu 16.04, I sort of discovered that I had turned on per-user fair share CPU scheduling as a side effect, although I didn't understand exactly how to do this deliberately. Armed with a deeper understanding of how to tell if fair share scheduling was on, I've now done a number of further experiments and I believe I have definitive answers. This applies only to Ubuntu 16.04 and its version of systemd as configured by Ubuntu; it doesn't seem to apply to, for example, a stock Fedora 26 system.

To enable per user fair share CPU scheduling, it appears that you must do two things:

  • First, set CPUAccounting=true on user.slice. You can do this temporarily with 'systemctl --runtime set-property' or permanently enable it.

  • Second, arrange to have CPUAccounting=true set on an active user slice. If you do this temporarily with 'systemctl --runtime', the user must be logged in with some sort of session at the time. If you do this permanently, nothing happens until that user logs in and systemd creates their user-${UID}.slice slice.

Once you've done both of these, all future (user) sessions from any user will have their processes included in per-user fair share scheduling. If you used 'systemctl --runtime' on a user-${UID}.slice, it doesn't matter if that user logs completely out and their slice goes away; the fair share scheduling sticks despite this. However, fair-share scheduling goes away if all users log out and user.slice is removed by systemd. You need at least one remaining user session at all times to keep user.slice still in use (a detached screen session will do).

If you want to force existing processes to be subject to per-user fair share scheduling, you must arrange to set CPUAccounting=true on all current user scopes:

for i in $(systemctl -t scope list-units |
           awk '{print $1}' |
           grep '^session-.*\.scope$'); do
    systemctl --runtime set-property $i CPUAccounting=true
done

This creates a slightly different cgroup hierarchy than you'll get from completely proper fair share scheduling, but the differences are probably unimportant in practice. In regular fair share scheduling, all processes from the same user are grouped together under user.slice/user-${UID}.slice, so they contend evenly with each other. When you force scopes this way, processes get grouped into their scopes, so they go in user.slice/user-${UID}.slice/session-<blah>.scope; as a result, a user's scopes also are fair-share scheduled against each other. This only applies to current processes and scopes; as users log out and then back in again, their new processes will be all grouped together.

If you have a sufficiently small number of users who will log in to your machines and run CPU-consuming things, it's feasible to create permanent settings for each of them with 'systemctl set-property user-${UID}.slice CPUAccounting=true'. If you have lots of users, as we do, this is infeasible; if nothing else, your /etc/systemd/system directory would wind up utterly cluttered. This means that you have to do it on the fly (and then do it again if all user sessions ended and systemd deleted user.slice).

This is where we run into an important limitation of per user fair share scheduling on a normally configured Ubuntu 16.04. As we've set fair-share scheduling up, this only applies to processes that are under user.slice; system processes are not fair-share scheduled. It turns out that user cron jobs don't run under user.slice and so are not fair-share scheduled. All processes created by user cron entries wind up all grouped together under cron.service; there is no per-user separation and nothing is put under user slices.

(It's possible that you can change this with PAM magic, but this is how a normal Ubuntu 16.04 machine behaves.)

I discovered this because I had the clever idea that I could use a root @reboot /etc/cron.d entry to set things on user.slice and user-0.slice shortly after the system booted. Attempting to do this led to the discovery that neither slice actually existed when my @reboot job ran, and that my process was under cron.service instead. As far as I can see there's no way around this; there just doesn't seem to be a systemd command that will run a command for you under a user slice.

(If there was, you could make a root @reboot crontab that ran the necessary systemctl commands and then didn't exit, so there would always be an active user slice so that user.slice wouldn't get removed by systemd.)

PS: My solution was to wrap up all of these steps into a shell script that we can run if we need to turn on fair-share scheduling on some machine because a bunch of users are contending over it. Such an on demand, on the fly solution is good enough for our case (even if it doesn't include crontab jobs, which is a real pity for some machines).

Ubuntu1604FairShareScheduling written at 00:01:12; Add Comment

2017-08-12

Notes on cgroups and systemd's interaction with them as of Ubuntu 16.04

I wrote recently on putting temporary CPU and memory limits on a user, using cgroups and systemd's features to fiddle around with them on Ubuntu 16.04. In the process I wound up confused about various aspects of how things work today. Since then I've done a bit of digging and I want to write down what I've learned before I forget it again.

The overall cgroup experience is currently a bit confusing on Linux because there are now two versions of cgroups, the original ('v1') and the new version ('v2'). The kernel people consider v1 cgroups to be obsolete and I believe that the systemd people do as well, but in practice Ubuntu 16.04 (and even Fedora 25) use cgroup v1, not v2. You find out which cgroup version your system is using by looking at /proc/mounts to see what sort of cgroup(s) you're mounting. With cgroup v1, you'll see multiple mounts in /sys/fs/cgroup with filesystem type cgroup and various cgroup controllers specified as mount options, eg:

[...]
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,[...],cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,[...],pids 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,[...],net_cls,net_prio 0 0
[...]

According to the current kernel v2 documentation, v2 cgroup would have a single mount with the filesystem type cgroup2. The current systemd.resource-control manpage discusses the systemd differences between v1 and v2 cgroups, and in the process mentions that v2 cgroups are incomplete because the kernel people can't agree on how to implement bits of them.

In my first entry, I wondered in an aside how you could tell if per-user fair share scheduling was on. The answer is that it depends on how processes are organized into cgroup hierarchies. You can see this for a particular process by looking at /proc/<pid>/cgroup:

11:devices:/user.slice
10:memory:/user.slice/user-915.slice
9:pids:/user.slice/user-915.slice
8:hugetlb:/
7:blkio:/user.slice/user-915.slice
6:perf_event:/
5:freezer:/
4:cpu,cpuacct:/user.slice/user-915.slice
3:net_cls,net_prio:/
2:cpuset:/
1:name=systemd:/user.slice/user-915.slice/session-c188763.scope

What this means is documented in the cgroups(7) manpage. The important thing for us is the interaction between the second field (the controller) and the path in the third field. Here we see that for the CPU time controller (cpu,cpuacct), my process is under my user-NNN.slice slice, not just systemd's overall user.slice. That means that I'm subject to per-user fair share scheduling on this system. On another system, the result is:

[...]
5:cpu,cpuacct:/user.slice
[...]

Here I'm not subject to per-user fair share scheduling, because I'm only under user.slice and I'm thus not separated out from processes that other users are running.

You can somewhat estimate the overall state of things by looking at what's in the /sys/fs/cgroup/cpu,cpuacct/user.slice directory. If there are a whole bunch of user-NNN.slice directories, processes of those users are at least potentially subject to fair share scheduling. If there aren't, processes from a user definitely aren't. Similar things apply to other controllers, such as memory.

(The presence of a user-915.slice subdirectory doesn't mean that all of my processes are subject to fair share scheduling, but it does mean that some of them are. On the system I'm taking this /proc/self/cgroup output from, there are a number of people's processes that are only in user.slice in the CPU controller; these processes would not be subject to per-user fair share scheduling, even though other processes of the same user would be.)

If you want a full overview of how everything is structured for a particular cgroup controller, you can use systemd-cgls to see this information all accumulated in one spot. You have to ask for a particular controller specifically, for example 'systemd-cgls /sys/fs/cgroup/cpu,cpuacct', and obviously it's only really useful if there actually is a hierarchy (ie, there are some subdirectories under the controller's user.slice directory). Unfortunately, as far as I know there's no way to get systemd-cgls to tell you the user of a particular process if it hasn't already been put under a user-NNN.slice slice; you'll have to grab the PID and then use another tool like ps.

For setting temporary systemd resource limits on slices, it's important to know that systemd completely removes those user-NNN.slice slices when a user logs out from all of their sessions, and as part of this forgets about your temporary resource limit settings (as far as I know). This may make them more temporary than you expected. I'm not sure if trying to set persistent resource limits with 'systemctl set-property user-NNN.slice ...' actually works; my results have been inconsistent, and since this doesn't work on user.slice I suspect it doesn't work here either.

(As far as I can tell, temporary limits created with 'systemctl --runtime set-property' work in part by writing files to /run/systemd/system/user-NNN.slice.d. When a user fully logs out and their user-NNN.slice is removed, systemd appears to delete the corresponding /run directory, thereby tossing out your temporary limits.)

Although you can ask systemd what it thinks the resource limits imposed on a slice are (with 'systemctl show ...'), the ultimate authority is the cgroup control files in /sys/fs/cgroup/<controller>/<path>. If in doubt, I would look there; the systemd.resource-control manpage will tell you what cgroup attribute is used for which systemd resource limit. Of course you need to make sure that the actual runaway process you want to be limited has actually been placed in the right spot in the hierarchy of the relevant cgroup controller, by checking /proc/<pid>/cgroup.

(Yes, this whole thing is a complicated mess. Slogging through it all has at least given me a better idea of what's going on and how to inspect it, though. For example, until I started writing this entry I hadn't spotted that systemd-cgls could show you a specific cgroup controller's hierarchy.)

SystemdCgroupsNotes written at 00:04:29; Add Comment

2017-08-03

Imposing temporary CPU and memory resource limits on a user on Ubuntu 16.04

Suppose, not entirely hypothetically, that you sometimes have users on your primary login server who accidentally run big CPU-consuming and memory-eating compute jobs that will adversely impact the machine. You could kill their process or their entire login session, but that's both a drastic impact and potentially not a sure cure, and life gets complicated if they're running something involving multiple processes. In an ideal world you would probably want to configure this shared login server so that all users are confined with reasonable per-user resource limits. Unfortunately systemd cannot do that today; you need to put limits on the user-${UID}.slice unit that systemd creates for each user, but you can't template this unit to add your own settings.

Without always-on per-user resource limits, what you'd like to do is impose per-user resource limits on your runaway user on the fly, so that they can't use more than, say, half the CPUs or three quarters of the memory or the like (pick your own reasonable limits). Systemd can do this, in a way similar to using systemd-run to limit something's RAM consumption, but on Ubuntu 16.04 this requires a little bit more work than you would expect.

The basic approach is to set limits on the user's user-${UID}.slice slice unit:

systemctl --runtime set-property user-915.slice CPUQuota=200% MemoryLimit=8G

With --runtime, these limits will not persist over the next reboot (although that may be quite some time in the future, depending on how you manage your machines; ours tend to be up for quite a while).

In theory this should be all that you need to do. In practice, on Ubuntu 16.04 the first problem is that this will limit new login sessions for the user but not existing ones. Of course existing ones are the ones that you care about right now, because the user is already logged on and already running those CPU-eaters. The problem appears to be that just setting these properties does not turn on CPUAccounting and MemoryAccounting for existing sessions, so nothing is actually enforcing those limits.

The obvious fix here is to explicitly turn on these for the user-${UID}.slice unit we already manipulated. Sadly this has no effect. Instead the magic fix appears to be to find one of the user's scopes (use 'systemctl status <PID>' for one of the CPU-eating processes) and then set the limit on that scope:

systemctl --runtime set-property scope-c178012.scope CPUAccounting=true MemoryAccounting=true

In my testing, the moment that I set these on for any current scope, all of the user's current login sessions were affected. If I sort of understand what systemd is doing with cgroups, this is probably because setting these on a single scope causes (or forces) systemd to ripple this up to parent units. Taken from the systemd.resource-control manpage:

Note that turning on CPU accounting for one unit will also implicitly turn it on for all units contained in the same slice and for all its parent slices and the units contained therein.

It's possible that this will turn on global per-user fair share scheduling all by itself. This is probably not such a bad thing on the kind of shared login server where we'd want to do this.

If you think you're going to need to add these on the fly limits, an obvious thing to do is to pre-enable CPU and memory accounting, so that all user slices and login scopes will be created ready for you to add limits. The basic idea works, but several ways to achieve it do not, despite looking like they should. What appears to be the requirement in Ubuntu 16.04 is that you force systemd to adjust its current in-memory configuration. The most straightforward way is this:

systemctl --runtime set-property user.slice CPUAccounting=true MemoryAccounting=true

Doing this works, but it definitely has the side effect that it turns on per-user fair share CPU scheduling. Hopefully this is a feature for you (it probably is for us).

The following two methods don't work, or at least they don't persist over reboots (they may initially appear to work because they're also causing systemd to adjust its current in-memory configuration):

  • Enabling DefaultCPUAccounting and DefaultMemoryAccounting in user.conf via a file in /etc/systemd/user.conf.d, contrary to how I thought you'd set up per-user fair share scheduling. There is no obvious reason why this shouldn't work and it's even documented as working, it just doesn't in the Ubuntu 16.04 version of systemd (nominally version 229). If you do 'systemctl daemon-reload' they may initially appear to work, but if you reboot they will quietly do nothing.

  • Permanently enabling CPUAccounting and MemoryAccounting on user.slice with, for example, 'systemctl set-property user.slice CPUAccounting=true MemoryAccounting=true'. This will create some files in /etc/systemd/system/user.slice.d, but much like the user.conf change, they will do nothing after a reboot.

I can only assume that this is a systemd bug, but I don't expect it to ever be fixed in Ubuntu 16.04's version and I have no idea if it's fixed in upstream systemd (and I have little capability to report a bug, given the version number issue covered here).

(There is presumably some sign in /sys/fs/cgroup/* to show whether per-user fair share CPU scheduling is on or off, but I have no idea what it might be. Alternately, if the presence of user-${UID}.slice directories in /sys/fs/cgroup/cpu,cpuacct/user.slice means that per-user fair share scheduling is on, it's somehow wound up being turned on on quite a few of our machines.)

In general I've wound up feeling that this area of systemd is badly underdocumented. All of the manpage stuff appears to be written for people who already understand everything that systemd is doing internally to manage resource limits, or at least for people who understand much more of how systemd operates here than I do.

SystemdDynamicUserLimits written at 01:16:59; Add Comment

2017-07-31

Using policy based routing to isolate a testing interface on Linux

The other day I needed to do some network bandwidth tests to and from one of our sandbox networks and wound up wanting to use a spare second network port on an already-installed test server that was fully set up on our main network. This calls for policy based routing to force our test traffic to flow only over the sandbox network, so we avoid various sorts of asymmetric routing situations (eg). I've used Linux's policy based routing and written about it here before, but surprisingly not in this specific situation; it's all been in different and more complicated ones.

So here is what I need for a simple isolated testing interface, with commentary so that when I need this again I don't have just the commands, I also can re-learn what they're doing and why I need them.

  • First we need to bring up the interface itself. For quick testing I just use raw ip commands:

    ip link set eno2 up
    ip addr add dev eno2 172.21.1.200/16
    

  • We need a routing table for this interface's routes and a routing policy rule that forces use of them for traffic to and from our IP address on eno2.

    ip route add 172.21.0.0/16 dev eno2 table 22
    ip route add default via 172.21.254.254 table 22
    
    ip rule add from 172.21.1.200 iif lo table 22 priority 6001
    

    We need the local network route for good reasons. The choice of table number is arbitrary.

By itself this is good enough for most testing. Other hosts can connect to your 172.21.1.200 IP and that traffic will always flow over eno2, as will outgoing connections that you specifically bind to the 172.21.1.200 IP address using things like ping's -I argument or Netcat's -s argument. You can also talk directly to things on 172.21/16 without having to explicitly bind to 172.21.1.200 first (ie you can do 'ping 172.21.254.254' instead of needing 'ping -I 172.21.1.200 172.21.254.254').

However, there is one situation where traffic will flow over the wrong network, which is if another host in 172.21/16 attempts to talk to your public IP (or if you try to talk to 172.21/16 while specifically using your public IP). Their outbound traffic will come in on eno1, but because your machine knows that it can talk to them directly on eno2 it will just send its return traffic that way (probably with odd ARP requests). What we want is to use the direct connection to 172.21/16 in only two cases. First, when the source IP is set to 172.21.1.200 in some way; this is already covered. Second, when we're generating outgoing traffic locally and we have not explicitly picked a source IP; this allows us to do just 'ping 172.21.254.254' and have it flow over eno2 the way we expect. There are a number of ways we could do this, but it turns out that the simplest way goes as follows.

  • Remove the global routing table entry for eno2:

    ip route del 172.21.0.0/16 dev eno2
    

    (This route in the normal routing table was added automatically when we configured our address on eno2.)

  • Add a new routing table with the local network route to 172.21/16 and use it for outgoing packets that have no source IP assigned yet:

    ip route add 172.21.0.0/16 dev eno2 src 172.21.1.200 table 23
    
    ip rule add from 0.0.0.0 iif lo lookup 23 priority 6000
    

    The nominal IP address 0.0.0.0 is INADDR_ANY (cf). INADDR_ANY is what the socket API uses for 'I haven't set a source IP', and so it's both convenient and sensible that the kernel reuses it during routing as 'no source IP assigned yet' and lets us match on it in our rules.

(Since our two rules here should be non-conflicting, we theoretically could use the same priority number. I'm not sure I fully trust that in this situation, though.)

You can configure up any number of isolated testing interfaces following this procedure. Every isolated interface needs its own separate table of its own routes, but table 23 and its direct local routes are shared between all of them.

IsolatingTestingInterface written at 22:59:19; Add Comment

2017-07-16

Why upstreams can't document their program's behavior for us

In reaction to SELinux's problem of keeping up with app development, one obvious suggestion is to have upstreams do this work instead. A variant of this idea is what DrScriptt suggested in a comment on that entry:

I would be interested in up stream app developers publishing things about their application, including what it should be doing. [...]

Setting aside the practical issue that upstream developers are not interested in spending their time on this, I happen to believe that there are serious and probably unsolvable problems with this idea even in theory.

The first issue is that the behavior of a sophisticated modern application (which are what we most care about confining well) is actually a composite of at least four different sources of behavior and behavior changes: the program itself, the libraries it uses, how a particular distribution configures and builds both of these, and how individual systems are configured. Oh, and as covered, this is really not 'the program' and 'the libraries', but 'the version of the program and the libraries used by a particular distribution' (or when the app was built locally).

In most Linux systems, even simple looking operations can go very deep here. Does your program call gethostbyname()? If so, what files it will access and what network resources it attempts to contact cannot be predicted in advance without knowing how nsswitch.conf (and other things) are configured on the specific system it's running on. The only useful thing that the upstream developers can possibly tell you is 'this calls gethostbyname(), you figure out what that means'. The same is true for calls like getpwuid() or getpwnam(), as well as any number of other things.

The other significant issue is that when prepared by an upstream, this information is essentially a form of code comments. Without a way for upstreams to test and verify the information, it's more or less guaranteed to be incomplete and sometimes outright wrong (just as comments are incomplete and periodically wrong). So we're asking upstreams to create security sensitive documentation that can be predicted in advance to be partly incorrect, and we'd also like it to be detailed and comprehensive (since we want to use this information as the basis for a fine-grained policy on things like what files the app will be allowed access to).

(I'm completely ignoring the very large question of what format this information would be in. I don't think there's any current machine-readable format that would do, which means either trying to invent a new one or having people eventually translate ad-hoc human readable documentation into SELinux policies and other things. Don't expect the documentation to be written with specification-level rigor, either; if nothing else, producing that grade of documentation is fairly expensive and time-consuming.)

AppBehaviorDocsProblem written at 01:18:05; Add Comment

2017-07-14

SELinux's problem of keeping up with general Linux development

Fedora 26 was released on Tuesday, so today I did my usual thing of doing a stock install of it in a virtual machine as a test, to see how it looks and so on. Predictable things ensued with SELinux. In the resulting Twitter conversation, I came to a realization:

It seems possible that the rate of change in what programs legitimately do is higher than the rate at which SELinux policies can be fixed.

Most people who talk about SELinux policy problems, myself included, usually implicitly treat developing SELinux policies as a static thing. If only one could understand the program's behavior well enough one could write a fully correct policy and be done with it, but the problem is that fully understanding program behavior is very hard.

However, this is not actually true. In reality, programs not infrequently change their (legitimate) behavior over time as new versions are developed and released. There are all sorts of ways this can happen; there's new features in the program, changes to how the program itself works, changes in how libraries the program uses work, changes in what libraries the program uses, and so on. When these changes in behavior happen (at whatever level and for whatever reason), the SELinux policies need to be changed to match them in order for things to still work.

In effect, the people developing SELinux policies are in a race with the people developing the actual programs, libraries, and so on. In order to end up with a working set of policies, the SELinux people have to be able to fix them faster than upstream development can break them. It would certainly be nice if the SELinux people can win this race, but I don't think it's at all guaranteed. Certainly with enough churn in enough projects, you could wind up in a situation where the SELinux people simply can't work fast enough to produce a full set of working policies.

As a corollary, this predicts that SELinux should work better in a distribution environment that rigidly limits change in program and library versions than in one that allows relatively wide freedom for changes. If you lock down your release and refuse to change anything unless you absolutely have to, you have a much higher chance of the SELinux policy developers catching up to the (lack of) changes in the rest of the system.

This is a more potentially pessimistic view of SELinux's inherent complexity than I had before. Of course I don't know if SELinux policy development currently is in this kind of race in any important way. It's certainly possible that SELinux policy developers aren't having any problems keeping up with upstream changes, and what's really causing them these problems is the inherent complexity of the job even for a static target.

One answer to this issue is to try to change who does the work. However, for various reasons beyond the scope of this entry, I don't think that having upstreams maintain SELinux policies for their projects is going to work very well even in theory. In practice it's clearly not going to happen (cf) for good reasons. As is traditional in the open source world, the people who care about some issue get to be the ones to do the work to make it happen, and right now SELinux is far from a universal issue.

(Since I'm totally indifferent about whether SELinux works, I'm not going to be filing any bugs here. Interested parties who care can peruse some logs I extracted.)

SELinuxCatchupProblem written at 01:19:14; Add Comment

2017-07-10

Ubuntu's 'Daily Build' images aren't for us

In response to my wish for easily updating the packages on Ubuntu ISO images, Aneurin Price brought up jigdo. In researching what Jigdo is, I wound up running into a tantalizing mention of Ubuntu daily builds (perhaps from here). This sent me off to Internet searches and eventually I wound up on Ubuntu's page for the Ubuntu Server 16.04.2 LTS (Xenial Xerus) Daily Build. This looked like exactly what we wanted, already pre-built for us (which is perfectly fine by me, I'm happy to have someone else do the work of putting in all of the latest package updates for us).

However, when I went looking around Ubuntu's site I couldn't find any real mention of these daily builds, including such things as what they were for, how long they got updated, and so on. That made me a bit nervous, so I pulled down the latest 'current' 16.04 server build and took a look inside the ISO image. Unfortunately I must report that it turns out to not be suitable for what we want, ironically because it has packages that are too fresh. Well, a package; all I looked at was the kernel image. At the moment, the current daily ISO has a kernel package that is marked as being '4.4.0-85', while the latest officially announced and released Ubuntu kernel is 4.4.0-83. We may like having current updates in our install ISOs, but we draw the line at future updates that are still presumably in testing and haven't been officially released (and may never be, if some problem is found or they're replaced by even newer ones).

To be clear, I'm not blaming Ubuntu. They do daily builds for not-yet-released Ubuntu versions, which are obviously 'don't use these on anything you care about', so there is no particular reason why the daily builds for released Ubuntu versions would be any different (and there are perfectly good reasons for wanting the very latest test packages in a bundle). I was just hopeful when I found this site, so now I'm reporting a negative result.

PS: I'm just guessing as to why this image has kernel 4.4.0-85. As mentioned, I haven't found much information from Ubuntu on what these daily builds are about (for already-released versions), and I don't know too much about how potential updates flow through Ubuntu's work processes and so on. I did find this page on their kernel workflow and its link to this report page, and also this bug that's tracking 4.4.0-85 and its changelog.

UbuntuDailyISOsNotForUs written at 01:32:52; Add Comment

2017-07-09

Why we're not currently interested in PXE-based Linux installs

In theory, burning Ubuntu install DVDs (or writing USB sticks) and then booting servers from them in order to do installs is an old-fashioned and unnecessary thing. One perfectly functional modern way is to PXE-boot your basic installer image, go through whatever questions your Ubuntu install process needs to ask, and then likely have the installer get more or less everything over the network from regular Ubuntu package repositories (or perhaps a local mirror). Assuming that it works, you might as well enable the Ubuntu update repositories as well as the basic ones, so that you get the latest versions of packages right from the start (which would deal with my wish for easily updated Ubuntu ISO images).

We don't do any sort of PXE or network installs, though, and we probably never will. There are a number of reasons for this. To start with, PXE network booting probably requires a certain amount of irritating extra setup work for each such machine to be installed, for example to add its Ethernet address to a DHCP server (which requires actually getting said Ethernet address). Ways around this are not particularly appealing, because they either require running an open DHCP server on our primary production network (where most of our servers go) or contriving an entire second 'install network' sandbox and assuming that most machines to be installed will have a second network port. It also requires us to run a TFTP server somewhere to maintain and serve up PXE images.

(This might be a bit different if we used DHCP for our servers, but we don't; all of our servers have static IPs.)

Next, I consider it a feature that you can do the initial install of a machine without needing to do much network traffic, because it means that we can install a bunch of machines in parallel at more or less full speed. All you need is a bunch of prepared media (and enough DVD readers, if we're using DVDs). As a purely pragmatic thing this also vastly speeds up my virtual machine installs, since my 'DVD' is actually an ISO image on relatively fast local disks. Even a local Ubuntu mirror doesn't fully help here unless we give it a 10G network connection and a beefy, fast disk system (and we're not going to do that).

(We actually have a local Ubuntu mirror that we do package upgrades and extra package installs from in the postinstall phase of our normal install process. I've seen some signs that it may be a chokepoint when several machines are going through their postinstall process at once, although I'd need to take measurements to be sure.)

Finally, I also consider it a feature that a server won't boot into the installer unless there is physical media plugged into it. Even with an installer that does nothing until you interact with it (and we definitely don't believe in fully automated netboot installs), there are plenty of ways for this to go wrong. All you need is for the machine to decide to prioritize PXE-booting higher than its local drives one day and whoops, your server is sitting dead in the installer until you can come by in person to fix that. On the flipside, having a dedicated 'install network' sandbox does deal with this problem; a machine can't PXE boot unless it's physically connected to that network, and you'd obviously disconnect machines after the install has finished.

(I'm going to assume that the Ubuntu network install process can deal with PXE-booting from one network port but configuring your real IP address on another one and then not configuring the PXE boot port at all in the installed system. This may be overly generous.)

The ultimate reason probably comes down to how often we install machines. If we were (re)installing lots of servers reasonably often, it might be worth dealing with all of these issues so that we didn't have to wrangle media (and DVD readers) all the time and we'd get a faster install overall under at least some circumstances. Our work in learning all about PXE booting, over the network Ubuntu installs, and so on, and building and maintaining the necessary infrastructure would have a real payoff. But on average we don't install machines all that often. Our server population is mostly static, with new arrivals being relatively rare and reinstalls of existing servers being uncommon. This raises the cost and hassles of a PXE netboot environment and very much reduces the payoff from setting one up.

(I was recently installing a bunch of machines, but that's a relatively rare occurrence.)

WhyNotPXEInstalls written at 01:11:30; Add Comment

2017-07-08

I wish you could easily update the packages on Ubuntu ISO images

Our system for installing Ubuntu machines starts from a somewhat customized Ubuntu ISO image (generally burned onto a DVD, although I want to experiment with making it work on a USB stick) and proceeds through some post-install customization scripts. One of the things that these scripts do is apply all of the accumulated Ubuntu updates to the system. In the beginning, when an Ubuntu LTS release is fresh and bright and new, this update process doesn't need to do much and goes quite fast. As time goes by, this changes. With 16.04 about a year old by now, applying updates requires a significant amount of time on real hardware (especially on servers without SSDs).

Ubuntu does create periodic point updates for their releases, with updated ISO images; for 16.04, the most recent is 16.04.2, created in mid-February. But there's still a decent number of updates that have accumulated since then. What I wish for is a straightforward way for third parties (such as us) to create an ISO image that included all of the latest updates, and to do so any time they felt like it. If we could do this, we'd probably respin our install images on a regular basis, which would be good for other reasons as well (for example, getting regular practice with the build procedure, which is currently something we only do once every two years as a new LTS release comes out).

There is an Ubuntu wiki page on Install CD customization, with a section on adding extra packages, but the procedure is daunting and it's not clear if it's what you do if you're updating packages instead of adding new ones. Plus, there's no mention of a tool that will figure out and perhaps fetch all of the current updates for the set of packages on the ISO image (I suspect that such a tool exists, since it's so obvious a need). As a practical matter it's not worth our time to fight our way through the resulting collection of issues and work, since all we'd be doing is somewhat speeding up our installs (and we don't do that many installs).

Sidebar: Why this is an extra pain with Ubuntu (and Debian)

The short version is that it is because how Debian and thus Ubuntu have chosen to implement package security. In the RPM world, what gets signed is the individual package and any collection of these packages is implicitly trusted. In the Debian and Ubuntu world, what generally gets signed is the repository metadata that describes a pool of packages. Since the metadata contains the cryptographic checksums of all of the packages, the packages are implicitly protected by the metadata's signature (see, for example, Debian's page on secure apt).

There are some good reasons to want signed repository metadata (also), but in practice it creates a real pain point for including extra packages or updating the packages. In the RPM world, any arbitrary collection of signed packages is perfectly good, so you can arbitrarily update an ISO image with new official packages (which will all be signed), or include extra ones. But in the Debian and Ubuntu world, changing the set of packages means that you need new signed metadata, and that means that you need a new key to sign it with (and then you need to get the system to accept your key).

UbuntuISOPackageUpdate written at 00:07:16; Add Comment

(Previous 10 or go back to June 2017 at 2017/06/28)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.