Wandering Thoughts


Link: Stop Using Encrypted Email

Stop Using Encrypted Email is about all of the fundamental reasons that you should do that, from Latacora. See also eg Patrick McKenzie, and the discussion on Hacker News where there are more comments from Latacora people (look for tptacek and lvh).

(Latacora people include Thomas Ptacek, who you may remember from Against DNSSEC.)

links/StopUsingEncryptedEmail written at 11:08:36; Add Comment


Load average is now generally only a secondary problem indicator

For a long time I've been in the habit of considering a high load average (or an elevated one) to be a primary indicator of problems. It was one of the first numbers I looked at on a system to see how it was, I ran xloads on selected systems to watch it more or less live, I put it on Grafana dashboards, and we've triggered alerts on it for a long time (well before our current metrics and alert setup was set up). But these days I've been moving away from that, because of things like how our login server periodically has brief load average spikes and our IMAP server's elevated load average has no clear cause or impact.

When I started planning this entry, I was going to ask if load average even matters any more. But that's going too far. In a good number of situations, looking at the load average will tell you a fair bit about whether you have a significant problem or perhaps the system is operating as expected but close to its limits. For instance, if a machine has a high CPU usage, it might be a single process that is running a lot (which could be expected), or it could be that you have more running processes than the machine can cope with; the load average will help you tell which is which. But a low load average doesn't mean the machine is fine and a high load average doesn't mean it's in trouble. You need to look for primary problem indicators first, and then use load average to assess how much of a problem you have.

(There are echoes of Brendan Gregg's USE method here. In USE terms, I think that load average is mostly a crude measure of saturation, not necessarily of utilization.)

Despite my shifting view on this, we're probably going to keep using load average in our alerts and our dashboards. It provides some information and more importantly it's what we're used to; there's value in keeping with history, assuming that the current state of things isn't too noisy (which it isn't; our load average alerts are tuned to basically never go off). But I'm running fewer xloads and spending less time actually looking at load average, unless I want to know about something I know is specifically reflected in it.

sysadmin/LoadAverageSecondarySign written at 23:37:41; Add Comment

How and why we regularly capture information about running processes

In a recent entry, I mentioned that we periodically capture ps and top output on our primary login server, and in fact we do it on pretty much all of our servers. There are three parts to this; the history of how we wound up here, how we do it, and why we've come to do it as a routine thing on our servers.

We had another monitoring system before our current Prometheus based one. One of its handy features was that when it triggered a load average alert, the alert email would include 'top' output rather than just have the load average. Often this led us right to the cause (generally a user running some CPU-heavy thing), even if it had gone away by the time we could look at the server. Prometheus can't do this in any reasonable way, so I did the next best thing by setting up a system to capture 'top' and 'ps' information periodically and save it on the machine for a while. The process information wouldn't be right in the email any more, but at least we could still go look it up later.

Mechanically, this is a cron job and a script that runs every minute and saves 'top' and 'ps' output to a file called 'procs-<HH>:<MM>' (eg 'procs-23:10') in a specific local directory for this purpose (in /var on the system). Using a file naming scheme based on the hour and minute the cron job started and overwriting any current file with that name means that we keep the last 24 hours of data (under normal circumstances). The files are just plain text files without any compression, because disk space is large these days and we don't need anything fancier. On a busy server this amounts to 230 MBytes or so for 24 hours of data; on less active servers it's often under 100 MBytes.

Our initial reason for doing this was to be able to identify users with CPU-consuming processes, so we started out only deploying this on our login servers, our general access compute servers (that anyone can log in to at any time), and a few other machines like our general web server. However, over time it became clear that being able to see what was running (and using CPU and RAM) around some time was useful even on servers that aren't user accessible, so we now install the cron job, script, local data directory, and so on on pretty much all of our machines. We don't necessarily look at the information the system captures all that often, but it's a cheap precaution to have in place.

(We also use Unix process accounting on many machines, but that doesn't give you the kind of moment in time snapshot that capturing 'top' and 'ps' output does.)

sysadmin/OurProcessInfoCapturing written at 00:13:17; Add Comment


The uncertainty of an elevated load average on our Linux IMAP server

We have an IMAP server, using Dovecot on Ubuntu 18.04 and with all of its mail storage on our NFS fileservers. Because of historical decisions (cf), we've periodically had real performance issues with it; these issues have been mitigated partly through various hacks and partly through migrating the IMAP server and our NFS fileservers from 1G Ethernet to 10G (our IMAP server routinely reads very large mailboxes, and the faster that happens the better). However, the whole experience has left me with a twitch about problem indicators for our IMAP server, especially now that we have a Prometheus metrics system that can feed me lots of graphs to worry about.

For a while after we fixed up most everything (and with our old OmniOS fileservers), the IMAP server was routinely running at a load average of under 1. Since then its routine workday load average has drifted upward, so that a load average of 2 is not unusual and it's routine for it to be over 1. However, there are no obvious problems the way there used to be; 'top' doesn't show constantly busy IMAP processes, for example, indicators such as the percentage of time the system spends in iowait (which on Linux includes waiting for NFS IO) is consistently low, and our IMAP stats monitoring doesn't show any clear slow commands the way it used to. To the extent that I have IMAP performance monitoring, it only shows slow performance for looking at our test account's INBOX, not really other mailboxes.

(All user INBOXes are in our NFS /var/mail filesystem and some of them are very large, so it's a really hot spot and is kind of expected to be slower than other filesystems; there's only really so much we can do about it. Unfortunately we don't currently have Prometheus metrics from our NFS fileservers, so I can't easily tell if there's some obvious performance hotspot on that fileserver.)

All of this leaves me with two closely related mysteries. First, does this elevated load average actually matter? This might be the sign of some real IMAP performance problem that we should be trying to deal with, or it could be essentially harmless. Second, what is causing the load average to be high? Maybe we frequently have blocked processes that are waiting on IO or something else, or that are running in micro-bursts of CPU usage.

(eBPF based tracing might be able to tell us something about all of this, but eBPF tools are not really usable on Ubuntu 18.04 out of the box.)

Probably I should invest in developing some more IMAP performance measurements and also consider doing some measurements of the underlying NFS client disk IO, at least for simple operations like reading a file from a filesystem. We might not wind up with any more useful information than we already have, but at least I'd feel like I was doing something.

linux/LoadAverageIMAPImpactQuestion written at 22:22:22; Add Comment

The case of mysterious load average spikes on our Linux login server

We have a Linux login server that is our primary server basically by default; it's the first one in numbering and the server a convenient alias is pointed to, so most people wind up using it. Naturally we monitor its OS level metrics as part of our Prometheus setup, and as part of that a graph of its load average (along with all our other interesting servers) appears on our overview Grafana dashboard. For basically as long as we've been doing this, we've noticed that this server experiences periodic and fairly drastic short term load average spikes for no clear reason.

A typical spike will take the 1-minute load average from 0.26 or so (the typical load average for it) up to 6.5 or 7 in a matter of seconds, and then immediately start dropping back down. There seems to often be some correlation with other metrics, such as user and system CPU time usage, but not necessarily a high one. We capture ps and top output periodically for reasons beyond the scope of this entry, and these captures have never shown anything in particular even when they capture the high load average itself. The spikes happen at all times, day or night and weekday or weekend, and don't seem to come in any regular pattern (such as every five minutes).

The obvious theory for what is going on is that there are a bunch of processes that have some sort of periodic wakeup where they do a very brief amount of work, and they've wound up more or less in sync with each other. When the periodic wakeup triggers, a whole bunch of processes become ready to run and so spike the load average up, but once they do run they don't do very much so the log-jam clears almost immediately (and the load average immediately drops). Since it seems to be correlated with the number of logins, this may be something in systemd's per-login process infrastructure. Since all of these logins happen over SSH, it could also partly be because we've set a ClientAliveInterval in our sshd_config so sshd likely wakes up periodically for some connections; however, I'm not clear how that would wind up in sync for a significant number of people.

I don't know how we'd go about tracking down the source of this without a lot of work, and I'm not sure there's any point in doing that work. The load spikes don't seem to be doing any harm, and I suspect there's nothing we could really do about the causes even if we identified them. I rather expect that having a lot of logins on a single Linux machine is now not a case that people care about very much.

linux/LoadAverageMultiuserSpikes written at 01:19:38; Add Comment


With sudo, complex argument validation is best in cover scripts

Suppose, as a not entirely hypothetical case, that you want to allow some people to run 'zfs destroy' to delete only ZFS snapshots (since ZFS cannot delegate this through its own permission system). You can tell ZFS snapshots apart from other ZFS objects because ZFS snapshots all have '@' in their names. There are two approaches to enforcing this restriction on 'zfs destroy' arguments. The first is to write a suitable sudoers rule that carefully constraints the arguments to 'zfs destroy' (see Michael's comment on this entry for one attempt). The second is to write a cover script that takes the snapshot names, validates them itself, and runs 'zfs destroy' on the suitably validated results. My view is that you should generally use cover scripts to do complex argument validation for sudo'd commands, not sudoers.

The reason for this is pretty straightforward and boils down to whitelisting being better than blacklisting. A script is in the position to have minimal arguments and only allow through what it has carefully determined is safe. Using sudoers to only permit some arguments to an underlying general purpose command is usually in the position of trying to blacklist anything bad (sometimes explicitly and sometimes implicitly, as in Michael's match pattern that blocks a nominal snapshot name with a leading '-'). General purpose commands are usually not written so that their command line arguments are easy to filter and limit; instead they often have quite a lot of general arguments that can interact in complex ways. If you only want to have a limited subset of arguments accepted, creating a cover script that only accepts those arguments is the simple approach.

Cover scripts also have the additional advantage that they can simplify the underlying commands in ways that reduce the chance of errors and make it clearer what you're doing. This is related to the issue of command error distance, although in these cases often your sudo setup is intended to block the dangerous operation in the first place. Still, the principle of fixing low command error distances with cover scripts applies here.

(Of course the downside is that now people have to remember the script instead of the actual command. But if you're extending sudo permissions to people who would not normally use the command at all, you have to train them about it one way or another.)

sysadmin/SudoersAndCoverScripts written at 02:16:35; Add Comment


Unix's /usr split and standards (and practice)

In Rob Landley about the /usr split, Rob Landley doesn't have very good things to say about how the split between /bin and /usr/bin (and various other directories) has continued to exist, especially in various standards. One of my views on this is that the split continuing to exist was always inevitable, regardless of why the split existed and what reasons people might have for preserving it (such as diskless workstations benefiting from it).

As far as standards go, Unix standards have pretty much always been mostly documentation standards, codifying existing practice with relatively little invention of new things. The people trying to create Unix standards are not in a position to mandate that existing Unixes change their practices and setup, and existing Unixes have demonstrated that they will just ignore attempts to do so. Writing a Unix filesystem hierarchy standard that tried to do away with /bin and mandated that /usr was on the root filesystem would have been a great way to it to fail.

(POSIX attempted to mandate some changes in the 1990s, and Unix vendors promptly exiled commands implementing those changes off to obscure subdirectories in /usr. Part of this is because being backward compatible is the path of least resistance and fewest complaints from customers.)

For actual Unixes in practice, conforming to the historical weight of existing other Unixes (including their own past releases) has always been the easiest way. There are countless people and scripts and so on that expect to find some things in /bin and some things in /usr/bin and so on, and the less you disrupt all of that the easier your life is. Inventing new filesystem layouts and pushing for them takes work; any Unix has a finite amount of work it can do and must carefully budget where that work goes. Reforming the filesystem layout is rarely a good use of limited time and work, partly because the returns on it are so low (and people will argue with you, which is its own time sink).

(Totally reinventing Unix from the ground up has been tried, by the people arguably in the best position possible to do it, and the results did not take the world by storm. Plan 9 from Bell Labs still has its fans and some of its ideas have leaked out to mainstream Unix, but that's about it.)

The modern irony about the whole issue is that recent versions of Linux distributions are increasingly requiring /usr to be on the root filesystem and merging /bin, /lib, and so on into the /usr versions, but this has been accomplished by the 800 pound gorilla of systemd, which many people are not happy about in general. The monkey's paw hopes you're happy with sort of achieving the end of this split.

(A clean end to the split would be to remove one or the other of /bin and /usr/bin, and similarly for the other duplicated directories.)

unix/UsrSplitAndStandards written at 23:33:35; Add Comment

The /bin versus /usr split and diskless workstations

I was recently reading Rob Landley about the /usr split (via, which can be summarized as being not very enthused about the split between /bin and /usr/bin, and how long it lasted. I have some opinions on this as a whole, but today I want to note that one factor in keeping this split going is diskless workstations and the issue of /etc.

Unix traditionally puts a number of machine specific pieces of information in /etc, especially the machine's hostname and its IP address and basic network configuration. A straightforward implementation of a diskless Unix machine needs to preserve this, meaning that you need a machine-specific /etc and that it has to be available very early (because it will be used to bootstrap a lot of the rest of the system). The simplest way to provide a machine specific /etc is to have a machine specific root filesystem, and for obvious reasons you want this to be as small as possible. This means not including /usr (and later /var) in this diskless root filesystem, which means that you need a place in the root filesystem to put enough programs to boot the system and NFS mount the rest of your filesystems. That place might as well be /bin (and later /sbin).

This isn't the only way to do diskless Unix machines, but it's the one that involves the least changes from a normal Unix setup. All you need is some way to get the root filesystem (NFS) mounted, which can be quite hacky since it's a very special case, and then everything else is normal. An /etc that isn't machine specific and where the machine specific information is set up and communicated in some other way requires significantly more divergence from standard Unix, all of which you will have to write and maintain. And diskless Unix machines remained reasonably popular for quite some time for various reasons.

(There is potentially quite a lot of machine specific information in /etc. Although it was common for diskless Unix machines to all be the same, you could want to run different daemons on some of them, have custom crontabs set up, only allow some people to log in to certain workstations, or all sorts of other differences. And of course all of these potential customizations were spread over random files in /etc, not centralized into some configuration store that you could just provide an instance of. In the grand Unix tradition, /etc was the configuration store.)

unix/DisklessUnixAndUsr written at 00:42:31; Add Comment


You can't delegate a ZFS administration permission to delete only snapshots

ZFS has a system that lets you selectively delegate administration permissions from root to other users (exposed through 'zfs allow') on a per filesystem tree basis. This led to the following interesting question (and answer) over on the fediverse:

@wxcafe: hey can anyone here confirm that there's no zfs permission for destroying only snapshots?

@cks: I can confirm this based on the ZFS on Linux code. The 'can you destroy a snapshot' code delegates to a general 'can you destroy things' permission check that uses the overall 'destroy' permission.

(It also requires mount permissions, presumably because you have to be able to unmount something that you're about to destroy.)

The requirement for unmount means that delegating 'destroy' permissions may not work on Linux (or may not always work), because only root can unmount things on Linux. I haven't tested to see whether ZFS will let you delegate unmount permission (and thereby pass its internal checks) but then later the unmount operation will fail, or whether the permission cannot be delegated on Linux (which would mean that you can't delegate 'destroy' either).

The inability to allow people to only delete snapshots is a bit unfortunately, because you can delegate the ability to create them (as the 'snapshot' permission). It would be nice to be able to delegate snapshot management entirely to people (or to an unprivileged account used for automated snapshot management) but not let them destroy the filesystem itself.

This situation is the outcome of two separate and individually sensible design decisions, which combine together here in a not great way. First, ZFS decided that creating snapshots would be a separate 'zfs' command but destroying them would be part of 'zfs destroy' (a decision that I personally dislike because of how it puts you that much closer to an irreversible error). Then when it added delegated permissions, ZFS chose to delegate pretty much by 'zfs' commands, although it could have chosen a different split. Since destroying snapshots is part of 'zfs destroy', it is all covered under one 'destroy' permission.

(The code in the ZFS kernel module does not require this; it has a separate permission check function for each sort of thing being destroyed. They all just call a common permission check function.)

The good news is that while writing this entry and reading the 'zfs allow' manpage, I realized that there may sort of be a workaround under specific situations. I'll just quote myself on Mastodon:

Actually I think it may be possible to do this in practice under selective circumstances. You can delegate a permission only for descendants of a filesystem, not for the filesystem itself, so if a filesystem will only ever have snapshots underneath it, I think that a 'descendants only' destroy delegation will in practice only let people destroy snapshots, because that's all that exists.

Disclaimer: this is untested.

On our fileservers, we don't have nested filesystems (or at least not any that contain data), so we could do this; anything that we'll snapshot has no further real filesystems as children. However in other setups you would have a mixture of real filesystems and snapshots under a top level filesystem, and delegating 'destroy' permission would allow people to destroy both.

(This assumes that you can delegate 'unmount' permission so that the ZFS code will allow you to do destroys in the first place. The relevant ZFS code checks for unmount permission before it checks for destroy permission.)

solaris/ZFSNoSnapshotDeleteDelegation written at 22:35:04; Add Comment

Some git aliases that I use

As a system administrator, I primarily use git not to develop my own local changes but to keep track of what's going on in projects that we (or I) use or care about, and to pull their changes into local repositories. This has caused me to put together a set of git aliases that are probably somewhat different than what programmers wind up with. For much the same reason that I periodically inventory my Firefox addons, I'm writing down my common aliases here today.

All of these are presented in the form that they would be in the '[alias]' section of .gitconfig.

  • source = remote get-url origin

    I don't always remember the URL of the upstream of my local tracking repository for something, and often I wind up want to go there to do things like look at issues, releases, or whatever.

  • plog = log @{1}..

    My most common git operation is to pull changes from upstream and look at what they are. This alias uses the reflog to theoretically show me the log for what was just pulled, which should be from the last reflog position to now.

    (I'm not confident that this always does the right thing, so often I just cut and paste the commit IDs that are printed by 'git pull'. It's a little bit more work but I trust my understanding more.)

  • slog = log --pretty=slog

    Normally if I'm reading a repo's log at all, I read the full log. But there are some repos where this isn't really useful and some situations where I just want a quick overview, so I only look at the short log. This goes along with the following in .gitconfig to actually define the 'slog' format:

        slog = format:* %s

  • pslog = log --pretty=slog @{1}..

    This is a combination of 'git plog' and 'git slog'; it shows me the log for the pull (theoretically) in short log form.

  • ffpull = pull --ff-only
    ffmerge = merge --ff-only

    These are two aliases I use if I don't entirely trust the upstream repo to not have rebased itself on me. The ffpull alias pulls with only a fast-forward operation allowed (the equivalent of setting 'git config pull.ff only', which I don't always remember to do), while ffmerge is what I use in worktrees.

(Probably I should set up a git alias or something that configures a newly cloned repo with all of the settings that I want. So far I think that's only 'pull.ff only', but there will probably be more in the future.)

I have other git aliases but in practice I mostly don't remember them (for instance, I have an 'idiff' alias for 'git diff --cached').

programming/GitAliasesIUse written at 00:52:07; Add Comment

(Previous 10 or go back to February 2020 at 2020/02/10)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.