Wandering Thoughts

2021-07-18

On sending all syslog messages to one file

Over on Twitter, I had a view on where syslog messages should go:

Tired sysadmin take: Different sorts of syslog messages going to different places are a mistake. Throw it all into /var/log/allmessages and I'll sort it out myself.

Like many Twitter takes of mine, in retrospect this one is heartfelt but a little bit too extreme as presented. Specifically, I think you should log all syslog messages to one place, but also log some sorts of messages to their own additional places so you can look through them more easily.

In the old days, I used to carefully curate my syslog.conf so that every different syslog facility had its own different file. Often, the net result of this is that I would end up using grep on every current syslog file in /var/log because I'd forgotten (or never knew) what facility a given program logged under. Trying to predict what facility a program will use is often almost as futile as predicting what priority level messages will be logged under.

(This is worse if you rely on the Unix vendor stock syslog.conf instead of customizing it. Unix vendors are inevitably different from each other, and some of them have rather strange ideas of what should go where.)

All of this leads to the tired sysadmin take of putting everything into one file (/var/log/allmessages is what I prefer) and then searching it. An allmessages file is the brute force solution to unpredictable programs and Unix vendor variability, and it also makes sure everything gets logged. But sending all syslog messages to only a single place is a little bit of overkill. Despite my tired take, there are often syslog facilities that it's sensible to also log to separate files, so you can look at just them.

The obvious case is kernel messages, and it's so obvious that systemd's journalctl has a dedicated flag to show you only kernel messages. If I was starting a syslog configuration from scratch, I would also have a log file dedicated to "auth" and "authpriv" messages, one dedicated to "mail" messages, and on my own systems, one dedicated to "daemon" messages. Everything would still go to allmessages; these files are in addition to it.

(And on some systems you might opt to have specific programs log to specific facilities, like "user" or "local0", and have specific files so you can monitor and see the activities of just those programs.)

Sending all syslog messages to an allmessage file is a blunt hammer, and like all blunt hammers it's possible to overuse it. Being able to scan through a single file that has everything has a lot of positive features, but not everything is best served by searching for it through a giant file. Sometimes you want both options.

SyslogToOnePlace written at 23:13:42; Add Comment

2021-07-17

The minimum for syslog configurations should be to log (nearly) everything

I have some opinions on how the venerable Unix syslog should be set up, but a very strong one of them is that (nearly) every syslog message should be logged somewhere. I consider this a minimum standard for vendor and distribution supplied syslog.conf files. The 'nearly' is that although syslog priorities don't mean much these days, I think a Unix is reasonably justified in not syslog'ing the debug priority for most facilities. However, a stock syslog.conf should definitely log each of the syslog facilities supported by its syslog to somewhere.

(POSIX's syslog.h defines seventeen facilities. Actual Unixes define more; Linux syslog(3) and OpenBSD have 20, while FreeBSD has 23.)

This should also be something you preserve in any local versions or modifications to the standard syslog configuration. Unless you're extremely sure that a syslog facility will never ever be used, you should keep logging it somewhere. And if you're sure it will never be used, well, what's the harm in having it sent to a file that will always wind up being empty? This is especially the case if you're running third party software (whether commercial or open source), because programmers can have all sorts of clever ideas about what syslog facilities to use for what.

If you're extremely sure that you don't need to syslog a particular facility and so you leave it out, please put a comment in your syslog configuration file to explain this. A good goal to strive for in syslog configuration files (for you and for vendors) is to create one that convinces any sysadmin reading it (including your future self) that it covers everything that will ever be logged.

(My other syslog configuration opinions are for another entry.)

PS: Out of the Unixes we use, Ubuntu has a default configuration that clearly logs everything to either /var/log/syslog or /var/log/auth.log, while the stock OpenBSD configuration only covers a limited number of facilities. It's possible that OpenBSD covers every use of syslog in the base system (you'd certainly hope so), but if so I doubt it covers all uses of syslog in the packages collection.

SyslogLogEverythingSomewhere written at 23:13:58; Add Comment

2021-07-16

The WireGuard VPN challenge of provisioning clients

I mentioned in yesterday's entry that at work I'm building a VPN server that will support WireGuard. I'm quite happy with WireGuard in general and I think it has some important attractive features (such as the lack of 'sessions'), but we won't be offering WireGuard for general use. I would like to, but every time I even consider the idea, I run headlong into the problem of provisioning, specifically of provisioning WireGuard clients in some way that ordinary people can successfully set them up.

Right now, to set up a WireGuard client you need the server's name and port (which every VPN needs), the server's public key, the IP the server expects you to have inside the WireGuard connection (its AllowedIPs setting for you), and a private key that the server has the public key for. We also need you to set your DNS server(s) to correctly point to us, and for general VPN usage you have to set your AllowedIPs to 0.0.0.0/0. This is a lot more things for you to set up than other VPN servers need, partly because other VPN servers will push your internal IP, the DNS servers to use, and often other information to you. Much of this is also sensitive to typos or, in the case of keys, must be cut and pasted to start with (no one is typing a base64 WireGuard key). If you get your client IP wrong, for example, things just quietly don't work (the server will discard your traffic).

The client keypair is an especially touchy problem. The ideal would be to securely generate it on the client and upload the public key. In practice this is asking a lot of people to do more or less by hand, so in a realistic setup we would probably want to generate your client keypair on the server and then somehow give you access to the private key for you to configure along side the server's public key. Given this, possibly the most generally usable way of provisioning WireGuard client connections would be to generate the wg.conf that a client would use with the normal WireGuard command line tools, then provide it to people and hope that any WireGuard client will be able to import it.

(The official WireGuard client for iOS and Android will apparently do this, including decoding the configuration from a QR code. I believe the official Windows client does as well. On Unix, you can use the wg.conf directly or import it into NetworkManager.)

An additional complication is that you need a separate WireGuard configuration on each device that you want to use WireGuard on at the same time. So we wouldn't have to just provision one WireGuard setup per person, we're looking at one for your laptop, one for your phone, one for your tablet, and so on. This also complicates naming them and keeping track of them (for people and for us), and likely would tempt people into reusing configurations across devices, which leads to fun problems if both devices are in use at the same time.

I don't blame the WireGuard project for this state of affairs. Provisioning is both a hard problem and a high level concern that is sort of out of scope for a project that's deliberately low level and simple. I'm honestly impressed (and happy) that there are official WireGuard clients on as many platforms as there are. I do wish there was some officially supported way to push configuration information to clients, although I understand why there isn't.

(Tailscale is not a solution for us for various reasons, including price. I do admire them for solving the provisioning problem, though.)

WireGuardProvisioningChallenge written at 23:56:07; Add Comment

2021-07-05

Losing track of part of our Amanda configuration and then recovering it

Yesterday, I wrote an entry on two ways to have Amanda always make full backups of a filesystem, and mentioned that we'd started out with one way (forcing them with amadmin) and had switched to the second way (configuring 'dumpcycle 0') a few years ago. There's a story as to why I wrote an entry about it now, instead of a few years ago.

For years, we've always been making full backups of our mail spool, because in our experience 'incremental' backups were almost as big and took longer. We started out doing this with our first method, and it worked fine for years. Then we started having the mail spool not back up at all once in a while. This was obviously bad, and we eventually worked out that it was 'amadmin .. force' skips doing a backup at all if a full backup won't fit (as covered in yesterday's entry). Having worked out this logic and found the second and for us better approach of 'dumpcycle 0', we switched over to it and moved on.

Then, over time, we forgot about the whole chain of logic. Recently we were looking at always doing full backups for another filesystem for reasons outside the scope of this entry, and we couldn't figure out why our existing full backups filesystem was set up to use this odd, indirect method of it instead of the obvious direct approach of forcing things. To make it more puzzling, our cron setup still had a commented out 'amadmin force' invocation. Fortunately we had archives of our old discussions, and we were able to go through them to recover the context of this bit of our Amanda configuration to understand the 'why' (and more of the 'what', because we hadn't remembered the gotcha of 'amadmin force' instead of 'dumpcycle 0').

(I've sort of written about this, when I wrote about how you should document why you didn't do attractive things. This isn't quite what I was thinking of in that entry, but it's certainly in the general area.)

Our fix for this was to put a big comment about the situation in our amanda.conf (before the 'dumpcycle 0') to document the why of this configuration setting. We've also revised our cron setup to entirely remove the commented out 'amadmin force' bits, so in the future we couldn't stumble over them and start wondering about why they were there.

(We do often write up the 'why' of changes and configurations, but we mostly do it in our worklog system, which makes such information less immediately accessible and obvious. Here we now have strong evidence to say that we should make sure this information is very visible (ie, our getting confused this time). Our worklogs can also have the problem of assumed context, including vaguely mentioning problems that we hadn't recorded in worklog.)

AmandaLosingTrackOfWhy written at 00:44:11; Add Comment

2021-07-04

Two ways to have Amanda always make full backups of a filesystem

Amanda (also) is the open source backup system that we use to run our disk-based backups (which I see is now more than ten years old and we're still happy with its design). Amanda normally does full backups of your filesystems once in a while and then various levels of incrementals until the next full backup. However, sometimes there are various reasons that you want to always do full backups of a particular filesystem; it may be faster, it may be better for disaster recovery, or whatever. Over our use of Amanda so far, we've used two different Amanda mechanisms for this and have developed a distinct preference for one of them.

The first and most obvious method is a cron job that runs 'amadmin <yourconfig> force targethost /target/fs' before you start your backups. The second and less obvious way is to set a dumpcycle in a special amanda.conf dumptype and use it for the filesystem or filesystems that you want to always be backed up with full backups. We used the first method for years, but switched to the second one a few years ago.

The two methods have an assortment of relatively obvious small tradeoffs. For example, it's easy to temporarily stop always doing full backups with the first method; you comment out the cron job or whatever; with the second method, you have to change your dumptypes around. The second method is more visible in your disklist. And so on. However, there is a big operational difference between the two, and that is what happens if your tapes fill up and Amanda has no room to do a full dump.

(One of the way to have tapes 'fill up' is to have no tapes available for some reason.)

When you force full dumps with amadmin, Amanda will skip dumping the filesystem rather than fall back to an incremental dump. This is in a sense fair; you told Amanda to do a full dump and it can't do one, so it's not doing anything. But it may well be surprising and not what you actually want. By contrast, if you set the dumpcycle to 0, Amanda will do incrementals if it has to because Amanda considers the dump cycle length to be a goal instead of a hard requirement.

In some environments, attempting an incremental dump may be sufficiently useless or harmful that you'd rather Amanda not even try. In our environment, we would rather have incrementals than nothing, and so we've switched from forcing with amadmin to setting the dumpcycle to 0.

(The corollary of this is that if you set 'dumpcycle 0', you probably want to keep an eye on your dump reports to see if Amanda is able to reliably deliver on this. Probably it will.)

AmandaAlwaysFullBackups written at 00:57:01; Add Comment

2021-06-29

Monitoring the status of Linux network interfaces with Prometheus

Recently I wrote about how we found out a network cable had quietly gone bad and dropped the link to 100 Mbits/sec and mentioned in passing that we were now monitoring for this sort of thing (and in comments, Ivan asked how we were doing this). We're doing our alerts for this through our existing Prometheus setup, using the metrics that node_exporter extracts from Linux's /sys/class/net data for network interface status, which puts some limits on what we can readily check.

To start with, you get the network interface's link speed in node_network_speed_bytes. The values I've seen on our hardware are 1250000000 (10G), 125000000 (1G), 12500000 (100M), and -125000 (for an interface that has been configured up but has no carrier). If all of your network ports are at a single speed, say 1G (or 10G), you can just alert on node_network_speed_bytes being anything other than your normal speed. We have a mixture of speeds, so I had to resort to a collection of alerts to cover all of the cases:

An Ethernet interface that's been configured up but has no carrier has a node_network_carrier of 0 and also a node_network_speed_bytes that's negative (and it also has a node_network_up of 0). You can use either metric to detect this state and alert on it, which will find both unused network interfaces that your system has decided to try to do DHCP on and network interfaces that are supposed to be live but have no carrier. Unfortunately there's no way to detect the inverse condition of an interface that has carrier but that hasn't been configured up. The Linux kernel doesn't report on the link carrier state for interfaces that aren't UP, and so node_exporter has no metric that can detect this.

(I'd like to detect situations where an unused server port has live networking, either because a cable got plugged in or an existing disused cable became live. In our environment, either is a mistake we want to fix.)

These days, almost all network links are full duplex. You can detect links that have come up at half duplex by looking for a 'duplex="half"' label in the node_network_info metric. Since not all network interfaces have a duplex, you can't just look for 'duplex!="full"'. Technically 1G Ethernet can be run at half duplex, although there's nothing that should do this. 10G-T Ethernet is apparently full duplex only.

The node_network_up metric looks tempting but unfortunately it's a combination of dangerous and pointless. node_network_up is 1 if and only if the interface's operstate is 'up', and not all live network interfaces are 'up' when they're working. Prominently, the loopback ('lo') interface's normal operstate is 'unknown', as are Wireguard interfaces (and PPP interfaces). In addition, an operstate of 'up' requires there to be carrier on the interface. Nor does node_network_up being 1 mean that everything is fine, since an interface can be up without any IP addresses being configured on it.

(But if you want to use node_network_up, you probably want to use 'node_network_up != 1 and (node_network_protocol_type == 1)'. This makes it conditional on the interface being an Ethernet interface, so we know that operstate should be 'up' if it's functional. This is sufficiently complicated that I would rather look for up interfaces without carrier, since that's the only error condition we can actually see for Ethernet interfaces..)

Unfortunately, as far as I know there are no metrics that will tell you if an interface has IPv4 or IPv6 addresses configured on it (whether or not it has carrier and so is up). The 'address' that node_network_info and node_network_address_assign_type talk about is the Ethernet address, not IP addresses (as you can see from the values of the label in node_network_info). My conclusion is that you need to check whatever IP addresses you need to be up through the Blackbox exporter.

Given all of this, under normal circumstances, I think there are three sensible alerts or sets of alerts for network interfaces. One alert or set of alerts is for interface speed, based on node_network_speed_bytes, requiring your interfaces to be at their expected speeds. In many environments, you could then look for node_network_carrier being 0 to detect interfaces that are configured but don't have carrier. Finally, you might as well check for half duplex with 'node_network_info{duplex="half"}'.

(It seems likely that a cable (or a port) that fails enough to force you down to half duplex will trigger other conditions as well, but who knows.)

PrometheusCheckingNetworkInterfaces written at 23:55:34; Add Comment

2021-06-26

Ethernet network cables can go bad over time, with odd symptoms

Last week we got around to updating the kernels on all of our Ubuntu servers, including our Prometheus metrics server, which is directly connected to four networks. When the metrics server rebooted, one of those network interfaces flapped down and up for a bit, then suddenly had a lot of intermittent ping failures to machines on that subnet. At first I thought that this might be a network driver bug in the new kernel, but when I rebooted the server again the network interface came up at 100 Mbit/sec instead of 1 Gbit/sec and suddenly we had no more ping problems. When we replaced the network cable yesterday, that interface returned to the 1G it was supposed to be at and pinging things on that network may now be more reliable than before.

The first thing I take away from this is that network cables don't just fail cleanly, and when they do have problems your systems may or may not notice. Last week, the network port hardware on both our metrics server and the switch it was connected to spent hours thinking that the cable was fine at 1G when it manifestly wasn't.

For various reasons I wound up investigating how long this had been going on, using both old kernel logs on our syslog server and the network interface speed information captured by the Prometheus host agent. This revealed that the problem most likely started around June 2019 to August of 2019, when the network link speed dropped to 100 Mbit/sec and stayed there other than briefly after some reboots. Over all that time, we didn't notice that the network interface was running at one step down from its expected rate, partly because we weren't doing anything performance sensitive over it.

(We now have alerts for this, just in case it ever happens again.)

The second thing I take away from this is that network cables can fail in place even after they've been plugged in and working for months. This network cable wasn't necessarily completely undisturbed in our machine room, but at most it would have gotten brushed and moved around in the rack cable runs as we added and removed other network cables. But the cable still failed over time, either entirely on its own or with quite mild mechanical stress. It's possible that the cable was always flawed to some degree, but if so the flaws got worse, causing the cable to decay from a reliable 1G link down to 100M.

I don't think there's anything we can really do about this except to keep it in mind as a potential cause of otherwise odd or mysterious problems. We're definitely not going to recable everything with fresh cables just in case, and we're probably not even going to use freshly made or bought cables when we rack new machines.

(Over time we'll turn over our cable stock as we move to 10G, but it's going to be a long time before we have all of the machines there.)

NetworkCablesGoBad written at 00:01:40; Add Comment

2021-06-21

A realization about our VPN and IPv6 traffic

At work, we operate a VPN for our users. The VPN is used to access both internal resources inside our networks and university resources that are normally only available from 'on-campus' IP addresses. Because of the latter, and for historical reasons, our VPN servers are configured to tell VPN clients to route all of their traffic through the VPN, regardless of the destination. In other words, the VPN makes itself the default route for traffic. Today, in the process of investigating an unfortunate Google decision, I realized that there's an important qualification on that statement.

(We actually support two different sorts of VPNs, OpenVPN and L2TP, and have two servers for each type, but all of this is a technical detail. Conceptually, we have 'a VPN service'.)

We and our networks are IPv4 only; we haven't even started to implement IPv6, and it will probably be years before we do. Naturally this means that our VPN is IPv4 only, so its default route only applies to IPv4 traffic, which means that all of the client's IPv6 traffic bypasses our VPN. All of the IPv4 traffic flows through the VPN, but if your client has a working local IPv6 connection, any IPv6 traffic will go through it.

The first consequence of this is for traffic to places outside the university. An increasing number of ISP networks provide IPv6 addresses to people's devices, many of those devices prefer IPv6 where possible, and an increasing number of sites are reachable over IPv6. Connections from people's devices to those sites don't go through our VPN. But if you move the same device over to a network that only provides it an IPv4 address, suddenly you're going through our VPN to reach all of those sites. This makes troubleshooting apparent VPN based connection problems much more exciting than before; we may have to disable IPv6 during our tests, and we may have to find out if a user who's having problems has an IPv6 connection.

The second consequence is that some day some of the university's on-campus websites may start to have IPv6 addresses themselves. Traffic to these websites from IPv6 capable clients that are connected to the VPN will mysteriously (to people) be seen as outside traffic by those on-campus websites, because it's coming directly from the outside client over IPv6 instead of indirectly through our VPN over IPv4. There are also some external websites that have historically given special permissions to the university's IPs. If these websites are IPv6 enabled and your client is IPv6 enabled, they're going to see you as a non-university connection even with the VPN up.

There probably isn't anything we can sensibly do about this. I think it would be a bad idea to try to have our VPN servers grab all client IPv6 traffic and block it, even if that's possible. Among other things, there are probably IPv6 only ISPs out there that this would likely interact very badly with.

(Our VPN isn't officially documented as a privacy aid for general Internet usage, although people may well use it as that. So I don't consider it a security issue that the current configuration leaks people's real IPv6 addresses to sites.)

OurVPNAndIPv6Traffic written at 22:46:17; Add Comment

2021-06-17

In Prometheus queries, on and ignoring don't drop labels from the result

Today I learned that one of the areas of PromQL, the query language for Prometheus that I'm a still a bit weak on is when labels will and won't get dropped from metrics as you manipulate them in a query. So I'll start with the story.

Today I wrote an alert rule to make sure that the network interfaces on our servers hadn't unexpectedly dropped down to 100 Mbit/second (instead of 1Gbit/s or for some servers 10Gbit/s). We have a couple of interfaces on a couple of servers that legitimately are at 100M (or as legitimately as a 100M connection can be in 2021), and I needed to exclude them. The speed of network interfaces is reported by node_exporter in node_network_speed_bytes, so I first wrote an expression using unless and all of the labels involved:

node_network_speed_bytes == 12500000 unless
  ( node_network_speed_bytes{host="host1",device="eno2",...} or
    node_network_speed_bytes{host="host2",device="eno1",...} )

However, most of the standard labels you get on metrics from the host agent (such as job, instance, and so on) are irrelevant and even potentially harmful to include (the full set of labels might have to change someday). The labels I really care about are the host and the device. So I rewrote this as:

node_network_speed_bytes == 12500000 unless on(host,device) [....]

When I wrote this expression I wasn't sure if it was going to drop all other labels beside host and device from the filtered end result of the PromQL expression. It turns out that it didn't; the full set of labels for node_network_speed_bytes is passed through, even though we're only matching on some of them in the unless.

(The host and the device are all that I needed for the alert message so it wouldn't have been fatal if the other labels were dropped. But it's better to retain them just in case.)

Aggregation operators discard labels unless you use without or by, as covered by their documentation (although it's not phrased that way), since aggregating over labels is their purpose. As I've found out, careless use of aggregation operators can lose labels that are valuable for alerts (which may be what left me jumpy about this case). Aggregation over time keeps all labels, though, because it's aggregating over time instead of over some or all labels. But as I was reminded today (since I'm sure I've seen it before), vector matching using on and ignoring don't drop labels, they merely restrict what labels are used in the matching (and then it's up to you to make sure you still have a one to one vector match or at least a match that you expect; I've made mistakes there).

(You can also explicitly pull in additional labels from other metrics.)

There may be other cases in PromQL where labels are dropped, but if so I can't think of them right now. My overall moral is that I still need to test my assumptions and guesses in order to be sure about this stuff.

Sidebar: Why I used unless (... or ...) in this query

In many cases, the obvious way to exclude some things from an alert rule expression is to use negative label matches. However, these can't match on the combination of several labels instead of the value of a single label. As far as I know, if you want to exclude only certain label combinations (here 'host1 and eno2' and 'host2 and eno1') where the individual label elements can occur separately (so host1 and host2 both have other network interfaces, and other hosts have eno1 and eno2 interfaces), you're stuck with more awkward construction I used. This construction is unfortunately somewhat brute force.

PrometheusOnIgnoringAndLabels written at 00:40:10; Add Comment

2021-06-16

The challenge of what to set server BIOSes to do on power loss

Modern PC BIOSes, including server BIOSes, almost always have a setting for what the machine should do if the power is lost and then comes back. Generally your three options are 'stay powered off', 'turn on', and 'stay in your last state'. Lately I've been realizing that none of them are ideal in our current 'work from home' environment, and the general problem is probably unsolvable without internal remote power control.

In the normal course of events, what we want while working from home is for servers to stay in their last power state. If the power is lost and then comes back, running servers will power back up but servers that we've shut down to take out of service will stay off. If we set servers to 'always turn on', we would have to remember to take servers out of service by powering down their outlet on our smart PDU, not just telling them to halt and power off at the OS level. And of course if we had them set to 'stay powered off', we would have to go in to manually power them up.

But a power loss is not the only case where we might have to take servers down temporarily. We've had one or two scares with machine room air conditioning, and if we had a serious AC issue we would have to (remotely) turn machines off to reduce the heat load. If we turn machines off remotely from the OS level, the BIOS setting of 'stay in your last state' doesn't give us any straightforward way of turning them back on, even with a smart PDU; if we toggle outlet power at the smart PDU, the server BIOS will say 'well I was powered off before so I will stay powered off'. What we need to recover from this situation is what I called internal remote power control, where we can remotely command the machine to turn on.

Right now, if we had an AC issue we would probably have to remember to turn machines off through our smart PDUs instead of at the OS level. With our normal BIOS settings, this would let us remotely restart them through the smart PDU afterward. Since this is very different from our normal procedure for powering off machines, I can only hope that we'd remember to do it in the pressure of a serious AC issue.

(Smart PDUs have a few issues. First, not all of our machines are on them because we don't have enough of them and enough outlets. Second, when you power off a machine this way you're trusting your mapping between PDU ports and actual machines. We think our mapping is trustworthy, but we'd rather not find out the hard way.)

BIOSPowerLossChallenge written at 00:04:11; Add Comment

(Previous 10 or go back to June 2021 at 2021/06/11)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.