Wandering Thoughts archives

2021-06-29

Monitoring the status of Linux network interfaces with Prometheus

Recently I wrote about how we found out a network cable had quietly gone bad and dropped the link to 100 Mbits/sec and mentioned in passing that we were now monitoring for this sort of thing (and in comments, Ivan asked how we were doing this). We're doing our alerts for this through our existing Prometheus setup, using the metrics that node_exporter extracts from Linux's /sys/class/net data for network interface status, which puts some limits on what we can readily check.

To start with, you get the network interface's link speed in node_network_speed_bytes. The values I've seen on our hardware are 1250000000 (10G), 125000000 (1G), 12500000 (100M), and -125000 (for an interface that has been configured up but has no carrier). If all of your network ports are at a single speed, say 1G (or 10G), you can just alert on node_network_speed_bytes being anything other than your normal speed. We have a mixture of speeds, so I had to resort to a collection of alerts to cover all of the cases:

An Ethernet interface that's been configured up but has no carrier has a node_network_carrier of 0 and also a node_network_speed_bytes that's negative (and it also has a node_network_up of 0). You can use either metric to detect this state and alert on it, which will find both unused network interfaces that your system has decided to try to do DHCP on and network interfaces that are supposed to be live but have no carrier. Unfortunately there's no way to detect the inverse condition of an interface that has carrier but that hasn't been configured up. The Linux kernel doesn't report on the link carrier state for interfaces that aren't UP, and so node_exporter has no metric that can detect this.

(I'd like to detect situations where an unused server port has live networking, either because a cable got plugged in or an existing disused cable became live. In our environment, either is a mistake we want to fix.)

These days, almost all network links are full duplex. You can detect links that have come up at half duplex by looking for a 'duplex="half"' label in the node_network_info metric. Since not all network interfaces have a duplex, you can't just look for 'duplex!="full"'. Technically 1G Ethernet can be run at half duplex, although there's nothing that should do this. 10G-T Ethernet is apparently full duplex only.

The node_network_up metric looks tempting but unfortunately it's a combination of dangerous and pointless. node_network_up is 1 if and only if the interface's operstate is 'up', and not all live network interfaces are 'up' when they're working. Prominently, the loopback ('lo') interface's normal operstate is 'unknown', as are Wireguard interfaces (and PPP interfaces). In addition, an operstate of 'up' requires there to be carrier on the interface. Nor does node_network_up being 1 mean that everything is fine, since an interface can be up without any IP addresses being configured on it.

(But if you want to use node_network_up, you probably want to use 'node_network_up != 1 and (node_network_protocol_type == 1)'. This makes it conditional on the interface being an Ethernet interface, so we know that operstate should be 'up' if it's functional. This is sufficiently complicated that I would rather look for up interfaces without carrier, since that's the only error condition we can actually see for Ethernet interfaces..)

Unfortunately, as far as I know there are no metrics that will tell you if an interface has IPv4 or IPv6 addresses configured on it (whether or not it has carrier and so is up). The 'address' that node_network_info and node_network_address_assign_type talk about is the Ethernet address, not IP addresses (as you can see from the values of the label in node_network_info). My conclusion is that you need to check whatever IP addresses you need to be up through the Blackbox exporter.

Given all of this, under normal circumstances, I think there are three sensible alerts or sets of alerts for network interfaces. One alert or set of alerts is for interface speed, based on node_network_speed_bytes, requiring your interfaces to be at their expected speeds. In many environments, you could then look for node_network_carrier being 0 to detect interfaces that are configured but don't have carrier. Finally, you might as well check for half duplex with 'node_network_info{duplex="half"}'.

(It seems likely that a cable (or a port) that fails enough to force you down to half duplex will trigger other conditions as well, but who knows.)

PrometheusCheckingNetworkInterfaces written at 23:55:34; Add Comment

2021-06-26

Ethernet network cables can go bad over time, with odd symptoms

Last week we got around to updating the kernels on all of our Ubuntu servers, including our Prometheus metrics server, which is directly connected to four networks. When the metrics server rebooted, one of those network interfaces flapped down and up for a bit, then suddenly had a lot of intermittent ping failures to machines on that subnet. At first I thought that this might be a network driver bug in the new kernel, but when I rebooted the server again the network interface came up at 100 Mbit/sec instead of 1 Gbit/sec and suddenly we had no more ping problems. When we replaced the network cable yesterday, that interface returned to the 1G it was supposed to be at and pinging things on that network may now be more reliable than before.

The first thing I take away from this is that network cables don't just fail cleanly, and when they do have problems your systems may or may not notice. Last week, the network port hardware on both our metrics server and the switch it was connected to spent hours thinking that the cable was fine at 1G when it manifestly wasn't.

For various reasons I wound up investigating how long this had been going on, using both old kernel logs on our syslog server and the network interface speed information captured by the Prometheus host agent. This revealed that the problem most likely started around June 2019 to August of 2019, when the network link speed dropped to 100 Mbit/sec and stayed there other than briefly after some reboots. Over all that time, we didn't notice that the network interface was running at one step down from its expected rate, partly because we weren't doing anything performance sensitive over it.

(We now have alerts for this, just in case it ever happens again.)

The second thing I take away from this is that network cables can fail in place even after they've been plugged in and working for months. This network cable wasn't necessarily completely undisturbed in our machine room, but at most it would have gotten brushed and moved around in the rack cable runs as we added and removed other network cables. But the cable still failed over time, either entirely on its own or with quite mild mechanical stress. It's possible that the cable was always flawed to some degree, but if so the flaws got worse, causing the cable to decay from a reliable 1G link down to 100M.

I don't think there's anything we can really do about this except to keep it in mind as a potential cause of otherwise odd or mysterious problems. We're definitely not going to recable everything with fresh cables just in case, and we're probably not even going to use freshly made or bought cables when we rack new machines.

(Over time we'll turn over our cable stock as we move to 10G, but it's going to be a long time before we have all of the machines there.)

NetworkCablesGoBad written at 00:01:40; Add Comment

2021-06-21

A realization about our VPN and IPv6 traffic

At work, we operate a VPN for our users. The VPN is used to access both internal resources inside our networks and university resources that are normally only available from 'on-campus' IP addresses. Because of the latter, and for historical reasons, our VPN servers are configured to tell VPN clients to route all of their traffic through the VPN, regardless of the destination. In other words, the VPN makes itself the default route for traffic. Today, in the process of investigating an unfortunate Google decision, I realized that there's an important qualification on that statement.

(We actually support two different sorts of VPNs, OpenVPN and L2TP, and have two servers for each type, but all of this is a technical detail. Conceptually, we have 'a VPN service'.)

We and our networks are IPv4 only; we haven't even started to implement IPv6, and it will probably be years before we do. Naturally this means that our VPN is IPv4 only, so its default route only applies to IPv4 traffic, which means that all of the client's IPv6 traffic bypasses our VPN. All of the IPv4 traffic flows through the VPN, but if your client has a working local IPv6 connection, any IPv6 traffic will go through it.

The first consequence of this is for traffic to places outside the university. An increasing number of ISP networks provide IPv6 addresses to people's devices, many of those devices prefer IPv6 where possible, and an increasing number of sites are reachable over IPv6. Connections from people's devices to those sites don't go through our VPN. But if you move the same device over to a network that only provides it an IPv4 address, suddenly you're going through our VPN to reach all of those sites. This makes troubleshooting apparent VPN based connection problems much more exciting than before; we may have to disable IPv6 during our tests, and we may have to find out if a user who's having problems has an IPv6 connection.

The second consequence is that some day some of the university's on-campus websites may start to have IPv6 addresses themselves. Traffic to these websites from IPv6 capable clients that are connected to the VPN will mysteriously (to people) be seen as outside traffic by those on-campus websites, because it's coming directly from the outside client over IPv6 instead of indirectly through our VPN over IPv4. There are also some external websites that have historically given special permissions to the university's IPs. If these websites are IPv6 enabled and your client is IPv6 enabled, they're going to see you as a non-university connection even with the VPN up.

There probably isn't anything we can sensibly do about this. I think it would be a bad idea to try to have our VPN servers grab all client IPv6 traffic and block it, even if that's possible. Among other things, there are probably IPv6 only ISPs out there that this would likely interact very badly with.

(Our VPN isn't officially documented as a privacy aid for general Internet usage, although people may well use it as that. So I don't consider it a security issue that the current configuration leaks people's real IPv6 addresses to sites.)

OurVPNAndIPv6Traffic written at 22:46:17; Add Comment

2021-06-17

In Prometheus queries, on and ignoring don't drop labels from the result

Today I learned that one of the areas of PromQL, the query language for Prometheus that I'm a still a bit weak on is when labels will and won't get dropped from metrics as you manipulate them in a query. So I'll start with the story.

Today I wrote an alert rule to make sure that the network interfaces on our servers hadn't unexpectedly dropped down to 100 Mbit/second (instead of 1Gbit/s or for some servers 10Gbit/s). We have a couple of interfaces on a couple of servers that legitimately are at 100M (or as legitimately as a 100M connection can be in 2021), and I needed to exclude them. The speed of network interfaces is reported by node_exporter in node_network_speed_bytes, so I first wrote an expression using unless and all of the labels involved:

node_network_speed_bytes == 12500000 unless
  ( node_network_speed_bytes{host="host1",device="eno2",...} or
    node_network_speed_bytes{host="host2",device="eno1",...} )

However, most of the standard labels you get on metrics from the host agent (such as job, instance, and so on) are irrelevant and even potentially harmful to include (the full set of labels might have to change someday). The labels I really care about are the host and the device. So I rewrote this as:

node_network_speed_bytes == 12500000 unless on(host,device) [....]

When I wrote this expression I wasn't sure if it was going to drop all other labels beside host and device from the filtered end result of the PromQL expression. It turns out that it didn't; the full set of labels for node_network_speed_bytes is passed through, even though we're only matching on some of them in the unless.

(The host and the device are all that I needed for the alert message so it wouldn't have been fatal if the other labels were dropped. But it's better to retain them just in case.)

Aggregation operators discard labels unless you use without or by, as covered by their documentation (although it's not phrased that way), since aggregating over labels is their purpose. As I've found out, careless use of aggregation operators can lose labels that are valuable for alerts (which may be what left me jumpy about this case). Aggregation over time keeps all labels, though, because it's aggregating over time instead of over some or all labels. But as I was reminded today (since I'm sure I've seen it before), vector matching using on and ignoring don't drop labels, they merely restrict what labels are used in the matching (and then it's up to you to make sure you still have a one to one vector match or at least a match that you expect; I've made mistakes there).

(You can also explicitly pull in additional labels from other metrics.)

There may be other cases in PromQL where labels are dropped, but if so I can't think of them right now. My overall moral is that I still need to test my assumptions and guesses in order to be sure about this stuff.

Sidebar: Why I used unless (... or ...) in this query

In many cases, the obvious way to exclude some things from an alert rule expression is to use negative label matches. However, these can't match on the combination of several labels instead of the value of a single label. As far as I know, if you want to exclude only certain label combinations (here 'host1 and eno2' and 'host2 and eno1') where the individual label elements can occur separately (so host1 and host2 both have other network interfaces, and other hosts have eno1 and eno2 interfaces), you're stuck with more awkward construction I used. This construction is unfortunately somewhat brute force.

PrometheusOnIgnoringAndLabels written at 00:40:10; Add Comment

2021-06-16

The challenge of what to set server BIOSes to do on power loss

Modern PC BIOSes, including server BIOSes, almost always have a setting for what the machine should do if the power is lost and then comes back. Generally your three options are 'stay powered off', 'turn on', and 'stay in your last state'. Lately I've been realizing that none of them are ideal in our current 'work from home' environment, and the general problem is probably unsolvable without internal remote power control.

In the normal course of events, what we want while working from home is for servers to stay in their last power state. If the power is lost and then comes back, running servers will power back up but servers that we've shut down to take out of service will stay off. If we set servers to 'always turn on', we would have to remember to take servers out of service by powering down their outlet on our smart PDU, not just telling them to halt and power off at the OS level. And of course if we had them set to 'stay powered off', we would have to go in to manually power them up.

But a power loss is not the only case where we might have to take servers down temporarily. We've had one or two scares with machine room air conditioning, and if we had a serious AC issue we would have to (remotely) turn machines off to reduce the heat load. If we turn machines off remotely from the OS level, the BIOS setting of 'stay in your last state' doesn't give us any straightforward way of turning them back on, even with a smart PDU; if we toggle outlet power at the smart PDU, the server BIOS will say 'well I was powered off before so I will stay powered off'. What we need to recover from this situation is what I called internal remote power control, where we can remotely command the machine to turn on.

Right now, if we had an AC issue we would probably have to remember to turn machines off through our smart PDUs instead of at the OS level. With our normal BIOS settings, this would let us remotely restart them through the smart PDU afterward. Since this is very different from our normal procedure for powering off machines, I can only hope that we'd remember to do it in the pressure of a serious AC issue.

(Smart PDUs have a few issues. First, not all of our machines are on them because we don't have enough of them and enough outlets. Second, when you power off a machine this way you're trusting your mapping between PDU ports and actual machines. We think our mapping is trustworthy, but we'd rather not find out the hard way.)

BIOSPowerLossChallenge written at 00:04:11; Add Comment

2021-06-11

How we're dealing with our expiring OpenVPN TLS root certificate

Recently I wrote about my failure to arrange a graceful TLS root certificate rollover for our OpenVPN servers. This might leave you wondering what we're doing about this instead, and the answer is that we've opted to use a brute force solution, because we know it works.

Our brute force solution is to set up a new set of OpenVPN servers (we have two of them for redundancy), using a new official name and with it a new TLS root certificate that is good for quite a while (I opted to be cautious and not cross into 2038) and with it a new host certificate. With the new servers set up and in production, we've updated our support site so use the new official name and the new TLS root certificate, so people who set up OpenVPN from now onward will be using the new environment.

Since these servers are using a new official name, they and the current (old) OpenVPN servers can operate at the same time. People with the new client configuration go through our new servers; people with the old client configuration and old TLS certificate go through our old servers. There's no flag day where we have to change the TLS root certificate on the old servers, and in fact they won't change; we're going to run them as-is right up until the TLS root certificate expires and no one can connect to them any more.

This leaves us with all of the people who are currently using our old OpenVPN servers with the expiring TLS root certificate. We're just going to have to contact all of them and ask them to update (ie change) their client configuration, changing the OpenVPN server name and getting and installing the new TLS root certificate. This is not quite as bad as it might sound, because we were always going to have to contact the current people to get them to update their TLS root certificate. So they only have to do one extra thing, although that extra thing may be quite a big pain.

(Some environments have nice, simple OpenVPN configuration systems. But on some platforms, the configuration is 'open a text editor and ...', and one of them is probably not one you're thinking of.)

Doing the change this way 'costs' us two extra servers for a while, which we have to spare, and more importantly it meant that we needed a new official name for our OpenVPN service. This time around this was acceptable, because our old official name was in retrospect perhaps not the best option. If we have to do this again, we may have a harder time coming up with a good new name, but hopefully next time around we'll be able to roll over the TLS root certificate instead of having to start all over from scratch.

(From my perspective, the most annoying thing about this is that I just rebuilt the OpenVPN servers in January in order to update them to a modern OpenBSD. If I'd known all of this back then, we could have gone straight to our new end state and saved one round of building and installing machines.)

OpenVPNTLSRootExpirySolution written at 23:26:58; Add Comment

2021-06-07

My failure to arrange a graceful TLS root certificate rollover with OpenVPN

Generally, what I write about here is discoveries, questions, and successes. This presents a somewhat misleading picture of what my sysadmin work is like, so today I'm going to talk about a TLS issue that I spent a day or two failing at recently.

(I wouldn't say that failure is a routine event in system administration, but sometimes you can't solve a problem, and it can happen to anyone.)

We have some OpenVPN servers for our users, running on OpenBSD using the OpenBSD packaged version of OpenVPN. When you run OpenVPN, you normally establish a private Certificate Authority, with your own root certificate. This is used to authenticate your OpenVPN server to users, by them verifying that your OpenVPN server presents a host certificate that's ultimately signed by your CA, and it can also be used to sign user certificates that are used to authenticate users. Of course to do this your users have to manually tell their OpenVPN client about your root CA. We do this by providing a copy of our local CA root certificate that they need to download and install in their client.

Almost ten years ago, in August of 2011, we set up the first instance of our modern OpenVPN server infrastructure, and generated its root CA certificate. The default expiry time on this CA certificate was ten years, and so it runs out at the end of August of 2021, which is to say in a couple of months. Since we can't assume that all OpenVPN clients will ignore the expiry time of the CA root certificate, we need to do something about this. The simple thing to do is to generate a new CA root certificate (with a long expiry time) and a new host certificate and start using them, but this creates a flag day where all of our OpenVPN users have to download the new CA certificate from us and switch to it; if they don't switch the CA certificate in time, they stop being able to connect to our OpenVPN servers.

We would like to do better, and I wound up with two ideas for how to do it. My first attempt was to create a new cross-signed CA root certificate (and a new host certificate signed by it). One version of the new root certificate was signed by the current root; our OpenVPN servers would provide this in a certificate chain until the old CA root expired. The other version was self-signed, and would be downloaded by people who'd switch to it in advance. The server host certificate would verify through either certificate chain.

Cross signed root certificates are a reasonably common thing in web TLS, and once I fumbled my way through some things the resulting certificate chains passed verification in OpenSSL and another tool I tend to use. But I couldn't get my test OpenVPN client to validate the host certificate using the new CA certificate.

My second attempt was more brute force. I took the keypair from our existing CA root certificate and used it to create a new version of the CA root certificate with the same keypair, the same Subject Name, and a longer expiry. Since this used the same keypair and Subject Name as our existing root certificate, in normal TLS certificate verification it's a perfect substitute for our current expiring CA. My verification tools said it was the same and would verify the current host certificate, and after some work 'openssl asn1parse' said that the two certificates had the same low-level content except the serial number, the validity dates, and the signature. But my test OpenVPN client would not accept the new CA certificate no matter what I did. I even generated and signed a new (test) server host certificate using this new version of the CA certificate and had my test OpenVPN server provide the new CA certificate and the new host certificate while my client was using the new CA certificate. It didn't work.

At this point I'm out of clever ideas to avoid significant pain for our users. Unless something changes in the situation, the best we can do for people is avoid a flag day as much as possible.

(This sort of elaborates on some tweets of mine. My test OpenVPN client was my Fedora 33 laptop; Fedora's OpenVPN client may be a bit atypical, but we have both Fedora and Ubuntu OpenVPN users, so if our work-around doesn't work with them some of our users will have a bad time.)

PS: Official TLS certificates for our OpenVPN servers wasn't really an option back in 2011, and it's probably still not one for various reasons. I made some tests to see if I could make it work in a test setup (hence my use of Let's Encrypt on OpenBSD) but failed there too, although I didn't investigate very deeply.

FailingAtTLSRootRollover written at 22:33:07; Add Comment

2021-06-04

HTTP/3 needs us (and other people) to make firewall changes

The other day, I had a little realization:

Today I realized that the growing enabling of HTTP/3 means that we need to allow UDP 443 through our firewalls (at least outbound), not just TCP 443. Although in the mean time, blocking it shields our users from any HTTP/3 issues. (Which happen.)

Like many places, our network layout has firewalls in it, in fact quite a lot of them. We have a perimeter firewall, of course, then we have firewalls between our internal subnets, our wireless network has a firewall, and our VPN servers have their own set of firewall rules. All of our firewalls have restrictions on outbound traffic, not just inbound traffic.

For obvious reasons, all of our firewalls allow outbound traffic to TCP port 443 (and port 80, and a number of others). However, some of them don't allow outbound traffic to UDP port 443, because there's been no protocol that used that. Until now. HTTP/3 uses QUIC, which runs over UDP, and so it thus generates traffic to UDP port 443. Right now any such traffic is probably not getting through.

Google's Chrome has enabled HTTP/3 (and QUIC) for some time, Firefox enabled HTTP/3 by default in Firefox 88, and Microsoft Edge has also had it for a while (Apple's Safari has yet to enable it by default). All of those browsers will now be sending traffic to UDP port 443 under the right circumstances, or at least trying to; while our firewalls block that traffic, they're not getting very far. I don't know how HTTP/3 implementations behave here, but I wouldn't be surprised if this creates at least a little bit of a slowdown.

(Of course this may shield people from a great deal of slowdown if HTTP/3 appears to work more.)

We're not the only places who are going to need to update firewalls to enable outbound UDP port 443, of course. But I suspect that Google (the originators of the whole QUIC idea) has studied this and determined that there are fewer firewall blocks in the way than I might expect.

Eventually we may also want to enable inbound UDP to port 443, so that people can run web servers that support HTTP/3. But that will probably take much longer, because server support is apparently rather lacking right now (based on the Wikipedia list). So far most of the web servers we run don't even have HTTP/2 enabled yet, for various reasons.

HTTP3AndOurFirewalls written at 23:54:52; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.