2024-09-08
I should probably reboot BMCs any time they behave oddly
Today on the Fediverse I said:
It has been '0' days since I had to reset a BMC/IPMI for reasons (in this case, apparently something power related happened that glitched the BMC sufficiently badly that it wasn't willing to turn on the system power). Next time a BMC is behaving oddly I should just immediately tell it to cold reset/reboot and see, rather than fiddling around.
(Assuming the system is already down. If not, there are potential dangers in a BMC reset.)
I've needed to reset a BMC before, but this time was more odd and less clear than the KVM over IP that wouldn't accept the '2' character.
We apparently had some sort of power event this morning, with a number of machines abruptly going down (distributed across several different PDUs). Most of the machines rebooted fine, either immediately or after some delay. A couple of the machines did not, and conveniently we had set up their BMCs on the network (although they didn't have KVM over IP). So I remotely logged in to their BMC's web interface, saw that the BMC was reporting that the power was off, and told the BMC to power on.
Nothing happened. Oh, the BMC's web interface accepted my command, but the power status stayed off and the machines didn't come back. Since I had a bike ride to go to, I stopped there. After I came back from the bike ride I tried some more things (still remotely). One machine I could remotely power cycle through its managed PDU, which brought it back. But the other machine was on an unmanaged PDU with no remote control capability. I wound up trying IPMI over the network (with ipmitool), which had no better luck getting the machine to power on, and then I finally decided to try resetting the BMC. That worked, in that all of a sudden the machine powered on the way it was supposed to (we set the 'what to do after power comes back' on our machines to 'last power state', which would have been 'powered on').
As they say, I have questions. What I don't have is any answers. I believe that the BMC's power control talks to the server's motherboard, instead of to the power supply units, and I suspect that it works in a way similar to desktop ATX chassis power switches. So maybe the BMC software had a bug, or some part of the communication between the BMC and the main motherboard circuitry got stuck or desynchronized, or both. Resetting the BMC would reset its software, and it could also force a hardware reset to bring the communication back to a good state. Or something else could be going on.
(Unfortunately BMCs are black boxes that are supposed to just work, so there's no way for ordinary system administrators like me to peer inside.)
2024-09-04
Using rsync to create a limited ability to write remote files
Suppose that you have an isolated high security machine and you want to back up some of its data on another machine, which is also sensitive in its own way and which doesn't really want to have to trust the high security machine very much. Given the source machine's high security, you need to push the data to the backup host instead of pulling it. Because of the limited trust relationship, you don't want to give the source host very much power on the backup host, just in case. And you'd like to do this with standard tools that you understand.
I will cut to the chase: as far as I can tell, the easiest way to do this is to use rsync's daemon mode on the backup host combined with SSH (to authenticate either end and encrypt the traffic in transit). It appears that another option is rrsync, but I just discovered that and we have prior experience with rsync's daemon mode for read-only replication.
Rsync's daemon mode is controlled by a configuration file that can restrict what it allows the client (your isolated high security source host) to do, particularly where the client can write, and can even chroot if you run things as root. So the first ingredient we need is a suitable rsyncd.conf, which will have at least one 'module' that defines parameters:
[backup-host1] comment = Backup module for host1 # This will normally have restricted # directory permissions, such as 0700. path = /backups/host1 hosts allow = <host1 IP> # Let's assume we're started out as root use chroot = yes uid = <something> gid = <something>
The rsyncd.conf 'hosts allow' module parameter works even over SSH; rsync will correctly pull out the client IP from the environment variables the SSH daemon sets.
The next ingredient is a shell script that forces the use of this rsyncd.conf:
#!/bin/sh exec /usr/bin/rsync --server --daemon --config=/backups/host1-rsyncd.conf .
As with the read-only replication, this script completely ignores command line arguments that the client may try to use. Very cautious people could inspect the client's command line to look for unexpected things, but we don't bother.
Finally you need a SSH keypair and a .ssh/authorized_keys entry on the backup machine for that keypair that forces using your script:
from="<host1 IP>",command="/backups/host1-script",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty [...]
(Since we're already restricting the rsync module by IP, we definitely want to restrict the key usage as well.)
On the high security host, you transfer files to the backup host with:
rsync -a --rsh="/usr/bin/ssh -i /client/identity" yourfile LOGIN@SERVER::backup-host1/
Depending on what you're backing up and how you want to do things, you might want to set the rsyncd.conf module parameters 'write only = true' and perhaps 'refuse options = delete', if you're sure you don't want the high security machine to be able to retrieve its files once it has put them there. On the other hand, if the high security machine is supposed to be able to routinely retrieve its backups (perhaps to check that they're good), you don't want this.
(If the high security machine is only supposed to read back files very rarely, you can set 'write only = true' until it needs to retrieve a file.)
There are various alternative approaches, but this one is relatively easy to set up, especially if you already have a related rsync daemon setup for read-only replication.
(On the one hand it feels annoying that there isn't a better way to do this sort of thing by now. On the other hand, the problems involved are not trivial. You need encryption, authentication of both ends, a confined transfer protocol, and so on. Here, SSH provides the encryption and authentication and rsync provides the confined transfer protocol, at the cost of having to give access to a Unix account and trust rsync's daemon mode code.)
2024-08-27
Some reasons why we mostly collect IPMI sensor data locally
Most servers these days support IPMI and can report various sensor readings through it, which you often want to use. In general, you can collect IPMI sensor readings either on the host itself through the host OS or over the network using standard IPMI networking protocols (there are several generations of them). Locally, we have almost always collected this information locally (and then fed it into our Prometheus based monitoring system), for an assortment of reasons, some of them general and some of them specific to us.
When we collect IPMI sensor data locally, we export it through the the standard Prometheus host agent, which has a feature where you can give it text files of additional metrics (cf). Although there is a 'standard' third party network IPMI metrics exporter, we ended up rolling our own for various reasons (through a Prometheus exporter that can run scripts for us). So we could collect IPMI sensor data either way, but we almost entirely collect the data locally.
(These days it is a standard part of our general Ubuntu customizations to set up sensor data collection from the IPMI if the machine has one.)
The generic reasons for not collecting IPMI sensor data over the network is that your server BMCs might not be on the network at all (perhaps they don't have a dedicated BMC network interface), or you've sensibly put them on a secured network and your monitoring system doesn't have access to it. We have two additional reasons for preferring local IPMI sensor data collection.
First, even when our servers have dedicated management network ports, we don't always bother to wire them up; it's often just extra work for relatively little return (and it exposes the BMC to the network, which is not always a good thing). Second, when we collect IPMI sensor data through the host, we automatically start and stop collecting sensor data for the host when we start or stop monitoring the host in general (and we know for sure that the IPMI sensor data really matches that host). We almost never care about IPMI data when either the host isn't otherwise being monitored or the host is off.
Our system for collecting IPMI sensor data over the network actually dates from when this wasn't true, because we once had some (donated) blade servers that periodically mysteriously locked up under some conditions that seemed related to load (so much so that we built a system to automatically power cycle them via IPMI when they got hung). One of the things we were very interested in was if these blade servers were hitting temperature or fan limits when they hung. Since the machines had hung we couldn't collect IPMI information through their host agent; getting it from the IPMI over the network was our only option.
(This history has created a peculiarity, which is that our script for collecting network IPMI sensor data used what was at the time the existing IPMI user that was already set up to remotely power cycle the C6220 blades. So now anything we want to remotely collect IPMI sensor data from has a weird 'reboot' user, which these days doesn't necessarily have enough IPMI privileges to actually reset the machine.)
PS: We currently haven't built a local IPMI sensor data collection system for our OpenBSD machines, although OpenBSD can certainly talk to a local IPMI, so we collect data from a few of those machines over the network.
2024-08-24
JSON is usually the least bad option for machine-readable output formats
Over on the Fediverse, I said something:
In re JSON causing problems, I would rather deal with JSON than yet another bespoke 'simpler' format. I have plenty of tools that can deal with JSON in generally straightforward ways and approximately none that work on your specific new simpler format. Awk may let me build a tool, depending on what your format is, and Python definitely will, but I don't want to.
This is re: <Royce Williams Fediverse post>
This is my view as a system administrator, because as a system administrator I deal with a lot of tools that could each have their own distinct output format, each of which I have to parse separately (for example, smartctl's bespoke output, although that output format sort of gets a pass because it was intended for people, not further processing).
JSON is not my ideal output format. But it has the same virtue as gofmt does; as Rob Pike has said, "gofmt's style is no one's favorite, yet gofmt is everyone's favorite" (source, also), because gofmt is universal and settles the arguments. Everything has to have some output format, so having a single one that is broadly used and supported is better than having N of them. And jq shows the benefit of this universality, because if something outputs JSON, jq can do useful things with it.
(In turn, the existence of jq makes JSON much more attractive to system administrators than it otherwise would be. If I had no ready way to process JSON output, I'd be much less happy about it and it would stop being the easy output format to deal with.)
I don't have any particular objection to programs that want to output in their own format (perhaps a simpler one). But I want them to give me an option for JSON too, and most of the time I'm going to go with JSON. I've already written enough ad-hoc text processing things in awk, and a few too many heavy duty text parsing things in Python. I don't really want to write another one just for you. If your program does use only a custom output format, I want there to be a really good reason why you did it, not just that you don't like the aesthetics of JSON. As Rob Pike says, no one likes gofmt's style, but we all like that everyone uses it.
(It's my view that JSON's increased verbosity over alternates isn't a compelling reason unless there's either a really large amount of data or you have to fit into very constrained space, bandwidth, or other things. In most environments, disk space and bandwidth are much cheaper than people's time and the liability of yet another custom tool that has to be maintained.)
PS: All of this is for output formats that are intended to be further processed. JSON is a terrible format for people to read directly, so terrible that my usual reaction to having to view raw JSON is to feed it through 'jq . | less'. But your tool should almost always also have an option for some machine readable format (trust me, someday system administrators will want to process the information your tool generates).
2024-08-20
Some brief notes on 'numfmt
' from GNU Coreutils
Many years ago I learned about numfmt
(also)
from GNU Coreutils (see the comments on this entry and then this entry). An additional source of information
is Pádraig Brady's numfmt - A number reformatting utility. Today I was faced
with a situation where I wanted to compute and print multi-day,
cumulative Amanda dump total sizes for filesystems in a readable
way, and the range went from under a GByte to several TBytes, so I
didn't want to just convert everything to TBytes (or GBytes) and
be done with it. I was doing the summing up in awk and briefly
considered doing this 'humanization' in awk (again, I've done it
before) before I remembered numfmt
and decided to give it a try.
The basic pattern for using numfmt here was:
cat <amanda logs> | awk '...' | sort -nr | numfmt --to iec
This printed out '<size> <what ...>', and then numfmt turned the first field into humanized IEC values. As I did here, it's better to sort before numfmt, using the full precision raw number, rather than after numfmt (with 'sort -h'), with its rounded (printed) values.
Although Amanda records dump sizes in KBytes, I had my awk print
them out in bytes. It turns out that I could have kept them in
KBytes and had numfmt do the conversion, with 'numfmt --from-unit
1024 --to iec
'.
(As far as I can tell, the difference between --from-unit and --to-unit is that the former multiplies the number and the latter divides it, which is probably not going to be useful with IEC units. However, I can see it being useful if you wanted to mass-convert times in sub-second units to seconds, or convert seconds to a larger unit such as hours. Unfortunately numfmt currently has no unit options for time, so you can only do pure numeric shifts.)
If left to do its own formatting, numfmt has two issues (at least when doing conversions to IEC units). First, it will print some values with one decimal place and others with no decimal place. This will generally give you a result that can be hard to skim because not everything lines up, like this:
3.3T [...] 581G [...] 532G [...] [...] 11G [...] 9.8G [...] [...] 1.1G [...] 540M [...]
I prefer all of the numbers to line up, which means explicitly specifying the number of decimal places that everything gets. I tend to use one decimal place for everything, but none ('.0') is a perfectly okay choice. This is done with the --format argument:
... | numfmt --format '%.1f' --to iec
The second issue is that in the process of reformatting your numbers, numfmt will by and large remove any nice initial formatting you may have tried to do in your awk. Depending on how much (re)formatting you want to do, you may want another 'awk' step after the numfmt to pretty-print everything, or you can perhaps get away with --format:
... | numfmt --format '%10.1f ' --to iec
Here I'm specifying a field width for enough white space and also putting some spaces after the number.
Even with the need to fiddle around with formatting afterward, using numfmt was very much the easiest and fastest way to humanize numbers in this script. Now that I've gone through this initial experience with numfmt, I'll probably use it more in the future.
2024-08-15
Workarounds are often forever (unless you work to make them otherwise)
Back in 2018, ZFS on Linux had a bug that could panic the system if you NFS-exported ZFS snapshots. We were setting up ZFS based NFS fileservers and we knew about this bug, so at the time we set things so that only filesystems themselves were NFS exported and available on our servers. Any ZFS snapshots on filesystems were only visible if you directly logged in to the fileservers, which was (and is) something that only core system staff could do. This is somewhat inconvenient; we have to get involved any time people want to get stuff back from snapshots.
It is now 2024. ZFS on Linux became OpenZFS (in 2020) and has long since fixed that issue and released versions with the fix. If I'm retracing Git logs correctly, the fix was in 0.8.0, so it was included (among many others) in Ubuntu 22.04's ZFS 2.1.5 (what our fileservers are currently running) and Ubuntu 24.04's ZFS 2.2.2 (what our new fileservers will run).
When we upgraded the fileservers from 18.04 to 22.04, did we go back to change our special system for generating NFS export entries to allow NFS clients to access ZFS snapshots? You already know the answer to that. We did not, because we had completely forgotten about it. Nor did we go back to do it as we were preparing the 24.04 setup of our ZFS fileservers. It was only today that it came up, as we were dealing with restoring a file from those ZFS snapshots. Since it's come up, we're probably going to test the change and then do it for our future 24.04 fileservers, since it will make things a bit more convenient for some people.
(The good news is that I left comments to myself in one program about why we weren't using the relevant NFS export option, so I could tell for sure that it was this long since fixed bug that had caused us to leave it out.)
It's a trite observation that there's nothing so permanent as a temporary solution, but just because it's trite doesn't mean that it's wrong. A temporary workaround that code comments say we thought we might revert later in the life of our 18.04 fileservers has lasted about six years, despite being unnecessary since no later than when our fileservers moved to Ubuntu 22.04 (admittedly, this wasn't all that long ago).
One moral I take from this is that if I want us to ever remove a 'temporary' workaround, I need to somehow explicitly schedule us reconsidering the workaround. If we don't explicitly schedule things, we probably won't remember (unless it's something sufficiently painful that it keeps poking us until we can get rid of it). The purpose of the schedule isn't necessarily to make us do the thing, it's to remind us that the thing exists and maybe it shouldn't.
(As a corollary, the schedule entry should include pointers to a lot of detail, because when it goes off in a year or two we won't really remember what it's talking about. That's why we have to schedule a reminder.)
2024-08-14
Traceroute, firewalls, and the modern Internet: a horrible realization
The venerable traceroute command sort of reports the hops your packets take to reach a host, and in the process can reveal where your packets are getting dropped or diverted. The traditional default way that traceroute works is by sending UDP packets to a series of high UDP ports with increasing IP TTLs, and seeing where each reply comes from. If the TTL runs out on the way, traceroute gets one reply; if the packet reaches the host, traceroute gets another one (assuming that nothing is listening on the particular UDP port on the host, which usually it isn't). Most versions of traceroute can also use ICMP based probes, while some of them can also use TCP based ones.
While writing my entry on using traceroute with a fixed target port, I had a horrible realization: traceroute's UDP probes mostly won't make it through firewalls. Traceroute's UDP probes are made to a series of high UDP ports (often starting at port 33434 and counting up). Most firewalls are set to block unsolicited incoming UDP traffic by default; you normally specifically configure them to pass only some UDP traffic through to limited ports (such as port 53 for DNS queries to your DNS servers). When traceroute's UDP packets, sent to effectively random high ports, arrive at such a firewall, the firewall will discard or reject them and your traceroute will go no further.
(If you're extremely confident no one will ever run something that listens on the UDP port range, you can make your firewall friendly to traceroute by allowing through UDP ports 33434 to 33498 or so. But I wouldn't want to take that risk.)
The best way around this is probably to use ICMP for traceroute (using a fixed UDP port is more variable and not always possible). Most Unix traceroute implementations support '-I' to do this.
This matters in two situations. First, if you're asking outside people to run traceroutes to your machines and send you the results, and you have a firewall; without having them use ICMP, their traceroutes will all look like they fail to reach your machines (although you may be able to tell whether or not their packets reach your firewall). Second, if you're running traceroute against some outside machine that is (probably) behind a firewall, especially if the firewall isn't directly in front of it. In that case, your traceroute will always stop at or just before the firewall.
A note to myself about using traceroute to check for port reachability
Once upon a time, the Internet was a simple place; if you could ping some remote IP, you could probably reach it with anything. The Internet is no longer such a simple place, or rather I should say that various people's networks no longer are. These days there are a profusion of firewalls, IDS/IDR/IPS systems, and so on out there in the world, and some of them may decide to block access only to specific ports (and only some of the time). In this much more complicated world, you can want to check not just whether a machine is responding to pings, but if a machine responds to a specific port and if it doesn't, where your traffic stops.
The general question of 'where does your traffic stop' is mostly answered by the venerable traceroute. If you think there's some sort of general block, you traceroute to the target and then blame whatever is just beyond the last reported hop (assuming that you can traceroute to another IP at the same destination to determine this). I knew that traceroute normally works by sending UDP packets to 'random' ports (with manipulated (IP) TTLs, and the ports are not actually picked randomly) and then looking at what comes back, and I superstitiously remembered that you could fix the target port with the '-p' argument. This is, it turns out, not actually correct (and these days that matters).
There are several common versions of (Unix) traceroute out there; Linux, FreeBSD, and OpenBSD all use somewhat different versions. In all of them, what '-p port' actually does by itself is set the starting port, which is then incremented by one for each additional hop. So if you do 'traceroute -p 53 target', only the first hop will be probed with a UDP packet to port 53.
In Linux traceroute, you get a fixed UDP port by using the additional argument '-U'; -U by itself defaults to using port 53. Linux traceroute can also do TCP traceroutes with -T, and when you do TCP traceroutes the port is always fixed.
In OpenBSD traceroute, as far as I can see you just can't get a fixed UDP port. OpenBSD traceroute also doesn't do TCP traceroutes. On today's Internet, this is actually a potentially significant limitation, so I suspect that you most often want to try ICMP probes ('traceroute -I').
In FreeBSD traceroute, you get a fixed UDP port by turning on 'firewall evasion mode' with the '-e' argument. FreeBSD traceroute sort of supports a TCP traceroute with '-P tcp', but as the manual page says you need to see the BUGS section; it's going to be most useful if you believe your packets are getting filtered well before their destination. Using the TCP mode doesn't automatically turn on fixed port numbers, so in practice you probably want to use, for example, 'traceroute -P tcp -e -p 22 <host>' (with the port number depending on what you care about).
Having written all of this down, hopefully I will remember it for the next time it comes up (or I can look it up here, to save me reading through manual pages).
2024-08-13
Some thoughts on OpenSSH 9.8's PerSourcePenalties feature
One of the features added in OpenSSH 9.8 is a new SSH server security feature to slow down certain sorts of attacks. To quote the release notes:
[T]he server will now block client addresses that repeatedly fail authentication, repeatedly connect without ever completing authentication or that crash the server. [...]
This is the PerSourcePenalties
configuration
setting and its defaults, and also see PerSourcePenaltyExemptList
and PerSourceNetBlockSize
.
OpenSSH 9.8 isn't yet in anything we
can use at work, but it will be in the next OpenBSD release (and then I'll
get it on Fedora).
On the one hand, this new option is exciting to me because for the first time it lets us block only rapidly repeating SSH sources that fail to authenticate, as opposed to rapidly repeating SSH sources that are successfully logging in to do a whole succession of tiny little commands. Right now our perimeter firewall is blind to whether a brief SSH connection was successful or not, so all it can do is block on total volume, and this means we need to be conservative in its settings. This is a single machine block (instead of the global block our perimeter firewall can do), but a lot of SSH attackers do seem to target single machines with their attacks (for a single external source IP, at least).
(It's also going to be a standard OpenSSH feature that won't require any configuration, firewall or otherwise, and will slow down rapid attackers.)
On the other hand, this is potentially an issue for anything that makes health checks like 'is this machine responding with a SSH banner' (used in our Prometheus setup) or 'does this machine have the SSH host key we expect' (used in our NFS mount authentication system). Both of these cases will stop before authentication and so fall into the 'noauth' category of PerSourcePenalties. The good news is that the default refusal duration for this penalty is only one second, which is usually not very long and you're probably not going to run into in health checks. The exception is if you're trying to verify multiple types of SSH host keys for a server, because you can only verify one host key in a given connection, so if you need to verify both a RSA host key and an Ed25519 host key, you need two connections.
(Even then, the OpenSSH 9.8 default is that only you only get blocked once you've built up 15 seconds of penalties. At the default settings, this would be hard with even repeated host key checks, unless the server has multiple IPs and you're checking all of them.)
It's going to be interesting to read practical experience reports with this feature as OpenSSH 9.8 rolls out to more and more people. And on that note I'm certainly going to wait for people's reports before doing things like increasing the 'authfail' penalty duration, as tempting as it is right now (especially since it's not clear from the current documentation how unenforced penalty times accumulate).
2024-08-12
Uncertainties and issues in using IPMI temperature data
In a comment on my entry about a machine room temperature distribution surprise, tbuskey suggested (in part) using the temperature sensors that many server BMCs support and make visible through IPMI. As it happens, I have flirted with this and have some pessimistic views on it in practice in a lot of circumstances (although I'm less pessimistic now that I've looked at our actual data).
The big issue we've run into is limitations in what temperature sensors are available with any particular IPMI, which varies both between vendors and between server models even for the same vendor. Some of these sensors are clearly internal to the system and some are often vaguely described (at least in IPMI sensor names), and it's hit or miss if you have a sensor that either explicitly labels itself as an 'ambient' temperature or that is probably this because it's called an 'inlet' temperature. My view is that only sensors that report on ambient air temperature (at the intake point, where it is theoretically cool) are really useful, even for relative readings. Internal temperatures may not rise very much even if the ambient temperature does, because the system may respond with measures like ramping up fan speed; obviously this has limits, but you'd generally like to be alerted before things have gotten that bad.
(Out of the 85 of our servers that are currently reporting any IPMI temperatures at all, only 53 report an inlet temperature and only nine report an 'ambient' temperature. One server reports four inlet temperatures; 'ambient', two power supplies, and a 'board inlet' temperature. Currently its inlet ambient is 22C, the board inlet is 32C, and the power supplies are 31C and 36C.)
The next issue I'm seeing in our data is that either we have temperature differences of multiple degrees C between machines higher and lower in racks, or the inlet temperature sensors aren't necessarily all that accurate (even within the same model of server, which will all have the 'inlet' temperature sensor in the same place). I'd be a bit surprised if our machine room ambient air did have this sort of temperature gradient, but I've been surprised before. But that probably means that you have to care about where in the rack your indicator machines are, not just where in the room.
(And where in the room probably matters too, as discussed. I see about a 5C swing in inlet temperatures between the highest and lowest machines in our main machine room.)
We push all of the IPMI readings we can get (temperature and otherwise) into our Prometheus environment and we use some of the IPMI inlet temperature readings to drive alerts. But we consider them only a backup to our normal machine room temperature monitoring, which is done by dedicated units that we trust; if we can't get readings from the main unit for some reason, we'll at least get alerts if something also goes wrong with the air conditioning. I wouldn't want to use IPMI readings as our primary temperature monitoring unless I had no other choice.
(The other aspect of using IPMI temperature measurements is that either the server has to be up or you have to be able to talk to its BMC over the network, depending on how you're collecting the readings. We generally collect IPMI readings through the host agent, using an appropriate ipmitool sub-command. Doing this through the host agent has the advantage that the BMC doesn't even have to be connected to the network, and usually we don't care about BMC sensor readings for machines that are not in service.)