2024-12-11
The long (after)life of some of our old fileserver hardware
Over on the Fediverse, I mentioned that we were still running a machine in production that was old enough that its BMC required Java for its KVM-over-IP functionality (and I've given up on working with it, rather than try to maintain a Java environment that could actually work with it). Naturally there's a story here.
The specific hardware in question is from our second generation ZFS fileservers, which ran OmniOS and which we used between 2014 and somewhere around 2019. Since we're still using the machine today, that means it's more than ten years old, although it spent some of that time powered off. Even we don't normally use machines for quite that long, but in this case we had some unusual requirements.
This particular server is a "bring your own disks" storage server for people who want bulk storage that's more limited in several ways than disk space on our fileservers, but (much) cheaper than what we normally charge. In order to offer this service as cheaply as possible, we need to keep server hardware costs down, so reusing old hardware is ideal. We also want to have a bunch of disk bays so we can build and operate as few of these servers as possible. At the same time, we need spare hardware, because while this service is low usage, it's important to the people who do use it. Our old OmniOS fileserver hardware ticks all of these boxes; the iSCSI disk backends can take 16 data disks, we have lots of spare servers that we could swap in if this hardware dies since we had lots of fileservers and backends, and they even have 10G Ethernet ports.
However, there are some drawbacks to reusing old (fileserver) hardware. One of them is that the server isn't very fast and doesn't have much RAM by modern standards, but that's broadly okay since this isn't a high performance storage server; in practice it's limited by the performance of the hard drives that people get for it and they've tended to prioritize capacity over speed. The other one is what I mentioned in my Fediverse post; although these servers have a BMC with a KVM-over-IP feature that we can in theory use, in practice their KVM-over-IP uses a Java applet that requires an ancient version of Java. I used to have a suitable and working Java environment, back when these were our production fileservers, but I believe it stopped working at some point and then later I definitely deleted it.
(This server came up before in passing, when I covered our server ages as of 2022. I also tried to use one of these OmniOS era fileservers as a Linux libvirt based virtualization server, but discovered that it was much too slow.)
2024-12-09
Maybe we should explicitly schedule rebooting our fleet every so often
We just got through a downtime where we rebooted basically everything in our fleet, including things like firewalls. Doing such a reboot around this time of year is somewhat of a tradition for us, since we have the university's winter break coming up and in the past some of our machines have had problems that seem to have been related to being up for 'too long'.
Generally we don't like to reboot our machines, because it's disruptive to our users. We're in an unusual sysadmin environment where people directly log in to or use many of the individual servers, so when one of them goes through a reboot, it's definitely noticed. There are behind the scenes machines that we can reboot without users particularly noticing, and some of our machines are sort of generic and could be rebooted on a rolling basis, but not our login servers, our general compute servers, our IMAP server, our heavily used general purpose web server, and so on. So our default is to not reboot our machines unless we have to.
The problem with defaults is that it's very easy to go with them. When the default is to not reboot your machines, this can result in machines that haven't been rebooted in a significant amount of time (and with it, haven't had work done on them that would require a reboot). When we were considering this December's round of precautionary, pre-break rebooting, we realized that this had happened to us. I'm not going to say just how long many of our machines had gone without a reboot, but it was rather long, long enough to feel alarming for various reasons.
We're not going to change our default of not rebooting things, but one way we could work within it is to decide in advance on a schedule for reboots. For example, we could decide that we'll reboot all of our fleet at least three times a year, since that conveniently fits into the university's teaching schedules (we're on the research side, but professors teaching courses may have various things on our machines). We probably wouldn't schedule the precise timing of this mass reboot in advance, but at least having it broadly scheduled (for example, 'we're rebooting everything around the start of May') might get us to do it reliably, rather than just drifting with our default.
2024-12-07
PCIe cards we use and have used in our servers
In a comment on my entry on how common (desktop) motherboards are supporting more M.2 NVMe slots but fewer PCIe cards, jmassey was curious about what PCIe cards we needed and used. This is a good and interesting question, especially since some number of our 'servers' are actually built using desktop motherboards for various reasons (for example, a certain number of the GPU nodes in our SLURM cluster, and some of our older compute servers, which we put together ourselves using early generation AMD Threadrippers and desktop motherboards for them).
Today, we have three dominant patterns of PCIe cards. Our SLURM GPU nodes obviously have a GPU card (x16 PCIe lanes) and we've added a single port 10G-T card (which I believe are all PCIe x4) so they can pull data from our fileservers as fast as possible. Most of our firewalls have an extra dual-port 10G card (mostly 10G-T but a few use SFPs). And a number of machines have dual-port 1G cards because they need to be on more networks; our current stock of these cards are physically x4 PCIe, although I haven't looked to see if they use all the lanes.
(We also have single-port 1G cards lying around that sometimes get used in various machines; these are x1 cards. The dual-port 10G cards are probably some mix of x4 and x8, since online checks say they come in both varieties. We have and use a few quad-port 1G cards for semi-exotic situations, but I'm not sure how many PCIe lanes they want, physically or otherwise. In theory they could reasonably be x4, since a single 1G is fine at x1.)
In the past, one generation of our fileserver setup had some machines that needed to use PCIe SAS controller in order to be able to talk to all of the drives in their chassis, and I believe these cards were PCIe x8; these machines also used a dual 10G-T card. The current generation handles all of their drives through motherboard controllers, but we might need to move back to cards in future hardware configurations (depending on what the available server motherboards handle on the motherboard). The good news, for fileservers, is that modern server motherboards increasingly have at least one onboard 10G port. But in a worst case situation, a large fileserver might need two SAS controller cards and a 10G card.
It's possible that we'll want to add NVMe drives to some servers (parts of our backup system may be limited by SATA write and read speeds today). Since I don't believe any of our current servers support PCIe bifurcation, this would require one or two PCIe x4 cards and slots (two if we want to mirror this fast storage, one if we decide we don't care). Such a server would likely also want 10G; if it didn't have a motherboard 10G port, that would require another x4 card (or possibly a dual-port 10G card at x8).
The good news for us is that servers tend to make all of their available slots be physically large (generally large enough for x8 cards, and maybe even x16 these days), so you can fit in all these cards even if some of them don't get all the PCIe lanes they'd like. And modern server CPUs are also coming with more and more PCIe lanes, so probably we can actually drive many of those slots at their full width.
(I was going to say that modern server motherboards mostly don't design in M.2 slots that reduce the available PCIe lanes, but that seems to depend on what vendor you look at. A random sampling of Supermicro server motherboards suggests that two M.2 slots are not uncommon, while our Dell R350s have none.)
2024-12-03
The modern world of server serial ports, BMCs, and IPMI Serial over LAN
Once upon a time, life was relatively simple in the x86 world. Most x86 compatible PCs theoretically had one or two UARTs, which were called COM1 and COM2 by MS-DOS and Windows, ttyS0 and ttyS1 by Linux, 'ttyu0' and 'ttyu1' by FreeBSD, and so on, based on standard x86 IO port addresses for them. Servers had a physical serial port on the back and wired the connector to COM1 (some servers might have two connectors). Then life became more complicated when servers implemented BMCs (Baseboard management controllers) and the IPMI specification added Serial over LAN, to let you talk to your server through what the server believed was a serial port but was actually a connection through the BMC, coming over your management network.
Early BMCs could take very brute force approaches to making this work. The circa 2008 era Sunfire X2200s we used in our first ZFS fileservers wired the motherboard serial port to the BMC and connected the BMC to the physical serial port on the back of the server. When you talked to the serial port after the machine powered on, you were actually talking to the BMC; to get to the server serial port, you had to log in to the BMC and do an arcane sequence to 'connect' to the server serial port. The BMC didn't save or buffer up server serial output from before you connected; such output was just lost.
(Given our long standing console server, we had feelings about having to manually do things to get the real server serial console to show up so we could start logging kernel console output.)
Modern servers and their BMCs are quite intertwined, so I suspect that often both server serial ports are basically implemented by the BMC (cf), or at least are wired to it. The BMC passes one serial port through to the physical connector (if your server has one) and handles the other itself to implement Serial over LAN. There are variants on this design possible; for example, we have one set of Supermicro hardware with no external physical serial connector, just one serial header on the motherboard and a BMC Serial over LAN port. To be unhelpful, the motherboard serial header is ttyS0 and the BMC SOL port is ttyS1.
When the BMC handles both server serial ports and passes one of them through to the physical serial port, it can decide which one to pass through and which one to use as the Serial over LAN port. Being able to change this in the BMC is convenient if you want to have a common server operating system configuration but use a physical serial port on some machines and use Serial over LAN on others. With the BMC switching which server serial port comes out on the external serial connector, you can tell all of the server OS installs to use 'ttyS0' as their serial console, then connect ttyS0 to either Serial over LAN or the physical serial port as you need.
Some BMCs (I'm looking at you, Dell) go to an extra level of indirection. In these, the BMC has an idea of 'serial device 1' and 'serial device 2', with you controlling which of the server's ttyS0 and ttyS1 maps to which 'serial device', and then it has a separate setting for which 'serial device' is mapped to the physical serial connector on the back. This helpfully requires you to look at two separate settings to know if your ttyS0 will be appearing on the physical connector or as a Serial over LAN console (and gives you two settings that can be wrong).
In theory a BMC could share a single server serial port between the physical serial connector and an IPMI Serial over LAN connection, sending output to both and accepting input from each. In practice I don't think most BMCs do this and there are obvious issues of two people interfering with each other that BMCs may not want to get involved in.
PS: I expect more and more servers to drop external serial ports over time, retaining at most an internal serial header on the motherboard. That might simplify BMC and BIOS settings.
2024-11-28
My life has been improved by my quiet Prometheus alert status monitor
I recently created a setup to provide a backup for our email-based Prometheus alerts; the basic result is that if our current Prometheus alerts change, a window with a brief summary of current alerts will appear out of the way on my (X) desktop. Our alerts are delivered through email, and when I set up this system I imagined it as a backup, in case email delivery had problems that stopped me from seeing alerts. I didn't entirely realize that in the process, I'd created a simple, terse alert status monitor and summary display.
(This wasn't entirely a given. I could have done something more clever when the status of alerts changed, like only displaying new alerts or alerts that had been resolved. Redisplaying everything was just the easiest approach that minimized maintaining and checking state.)
After using my new setup for several days, I've ended up feeling that I'm more aware of our general status on an ongoing and global basis than I was before. Being more on top of things this way is a reassuring feeling in general. I know I'm not going to accidentally miss something or overlook something that's still ongoing, and I actually get early warning of situations before they trigger actual emails. To put it in trendy jargon, I feel like I have more situational awareness. At the same time this is a passive and unintrusive thing that I don't have to pay attention to if I'm busy (or pay much attention to in general, because it's easy to scan).
Part of this comes from how my new setup doesn't require me to do anything or remember to check anything, but does just enough to catch my eye if the alert situation is changing. Part of this comes from how it puts information about all current alerts into one spot, in a terse form that's easy to scan in the usual case. We have Grafana dashboards that present the same information (and a lot more), but it's more spread out (partly because I was able to do some relatively complex transformations and summarizations in my code).
My primary source for real alerts is still our email messages about alerts, which have gone through additional Alertmanager processing and which carry much more information than is in my terse monitor (in several ways, including explicitly noting resolved alerts). But our email is in a sense optimized for notification, not for giving me a clear picture of the current status, especially since we normally group alert notifications on a per-host basis.
(This is part of what makes having this status monitor nice; it's an alternate view of alerts from the email message view.)
2024-11-22
My new solution for quiet monitoring of our Prometheus alerts
Our Prometheus setup delivers all alert messages through email, because we do everything through email (as a first approximation). As we saw yesterday, doing everything through email has problems when your central email server isn't responding; Prometheus raised alerts about the problems but couldn't deliver them via email because the core system necessary to deliver email wasn't doing so. Today, I built myself a little X based system to get around that, using the same approach as my non-interrupting notification of new email.
At a high level, what I now have is an xlbiff based notification of our current Prometheus alerts. If there are no alerts, everything is quiet. If new alerts appear, xlbiff will pop up a text window over in the corner of my screen with a summary of what hosts have what alerts; I can click the window to dismiss it. If the current set of alerts changes, xlbiff will re-display the alerts. I currently have xlbiff set to check the alerts every 45 seconds, and I may lengthen that at some point.
(The current frequent checking is because of what started all of this; if there are problems with our email alert notifications, I want to know about it pretty promptly.)
The work of fetching, checking, and formatting alerts is done by a Python program I wrote. To get the alerts, I directly query our Prometheus server rather than talking to Alertmanager; as a side effect, this lets me see pending alerts as well (although then I have to have the Python program ignore a bunch of pending alerts that are too flaky). I don't try to do the ignoring with clever PromQL queries; instead the Python program gets everything and does the filtering itself.
Pulling the current alerts directly from Prometheus means that I can't readily access the explanatory text we add as annotations (and that then appears in our alert notification emails), but for the purposes of a simple notification that these alerts exist, the name of the alert or other information from the labels is good enough. This isn't intended to give me full details about the alerts, just to let me know what's out there. Most of the time I'll get email about the alert (or alerts) soon anyway, and if not I can directly look at our dashboards and Alertmanager.
To support this sort of thing, xlbiff has the notion of a 'check'
program that can print out a number every time it runs, and will
get passed the last invocation's number on the command line (or '0'
at the start). Using this requires boiling down the state of the
current alerts to a single signed 32-bit number. I could have used
something like the count of current alerts, but me being me I decided
to be more clever. The program takes the start time of every current
alert (from the ALERTS_FOR_STATE
Prometheus metric), subtracts
a starting epoch to make sure we're not going to overflow, and adds
them all up to be the state number (which I call a 'checksum' in
my code because I started out thinking about more complex tricks
like running my output text through CRC32).
(As a minor wrinkle, I add one second to the start time of every firing alert so that when alerts go from pending to firing the state changes and xlbiff will re-display things. I did this because pending and firing alerts are presented differently in the text output.)
To get both the start time and the alert state, we must use the usual trick for pulling in extra labels:
ALERTS_FOR_STATE * ignoring(alertstate) group_left(alertstate) ALERTS
I understand why ALERTS_FOR_STATE
doesn't include the alert state,
but sometimes it does force you to go out of your way.
PS: If we had alerts going off all of the time, this would be far too obtrusive an approach. Instead, our default state is that there are no alerts happening, so this alert notifier spends most of its time displaying nothing (well, having no visible window, which is even better).
2024-11-21
Our Prometheus alerting problem if our central mail server isn't working
Over on the Fediverse, I said something:
Ah yes, the one problem that our Prometheus based alert system can't send us alert email about: when the central mail server explodes. Who rings the bell to tell you that the bell isn't working?
(This is of course an aspect of monitoring your Prometheus setup itself, and also seeing if Alertmanager is truly healthy.)
There is a story here. The short version of the story is that today we wound up with a mail loop that completely swamped our central Exim mail server, briefly running its one minute load average up to a high water mark of 3,132 before a co-worker who'd noticed the problem forcefully power cycled it. Plenty of alerts fired during the incident, but since we do all of our alert notification via email and our central email server wasn't delivering very much email (on account of that load average, among other factors), we didn't receive any.
The first thing to note is that this is a narrow and short term problem for us (which is to say, me and my co-workers). On the short term side, we send and receive enough email that not receiving email for very long during working hours is unusual enough that someone would have noticed before too long, in fact my co-worker noticed the problems even without an alert actively being triggered. On the narrow side, I failed to notice this as it was going on because the system stayed up, it just wasn't responsive. Once the system was rebooting, I noticed almost immediately because I was in the office and some of the windows on my office desktop disappeared.
(In that old version of my desktop I would have
noticed the issue right away, because an xload
for the machine
in question was right in the middle of these things. These days
it's way off to the right side, out of my routine view, but I could
change that back.)
One obvious approach is some additional delivery channel for alerts about our central mail server. Unfortunately, we're entirely email focused; we don't currently use Slack, Teams, or other online chatting systems, so sending selected alerts to any of them is out as a practical option. We do have work smartphones, so in theory we could send SMS messages; in practice, free email to SMS gateways have basically vanished, so we'd have to pay for something (either for direct SMS access and we'd build some sort of system on top, or for a SaaS provider who would take some sort of notification and arrange to deliver it via SMS).
For myself, I could probably build some sort of script or program that regularly polled our Prometheus server to see if there were any relevant alerts. If there were, the program would signal me somehow, either by changing the appearance of a status window in a relatively unobtrusive way (eg turning it red) or popping up some sort of notification (perhaps I could build something around a creative use of xlbiff to display recent alerts, although this isn't as simple as it looks).
(This particular idea is a bit of a trap, because I could spend a lot of time crafting a little X program that, for example, had a row of boxes that were green, yellow, or red depending on the alert state of various really important things.)
2024-11-15
IPv6 networks do apparently get probed (and implications for address assignment)
For reasons beyond the scope of this entry, my home ISP recently changed my IPv6 assignment from a /64 to a (completely different) /56. Also for reasons beyond the scope of this entry, they left my old /64 routing to me along with my new /56, and when I noticed I left my old IPv6 address on my old /64 active, because why not. Of course I changed my DNS immediately, and at this point it's been almost two months since my old /64 appeared in DNS. Today I decided to take a look at network traffic to my old /64, because I knew there was some (which is actually another entry), and to my surprise much more appeared than I expected.
On my old /64, I used ::1/64 and ::2/64 for static IP addresses,
of which the first was in DNS, and the other IPv6 addresses in it
were the usual SLAAC assignments. The first thing I discovered in
my tcpdump
was a surprisingly large number of cloud-based IPv6
addresses that were pinging my ::1 address. Once I excluded that
traffic, I was left with enough volume of port probes that I could
easily see them in a casual tcpdump.
The somewhat interesting thing is that these IPv6 port probes were happening at all. Apparently there is enough out there on IPv6 that it's worth scraping IPv6 addresses from DNS and then probing potentially vulnerable ports on them to see if something responds. However, as I kept watching I discovered something else, which is that a significant number of these probes were not to my ::1 address (or to ::2). Instead they were directed to various (very) low-number addresses on my /64. Some went to the ::0 address, but I saw ones to ::3, ::5, ::7, ::a, ::b, ::c, ::f, ::15, and a (small) number of others. Sometimes a sequence of source addresses in the same /64 would probe the same port on a sequence of these addresses in my /64.
(Some of this activity is coming from things with DNS, such as various shadowserver.org hosts.)
As usual, I assume that people out there on the IPv6 Internet are doing this sort of scanning of low-numbered /64 IPv6 addresses because it works. Some number of people put additional machines on such low-numbered addresses and you can discover or probe them this way even if you can't find them in DNS.
One of the things that I take away from this is that I may not want to put servers on these low IPv6 addresses in the future. Certainly one should have firewalls and so on, even on IPv6, but even then you may want to be a little less obvious and easily found. Or at the least, only use these IPv6 addresses for things you're going to put in DNS anyway and don't mind being randomly probed.
PS: This may not be news to anyone who's actually been using IPv6 and paying attention to their traffic. I'm late to this particular party for various reasons.
2024-11-14
Your options for displaying status over time in Grafana 11
A couple of years ago I wrote about your options for displaying status over time in Grafana 9, which discussed the problem of visualizing things how many (firing) Prometheus alerts there are of each type over time. Since then, some things have changed in the Grafana ecosystem, and especially some answers have recently become clearer to me (due to an old issue report), so I have some updates to that entry.
The generally best panel type you want to use for this is a state timeline panel, with 'merge equal consecutive values' turned on. State timelines are no longer 'beta' in Grafana 11 and they work for this, and I believe they're Grafana's more or less officially recommended solution for this problem. By default a state timeline panel will show all labels, but you can enable pagination. The good news (in some sense) is that Grafana is aware that people want a replacement for the old third party Discrete panel (1, 2, 3) and may at some point do more to move toward this.
You can also use bar graphs and line graphs, as mentioned back then, which continue to have the virtue that you can selectively turn on and off displaying the timelines of some alerts. Both bar graphs and line graphs continue to have their issues for this, although I think they're now different issues than they had in Grafana 9. In particular I think (stacked) line graphs are now clearly less usable and harder to read than stacked bar graphs, which is a pity because they used to work decently well apart from a few issues.
(I've been impressed, not in a good way, at how many different ways Grafana has found to make their new time series panel worse than the old graph panel in a succession of Grafana releases. All I can assume is that everyone using modern Grafana uses time series panels very differently than we do.)
As I found out, you don't want to use the status history panel for this. The status history panel isn't intended for this usage; it has limits on the number of results it can represent and it lacks the 'merge equal consecutive values' option. More broadly, Grafana is apparently moving toward merging all of the function of this panel into the Heatmap panel (also). If you do use the status history panel for anything, you want to set a general query limit on the number of results returned, and this limit is probably best set low (although how many points the panel will accept depends on its size in the browser, so life is fun here).
Since the status history panel is basically a variant of heatmaps, you don't really want to use heatmaps either. Using Heatmaps to visualize state over time in Grafana 11 continue to have the issues that I noted in Grafana 9, although some of them may be eliminated at some point in the future as the status history panel is moved further out. Today, if for some reason you have to choose between Heatmaps and Status History for this, I think you should use Status History with a query limit.
If we ever have to upgrade from our frozen Grafana version, I would expect to keep our line graph alert visualizations and replace our Discrete panel usage with State Timeline panels with pagination turned on.
2024-11-12
Finding a good use for keep_firing_for in our Prometheus alerts
A while back (in 2.42.0), Prometheus
introduced a feature to artificially keep alerts firing for some
amount of time after their alert condition had cleared; this is
'keep_firing_for
'. At the time, I said that I didn't really
see a use for it for us, but I now
have to change that. Not only do we have a use for it, it's one
that deals with a small problem in our large scale alerts.
Our 'there is something big going on' alerts exist only to inhibit
our regular alerts. They trigger when there seems to be 'too much'
wrong, ideally fast enough that their inhibition effect stops the
normal alerts from going out. Because normal alerts from big issues
being resolved don't necessarily clean out immediately, we want our
large scale alerts to linger on for some time after the amount of
problems we have drop below their trigger point. Among other things,
this avoids a gotcha with inhibitions and resolved alerts. Because we created these alerts
before v2.42.0, we implemented the effect of lingering on by using
max_over_time()
on the alert conditions (this was the old
way of giving an alert a minimum duration).
The subtle problem with using max_over_time() this way is that it means you can't usefully use a 'for:' condition to de-bounce your large scale alert trigger conditions. For example, if one of the conditions is 'there are too many ICMP ping probe failures', you'd potentially like to only declare a large scale issue if this persisted for more than one round of pings; otherwise a relatively brief blip of a switch could trigger your large scale alert. But because you're using max_over_time(), no short 'for:' will help; once you briefly hit the trigger number, it's effectively latched for our large scale alert lingering time.
Switching to extending the large scale alert directly with
'keep_firing_for
' fixes this issue, and also simplifies the
alert rule expression. Once we're no longer using max_over_time(),
we can set 'for: 1m' or another useful short number to de-bounce
our large scale alert trigger conditions.
(The drawback is that now we have a single de-bounce interval for all of the alert conditions, whereas before we could possibly have a more complex and nuanced set of conditions. For us, this isn't a big deal.)
I suspect that this may be generic to most uses of max_over_time() in alert rule expressions (fortunately, this was our only use of it). Possibly there are reasonable uses for it in sub-expressions, clever hacks, and maybe also using times and durations (eg, also, also).