Wandering Thoughts

2020-10-24

Why configuration file snippets in a directory should have some extension

After a great deal of painful experience with the combination of local configuration tweaks combined with vendor upgrades, many systems and people have adopted an approach of splitting monolithic configuration files apart into multiple snippets that sit in a directory. One of the latest I've run into is Fedora with sshd configuration, with is especially relevant to me because I've been customizing mine for years (and carefully re-merging my customizations after various upgrades). However, there is an important thing to bear in mind when setting up such a system.

When you do this split and support including snippets from a directory, you should always require that the snippets have a specific extension, conventionally .conf, instead of just accepting any old file there. A big reason for this is that many Linux packaging systems may wind up creating or leaving oddly named files there when a package is added, upgraded, or removed under the right circumstances; for example, RPM (used on Red Hat Enterprise Linux among others) can create <something>.rpmnew and .rpmsave files. These variously created files should not be treated as live configuration snippets.

(Similarly, some systems for automatically modifying files will leave backup versions of the file around with some extension like .bak. You can usually turn this off, but you have to remember to do so; mistakes are inevitable.)

Requiring a specific extension also makes it easier to temporarily deactivate a snippet (just rename it to add a suffix on the extension), put in a README file to explain what you're doing, and so on.

Other methods of marking which snippets should be active don't cooperate as well with common package managers and generally aren't as obvious. If you're writing or modifying local software, you may not care about package managers (although you never know, you may want to put your software in one someday), but there's value in the other advantages of requiring an extension and other things on your systems are probably already working this way.

(Fedora's modification of their sshd_config to move it to being modified through snippets in /etc/ssh/sshd_config.d unsurprisingly requires all of the snippets to have a .conf extension.)

PS: This may be a standard new OpenSSH thing, since Ubuntu has it as well, and thus presumably Debian too. If anything Fedora is late to this party.

WhyIncludeWithExtension written at 00:27:59; Add Comment

2020-10-23

An inconvenience of physical hardware is that it has to be delivered

We're in the process of getting various pieces of new hardware here. In fact we've actually been in this process for months and months. Well, sort of. Part of what's made this process take so long is that we've run into an inconvenient aspect of getting physical hardware that I hadn't even considered before, which is that it has to be delivered to us.

Under normal circumstances getting hardware delivered is easy and I don't think much about it; we order things and they show up, sometimes directly in our office and sometimes at mailrooms where we have to fetch them. The most troublesome part is usually getting the websites of various companies to accept the full shipping address that we use (which has various long names in it, extra building names, and so on). But the circumstances haven't been normal for some time, and that complicates everything. Put simply, when there's no one in the building you can't get deliveries.

For some of the time since the end of February, we could have arranged for a person to be present to receive a delivery if we could schedule one. Unfortunately very few delivery firms will do this, even for relatively broad time windows like 'the morning of <X>'; instead they're oriented around the idea that they can turn up any time and find someone present. This is understandable but hasn't been all that workable under a full scale 'work from home' situation.

(Having things delivered to people's homes is not a viable thing once you're dealing with more than a modest amount of hardware, partly because people only have so much room and so much capacity to transport hardware in to work. A bunch of 4U rackmount computer cases or Dell 1U servers take up a lot of space and aren't exactly the lightest thing in the world.)

I'm sure we're not the only people who are dealing with this and we have been able to come up with some solutions (eventually, and after various confusion). But it's definitely an aspect of getting hardware that I hadn't previously really thought about much. It turns out that there's a bunch of invisible infrastructure in the background of our everyday activities.

(For instance, I don't think much about how we have a collection of carts that we can load up with boxes to move things around, and how our buildings all have access ramps. Without both together, getting deliveries from mailrooms to our offices or storage areas would be a lot more work.)

HardwareRequiresDelivery written at 00:46:14; Add Comment

2020-10-17

A potential Prometheus issue for labeled metrics for infrequent events

One of the things you often get to do with Prometheus is to design your own custom metrics for things (which may be generated and exposed in a number of ways, for example using mtail to extract them from log files). One piece of advice for designing metrics that I've seen is to group closely related measurements together under one metric name using label values to differentiate between them. The classical example here is counting HTTP requests with different return codes using one metric name with a label for the return code, instead of something like one metric name for 200-series responses, a second one for 400-series responses, and so on.

(One of the advantages of using a single metric with label values instead of multiple metrics is that it's easier to do operations across all of the different versions of the metric that way. If you want the count of all hits, you can just do 'sum(http_requests) without(status)', instead of having to manually add several metrics together.)

However, using labels this way can create an issue if your metric name is for events that occur only rarely. When something happens that resets your group of metrics (such as the machine reboots), you can wind up in a situation where you've seen no events so you have no time series at all for the metric name, never mind having time series for all of the labels that you expect to eventually see. If there are no time series at all, operations like 'sum()', 'rate()', and so on will fail (well, give no answers), which can potentially make it awkward to do dashboards and graphs that use this metric.

(Your dashboards will usually work, once enough time and events have gone by so that the metric name is populated with labels for everything that you expect.)

The advantage of separate metric names without labels is that most exporter implementations will naturally give them 0 values when the exporter restarts. These 0 values insure that the metric is always present, so you can easily build reliable graphs that use it and so on. There are various standard solutions for potentially missing data in Prometheus queries (see here), but they all make the PromQL expression less pleasant.

As covered in Brian Brazil's Existential issues with metrics, if you know the labels that will eventually be present and you can, it's ideal to pre-create them ahead of time so that they always exist with a value of 0. Unfortunately not all metrics creation environments provide obvious support for this. Even if you can coerce the metrics creation environment into doing it, it's easy to miss the issue when you're setting things up to start with.

(This came up when I was augmenting our mtail Exim logfile parsing to count some infrequent spam-related events. Mtail can probably be pushed to do this, but it's not straightforward.)

PrometheusSlowLabeledMetricIssue written at 00:19:36; Add Comment

2020-10-13

As an outsider, I prefer issue tracking to be in its own application

I recently read Josef "Jeff" Sipek's Email vs. Tool du Jour (via). I broadly agree with Sipek's call to generally use email to other mechanisms, but after mulling over my reaction to one portion of it, I disagree with Sipek over issue tracking, at least some of the time. Sipek says:

First, what about issue tracking? How does that tie into my email-centric world? Well, you can keep your issue tracker, but in my opinion, the comments feature should not be used. [...]

When I work with outside projects or vendors, what I prefer is an issue tracker that holds all the comments in a ticket and also sends a copy to my email. The great advantage of this is that it automatically creates a durable historical record of that issue's discussion (and only that issue's discussion) in one place, one that people (myself included) can refer back to later.

(Sometimes you're referring back to an issue to see what was discussed at the time and hopefully why things were decided as they were. Sometimes you're referring back because an old issue was suddenly revived on you and now you have to remember stuff from two years ago.)

I can create such a durable historical record myself with email, but to do that I would have to create a folder or a tag or whatever for each issue and then carefully file all email in the right way. I would rather have an issue tracker do this for me, and in the process provide a convenient identifier for the whole issue (a URL, a bug number, etc). The natural state of email is a big pile, which goes badly with wanting to keep track of a bunch of things separately so you can find them (and just them) later.

With that said, we don't use an issue tracker here, although we do archive all of our group discussion email for later reference. I think one difference is the volume involved (we send a lot more email and have a lot more discussions than I do when I interact with outside projects) and another is that we feel many of the issues we deal with are ephemeral and will never need to be looked back at (and won't be revived suddenly in three months, although it sometimes happens).

(And if we did have an issue tracker, I would very much want to interact with it by email since we basically live in email as it is.)

IssueTrackingViaApp written at 23:29:25; Add Comment

2020-10-11

Our current usage and views of UPSes (late 2020 edition)

Over the time I've been here, we've had rather mixed experiences with UPSes. Initially we used them relatively freely on machines that we felt were important, such as our first generation ZFS fileservers. Unfortunately, after a while the UPSes themselves caused us problems (for example, a spontaneous UPS reset that power cycled the machine attached to it), and we grew much more disenchanted with them.

These days we still use UPSes on some machines, such as our fileservers, but pretty much only on machines with redundant power supplies and good IPMIs. One power supply goes to the UPS and the other power supply to line power (via a rack PDU), and we've done as much as possible to make it so that if one power supply sees a power failure, we'll get notified about it. Most of our servers have only one power supply, so we don't even consider putting them on a UPS (even with an automatic transfer switch, of which we have some from days past).

(We don't currently try to monitor or talk to our currently in use UPSes, but we should at least look into that. Probably they're capable of it, since they're decent quality rack mounted units.)

Today, this is the only configuration that I feel comfortable with for production use, because it doesn't add any new single points of failure but will probably give us protection from power loss (assuming the UPS is working properly; if it's not, we're no worse off if we lose main power).

We've historically not bought many servers with redundant power supplies, but that's starting to change now that we're working remotely and fixing a server with a dead PSU is much slower and more work. This may push us toward more use of UPSes, although that's not as useful as it looks because our network switches generally only have one power supply and so won't be on a UPS.

(The other issue with putting switches on a UPS is that right now it would cut off our ability to power cycle them remotely, since they generally don't have a full equivalent of an IPMI to enable internal remote power control. There are various hacks possible here, though.)

Some UPSes stop working if their batteries are unhealthy or dead, which is not something we want to have happen with our UPSes. There are multiple ways to implement a UPS, called topologies, that you can find described in eg Wikipedia, but I don't know if any of them actually require this or if your UPS stopping if the battery is dead is just a choice that UPS vendors make. Our current practice is to replace UPS batteries on a reasonably regular basis, partly to avoid any unpleasant surprises like this.

(Once we're actually talking to our UPSes, hopefully they will tell us about this sort of thing.)

UPSOurViews-2020-10 written at 00:40:37; Add Comment

2020-10-10

Wanting to be able to monitor for electrical power quality issues

Today, we had what you could call a "power event" at work. There was some sort of power blip in the electrical feeds from Toronto Hydro to multiple university buildings, and a number of our servers rebooted (not all of them, though). We've seen this kind of thing happen before so in one sense it's not surprising, but this time around the actual experience was rather alarming and confusing because it happened during the working day and we were all working remotely, so we couldn't go down to our machine room to see what was going on.

In the aftermath (and after my recent experience setting up monitoring of my home UPS), I have a new desire for us to be able to know what power blips and power issues have happened at work, in a way that gets recorded and is accessible remotely if enough of the network is still up. Currently, about all we can see is whatever various IPMI systems have recorded in their logs, and generally that amounts to not very much. And if the issue wasn't severe enough for the servers to notice, we don't know about it at all.

Monitoring a sufficiently capable UPS will definitely tell us about power failures (assuming that the system doing the monitoring stays up long enough to record what the UPS tells it). However, I don't know if inexpensive UPSes report smaller power blips or power issues that they detect and clean up, or how much load you have to have on them before they'll report things. While we can induce outright power failures as the UPS sees it by just unplugging it, I don't think we can test other sorts of power issues. Still, setting up communication with one of our existing UPSes that are capable of it would be a step forward and give us some information.

Some cursory Internet searches suggest that you can definitely get some products that do this in general (although who knows if we could get them to talk to a Linux server). In news that's no surprise, they are not all that inexpensive. If you need one, you're likely willing to pay the price, but probably we aren't; we have a casual interest, not a suspicion that our power is unstable in general.

(Based on the lack of information in IPMI logs from both machines that rebooted and ones that stayed up, it seems that server PSUs likely can't be used to monitor for this sort of stuff even with a cooperative IPMI. I can't blame them; the PSU's job is to keep the power on, not to report on how good it is.)

Sidebar: What we saw from outside

The initial symptoms were that a whole bunch of machines abruptly dropped off the network (conveniently this did not include our Prometheus server, so we got an alert that something was up). There are quite a number of potential causes of some but not all of your servers abruptly disappearing, including that you've had one or more switches fail (or reboot) or that you've lost some electrical circuits or rack PDUs. Things only got less alarming when servers started coming back up and reporting that they had just rebooted, but then we spent some time looking both at affected machines and ones not affected to try to figure out what had happened.

PowerIssuesMonitoringWish written at 00:27:41; Add Comment

2020-10-09

Whether extra disks should be live or spare now depends on HDs versus SSDs

Suppose, not entirely hypothetically, that you have a server with at least three or four spare drive bays and you want to build a mirrored storage setup that can maintain redundancy without requiring an in-person drive swap should a drive fail. Let's say you go with three drives in the system in total. Obviously two of them have to be mirrored in order to get your basic redundancy, but the third one could be used either to make a three way mirror or held back as a (configured) hot spare.

(If you can use four drives, you can have a three way mirror and a spare, or a four way mirror.)

In the old days of hard drives, you generally might as well use your extra drive as an additional mirror instead of holding it back as a spare. Either way the drive was usually going to be powered on and spinning, and this was enough to start counting down its lifetime. Having it active as a mirror got you some additional read IOs a second and meant you kept all your redundancy without needing any time for a mirror re-synchronization after a drive failed.

(The folklore, at least, was that powered on time was the most important thing for hard drive lifetime because of wear on the main motor. There were other mechanical parts involved in things like the read/write heads, but my impression is that they were usually not seen as a likely source of failure. This may be incorrect in practice.)

In the new days of solid state disks, I've recently realized that that's no longer really true. The amount of time a SSD spends powered on does matter, but so does the amount of data that's been written to it (and perhaps the read volume as well). By keeping your extra SSD out of the mirror, you avoid exposing it to all that write traffic, which prolongs its lifetime. By putting less load on your extra SSD in general, you also hopefully make it less likely that the SSD will suffer an infant mortality death too close to another SSD dying. Or in short, with SSDs, write and read activity now matters. A quiet drive is likely to be a longer-lived drive, so keeping your extra drive quiet is good.

(At the same time you want to poke the extra drive periodically just to make sure it still works. Regular SMART probes might be good enough, but for caution you might want to do a tiny bit of writes every so often or something.)

SpareOrLiveHDvsSSD written at 00:18:34; Add Comment

2020-09-29

Implementing 'and' conditions in Exim SMTP ACLs the easy way (and in Exim routers too)

One of the things that makes Exim a powerful mailer construction kit is that it has a robust set of string expansions, which can be used to implement conditional logic among other things (examples include 1, 2, 3, and 4). However, this power comes with an obscure, compact syntax that's more or less like a Lisp, but not as nice, and in practice is surprisingly easy to get lost in. String expansions have an 'if' so you can implement conditional things and since they have an 'if' they have the usual boolean operators, including 'and'. I've written my share of these complicated conditions, but I've never really been happy with how the result came out.

Today, I was writing another Exim ACL statement that was conditional, with two parts that needed to be anded together, and I realized that there was a simpler approach than using the power and complexity of '${if and{....}'. Namely, multiple 'condition =' requirements are ANDed together (just as all requirements are, to be clear). In my case it was clearly simpler to write my two parts separately as:

deny
   condition = ....
   condition = ....
   [... etc ...]

I actually did the multiple 'condition =' version as a quick first version for testing, then rewrote it as a nominally proper single 'condition =' using '${if and{...}', then went back and reversed the change because the revised version was both longer and less clear.

This works in Exim routers as well as ACLs, since routers also use 'condition =' as a general precondition.

(This isn't going to be universally more readable, but nothing ever is. Also, I carefully put a comment in my new ACL statement to explain why I was doing such an odd looking thing, although in retrospect the comment's not quite as clear and explanatory as I'd like.)

PS: I'm not sure why this hasn't occurred to me before, since I've always known that multiple requirements in ACLs and Exim routers must all be true (ie are ANDed together). Possibly I just never thought to use 'condition =' twice and fell into the trap of always using the obvious hammer of '${if and{...}'.

EximMultiConditionsForAnd written at 20:56:15; Add Comment

2020-09-28

Making product names of what you use visible to people is generally a mistake

For years, we've used Sophos PureMessage as the major part of our overall spam filtering. I don't mention specific product names very often for various reasons, but it's now harmless because Sophos is dropping PureMessage (also). We were already planning to almost certainly replace PureMessage for reasons other than this, but Sophos's decision to move to a cloud based service model forces our hand.

(We actually have the replacement more or less planned out and will likely start switching away from PureMessage very soon.)

As part of our overall filtering (and as standard in a lot of environments), we've set it up so that messages that are considered sufficiently spammy have a tag at the start of the Subject: header. People can then do their own filtering (in procmail or these days in their IMAP mail client) based on that tag, and various other pieces of our mail system also change their behavior if a message's Subject: has been marked this way. The specific tag we use is thus a well known and fundamentally fixed part of our overall mail environment; changing it would require configuration changes across our systems and force people to change their own mail setups, to their annoyance.

The tag we chose, almost fifteen years ago, was (and is) '[PMX:SPAM]'. This was chosen because 'PMX' is the common abbreviation for 'PureMessage' (used in Sophos's documentation, among other places), and we thought that '[SPAM]' was a bit too generic and likely to be added to Subject: headers by other places before the messages got to us.

If things go as expected, in a few months we won't be using Sophos PureMessage any more, and 'PMX' will mean nothing. But I can confidently predict that in ten years, our mail system will still be tagging sufficiently spammy email with '[PMX:SPAM]' (if we still have a mail system at all, and we probably will).

This is not the first time I've made the mistake of burning product names (software or hardware) into things that are visible to people, and it probably won't be the last time, either. Doing this is even a famous sysadmin mistake for hostnames (many '<x>vax' hosts lived on for years past when they were in fact DEC VAXes). But still, hopefully I can learn something from this and maybe do better for the next time around.

PS: There are clever transition plans like adding a second, more generic tag and then deprecating the first one over the course of many years, but they're not worth it. The other lesson is that sometimes you just shrug and live with the odd name long after you're using a software product or a particular type of hardware. It can even become a part of local folklore.

VisibleProductNamesBad written at 17:02:48; Add Comment

2020-09-27

Remote power control for your machines comes in two flavours

In yesterday's entry on our trouble free remote reboot of our machines, I mentioned that remote power control for all of our machines would be nice. When I said that, I was insufficiently specific, because there are actually two forms of remote power control and we mostly have one of them, but it's not always good enough. I will call these two sorts of remote power control external and internal remote power control.

In external remote power control, your servers and machines are plugged in to a smart power bar or smart rack PDU that you can remotely control to turn outlets on or off. In internal remote power control, the machine itself has some form of lights out management that will control the machine's power; for instance, power control over (networked) IPMI, or a serial connection to the management process. External remote power control is much easier to set up and is what we have, but it's not quite as good as internal remote power control overall.

The largest limitation of external remote power control is that it leaves you vulnerable to BIOSes with undesirable power control settings. If you have a machine with its BIOS set to 'when power returns after a power loss, stay turned off', then you can't force reboot the machine with external remote power control; once you turn the power to it off, it won't come back up again until you go in to push its physical power button. The machine has power, but the BIOS has been told to keep it off until you push that button.

(Plain KVM over IP will usually not get you out of this, because the BIOS is not powering up the keyboard and video. Wake on LAN probably would, but if your BIOS is set to keep the power off, you probably didn't set WoL up either.)

Plain external remote power control also doesn't do anything if you have the machine plugged into a UPS (either completely, for all of its power supplies, or partially, with one redundant PSU plugged into a UPS and the other into your smart PDU). If you can remote control outlets on the UPS, you can deal with this, but I'm not sure how many UPSes (even rack ones) support this. I don't believe ours do.

(Our fileservers have dual PSUs with one on a UPS and one on line power this way.)

Internal remote power control deals with both issues because it can directly control the machine's PSUs and will simulate you pushing that front panel power button to overcome BIOS settings. The drawback of having only internal remote power control is that you can't completely cut power to the machine, including to its lights out management processor (you can generally tell the LOM to reboot, but not to turn itself off). So the best of both worlds is to have both smart PDUs and lights out management on your machines.

RemotePowerControlTwoTypes written at 16:18:57; Add Comment

(Previous 10 or go back to September 2020 at 2020/09/26)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.