Wandering Thoughts

2018-11-13

Our pragmatic attachment to OpenBSD PF for our firewall needs

Today on Twitter, I asked:

#Sysadmin people: does anyone have good approaches for a high-performance 10G OpenBSD firewall (bridging or routing)? Is the best you can do still 'throw the fastest single-core CPU you can find at it'?

A number of people made the reasonable suggestion of looking into FreeBSD or Linux instead of OpenBSD for our 10G Ethernet firewall needs. We have done some investigation of this (and certainly our Linux machines have no problem with 10G wire speeds, even with light firewall rules in place) but it's not a very attractive solution. The problem is that we're very attached to OpenBSD PF for pragmatic reasons.

At this point, we've been using OpenBSD based firewalls with PF for fifteen years or more. In the process we've built up a bunch of familiarity with the quirks of OpenBSD and of PF, but more importantly we've ended up with thousands of lines of PF rulesets, some in relatively complicated firewall configurations, and all of which are only documented implicitly in the PF rules themselves because, well, that's what we wrote our firewall rules in.

Moving to anything other than OpenBSD PF means both learning a new rule language and translating our current firewall rulesets to that language. We'd need to do this for at least the firewalls that need to migrate to 10G (one of which is our most complicated firewall), and we'd probably want to eventually do it for all firewalls, just so that we didn't have to maintain expertise in two different firewall languages and environments. We can do this if we have to, but we would very much rather not; OpenBSD works well for us in our environment and we have a solid, reliable setup (including pfsync).

(We don't use CARP, but we count on pfsync to maintain hot spare firewalls in a 'ready to be made live' state. Having pfsync has made shifting between live and hot spare firewalls into something that users barely notice, where in the old pre-pfsync days a firewall shift required scheduled downtimes because it broke everyone's connections. One reason we shift between live and hot spare firewalls is if we think the live firewall needs a reboot or some hardware work.)

We also genuinely like PF; it seems to operate at about the right level of abstraction for what we want to do, and we rarely find ourselves annoyed at it. We would probably not be enthused about trying to move to something that was either significantly higher level or significantly lower level. And, barring our issues with getting decent 10G performance, OpenBSD PF has performed well and been extremely solid for us; our firewalls are routinely up for more than a year and generally we don't have to think about them. Anything that proposes to supplant OpenBSD in a firewall role here has some quite demanding standards to live up to.

PS: For our purposes, FreeBSD PF is a different thing than OpenBSD PF because it hasn't picked up the OpenBSD features and syntax changes since 4.5, and we use any number of those in our PF rules (you have to, since OpenBSD loves changing PF syntax). Regardless of how well FreeBSD PF works and how broadly familiar it would be, we'd have to translate our existing rulesets from OpenBSD PF to FreeBSD PF. This might be easier than translating them to anything else, but it would still be a non-trivial translation step (with a non-trivial requirement for testing the result).

sysadmin/OpenBSDPFAttachment written at 23:48:51; Add Comment

2018-11-12

What Python 3 versions I can use (November 2018 edition)

Back several years ago, I did a couple of surveys of what Python versions I could use for both Python 2 and Python 3, based on what was available on the platforms that we (and I) use. What Python 2 versions are available is almost irrelevant to me now; everything I still care about has a sufficiently recent version of 2.7, and anyway I'm moving to Python 3 for new code both personally and for work. So the much more interesting question is what versions of Python 3 are out there, or at least what major versions. Having gone through this exercise, my overall impression is that the Python 3 version landscape has stabilized for the uses that we currently make of Python 3.

At this point, a quick look at the release dates of various Python 3 versions is relevant. Python 3.4 was released March 16, 2014; 3.5 was released September 13, 2015; 3.6 was released December 23, 2016; 3.7 was only released this June 27, 2018. At this point, anyone using 3.7 on Unix is either using a relatively leading edge Unix distribution or built it themselves (I think it just got into Fedora 29 as the default 'Python 3', for example). However, I suspect that 3.6 is the usual baseline people developing Python 3 packages assume and target, perhaps with some people still supporting 3.5.

At work, we mostly have a mixture of Ubuntu LTS versions. The oldest one is Ubuntu 14.04; it's almost gone but we still have two last 14.04 servers for a couple more months and I actually did write some new Python 3 code for them recently. The current 14.04 Python 3 is 3.4.3, which is close enough to modern Python 3 that I didn't run into any problems in my simple code, but I wouldn't want to write anything significant or tricky that had to run in Python 3 on those machines.

(When I started writing the code, I actually asked myself if I wanted to fall back to Python 2 because of how old these machines were. I decided to see if Python 3 would still work well enough, and it did.)

We have a bunch of Ubuntu 16.04 machines that will be staying like that until 2020 or so, when 16.04 starts falling out of support. Ubuntu 16.04 currently has 3.5.2, and the big feature it doesn't have that I'm likely to run into is probably literal string interpolation; I can avoid it in my own code, but not necessarily in any third party modules I want to use. Until recently, the 16.04 Python 3.5 was the Python 3 that I developed to and most actively used, so it's certainly a completely usable base for our Python 3 code.

Ubuntu 18.04 has Python 3.6.6, having been released a few months before 3.7. I honestly don't see very much in the 3.7 release notes that I expect to actively miss, although a good part of this is because we don't have any substantial Python programs (Python 3 or otherwise). If we used asyncio, for instance, I think we'd care a lot more about not having 3.7.

We have one CentOS 6 machine, but it's turning into a CentOS 7 machine some time in the next year and we're not likely to run much new Python code on it. However, just as back in 2014, CentOS 7 continues to have no version of Python 3 in the core package set. Fortunately we don't need to run any of our new Python 3 programs on our CentOS machines. EPEL has Python 3.4.9 and Python 3.6.6 if we turn out to need a version of Python 3 (CentOS maintains a wiki page on additional repositories).

My own workstation runs Fedora, which is generally current or almost current (depending on when Fedora releases happen and when Python releases happen). I'm currently still on Fedora 28 as I'm waiting for Fedora 29 to get some more bugs fixed. I have Python 3.6.6 by default and I could get Python 3.7 if I wanted it, and my default Python 3 will become 3.7 when I move to Fedora 29.

The machine currently hosting Wandering Thoughts is running FreeBSD 10.4 at the moment, which seems to have Python 3.6.2 available through the Ports system. However, moving DWiki (the Python software behind the blog) to Python 3 isn't something that I plan to do soon (although the time is closer than it was back in 2015). My most likely course of action with DWiki is to see what the landscape looks like for Python 2 starting in 2020, when it's formally no longer supported (and also what the landscape looks like for Python 3, for example if there are prospects of significant changes or if things appear to have quieted down).

(Perhaps I should start planning seriously for a Python 3 version of DWiki, though. 2020 is not that far away now and I don't necessarily move very fast with personal projects these days, although as usual I expect Python 2 to be viable and perfectly good for well beyond then. I probably won't want to write code in Python 2 any more by then, but then I'm not exactly modifying DWiki much right now.)

python/MyPython3Versions2018-11 written at 22:56:09; Add Comment

2018-11-11

Easy configuration for lots of Prometheus Blackbox checks

Suppose, not entirely hypothetically, that you want to do a lot of Prometheus Blackbox checks, and worse, these are all sorts of different checks (not just the same check against a lot of different hosts). Since the only way to specify a lot of Blackbox check parameters is with different Blackbox modules, this means that you need a bunch of different Blackbox modules. The examples of configuring Prometheus Blackbox probes that you'll find online all set the Blackbox module as part of the scrape configuration; for example, straight from the Blackbox README, we have this in their example:

- job_name: 'blackbox'
  metrics_path: /probe
  params:
    module: [http_2xx]
  [...]

You can do this for each of the separate modules you need to use, but that means many separate scrape configurations and for each separate scrape configuration you're going to need those standard seven lines of relabeling configuration. This is annoying and verbose, and it doesn't take too many of these before your Prometheus configuration file is so overgrown with many Blackbox scrapes that it's hard to see anything else.

(It would be great if Prometheus could somehow macro-ize these or include them from a separate file or otherwise avoid repeating everything for each scrape configuration, but so far, no such luck. You can't even move some of your scrape configurations into a separate included file; they all have to go in the main prometheus.yml.)

Fortunately, with some cleverness in our relabeling configuration we can actually embed the name of the module we want to use into our Blackbox target specification, letting us use one Blackbox scrape configuration for a whole bunch of different modules. The trick is that what's necessary for Blackbox checks is that by the end of setting up a particular scrape, the module parameter is in the __param_module label. Normally it winds up there because we set it in the param section of the scrape configuration, but we can also explicitly put it there through relabeling (just as we set __address__ by hand through relabeling).

So, let's start with nominal declared targets that look like this:

- ssh_banner,somehost:25
- http_2xx,http://somewhere/url

This encodes the Blackbox module before the comma and the actual Blackbox target after it (you can use any suitable separator; I picked comma for how it looks).

Our first job with relabeling is to split this apart into the module and target URL parameters, which are the magic __param_module and __param_target labels:

relabel_configs:
  - source_labels: [__address__]
    regex: ([^,]*),(.*)
    replacement: $1
    target_label: __param_module
  - source_labels: [__address__]
    regex: ([^,]*),(.*)
    replacement: $2
    target_label: __param_target

(It's a pity that there's no way to do multiple targets and replacements in one rule, or we could make this much more compact. But I'm probably far from the first person to observe that Prometheus relabeling configurations are very verbose. Presumably Prometheus people don't expect you to be doing very much of it.)

Since we're doing all of our Blackbox checks through a single scrape configuration, we won't normally be able to easily tell which module (and thus which check) failed. To make life easier, we explicitly save the Blackbox module as a new label, which I've called probe:

  - source_labels: [__param_module]
    target_label: probe

Now the rest of our relabeling is essentially standard; we save the Blackbox target as the instance label and set the actual address of our Blackbox exporter:

  - source_labels: [__param_target]
    target_label: instance
  - target_label: __address__
    replacement: 127.0.0.1:9115

All of this works fine, but there turns out to be one drawback of putting all or a lot of your blackbox checks in a single scrape configuration, which is that you can't set the Blackbox check interval on a per-target or per-module basis. If you need or want to vary the check interval for different checks (ie, different Blackbox modules) or even different targets, you'll need to use separate scrape configurations, even with all of the extra verbosity that that requires.

(As you might suspect, I've decided that I'm mostly fine with a lot of our Blackbox checks having the same frequency. I did pull ICMP ping checks out into a separate scrape configuration so that we can do them a lot more frequently.)

PS: If you wanted to, you could go further than this in relabeling; for instance, you could automatically add the :25 port specification on the end of hostnames for SSH banner checks. But it's my view that there's a relatively low limit on how much of this sort of rewriting one should do. Rewriting to avoid having a massive prometheus.yml is within my comfort limit here; rewriting just avoid putting a ':25' on hostnames is not. There is real merit to being straightforward and sticking as close to normal Prometheus practice as possible, without extra magic.

(I think that the 'module,real-target' format of target names I've adopted here is relatively easy to see and understand even if you don't know how it works, but I'm biased and may be wrong.)

sysadmin/PrometheusBlackboxBulkChecks written at 22:35:04; Add Comment

The needs of Version Control Systems conflict with capturing all metadata

In a comment on my entry Metadata that you can't commit into a VCS is a mistake (for file based websites), Andrew Reilly put forward a position that I find myself in some sympathy with:

Doesn't it strike you that if your VCS isn't faithfully recording and tracking the metadata associated with the contents of your files, then it's broken?

Certainly I've wished for VCSes to capture more metadata than they do. But, unfortunately, I've come to believe that there are practical issues for VCS usage that conflict with capturing and restoring metadata, especially once you get into advanced cases such as file attributes. In short, what most users of a VCS want are actively in conflict with the VCS being a complete and faithful backup and restore system, especially in practice (ie, with limited programming resources to build and maintain the VCS).

The obvious issue is file modification times. Restoring file modification time on checkout can cause many build systems (starting with make) to not rebuild things if you check out an old version after working on a recent version. More advanced build systems that don't trust file modification timestamps won't be misled by this, but not everything uses them (and not everything should have to).

More generally, metadata has the problem that much of it isn't portable. Non-portable metadata raises multiple issues. First, you need system-specific code to capture and restore it. Then you need to decide how to represent it in your VCS (for instance, do you represent it as essentially opaque blobs, or do you try to translate it to some common format for its type of metadata). Finally, you have to decide what to do if you can't restore a particular piece of metadata on checkout (either because it's not supported on this system or because of various potential errors).

(Capturing certain sorts of metadata can also be surprisingly expensive and strongly influence certain sorts of things about your storage format. Consider the challenges of dealing with Unix hardlinks, for example.)

You can come up with answers for all of these, but the fundamental problem is that the answers are not universal; different use cases will have different answers (and some of these answers may actually conflict with each other; for instance, whether on Unix systems you should store UIDs and GIDs as numbers or as names). VCSes are not designed or built to be comprehensive backup systems, partly because that's a very hard job (especially if you demand cross system portability of the result, which people do very much want for VCSes). Instead they're designed to capture what's important for version controlling things and as such they deliberately exclude things that they think aren't necessary, aren't important, or are problematic. This is a perfectly sensible decision for what they're aimed at, in line with how current VCSes don't do well at handling various sorts of encoded data (starting with JSON blobs and moving up to, say, word processor documents).

Would it be nice to have a perfect VCS, one that captured everything, could restore everything if you asked for it, and knew how to give you useful differences even between things like word processor documents? Sure. But I can't claim with a straight face that not being perfect makes a VCS broken. Current VCSes explicitly make the tradeoff that they are focused on plain text files in situations where only some sorts of metadata are important. If you need to go outside their bounds, you'll need additional tooling on top of them (or instead of them).

(Or, the short version, VCSes are not backup systems and have never claimed to be ones. If you need to capture everything about your filesystem hierarchy, you need a carefully selected, system specific backup program. Pragmatically, you'd better test it to make sure it really does back up and restore unusual metadata, such as file attributes.)

tech/VCSVsMetadata written at 18:40:40; Add Comment

OpenSSH 7.9's new key revocation support is welcome but can't be a full fix

I was reading the OpenSSH 7.9 release notes, as one does, when I ran across a very interesting little new feature (or combination of features):

  • sshd(8), ssh-keygen(1): allow key revocation lists (KRLs) to revoke keys specified by SHA256 hash.

  • ssh-keygen(1): allow creation of key revocation lists directly from base64-encoded SHA256 fingerprints. This supports revoking keys using only the information contained in sshd(8) authentication log messages.

Any decent security system designed around Certificate Authorities needs a way of revoking CA-signed keys to make them no longer valid. In a disturbingly large number of these systems as people actually design and implement them, you need a fairly decent amount of information about a signed key in order to revoke it (for instance, its full public key). In theory, of course you'll have this information in your CA system's audit records because you'll capture all of it in your audit system when you sign a key. In practice there are many things that can go wrong even if you haven't been compromised.

Fortunately, OpenSSH was never one of these systems; as covered in ssh-keygen(1)'s 'Key Revocation Lists', you could specify keys in a variety of ways that didn't require a full copy of the key's certificate (by serial number or serial number range, by 'key id', or by its SHA1 hash). What's new in OpenSSH 7.9 is that they've reduced the amount of things you need to know in practice, as now you can revoke a key given only the information in your ordinary log messages. This includes but isn't limited to CA-signed SSH keys (as I noticed recently).

(This took both the OpenSSH 7.9 change and an earlier change to log the SHA256 of keys, which happened in OpenSSH 6.8.)

This OpenSSH 7.9 new feature is a very welcome change; it's now much easier to go from a log message about a bad login to blocking all future use of that key, including and especially if that key is a CA-signed key and so you don't (possibly) have a handy copy of the full public key in someone's ~/.ssh/authorized_keys. However, this isn't and can't be a full fix for the tradeoff of having a local CA. The tradeoff is still there, it's just somewhat easier to deal with either a compromised signed key or the disaster scenario of a compromised CA (or a potentially compromised one).

With a compromised key, you can immediately push it into your system for distributing revocation lists (and you should definitely build such a system if you're going to use a local CA); you don't have to go to your CA audit records first to fish out the full key and other information. With a potentially compromised CA, it buys you some time to roll over your CA certificate, distribute the new one, re-issue keys, and so on, without being in a panic situations where you can't do anything but revoke the CA certificate immediately and invalidate everyone's keys. Of course, you may want to do that anyway and deal with the fallout, but at least now you have more options.

(If you believe that your attacker was courteous enough to use unique serial numbers, you can also do the brute force approach of revoking every serial number range except the ones that you're using for known, currently valid keys. Whether or not you want to use consecutive serial numbers or random ones is a good question, though, and if you use random ones, this probably isn't too feasible.)

PS: I continue to believe that if you use a local CA, you should be doing some sort of (offline) auditing to look for use of signed keys or certificates that are not in your CA audit log. You don't even have to be worried that your CA has been compromised, because CA software (and hardware) can have bugs, and you want to detect them. Auditing used keys against issued keys is a useful precaution, and it shouldn't need to be expensive at most people's scale.

tech/SSHSignedKeyRevocation written at 17:31:47; Add Comment

2018-11-10

Why Prometheus turns out not be our ideal alerting system

What we want out of an alert system is relatively straightforward (and was probably once typically for sysadmins who ran machines). We would like to get notified once and only once for any new alert that shows up (and for some of them to get notified again when they go away), and we'd also like these alerts to be aggregated together to some degree so we aren't spammed to death if a lot of things go wrong at once.

(It would be ideal if the degree of aggregation was something we could control on the fly. If only a few machines have problems we probably want to get separate emails about each machine, but if a whole bunch of machines all suddenly have problems, please, just send us one email with everything.)

Unfortunately Prometheus doesn't do this, because its Alertmanager has a fundamentally different model of how alert notification should work. Alertmanager's core model is that instead of sending you new alerts, it will send you the entire current state of alerts any time that state changes. So, if you group alerts together and initially there are two alerts in a group and then a third shows up later, Alertmanager will first notify you about the initial two alerts and then later re-notify you with all three alerts. If one of the three alerts clears and you've asked to be notified about cleared alerts, you'll get another notification that lists the now-cleared alert and the two alerts that are still active. And so on.

(One way to put this is to say that Alertmanager is sort of level triggered instead of edge triggered.)

This is not a silly or stupid thing for Alertmanager to do, and it has some advantages; for instance, it means that you only need to read the most recent notification to get a full picture of everything that's currently wrong. But it also means that if you have an escalating situation, you may need to carefully read all of the alerts in each new notification to realize this, and in general you risk alert fatigue if you have a lot of alerts that are grouped together; sooner or later the long list of alerts is just going to blur together. Unfortunately this describes our situation, especially if we try to group things together broadly.

(Alertmanager also sort of assumes other things, for example that you have a 24/7 operations team who deal with issues immediately. If you always deal with issues when they come up, you don't need to hear about an alert clearing because you almost certainly caused that and if you didn't, you can see the new state on your dashboards. We're not on call 24/7 and even when we're around we don't necessarily react immediately, so it's quite possible for things to happen and then clear up without us even looking at anything. Hence our desire to hear about cleared alerts, which is not the Alertmanager default.)

I consider this an unfortunate limitation in Alertmanager. Alertmanager internally knows what alerts are new and changed (since that's part of what drives it to send new notifications), but it doesn't expose this anywhere that you can get at it, even in templating. However I suspect that the Prometheus people wouldn't be interested in changing this, since I expect that distinguishing between new and old alerts doesn't fit their model of how alerting should be done.

On a broader level, we're trying to push a round solution into a square hole and this is one of the resulting problems. Prometheus's documentation is explicit about the philosophy of alerting that it assumes; basically it wants you to have only a few alerts, based on user-visible symptoms. Because we look after physical hosts instead of services (and to the extent that we have services we have a fair amount of them), we have a lot of potential alerts about a lot of potential situations.

(Many of these situations are user visible, simply because users can see into a lot of our environment. Users will notice if any particular general access login or compute server goes down, for example, so we have to know about it too.)

Our current solution is to make do. By grouping alerts only on a per-host basis, we hope to keep the 'repeated alerts in new notifications' problem down to a level where we probably won't miss significant new problems, and we have some hacks to create one time notifications (basically, we make sure that some alerts just can't group together with anything else, which is more work than you'd think).

(It's my view that using Alertmanager to inhibit 'less severe' alerts in favour of more severe ones is not a useful answer for us for various reasons beyond the scope of this entry. Part of it is that I think maintaining suitable inhibition rules would take a significant amount of care in both the Alertmanager configuration and the Prometheus alert generation, because Alertmanager doesn't give you very much power for specifying what inhibits what.)

Sidebar: Why we're using Prometheus for alerting despite this

Basically, we don't want to run a second system just for alerting unless we really have to, especially since a certain number of alerts are naturally driven from information that Prometheus is collecting for metrics purposes. If we can make Prometheus work for alerting and it's not too bad, we're willing to live with the issues (at least so far).

sysadmin/PrometheusAlertsProblem written at 23:35:56; Add Comment

Character by character TTY input in Unix, then and now

In Unix, normally doing a read() from a terminal returns full lines, with the kernel taking care of things like people erasing characters and words (and typing control-D); if you run 'cat' by itself, for example, you get this line at a time input mode. However Unix has an additional input mode, raw mode, where you read() every character as it's typed (or at least as it becomes available to the kernel). Programs that support readline-style line editing operate in this mode, such as shells like Bash and zsh, as do editors like vi (and emacs if it's in non-windowed mode).

(Not all things that you might think operate in raw mode actually do; for example, passwd and sudo don't use raw mode when you enter your password, they just turn off echoing characters back to you.)

Unix has pretty much always had these two terminal input modes (kernel support for both goes back to at least Research Unix V4, which seems to be the oldest one that we have good kernel source for through tuhs.org). However, over time the impact on the system of using raw mode has changed significantly, and not just because CPUs have gotten faster. In practice, modern cooked (line at a time) terminal input is much closer to raw mode than it was in the days of V7, because over time we've moved from an environment where input came from real terminals over serial lines to one where input takes much more complicated and expensive paths into Unix.

In the early days of Unix, what you had was real, physical terminals (sometimes hardcopy ones, such as in famous photos of Bell Labs people working on Unix in machine rooms, and sometimes 'glass ttys' with CRT displays). These terminals were connected to Unix machines by serial lines. In cooked, line at a time mode, what happened when you hit a character on the terminal was that the character was sent over the serial line, the serial port hardware on the Unix machine read the character and raised an interrupt, and the low level Unix interrupt handler read the character from the hardware, perhaps echoed it back out, and immediately handled a few special characters like ^C and CR (which made it wake up the rest of the kernel) and perhaps the basic line editing characters. When you finally typed CR, the interrupt handler would wake up the kernel side of your process, which was waiting in the tty read() handler. This higher level would eventually get scheduled, process the input buffer to assemble the actual line, copy it to your user-space memory, and return from the read() to user space, at which point your program would actually wake up to handle the new line it got.

(Versions of Research Unix through V7 actually didn't really handle your erase or line-kill characters at interrupt level. Instead they push everything into a 'raw buffer', and only once a CR was typed was this buffer canonicalized by applying the effects of characters to determine the final line that was returned to user level.)

The important thing here is that in line at a time tty input in V7, the only code that had to run for each character was the low level kernel interrupt handler, and it deliberately did very little work. However, if you turned on raw mode all of this changed and suddenly you had to run a lot more code. In raw mode, the interrupt handler had to wake the higher level kernel at each character, and the higher level kernel had to return to user level, and your user level code had to run. On the comparatively small and slow machines that early Unixes ran on, going all the way to user-level code for every character would have been and probably was a visible performance hit, especially if a bunch of people were doing it at the same time.

Things started changing in BSD Unix with the introduction of pseudo-ttys (ptys). BSD Unix needed ptys in order to support network logins over Telnet and rlogin, but network logins and ptys fundamentally change what the character input path looks like in practice. Programs reading from ptys still ran basically the same sort of code in the kernel as before, with a distinction between low level character processing and high level line processing, but now getting characters to the pty wasn't just a matter of a hardware interrupt. For a telnet or rlogin login, the path looked something like this:

  • the network card gets a packet and raises an interrupt.
  • the kernel interrupt handler reads the packet and passes it to the kernel's TCP state machine, which may not run entirely at interrupt level and is in any case a bunch of code.
  • the TCP state machine eventually hands the packet data to the user-level telnet or rlogin daemon, which must wake up to handle it.
  • the woken-up telnetd or rlogind injects the new character into the master side of the pty with a write() system call, which percolates down through various levels of kernel code.

In other words, with logins over the network, a bunch of code, including user-level code, had to run for every character even for line at a time input.

In this new world, having the shell or program that's reading input from the pty operate in line at a time mode remained somewhat more efficient than raw mode but it wasn't anywhere near the difference in the amount of code that it was (and is) for terminals connected over serial lines. You weren't moving from no user level wakeups to one; you were moving from one to two, and the additional wakeup was on a relatively simple code path (compared to TCP packet and state handling).

(It's a good thing Vaxes were more powerful than PDP-11s; they needed to be.)

Things in Unix have only gotten worse for the character input path since then. Modern input over the network is through SSH, which requires user-level decryption and de-multiplexing before you end up with characters that can be written to the master side of the pseudo-tty; the network input may also involve kernel level firewall checks or even another level of decryption from a VPN (either at kernel level or at user level, depending on the VPN technology). Windowing systems such as X or Wayland add at least two processes to the stack, as generally the window server has to read and process the keystroke and then pass it to the terminal window process (as a generalized event). Sometimes there are more processes, and keyboard event handling is generally complicated in general (which means that there's a lot of code that has to run).

I won't say that character at a time input has no extra overhead in Unix today, because that's not quite true. What is true is that the extra overhead it adds is now only a small percentage of the total cost (in time and CPU instructions) of getting a typed character from the keyboard to the program. And since readline-style line editing and other features that require character at a time input add real value, they've become more and more common as the relative expensive of providing them has fallen, to the point where it's now a bit unusual to find a program that doesn't have readline editing.

The mirror image of this modern state is that back in the old days, avoiding raw mode as much as possible mattered a lot (to the point where it seems that almost nothing in V7 actually uses its raw mode). This persisted even into the days of 4.x BSD on Vaxes, if you wanted to support a lot of people connected to them (especially over serial terminals, which people used for a surprisingly long time). This very likely had a real influence on what sort of programs people developed for early Unix, especially Research Unix on PDP-11s.

PS: In V7, the only uses of RAW mode I could spot were in some UUCP and modem related programs, like the V7 version of cu.

PPS: Even when directly connected serial terminals started going out of style for Unix systems, with sysadmins and other local users switching to workstations, people often still cared about dial-in serial connections over modems. And generally people liked to put all of the dial-in users on one machine, rather than try to somehow distribute them over a bunch.

unix/RawTtyInputThenAndNow written at 19:20:34; Add Comment

2018-11-09

Getting CPU utilization breakdowns efficiently in Prometheus

I wrote before about getting a CPU utilization breakdown in Prometheus, where I detailed building up a query that would give us a correct 0.0 to 1.0 CPU utilization breakdown. The eventual query is:

(sum(irate(node_cpu_seconds_total {mode!="idle"} [1m])) without (cpu)) / count(node_cpu_seconds_total) without (cpu)

(As far as using irate() here goes, see rate() versus irate().)

This is a beautiful and correct query, but as it turns out you may not want to actually use it. The problem is that in practice, it's also an expensive query when evaluated over a sufficient range, especially if you're using some version of it for multiple machines in the same graph or Grafana dashboard. In some reasonably common cases, I saw Prometheus query durations of over a second for our setup. Once I realized how slow this was, I decided to try to do better.

The obvious way to speed up this query is to precompute the number that's essentially a constant, namely the number of CPUs (the thing we're dividing by). To make my life simpler, I opted to compute this so that we get a separate metric for each mode, so we don't have to use group_left in the actual query. The recording rule we use is:

- record: instance_mode:node_cpus:count
  expr: count(node_cpu_seconds_total) without (cpu)

(The name of this recording rule metric is probably questionable, but I don't understand the best practices suggestions here.)

This cuts out a significant amount of the query cost (anywhere from one half to two thirds or so in some of my tests), but I was still left with some relatively expensive versions of this query (for instance, one of our dashboards wants to display the amount of non-idle CPU utilization across all of our machines). To do better, I decided to try to pre-compute the sum() of the CPU modes across all CPUs, with this recording rule:

- record: instance_mode:node_cpu_seconds_total:sum
  expr: sum(node_cpu_seconds_total) without (cpu)

In theory this should provide basically the same result with a clear saving in Prometheus query evaluation time. In practice this mostly works but occasionally there are some anomalies that I don't understand, where a rate() or irate() of this will exceed 100% (ie, will return a result greater than the number of CPUs in the machine). These excessive results are infrequent and you do save a significant amount of Prometheus query time, which means that there's a tradeoff to be made here; do you live with the possibility of rare weird readings in order to get efficient general trends and overviews, or do you go for complete correctness even at the sake of higher CPU costs (and graphs that take a bit of time to refresh or generate themselves)?

(If you know that you want a particular resolution of rate() a lot, you can pre-compute that (or pre-compute an irate()). But you have to know the resolution, or know that you want irate(), and you may not, especially if you're using Grafana and its magic $__interval template variable.)

I've been going back and forth on this question since I discovered this issue. Right now my answer is that I'm defaulting to correct results even at more CPU cost unless the CPU cost becomes a real, clear problem. But we have the luxury that our dashboards aren't likely to be used very much.

Sidebar: Why I think the sum() in this recording rule is okay

The documentation for both rate() and irate() tells you to always take the rate() or irate() before sum()'ing, in order to detect counter resets. However, in this case all of our counters are tied together; all CPU usage counters for a host will reset at the same time, when the host reboots, and so rate() should still see that reset even over a sum().

(And the anomalies I've seen have been over time ranges where the hosts involved haven't been rebooting.)

I have two wild theories for why I'm seeing problems with this recording rule. First, it could be that the recording rule is summing over a non-coherent set of metric points, where the node_cpu_seconds_total values for some CPUs come from one Prometheus scrape and others come from some other scrape (although one would hope that metrics from a single scrape appear all at once, atomically). Second, perhaps the recording rule is being evaluated twice against the same metric points from the same scrape, because it is just out of synchronization with a slow scrape of a particular node_exporter. This would result in a flat result for one point of the recording rule and then a doubled result for another one, where the computed result actually covers more time than we expect.

(Figuring out which it is is probably possible through dedicated extraction and processing of raw metric points from the Prometheus API, but I lack the patience or the interest to do this at the moment. My guess is currently the second theory, partly based on some experimentation with changes().)

sysadmin/PrometheusCPUStatsII written at 23:40:14; Add Comment

2018-11-08

The future of our homedir-based mail server system design

In a comment on my entry on our self-serve system for autoreplies, I was asked a very good question:

Do you think it will ever be possible for you to move to a non-homedir-based mail server at all?

My answer is that I no longer think it would be a good thing for us to move to a non-homedir-based mail system.

Most mail systems segregate all mail storage and mail processing away from regular user files. As the commentator is noting, our mail system doesn't work this way and instead does things like allow users to have .forward files in their home directories. Sometimes this causes us difficulties, and so the question here is a sensible one. In the past I would have told you that we would eventually have to move to an IMAP-only environment where mail handling and storage was completely segregated from user home directories. Today I've changed my mind; I now think that we should not move mail out of people's home directories. In fact I would like more of their mail to live under their home directories; in an IMAP-only environment, I would like to put people's INBOXes somewhere in there too, instead of in the shared /var/mail filesystem that we have today.

The reasons for this ultimately come down to that eternal issue of storage allocation, plus the fact that any number of our users have quite a lot of email (gigabytes and gigabytes of it). No matter what you do about email, it has to live somewhere, and someone has to pay for the storage space, the backups, and so on. In our environment, how we allocate storage in general is that people get however much disk space they're willing to pay for. There's various obvious good reasons to stick with this for mail storage space, and once we're doing that there are good reasons to stick with our standard model of providing disk space when providing mail folder space, including that we already have an entire collection of systems for managing it, backing it up, and so on. Since we use ZFS pools, in theory this mail storage space doesn't have to go in people's home directories; we could make separate 'mail storage' filesystems for every person and every group. In practice, we already have home directory filesystems.

It's possible that IMAP-only mail storage in some dedicated format would be more efficient and faster than what we currently have (most people keep their mail in mbox format files). In practice we don't have a mail environment that is that demanding (and since we're moving an an all-SSD setup for all storage, our IO rates are about to get much better anyway).

As far as things like .forwards go, my view is that this is a pragmatic tradeoff. Supporting .forwards ties our hands to some degree, but it also means that we don't have to build and commit to a user accessible server side mail filtering system, with all of the many questions it would raise. As with mail folder storage, using people's home directories and all of the regular Unix infrastructure is the easiest and most obvious approach.

PS: Our Unix-focused approach is not one I would necessarily recommend for other environments. It works here for various reasons, including that we already have general Unix login servers for people to use and that we don't have that many users.

sysadmin/MailAndHomedirs written at 23:54:27; Add Comment

2018-11-07

What email messages to not send autoreplies to (late 2018 edition)

Our mail system is very old. Much of the current implementation dates back about ten years, when we moved it to be based on Exim, but the features and in some cases the programs involved go back much further than that. One part of it is that we have a local version of the venerable Unix vacation program, and this local version goes back a very long time (some comments say it is the 4.3 BSD-Reno version, which would date it to 1990). By now our version is ancient and creaky, and in general we're no longer enthused about maintaining locally hacked versions of software, so we need to move to using the standard Ubuntu version. Unfortunately, our local version has some differences from the standard one; it supports an additional command line option that's used by an unknown number of people, and we long since made it not autoreply to some additional things over what the standard vacation already ignored. To deal with both problems we're using the standard computer science solution of adding another layer of indirection, in the form of a cover script. One of the jobs of this cover script is knowing what not to autoreply to (beyond extremely obvious things like messages that we detect as spam).

When I started out writing the cover script, I thought this would be simple. This is not the case, as what not to autoreply to has gotten a little bit more complicated since 1990 or so; for instance, there is now an actual RFC for this, RFC 3834. Based on Internet searches and this very helpful Superuser answer, the current list appears start with:

  • a Precedence: header value of 'bulk', 'list', or 'junk'; this is the old standard way.

  • an Auto-submitted: header value of anything but 'no', which is the RFC 3834 standard way. In practice, this is effectively 'if there is an Auto-submitted header'; I searched through a multi-year collection of email and couldn't find anything that used it with a 'no' value.

  • an X-Auto-Response-Suppress: header with effectively any value, although Microsoft's official documentation says that a value of 'none' means that you can auto-reply. In practice that multi-year collection of email contains no cases with the 'none' value.

    (Energetic people can look for only 'All' or 'OOF', but matching this is annoying and, again, my mail collection shows no hits for anything without one or the other of those.)

  • Any of the various headers that indicate a mailing list message, such as List-Id: or List-Unsubscribe:. In a sane world you would only need to look for one of them, but this is not a sane world (especially once spammers get involved); I have seen at least one message with only a List-Unsubscribe:.

  • A null (envelope) sender address, although of course any autoreplies to that aren't going to get very far. Generally you'll want to not autoreply to postmaster@ or mailer-daemon@, although it's not clear how much stuff gets sent out with such envelope senders.

In theory you could stop here and be nominally correct, more or less. In practice it seems clear that you want to do some additional matching on the sender address, to not auto-reply to at least:

  • Definitely various variations on 'noreply' and 'donotreply' sender addresses. You might think that people sending emails with these sender addresses would tag them in various ways to avoid auto-replies, but it is not so; for example, just yesterday Flickr sent me a notification email about some important upcoming changes that came from 'donotreply@flickr.com' and had none of those 'please do not reply' header markers.

  • Probably anything that appears to be an address that exists to collect bounces, especially tagged sender addresses. There are a bunch of patterns for these, where they start with 'bounce-' or 'bounce.' or 'bounce+' or 'bounces+', or come from a domain that is 'bounce.<something>' or 'bounces.<something>'. Just to be different, Google uses '@<something>.bounces.google.com'.

    Some of these 'bounces' addresses are also tagged with various 'do not autoreply' headers, but not all of them. Since tagged bounce addresses are always unique, they'll generally always bypass vacation's attempts to only send an autoreply notification every so often, which is one reason I think one should suppress autoreplies to them.

  • Perhaps all detectable tagged sender addresses, especially repeated sources of them. The one that we've already seen in our logs is AmazonSES ones, some of which don't have any 'don't autoreply' headers. Perhaps there are some AmazonSES senders who should get vacation autoreplies, but I suspect that there are not that many.

(I'm sure that there are some senders who would like to get vacation autoreplies so they know that their email is sort of getting through. It's less clear that our users want those senders to know that, given some of the uses of AmazonSES.)

Possibly you also want to not autoreply to sender addresses with various generic local parts, such as 'root', 'www-data', 'apache', and so on. Perhaps you also want to include 'info', but that feels more potentially questionable; there might actually be a human who reads replies to that and cares about out of office things and so on.

(In general my view is that it's only useful to send autoreplies to actual people, and in some cases sending autoreplies to non-people addresses is at least potentially harmful. If we can establish fairly confidently that a given sender address is not a person, not sending vacation and out of office and so on autoreplies to it is harmless and perhaps beneficial. At the same time it's important not to be too aggressive, because our users do count on their autoreplies reliably telling people about their status.)

PS: In an extremely cautious world, you would not autoreply to anything that hadn't passed either strict SPF checks or strict DMARC policies. You can use DKIM too, but I think only if you carefully check that you're verifying a DKIM signature for the sender domain, because only then have you verified attribution to the domain. I rather expect that this is too strict to make users happy today, because it would exclude too many real people that send them email and so should get their autoreply messages.

Sidebar: My guess about non-human email that lacks these markers

One might wonder why email notifications and other similar large scale messages don't have some version of 'please do not autoreply' tags. My suspicion is that people have found that email without such tags is more likely to appear in people's inboxes on large providers like GMail and so on, while email with those tags is more likely to get dumped into a less frequently examined location.

If you're someone like Flickr (well, SmugMug, who bought Flickr) and really do have an important message that many Flickr members need to read, this leaves you with an unfortunate dilemma. On the whole I can't blame SmugMug for making the email choice that they did; with data at future risk, it is better to err on the side of getting more autoreplies than having people not see your message.

(In this view, the 'donotreply' email sender address is mostly there in the hopes that actual people will not hit 'reply' and send email back, email that will not have the desired effect.)

spam/AutorepliesWhatNot written at 22:31:05; Add Comment

(Previous 10 or go back to November 2018 at 2018/11/06)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.