Wandering Thoughts archives

2018-11-10

Why Prometheus turns out not be our ideal alerting system

What we want out of an alert system is relatively straightforward (and was probably once typically for sysadmins who ran machines). We would like to get notified once and only once for any new alert that shows up (and for some of them to get notified again when they go away), and we'd also like these alerts to be aggregated together to some degree so we aren't spammed to death if a lot of things go wrong at once.

(It would be ideal if the degree of aggregation was something we could control on the fly. If only a few machines have problems we probably want to get separate emails about each machine, but if a whole bunch of machines all suddenly have problems, please, just send us one email with everything.)

Unfortunately Prometheus doesn't do this, because its Alertmanager has a fundamentally different model of how alert notification should work. Alertmanager's core model is that instead of sending you new alerts, it will send you the entire current state of alerts any time that state changes. So, if you group alerts together and initially there are two alerts in a group and then a third shows up later, Alertmanager will first notify you about the initial two alerts and then later re-notify you with all three alerts. If one of the three alerts clears and you've asked to be notified about cleared alerts, you'll get another notification that lists the now-cleared alert and the two alerts that are still active. And so on.

(One way to put this is to say that Alertmanager is sort of level triggered instead of edge triggered.)

This is not a silly or stupid thing for Alertmanager to do, and it has some advantages; for instance, it means that you only need to read the most recent notification to get a full picture of everything that's currently wrong. But it also means that if you have an escalating situation, you may need to carefully read all of the alerts in each new notification to realize this, and in general you risk alert fatigue if you have a lot of alerts that are grouped together; sooner or later the long list of alerts is just going to blur together. Unfortunately this describes our situation, especially if we try to group things together broadly.

(Alertmanager also sort of assumes other things, for example that you have a 24/7 operations team who deal with issues immediately. If you always deal with issues when they come up, you don't need to hear about an alert clearing because you almost certainly caused that and if you didn't, you can see the new state on your dashboards. We're not on call 24/7 and even when we're around we don't necessarily react immediately, so it's quite possible for things to happen and then clear up without us even looking at anything. Hence our desire to hear about cleared alerts, which is not the Alertmanager default.)

I consider this an unfortunate limitation in Alertmanager. Alertmanager internally knows what alerts are new and changed (since that's part of what drives it to send new notifications), but it doesn't expose this anywhere that you can get at it, even in templating. However I suspect that the Prometheus people wouldn't be interested in changing this, since I expect that distinguishing between new and old alerts doesn't fit their model of how alerting should be done.

On a broader level, we're trying to push a round solution into a square hole and this is one of the resulting problems. Prometheus's documentation is explicit about the philosophy of alerting that it assumes; basically it wants you to have only a few alerts, based on user-visible symptoms. Because we look after physical hosts instead of services (and to the extent that we have services we have a fair amount of them), we have a lot of potential alerts about a lot of potential situations.

(Many of these situations are user visible, simply because users can see into a lot of our environment. Users will notice if any particular general access login or compute server goes down, for example, so we have to know about it too.)

Our current solution is to make do. By grouping alerts only on a per-host basis, we hope to keep the 'repeated alerts in new notifications' problem down to a level where we probably won't miss significant new problems, and we have some hacks to create one time notifications (basically, we make sure that some alerts just can't group together with anything else, which is more work than you'd think).

(It's my view that using Alertmanager to inhibit 'less severe' alerts in favour of more severe ones is not a useful answer for us for various reasons beyond the scope of this entry. Part of it is that I think maintaining suitable inhibition rules would take a significant amount of care in both the Alertmanager configuration and the Prometheus alert generation, because Alertmanager doesn't give you very much power for specifying what inhibits what.)

Sidebar: Why we're using Prometheus for alerting despite this

Basically, we don't want to run a second system just for alerting unless we really have to, especially since a certain number of alerts are naturally driven from information that Prometheus is collecting for metrics purposes. If we can make Prometheus work for alerting and it's not too bad, we're willing to live with the issues (at least so far).

sysadmin/PrometheusAlertsProblem written at 23:35:56; Add Comment

Character by character TTY input in Unix, then and now

In Unix, normally doing a read() from a terminal returns full lines, with the kernel taking care of things like people erasing characters and words (and typing control-D); if you run 'cat' by itself, for example, you get this line at a time input mode. However Unix has an additional input mode, raw mode, where you read() every character as it's typed (or at least as it becomes available to the kernel). Programs that support readline-style line editing operate in this mode, such as shells like Bash and zsh, as do editors like vi (and emacs if it's in non-windowed mode).

(Not all things that you might think operate in raw mode actually do; for example, passwd and sudo don't use raw mode when you enter your password, they just turn off echoing characters back to you.)

Unix has pretty much always had these two terminal input modes (kernel support for both goes back to at least Research Unix V4, which seems to be the oldest one that we have good kernel source for through tuhs.org). However, over time the impact on the system of using raw mode has changed significantly, and not just because CPUs have gotten faster. In practice, modern cooked (line at a time) terminal input is much closer to raw mode than it was in the days of V7, because over time we've moved from an environment where input came from real terminals over serial lines to one where input takes much more complicated and expensive paths into Unix.

In the early days of Unix, what you had was real, physical terminals (sometimes hardcopy ones, such as in famous photos of Bell Labs people working on Unix in machine rooms, and sometimes 'glass ttys' with CRT displays). These terminals were connected to Unix machines by serial lines. In cooked, line at a time mode, what happened when you hit a character on the terminal was that the character was sent over the serial line, the serial port hardware on the Unix machine read the character and raised an interrupt, and the low level Unix interrupt handler read the character from the hardware, perhaps echoed it back out, and immediately handled a few special characters like ^C and CR (which made it wake up the rest of the kernel) and perhaps the basic line editing characters. When you finally typed CR, the interrupt handler would wake up the kernel side of your process, which was waiting in the tty read() handler. This higher level would eventually get scheduled, process the input buffer to assemble the actual line, copy it to your user-space memory, and return from the read() to user space, at which point your program would actually wake up to handle the new line it got.

(Versions of Research Unix through V7 actually didn't really handle your erase or line-kill characters at interrupt level. Instead they push everything into a 'raw buffer', and only once a CR was typed was this buffer canonicalized by applying the effects of characters to determine the final line that was returned to user level.)

The important thing here is that in line at a time tty input in V7, the only code that had to run for each character was the low level kernel interrupt handler, and it deliberately did very little work. However, if you turned on raw mode all of this changed and suddenly you had to run a lot more code. In raw mode, the interrupt handler had to wake the higher level kernel at each character, and the higher level kernel had to return to user level, and your user level code had to run. On the comparatively small and slow machines that early Unixes ran on, going all the way to user-level code for every character would have been and probably was a visible performance hit, especially if a bunch of people were doing it at the same time.

Things started changing in BSD Unix with the introduction of pseudo-ttys (ptys). BSD Unix needed ptys in order to support network logins over Telnet and rlogin, but network logins and ptys fundamentally change what the character input path looks like in practice. Programs reading from ptys still ran basically the same sort of code in the kernel as before, with a distinction between low level character processing and high level line processing, but now getting characters to the pty wasn't just a matter of a hardware interrupt. For a telnet or rlogin login, the path looked something like this:

  • the network card gets a packet and raises an interrupt.
  • the kernel interrupt handler reads the packet and passes it to the kernel's TCP state machine, which may not run entirely at interrupt level and is in any case a bunch of code.
  • the TCP state machine eventually hands the packet data to the user-level telnet or rlogin daemon, which must wake up to handle it.
  • the woken-up telnetd or rlogind injects the new character into the master side of the pty with a write() system call, which percolates down through various levels of kernel code.

In other words, with logins over the network, a bunch of code, including user-level code, had to run for every character even for line at a time input.

In this new world, having the shell or program that's reading input from the pty operate in line at a time mode remained somewhat more efficient than raw mode but it wasn't anywhere near the difference in the amount of code that it was (and is) for terminals connected over serial lines. You weren't moving from no user level wakeups to one; you were moving from one to two, and the additional wakeup was on a relatively simple code path (compared to TCP packet and state handling).

(It's a good thing Vaxes were more powerful than PDP-11s; they needed to be.)

Things in Unix have only gotten worse for the character input path since then. Modern input over the network is through SSH, which requires user-level decryption and de-multiplexing before you end up with characters that can be written to the master side of the pseudo-tty; the network input may also involve kernel level firewall checks or even another level of decryption from a VPN (either at kernel level or at user level, depending on the VPN technology). Windowing systems such as X or Wayland add at least two processes to the stack, as generally the window server has to read and process the keystroke and then pass it to the terminal window process (as a generalized event). Sometimes there are more processes, and keyboard event handling is generally complicated in general (which means that there's a lot of code that has to run).

I won't say that character at a time input has no extra overhead in Unix today, because that's not quite true. What is true is that the extra overhead it adds is now only a small percentage of the total cost (in time and CPU instructions) of getting a typed character from the keyboard to the program. And since readline-style line editing and other features that require character at a time input add real value, they've become more and more common as the relative expensive of providing them has fallen, to the point where it's now a bit unusual to find a program that doesn't have readline editing.

The mirror image of this modern state is that back in the old days, avoiding raw mode as much as possible mattered a lot (to the point where it seems that almost nothing in V7 actually uses its raw mode). This persisted even into the days of 4.x BSD on Vaxes, if you wanted to support a lot of people connected to them (especially over serial terminals, which people used for a surprisingly long time). This very likely had a real influence on what sort of programs people developed for early Unix, especially Research Unix on PDP-11s.

PS: In V7, the only uses of RAW mode I could spot were in some UUCP and modem related programs, like the V7 version of cu.

PPS: Even when directly connected serial terminals started going out of style for Unix systems, with sysadmins and other local users switching to workstations, people often still cared about dial-in serial connections over modems. And generally people liked to put all of the dial-in users on one machine, rather than try to somehow distribute them over a bunch.

unix/RawTtyInputThenAndNow written at 19:20:34; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.