Wandering Thoughts

2018-10-22

Some DKIM usage statistics from our recent inbound email (October 2018 edition)

By this point in time, {{AB:DKIM:Domain Keys Identified Mail}} has been around for long enough and enough large providers like GMail have been pushing for it that it has a certain decent amount of usage. In particular, a surprising number of sources of undesirable email seem to have adopted DKIM, or at least they add DKIM headers to their email. Our Exim setup logs the DKIM status of incoming email on our external MX gateway and for reasons beyond the scope of today's entry I have become interested in gathering some statistics about what sort of DKIM usage we see, who from, and how many of those DKIM signatures actually verify.

All of the following statistics are from the past ten days of full logs. Over that time we received 105,000 messages, or about 10,000 messages a day, which is broadly typical volume for us from what I remember. Over this ten day period, we saw 69,400 DKIM signatures, of which 55 were so mangled that Exim only reported:

DKIM: Error while running this message through validation, disabling signature verification.

(Later versions of Exim appear to log details about what went wrong, but the Ubuntu 16.04 version we're currently using doesn't.)

Now things get interesting, because it turns out that a surprising number of messages have more than one DKIM signature. Specifically, roughly 7,600 have two or more (and the three grand champions have six); in total we actually have only 61,000 unique messages with DKIM signatures (which still means that more than half of our incoming email had DKIM signatures). On top of that, 297 of those messages were actually rejected at SMTP time during DATA checks; it turns out that if you get as far as post-DATA checks, Exim is happy to verify the DKIM signature before it rejects the message.

The DKIM signatures break down as follows (all figures rounded down):

62240 verification succeeded
3340 verification failed - signature did not verify (headers probably modified in transit)
2660 invalid - public key record (currently?) unavailable
790 verification failed - body hash mismatch (body probably modified in transit)
310 invalid - syntax error in public key record

Of the DKIM signatures on the messages we rejected at SMTP time, 250 had successful verification, 45 had no public key record available, 5 had probably modified headers, and two were mangled. The 250 DKIM verifications for messages rejected at SMTP time had signatures from around 100 different domains, but a number of them were major places:

    41  d=yahoo.com 
    18  d=facebookmail.com 
    13  d=gmail.com 

(I see that Yahoo is not quite dead yet.)

There were 5,090 different domains with successful DKIM verifications, of which 2,170 had only one DKIM signature and 990 had two. The top eight domains each had at least 1,000 DKIM signatures, and the very top one had over 6,100. That very top one is part of the university, so it's not really surprising that it sent us a lot of signed email.

Overall, between duplicate signatures and whatnot, 55,780 or so of the incoming email messages that we accepted at SMTP time had verified DKIM signatures, or just over half of them. On the one hand, that's a lot more than I expected. On the other hand, that strongly suggests that no one should expect to be able to insist on valid DKIM signatures any time soon; there are clearly a lot of mail senders that either don't do DKIM at all, don't have it set up right, or are having their messages mangled in transit (perhaps by mailing list software).

Among valid signatures, 46,270 were rsa-sha256 and 15,960 were rsa-sha1. The DKIM canonicalization (the 'c=' value reported by Exim) breaks down as follows:

 51470  c=relaxed/relaxed 
  9440  c=relaxed/simple 
  1290  c=simple/simple 
    20  c=simple/relaxed

I don't know if this means anything, but I figured I might as well note it. Simple/simple is apparently the default.

spam/DKIMIncomingMailStats-2018-10 written at 23:15:41; Add Comment

Using group_* vector matching in Prometheus for database lookups

On Mastodon, I said:

Current status: writing a Prometheus expression involving 'group_left (sendto) ...' and cackling maniacally.

Boy am I abusing metrics as a source of facts and configuration information, but it's going to beat writing and maintaining a bunch of Prometheus alert rules for people.

(If a system gives me an awkward hammer as my only tool, why yes, I will hit everything with it. Somehow.)

There are many things bundled up in this single toot, but today I'm going to write down the details of what I'm doing in my PromQL before I forget them, because it involves some tricks and hacks (including my use of group_left).

Suppose, not hypothetically, that you have quite a lot of ZFS filesystems and pools and that you want to generate alerts when they start running low on disk space. We start out with a bunch of metrics on the currently available disk space that look like this:

our_zfs_avail_gb{ pool="tank", fs="/h/281", type="fs" } 35.1
our_zfs_avail_gb{ pool="tank", fs="tank", type="pool" } 500.8

(In real life you would use units of bytes, not fractional GB, but I'm changing it to avoid having to use giant numbers. Also, this is an incomplete set of metrics; I'm just including enough for this entry.)

If life was very simple, we could write an alert rule expression for our space alerts that looked like this:

our_zfs_avail_gb < 10

The first problem with this is that we might find that space usage was oscillating right around our alert point. We want to smooth that out, and while there are probably many ways of doing that, I'll go with the simple approach of looking at the average space usage over the last 15 minutes:

avg_over_time(our_zfs_avail_gb [15m]) < 10

In PromQL, avg_over_time is one of the family of X_over_time functions that do their operation over a time range to give you a single number.

If life was simple, we could stop now. Unfortunately, not only do we have a wide variety of ZFS filesystems but they're owned by a wide variety of people, who are who should be notified when the space is low because they're the only ones who can do anything about it, and these people have widely varying opinions about what level of free space is sufficiently low to be alertable on. In other words, we need to parameterize both our alert level and who gets notified on a per-filesystem basis.

In theory you could do this with a whole collection of Prometheus alerting rules, one for each combination of an owner and a set of filesystems with the same low space alert level. In practice this would be crazy to maintain by hand; you'd have to generate all of the alert rules from templates and external information and it would get very complicated very fast. Instead we can use brute force and the only good tool that Prometheus gives us for dynamic lookups, which is metrics.

We'll create a magic metrics sequence that encodes both the free space alert level and the owner of each filesystem. These metrics will look like this:

our_zfs_minfree_gb{ fs="/h/281", sendto="cks" }     50
our_zfs_minfree_gb{ fs="tank", sendto="sysadmins" } 200

These metrics can be pushed into Prometheus in various ways, for example by writing them into a text file for the Prometheus node exporter to pick up, or sent into a Pushgateway (which will persist them for us).

So our starting point for a rule is the obvious (but non-working):

avg_over_time(our_zfs_avail_gb [15m]) < our_zfs_minfree_gb

If we tried this, we would get no results at all. Why this doesn't work is that Prometheus normally requires completely matching labels across your expression (as described in the documentation for comparison binary operators and vector matching). These metrics don't have matching labels; even if they had no other labels that clashed (and in real life they will), our_zfs_avail_gb has the pool and type labels, and our_zfs_minfree_gb side has the sendto label.

As I've learned the hard way, in any PromQL expression involving multiple metrics it's vital to understand what labels you have and where they might clash. It's very easy to write a query that returns no data because you have mis-matched labels (I've done it a lot as I've been learning to work with PromQL).

To work around this issue, we need to tell PromQL to do the equivalent of a database join on the fs label to pick out the matching us_zfs_minfree_gb value for a given filesystem. Since we're doing a comparison between (instant) vectors, this is done with the on modifier for vector matches:

avg_over_time(our_zfs_avail_gb [15m]) < on (fs) our_zfs_minfree_gb

If we apply this by itself (and /h/281 has had its current usage over our 15 minute window), we will get a result that looks like this:

{ fs="/h/281" } 35.1

What has happened here is that Prometheus is sort of doing what we told it to do. We implicitly told it that fs was the only label that mattered to us by making it the label we cross-matched on, so it reduced the labels in the result down to that label.

This is not what we want. We want to carry all of the labels from our_zfs_avail_gb over to the output, so that our alerts can be summarized by pool and so on, and we need to pull in the sendto label from our_zfs_minfree_gb so that Alertmanager knows who to send them to. To do this, we abuse the group_left many-to-one vector matching operator.

The full expression is now (with a linebreak for clarity):

avg_over_time(our_zfs_avail_gb [15m]) <
  on (fs) group_left (sendto) our_zfs_minfree_gb

When we use group_left here, two things happen for us. First, all of the labels from the metric on the left side of the expression are included in the result, so we get all of the labels from our_zfs_avail_gb, including pool. Second, group_left also includes the label we listed from the right metric. The result is:

{ pool="tank", fs="/h/281", type="fs", sendto="cks" } 35.1

Strictly speaking, this is an abuse of group_left because our left and our right metrics have the same cardinality. So let's talk about PromQL cardinality for a moment. When PromQL does vector matches in operations like <, it normally requires that exactly one metric on the left match exactly one metric on the right; if there are too many metrics on either the left or the right, PromQL just punts and skips the metric(s). The matching is done on their full labels by default. When you use on or without, you narrow the matching to happen only on those labels or without those labels, but PromQL still requires a one to one match.

Since plain on worked for us, we had that one to one matching already. So we're using group_left only for its side effects of including extra labels, not because we need it for a many to one match. If we changed group_left to group_right, we would get the same set of matches and outputs, but the labels would change:

{ fs="/h/281", sendto="cks" } 35.1

This is because now the labels are coming from the right metric, augmented by any labels from the left metric added by group_*, which in this case doesn't include anything new. If we wanted to get the same results, we would have to include the left side labels we wanted to add:

avg_over_time(our_zfs_avail_gb [15m]) <
  on (fs) group_right (pool, type) our_zfs_minfree_gb

This would get us the same labels, although in a different order because group_* appends the extra labels they add on the end:

{ fs="/h/281", sendto="cks", pool="tank", type="fs" } 35.1

Now, suppose that we didn't have the sendto label and we were using our_zfs_minfree_gb purely to set a per-filesystem level. However, we still want to carry over all of the labels from our_zfs_avail_gb into the output, so that they can be used by Alertmanager. Our quick first attempt at this would probably be:

avg_over_time(our_zfs_avail_gb [15m]) <
  on (fs) group_left (fs) our_zfs_minfree_gb

If we try this, PromQL will immediately give us an error message:

[...]: label "fs" must not occur in ON and GROUP clause at once

This restriction is documented but annoying. Fortunately we can get around it because the group_* operators don't require that their new label(s) actually exist. So we can just give them a label that isn't even in our metric and they're happy:

avg_over_time(our_zfs_avail_gb [15m]) <
  on (fs) group_left (bogus) our_zfs_minfree_gb

This will give us just the labels from the left:

{ pool="tank", fs="/h/281", type="fs" } 35.1

(If we wanted just the labels from the right we could use group_right instead.)

PS: In the expression that I've built up here, any filesystem without an our_zfs_minfree_gb metric will have no free space alert level; it can run right down to 0 bytes left and you'll get no alert about it. Fixing this in the PromQL expression is complicated for reasons beyond the scope of this entry, so in my opinion the best place to fix it is in the tools that generate and check your our_zfs_minfree_gb metrics from some data file in a more convenient format.

sysadmin/PrometheusGroupLeftHack written at 01:44:30; Add Comment

2018-10-21

Some tradeoffs of having a Certificate Authority in your model

There are two leading models for checking identity via public key cryptography in a self-contained environment; you can simply maintain and distribute a list of valid keys, or you can set up a little local CA, have it sign the keys, and then verify keys against the CA's public key. One prominent system that offers you the choice of either option is OpenSSH, which famously supports both models and lets you chose between them for both server public keys and user authentication keys. Despite writing a certain amount about what I see as the weaknesses of the CA model, I accept that the CA model has advantages and makes tradeoffs (like many things in security).

The obvious advantage of the CA model is that using a CA means that you don't have to distribute your keylist around. In the non-CA model, everyone needs to have a copy of the entire list of valid keys they may need to deal with; in the CA model, this is replaced by the much smaller job of distributing the CA keys. Given this, the CA model is likely to be especially popular with big places where distributing the keylist is a hard problem; you have lots of updates, a big keylist, many places to send it, or all of the above. Conversely, if your keylist is essentially constant and you have only a few places where it needs to be distributed to, the CA model is not necessarily much of a win.

(The CA model doesn't entirely eliminate moving keys around, because you still need to get keys signed and the signatures transported to the things that need to use the keys. Nor does the CA model prevent the compromise of individual keys; they can still be copied or stolen by attackers.)

By removing the keylist distribution problem, the CA model enables the use of more keys and more key changes than might be feasible otherwise. One potentially important consequence of removing the distribution problem is that new CA-signed keys are instantly valid everywhere. When you get a new key, you can use it immediately; you don't have to wait for systems to synchronize and update.

(Frequent key changes and limited key lifetimes have the traditional security advantages of limiting the impact of key theft and perhaps making it significantly harder in the first place.)

A more subtle advantage of the CA model is that using CAs enables mass invalidation of keys, because the validity of a whole bunch of keys is centrally controlled through the validity of their CA. If you remove or invalidate a CA, all keys signed (only) by it immediately stop working (assuming that your software gets things right by, eg, checking CA validity dates, not just key validity dates).

The drawback of the CA model is the same as it ever was, which is that a local CA is a single point of compromise for your entire authentication system, and having a CA means you can no longer know for sure what keys have access to your systems. If your systems are working properly you haven't signed any improper or unauthorized keys and you have a complete list of what keys you have signed, but you ultimately have to take this on trust (and audit key use to make sure that all keys you see are ones you know about). The story of the modern public CA system over the past few years is a cautionary story about how well that's worked out in the large, which is so well that people are now creating what is in effect a set of giant key distribution systems for TLS.

(That is ultimately what Certificate Transparency is; it's a sophisticated technique to verify that all keys are in a list.)

Using a local CA thus is a tradeoff. You're getting out of key distribution and giving yourself some multi-level control over key validity in exchange for creating a single point of security failure. It remains my view that in most environments, key distribution is not a big problem and properly operating a genuinely secure CA is. However, setting up a sort of secure CA is certainly much easier than readily solving key distribution (especially key distribution to end devices), so I expect using local CAs to remain eternally attractive.

(Or perhaps these days there's easily installed and operated software for local CAs that relies on the security of, say, a Yubikey for actually signing keys. Of course if the CA operator has to touch their Yubikey every time to sign a new key, you're not going to be doing that a lot.)

tech/LocalCATradeoffs written at 02:23:38; Add Comment

2018-10-20

The original Unix ed(1) didn't load files being edited into memory

These days almost all editors work by loading the entire file (or files) that you're editing into memory, either into a flat buffer or two or into some more sophisticated data structure, and then working on them there. This approach to editing files is simple, straightforward, and fast, but it has an obvious limitation; the file you want to edit has to fit into memory. These days this is generally not much of an issue.

V7 was routinely used on what are relatively small machines by modern standards, and those machines were shared by a fairly large number of people, so system memory was a limited resource. Earlier versions of Research Unix had to run on even smaller machines, too. On one of those machines, loading the entire file you wanted to edit into memory was somewhere between profligate and impossible, depending on the size of the file and the machine you were working on. As a result, the V7 ed does not edit files in memory.

The V7 ed manpage says this explicitly, although it's tempting to regard this as hand waving. Here's the quote:

Ed operates on a copy of any file it is editing; [...]. The copy of the text being edited resides in a temporary file called the buffer.

The manual page is being completely literal. If you started up V7 in a PDP-11 emulator and edited a file with ed, you would find a magic file in /tmp (called /tmp/e<something>, the name being created by the V7 mktemp()). That file is the buffer file, and you will find much or all of the file you're editing in it (in some internal format that seems to have null bytes sprinkled around through it).

(V6 is substantially the same, so you can explore this interactively here. I was surprised to discover that V6 doesn't seem to have sed.)

I've poked through the V7 ed.c to see if I could figure out what ed is doing here, but I admit the complete lack of comments has mostly defeated me. What I think it's doing is only allocating and storing some kind of index to where every line is located, then moving line text in and out of a few small 512-byte buffers as you work on them. As you add text or move things around, I believe that ed writes new copies of the line(s) you've touched to new locations in the buffer file, rather than try to overwrite the old versions in place. The buffer file has a limit of 256 512-byte blocks, so if you do enough editing of a large enough file I believe you can run into problems there.

(This agrees with the manpage's section on size limitations, where it says that ed has a 128 KB limit on the temporary file and the limit on the number of lines you can have is the amount of core, with each line taking up one PDP-11 'word' (in the code this is an int).)

Exploring the code also led me to discover how ed handled errors internally, which is by using longjmp() to jump back to main() and re-enter the main command loop from there. This is really sort of what I should have expected from a V7 program; it's straight, minimal, and it works. Perhaps it's not how we'd do that today, but V7 was a different and smaller place.

PS: If you're reading the V7 ed source and want to see where this is, I think it runs through getline(), putline, getblock(), and blkio(). I believe that the tline variable is the offset that the next line will be written to by putline(), and it gets stored in the dynamically allocated line buffer array that is pointed to by zero. The more I look at it, the more that the whole thing seems pretty clever in an odd way.

(My looking into this was sparked by Leah Neukirchen's comment on my entry on why you should be willing to believe that ed is a good editor. Note that even if you don't hold files in memory, editing multiple files at once requires more memory. In ed's scheme, you would need to have multiple line-mapping arrays, one for each file, and probably you'd want multiple buffer files and some structures to keep track of them. You might also be more inclined to do more editing operations in a single session and so be more likely to run into the size limit of a buffer file, which I assume exists for a good reason.)

unix/EdFileNotInMemory written at 01:17:05; Add Comment

2018-10-18

Some things on delays and timings for Prometheus alerts

One of the things I've been doing lately is testing and experimenting with Prometheus alerts. When you're testing alerts you become quite interested in having your alerts fire and clear rapidly, so you can iterate tests rapidly; it is no fun sitting there for five or ten minutes waiting for everything to reset so you can try something new from a clean slate. Also, of course, I have been thinking about how we want to set various alert-related parameters in an eventual production deployment.

Let's start with the timeline of a simple case, where an event produces a single alert:

  1. the actual event happens (your service falls over)
  2. Prometheus notices when it next scrapes (or tries to scrape) the service's metrics. This may be up to your scrape_interval later, if your timing is unlucky. At this point the event is visible in Prometheus metrics.

  3. Prometheus evaluates alerting rules and realizes that this is alertable. This may be up to your evaluation_interval later. If the alert rule has no 'for: <duration>' clause, the alert is immediately firing (and we go to #5); otherwise, it is pending.

    At this point, the alert's existence now appears in Prometheus's ALERTS metric, which means that your dashboards can potentially show it as an alert (if they refresh themselves, or you tell them to).

  4. if the alert is pending, Prometheus continues to check the alert rule; if it remains true in every check made through your for: duration, the alert becomes firing. This takes at least your for: duration, maybe a bit more. Prometheus uses whatever set of metrics it has on hand at the time it makes each of these checks, and presumably they happen every evaluation_interval as part of alert rule evaluation.

    This means that there isn't much point to a for duration that is too short to allow for a second metrics scrape. Sure, you may check the alert rule more than once, but you're checking with the same set of metrics and you're going to get the same answer. You're just stalling a bit.

    (So, really, I think you should think of the for duration as 'how many scrapes do I want this to have to be true for'. Then add a bit more time for the delays involved in scrapes and rule evaluations.)

  5. Prometheus sends the firing alert to Alertmanager, and will continue to do so periodically while it's firing (cf).
  6. Alertmanager figures out the grouping for the alert. If the grouping has a group_wait duration, it starts waiting for that much time.
  7. If the alert is (still) firing at the end of the group_wait period, Alertmanager sends notifications. You finally know that the event is happening (if you haven't already noticed from your dashboards, your own use of your service, or people telling you).

The end to end delay for this alert is a composite of the event to scrape delay, the scrape to rule evaluation delay, at least your for duration with maybe a bit more time as you wait for the next alert rule evaluation (if you use a for: duration), and the Alertmanager group_wait duration (if you have one). If you want fast alerts, you will have to figure out where you can chop out time. Alternately, if you're okay with slow alerts if it gets you advantages, you need to think about which advantages you care about and what the tradeoffs are.

The obvious places to add or cut time are in Prometheus's for and Alertmanager's group_wait. There's a non-obvious and not entirely documented difference between them, which is that Alertmanager only cares about the state of the alert at the start and at the end of the group_wait time; unlike Prometheus, it doesn't care if the alert stops being firing for a while in the middle. This means that only Prometheus can de-bounce flapping alerts. However, if your alerts don't flap and you're only waiting in the hopes that they'll cure themselves, in theory you can wait in either place. Well, if you have a simple situation.

Now suppose that your service falling over creates multiple alerts that are driven by different metric scrapes, and you would like to aggregate them together into one alert message. Now it matters where you wait, because the firing alerts can only be aggregated together once they reach Alertmanager and only if they all get there within the group_wait time window. The shorter the group_wait time, the narrower your window to have everything scraped and evaluated in Prometheus is, so that they all escape their for duration close enough to each other for Alertmanager.

(Or you could get lucky, with both sets of metrics scraped sufficiently close to each other that their alert rules are both processed in the same alert rule evaluation cycle, so that their for waits start and end at the same time and they'll go to Alertmanager together.)

So, I think that the more you care about alert aggregation, the more you need to shift your delay time from Prometheus's for to Alertmanager's group_wait. To get a short group_wait and still reliably aggregate alerts together, I think you need to set up your scrape and rule evaluation intervals so that different metrics scrapes are all reliably ingested and processed within the group_wait interval.

Suppose, for example, that you have both set to 15 seconds. Then when an event beings at time T, you know that all metrics reflecting it will be scraped by Prometheus after at most 15 seconds after T (plus up to almost your scrape timeout interval) and their alert rules should be processed after at most 15 seconds or so afterward. At this point all alerts with for conditions will have become pending and started counting down, and they will all transition to firing at most 30 seconds apart (plus wiggle room for scrape and rule evaluation slowness). If you give Alertmanager a 45 second group_wait, it's almost certain that you'll get them aggregated together. 30 seconds might be pushing it, but you'll probably make it most of the time; you would have to be pretty unlucky for one scrape to happen immediately after T with an alert rule evaluation right after it (so that it becomes pending at T+2 or so), then another metric scrape at T+14.9 seconds, have that scrape be slow, and then only get its alert rules evaluated at, say, T+33.

Things are clearly different if you have one scrape source on a 15 second scrape_interval (perhaps an actual metrics point) and another one on a one or two minute scrape_interval or update interval (perhaps an expensive blackbox check, or perhaps a Pushgateway metric that only gets updated once a minute from cron). Here you'd have to be lucky to have Alertmanager aggregate the alerts together with a 30 or 45 second group_wait time.

(One thing that was useful here is Prometheus: understanding the delays on alerting, which has pictures. It dates from 2016 so some of the syntax is a bit different, but the concepts don't seem to have changed.)

PS: Since I tested this, Alertmanager does not send out any message if it receives a firing alert from Prometheus and then the alert goes away before the group_wait period is up, not even a 'this was resolved' message if you have send_resolved turned on. This is reasonable from one perspective and potentially irritating from another, depending on what you want.

sysadmin/PrometheusAlertDelays written at 23:53:14; Add Comment

Link: Vectorized Emulation [of CPUs and virtual machines]

Vectorized Emulation: Hardware accelerated taint tracking at 2 trillion instructions per second (via) is about, well, let me quote from the introduction rather than try to further summarize it:

In this blog I’m going to introduce you to a concept I’ve been working on for almost 2 years now. Vectorized emulation. The goal is to take standard applications and JIT them to their AVX-512 equivalent such that we can fuzz 16 VMs at a time per thread. The net result of this work allows for high performance fuzzing (approx 40 billion to 120 billion instructions per second [the 2 trillion clickbait number is theoretical maximum]) depending on the target, while gathering differential coverage on code, register, and memory state.

Naturally you need to do all sorts of interesting tricks to make this work. The entry is an overview, and the author is going to write more entries later on the details of various aspects of it, which I'm certainly looking forward to even if I'm not necessarily going to fully follow the details.

I found this interesting both by itself and for giving me some more insight into modern SIMD instructions and what goes into using them. SIMD and GPU computing feel like something that I should understand some day.

(I find SIMD kind of mind bending and I've never really dug into how modern x86 machines do this sort of stuff and what you use it for.)

links/VectorizedEmulation written at 20:06:28; Add Comment

Why you should be willing to believe that ed(1) is a good editor

Among the reactions to my entry on how ed(1) is no longer a good editor today was people wondering out loud if ed was ever a good editor. My answer is that yes, ed is and was good editor in the right situations, and I intend to write an entry about that. But before I write about why ed is a good editor, I need to write about why you should be willing to believe that it is. To put it simply, why you should believe that ed is a good editor has nothing to do with anything about its technical merits and everything to do with its history.

Ed was created and nurtured by the same core Bell Labs people who created Unix, people like Dennis Ritchie and Ken Thompson. Ed wasn't their first editor; instead, it was the end product of a whole series of iterations of the same fundamental idea, created in the computing environment of the late 1960s and early to mid 1970s. The Bell Labs Unix people behind ed were smart, knew what they were doing, had done this many times before, had good taste, were picky about their tools, used ed a great deal themselves, and were not afraid to completely redo Unix programs that they felt were not up to what they should be (the Unix shell was completely redesigned from the ground up between V6 and V7, for example). And what these people produced and used was ed, not anything else, even though it's clear that they could have had something else if they'd wanted it and they certainly knew that other options were possible. Ed is clearly not the product of limited knowledge, imagination, skill, taste, or indifference to how good the program was.

It's certainly possible to believe that the Bell Labs Research Unix people had no taste in general, if you dislike Unix as a whole; in that case, ed is one more brick in the wall. But if you like Unix and think that V7 Unix is well designed and full of good work, it seems a bit of a stretch to believe that all of the Bell Labs people were so uniquely blinded that they made a great Unix but a bad editor, one that they didn't recognize as such even though they used it to write the entire system.

Nor do I think that resource constraints are a convincing explanation. While the very limited hardware of the very earliest Unix machines might have forced early versions of ed to be more limited than prior editors like QED, by the time of V7, Bell Labs was running Unix on reasonably good hardware for the time.

The conclusion is inescapable. The people at Bell Labs who created Unix found ed to be a good editor. Since they got so much else right and saw so many things so clearly, perhaps we should consider that ed itself has merits that we don't see today, or don't see as acutely as they did back then.

unix/EdBelieveGoodEditor written at 00:42:26; Add Comment

2018-10-17

When metrics disappear on updates with Prometheus Pushgateway

To simplify, Prometheus has a notion of current versus stale metrics. As you'd want, straightforward Prometheus queries (for instance, for your current CPU usage) return only current metrics. There are also a number of ways to push metrics into Prometheus from places like scripts, such as the node exporter's textfile collector, where your script can write files to a magic directory on a machine that is already running the node exporter, and the Pushgateway, where your script can use curl to just poke metrics straight into Prometheus.

(A metric here is a combination of a metric name and a set of labels with their values.)

If you use the node exporter's textfile collector, when your metrics stop being current is pretty straightforward. If Prometheus can't talk to the node exporter, all of your metrics go stale; if it can, any metrics that aren't there go stale. So if you remove your file entirely, all of your metrics go stale, while if you write a new version of the file that's missing some metrics, just those go stale. Basically the current state of the world is, well, current, and everything else is stale.

(However, if you write your file and let it sit for a month, those metrics are still current as far as Prometheus is concerned. The textfile collector exposes metrics for the most recent times those files were modified.)

Pushgateway famously does not work this way, in that metrics pushed to it have no expiry time and will be considered current by Prometheus for as long as the Pushgateway responds. To quote from When to use the Pushgateway's rundown of pitfalls of using Pushgateway:

  • The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever unless those series are manually deleted via the Pushgateway's API.

(Whether or not this is a feature depends on your usage.)

This is true in one sense and is not quite completely true in another sense. If you push metrics and then go away, it is true. But if you are regularly pushing new versions of metrics, as you would be regularly generating new versions of your metrics file for the node exporter's textfile collector, what metrics disappear when depends on both what metrics you push, especially what metric names, and whether you push them with POST or with PUT.

Here's an example. We start by pushing the following metrics to /metrics/job/test/instance/fred on our Pushgateway (the job and instance here form what Pushgateway calls a 'grouping key'):

sensor_temp{id="1"}   23.1
sensor_temp{id="2"}   25.6
sensor_switch{id="1"} 1

Then we push to the same URL with the following new version of our metrics, which no longer mentions either sensor_temp{id="2"} or sensor_switch{id="1"}:

sensor_temp{id="1"}   24.0

If you send this with a POST, Pushgateway will remove the old sensor_temp{id="2"} metric, making it stale, but will continue to expose sensor_switch{id="1"}. If you send this with PUT, Pushgateway removes both.

If you use PUT, Pushgateway assumes that you are completely authoritative for what metrics currently exist under your grouping key; any metrics that you didn't push are removed and become stale in Prometheus. If you use POST, Pushgateway assumes that you're only authoritative for the metric names that you're using in your push. Metric names that you didn't mention might be handled by some other job, so it doesn't touch metrics from them.

As the Pushgateway documentation mentions but does not explicitly explain, this means that a POST with an empty body does nothing except update the push_time_seconds metric for your group key; since you pushed no metric names, Pushgateway doesn't touch any of the existing metrics. If you did a PUT with an empty body, in theory you would get the same effect as DELETE (but Pushgateway may consider this an error, I haven't checked).

Given this, my opinion is you should normally use PUT when sending metrics to Pushgateway. If you actually want to have several things separately pushing to the same group key with POST, you need to explicitly coordinate who gets to use what metric name(s), because otherwise you will quietly have push sources stepping on each other's toes and things will probably get very confusing (as metrics become stale or current depending on who pushed last and when Prometheus scraped your Pushgateway).

(One use of POST is to explicitly only update the last pushed time, with no chance of touching any of the current metrics. In this use it's the Pushgateway equivalent of the Unix touch command.)

I think that it's kind of unfortunate that the Pushgateway README implicitly uses POST in their examples (by using curl with no special options). If I really wanted to try to shave this particular yak I suppose that I could always submit a pull request, although I wonder if it would be declined on the grounds of being too verbose and explaining the nominally obvious.

Sidebar: When persistence in your metrics is a feature

The short version is that I see a use for pushing metrics that basically represent general facts into Pushgateway and then letting it persist them for us. These facts are not per-host things (or at least not things that we really want to generate on the individual hosts), so while we could expose them through the Prometheus host's node exporter and textfiles, that seems a bit like a hack.

Some people would say 'don't put general facts into Prometheus metrics'. My answer is that there isn't really a better option due to the paucity of features in things like alerting rules; you get PromQL expressions and that's mostly it, so either you write lots of alert rules or put your facts where PromQL can get at them.

(Or perhaps I'm missing something.)

sysadmin/PrometheusPushgatewayDropMetrics written at 23:10:05; Add Comment

2018-10-16

Quickly bashing together little utilities with Python is nice

One of the reasons that I love Python, and one of the things that I love using it for, is its ability to quickly and easily bash together little utility programs. By their nature, these things don't get talked about very much (they're so small and often so one-off), but this time I have a couple of examples to talk about.

As part of writing yesterday's entry on external email delivery delays we see, I found myself wanting to turn Exim's human-friendly format that it reports time durations in into an integer number of seconds, so that I could sort it, and then later on I found myself wanting to convert the other way, so that I could tell what the few high number of seconds I was getting turned into in human-readable terms.

Exim's queue delay format looks like '1d19h54m13s'. The guts of my little Python program to convert these into seconds looks like this:

rexp = re.compile("([a-z])")
timemap = {'s': 1,
           'm': 60,
           'h': 60*60,
           'd': 24*60*60,
           }

def process():
    for l in sys.stdin:
        sr = rexp.split(l)
        # The minimum split should be '1', 's', '\n'.
        if len(sr) < 3:
            continue
        secs = 0
        for i in range((len(sr)-1) // 2):
            o = i*2
            secs += int(sr[o]) * timemap[sr[o+1]]
        print(secs)

The core trick is to use a Python regexp to split '1d19h54m13s' into ['1', 'd', '19', 'h', '54', 'm', '13', 's'] (plus a trailing newline in this case). We can then take pairs of these things, turn the first into a number, and multiply it by the unit conversion determined by the second.

Going the other direction looks surprisingly similar (for example, I literally copied the timemap over):

timemap = {'s': 1,
           'm': 60,
           'h': 60*60,
           'd': 24*60*60,
           }

def tsstring(secs):
    o = []
    for i in ('d', 'h', 'm', 's'):
        if secs >= timemap[i] or o:
            n = secs // timemap[i]
            o.append("%d%s" % (n, i))
            secs %= timemap[i]
    return "".join(o)

There are probably other, more clever ways to do both conversions, as well as little idioms that could make these shorter and perhaps more efficient. But one of the great quiet virtues of Python is that I didn't need to reach for any special idioms to bash these together. The only trick is the regular expression split and subsequent pairing in the first program. Everything else I just wrote out as the obvious thing to do.

Neither of these programs worked quite right the first time I wrote them, but in both cases they were simple enough that I could realize my oversight by staring at things for a while (and it didn't take very long). Neither needed a big infrastructure around them, and with both I could explore their behavior and the behavior of pieces of them interactively in the Python interpreter.

(Exploring things in the Python interpreter was where I discovered that the .split() was going to give me an odd number of elements no matter what, so I realized that I didn't need to .strip() the input line. I'm sure that people who work with RE .split() in Python know this off the top of their head, but I'm an infrequent user at this point.)

Neither of these programs have any error handling, but neither would really be improved in practice by having to include it. They would be more technically correct, but I should never feed these bad input and if I do, Python will give me an exception despite me skipping over the entire issue. That's another area where Python has an excellent tradeoff for quickly putting things together; I don't have to spend any time on error handling, but at the same time I can't blow my foot off (in the sense of quietly getting incorrect results) if an error does happen.

(I almost started writing the Exim format to seconds conversion program in Go, except when I looked it up time.ParseDuration() doesn't accept 'd' as a unit. I think writing it in Python was easier overall.)

Sidebar: Code bumming and not thinking things through

I knew I would be processing a decent number of Exim delay strings in my program, so I was worried about it being too slow. This led me to write the core loop in the current not quite obvious form with integer indexing, when it would be more straightforward to iterate through the split list itself, pulling elements off the front until there wasn't enough left. There are alternate approaches with an explicit index.

Looking at this now the day afterward, I see that I could have made my life simpler by remembering that range() takes an iteration step. That would make the core loop:

        for i in range(0, len(sr)-1, 2):
            secs += int(sr[i]) * timemap[sr[i+1]]

This would also perform better. Such are the perils of paying too close attention to one thing and not looking around to think about if there's a better, more direct way.

python/PythonQuickUtilsNice written at 21:51:42; Add Comment

2018-10-15

The external delivery delays we see on our central mail machine

These days, our central mail machine almost always has a queue of delayed email that it's trying to deliver to the outside world. Sometimes this is legitimate email and it's being delayed because of some issue on the remote end, ranging from the remote end being down or unreachable to the destination user being over quota. But quite a lot of time it is for some variety of email that we didn't successfully recognize as spam and are either forwarding to some place that does (but that hasn't outright rejected the email yet) or we've had a bounce (or an autoreply) and we're trying to deliver that message to the envelope sender address, except that the sender in question doesn't have anything there to accept (or reject) it (which is very common behavior, for all that I object strongly to it).

(Things that we recognize as spam either don't get forwarded at all or go out through a separate machine, where bounces are swallowed. At the moment users can autoreply to spam messages if they work at it, although we try to avoid it by default and we're going to do better at that in the near future.)

Our central mail machine has pretty generous timeouts for delivering external email. For regular email, most destinations get six days, and for bounces or autoreplies most destinations get three days. These durations are somewhat arbitrary, so today I found myself wondering how long our successful external deliveries took and what the longest delays for successful deliveries actually were. The results surprised me.

(By 'external deliveries' I mean deliveries to all other mail systems, both inside and outside the university. I suppose I will call these 'SMTP deliveries' now.)

Most of my analysis was done on the last roughly 30 full days of SMTP deliveries. Over this time, we did about 136,000 successful SMTP deliveries to other systems. Of these, only 31,000 took longer than one second to be delivered (from receiving the message to the remote end accepting the SMTP transaction). That's still about 23% of our messages, but it's still impressive that more than 75% of the messages were sent onward within a second. A further 15,800 completed in two seconds, while only 5,780 took longer than ten seconds; of those, 3,120 were delivered in 60 seconds or less.

Our Exim queue running time is every five minutes, which means that a message that fails its first delivery or is queued for various reasons will normally see its first re-delivery attempt within five minutes or so. Actually there's a little bit of extra time because the queue runner may have to try other messages before it gets to you, so let's round this up to six minutes. Only 368 successfully delivered messages took longer than six minutes to be delivered, which suggests that almost everything is delivered or re-delivered in the first queue run in which it's in the queue. At this point I'm going to summarize:

  • 63 messages delivered in between 6 minutes and 11 minutes.
  • 252 messages delivered in between 11 minutes and an hour.
  • 24 messages delivered in between one and two hours.
  • 29 messages delivered in over two hours, with the longest five being delivered in 2d 3h 22m, 2d 3h 14m, 2d 0h 11m, 1d 5h 20m, and 1d 2h 39m respectively. Those are the only five that took over a day to be delivered.

We have mail logs going back 400 days, and over that entire time only 45 messages were successfully delivered with longer queue times than our 2d 3h 22m champion from the past 30 days. On the other hand, our long timeouts are actually sort of working; 12 of those 45 messages took at least five days to be delivered. One lucky message was delivered in 6d 0h 2m, which means that it was undergoing one last delivery attempt before being expired.

Despite how little good our relatively long expiry times are doing for successfully delivered messages, we probably won't change them. They seem to do a little bit of good every so often, and our queues aren't particularly large even when we have messages camped out in them going nowhere. But if we did get a sudden growth of queued messages that were going nowhere, it's reassuring to know that we could probably cut down our expire times quite significantly without really losing anything.

spam/ExternalDeliveryDelays written at 22:23:21; Add Comment

Garbage collection and the underappreciated power of good enough

Today I read The success of Go heralds that of Rust (via). In it the author argues that Rust is poised to take over from Go because Rust is a more powerful language environment. As one of their arguments against Go, the author says:

All the remaining issues with Go stem from three design choices:

  • It’s garbage collected, rather than having compile time defined lifetimes for all its resources. This harms performance, removes useful concepts (like move semantics and destructors) and makes compile-time error checking less powerful.

[...]

Fix those things, and Go essentially becomes the language of the future. [...]

I think the author is going to be disappointed about what happens next, because in my opinion they've missed something important about how people use programming languages and what matters to people.

To put it simply, good garbage collection is good enough in practice for almost everyone and it's a lot easier than the fully correct way of carefully implementing lifetimes for all resources. And Go has good garbage collection, because the Go people considered it a critical issue and spent a lot of resources on it. This means that Go's garbage collection is not an issue in practice for most people, regardless of what the author thinks, and thus that Rust's compile time defined lifetimes are not a compelling advantage for Rust that will draw many current (or future) Go users from Go to Rust.

There are people who need the guarantees and minimal memory usage that lifetimes for memory gives you, and there are programs that blow up under any particular garbage collection scheme. But most people do not have such needs or such programs. Most programs are perfectly fine with using some more memory than they strictly need and spending some more CPU time on garbage collection, and most people are fine with this because they have other priorities than making their programs run as fast and with as minimal resource use as possible.

(These people are not being 'lazy'. They are optimizing what matters to them in their particular environment, which may well be things like speed to deployment.)

The underappreciated power of 'good enough' is that good enough is sufficient for most people and most programs; really, it's almost always sufficient. People don't always stop at good enough, but they often do, especially when achieving better than good enough is significantly harder. This is not a new thing and we have seen this over and over again in programming languages; just look at how people have flocked to Perl, Python, PHP, Ruby, and JavaScript, despite there being more powerful alternatives.

(Good enough changes over time and over scale. What is good enough at small scale is not good enough at larger scale, but many times you never need to handle that larger scale.)

What matters to most people most of the time is not perfection, it is good enough. They may go beyond that, but you should not count on it, and especially you should not count on perfection pulling lots of people away from good enough. We have plenty of proof that it doesn't.

programming/GarbageCollectionGoodEnough written at 00:06:10; Add Comment

(Previous 11 or go back to October 2018 at 2018/10/13)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.