Wandering Thoughts

2019-07-18

Switching Let's Encrypt clients is currently quite disruptive

On Twitter, I said:

At the moment, changing between Let's Encrypt clients appears to be about as disruptive as changing to or from Let's Encrypt and another CA. Certificate paths change, software must be uninstalled and installed, operational practices revised, and nothing can be moved over easily.

I didn't mention that you are probably going to have to get reissued certificates unless you like doing a lot of work, but that's true too. Let's Encrypt makes this easy, but some people may run into rate limits here.

This is on my mind because we're replacing acmetool with something else (almost certainly Certbot for reasons beyond the scope of this entry), so I've been thinking about the mechanics of the switch. Unfortunately there are a lot of them. Acmetool and Certbot do almost everything differently; they put their certificates in different places, they have different command lines and setup procedures, and Certbot needs special handling during some system installs that acmetool doesn't.

So to transition a machine we're going to have to install Certbot (or whatever client), install our Certbot customizations (we need at least a hook script or two), uninstall acmetool to remove its cron job and Apache configuration snippet, set up the Apache configuration snippet that Certbot needs, register a new account, request certificates, and then update the configuration of all of our TLS-using programs to the new certificate locations. Then the setup instructions for the machine need to be revised to perform the Certbot install and setup instead of the current acmetool one. We get to repeat this for every system we have that uses Let's Encrypt. All of this requires manual work; it's not something we can really automate in a sensible amount of time (at least not safely, cf).

(Then when we need new TLS certificates we'll have to use different commands to get them, and if we run into rate limits we'll have to use different ways to deal with the situation.)

There are multiple causes for this. One of them is simply that clients are different, with different command lines (and Certbot has some very ornate ones, which we'll almost certainly fix with a cover script that provides our standard local options). But a big one is that clients have not standardized even where and how they store data about certificates and Let's Encrypt accounts, much less anything more. As a result, for example, as far as I know there's no official way to import current certificates and accounts into Certbot, or extract them out afterward. Your Let's Encrypt client, whatever it is, is likely to be a hermetically sealed world that assumes you're starting from scratch and you'll never want to leave.

(It would be nice if future clients could use Certbot's /etc/letsencrypt directory structure for storing your TLS certificates and keys. At least then switching clients wouldn't require updating all of the TLS certificate paths in configuration files for things like Apache, Exim, and Dovecot.)

sysadmin/LetsEncryptClientChangeHassle written at 23:10:45; Add Comment

2019-07-17

Django 1.11 has a bug that causes intermittent CSRF validation failures

Over on Twitter, I said:

People say that Django version upgrades are easy and reliable. That is why our web app, moved from 1.10 to 1.11, is now throwing CSRF errors on *a single form* but only when 'DEBUG=False' which, you know, doesn't help debug the issue.

Last week I updated our Django web application from Django 1.10.7 to 1.11.22. Today, one of its users reported that when they tried to submit a form, the application reported:

Forbidden (403)
CSRF verification failed. Request aborted.

More information is available with DEBUG=True.

At first I expected this to be a simple case of Django's CSRF browser cookie expiring or getting blocked. However, the person reproduced the issue, and then I reproduced the issue too, except that when I switched the live web app over to 'DEBUG=True', it didn't happen, and then sometimes it didn't happen even when debugging was off.

(Our application is infrequently used, so it's not surprising that this issue didn't surface (or didn't get reported) for a week.)

There are a number of reports of similar things on the Internet, for example here, here, here, and especially Django ticket #28488. Unfortunately not only was ticket 28488 theoretically fixed years ago, but it doesn't match what I see in Firefox's Network pane; there are no 404 HTTP requests served by our Django app, just regular successful ones.

(Here hints that maybe the issue involves using both sessions and CSRF cookies, which we do because sessions are a requirement for HTTP Basic Authentication, or at least they were at one point.)

The most popular workaround appears to be to stop Django from doing CSRF checks, often by setting CSRF_TRUSTED_ORIGINS to some value. My workaround for now is to revert back to Django 1.10.7; it may not be supported, but it actually works reliably for us, unlike Django 1.11. I am not sure that we will ever try 1.11 again; an intermittent failure that only happens in production is a really bad thing and not something I am very enthused about risking.

(I'm not particularly happy about this state of affairs and I have low expectations for the Django people fixing this issue in the remaining lifetime of 1.11, since this has clearly been happening with 1.11 for some time. Since I'm not willing to run 1.11 in production to test and try things for the Django people, it doesn't seem particularly useful to even try to report a bug.)

python/Django111CSRFFailures written at 21:30:52; Add Comment

2019-07-16

Go's proposed try() will be used and that will change how code is written

One of the recent commotions in Go is over the Go team's proposed try() built-in error check function, which is currently planned to be part of Go 1.14 (cf). To simplify, 'a, [...] := try(f(...))' can be used to replace what you would today have to write as:

a, [...], err := f(...)
if err != nil {
   return [...], err
}

Using try() means you can drop that standard if block and makes your function clearer; much more of the code that remains is relevant and important.

Try() is attractive and will definitely be used in Go code, probably widely, and especially by people who are new to Go and writing more casual code. However, this widespread use of try() is going to change how Go code is written.

One of my firm beliefs is that most programmers are strongly driven to do what their languages make easy, and I don't think try() is any exception (I had similar thoughts about the original error handling proposal). What try() does is that try() makes returning an unchanged error the easiest thing to do. You can wrap the error from f() with more context if you work harder, but the easiest path is to not wrap it at all. This is a significant change from the current state of Go, where wrapping an error is a very easy thing that needs almost no change to the boilerplate code:

a, [...], err := f(...)
if err != nil {
   return [...], fmt.Errorf("....", ..., err)
}

In a try() world, adding that error wrapping means adding those three lines of near boilerplate back in. As a result, I think that once try() is introduced, Go code will see a significantly increased use of errors being returned unchanged and unwrapped from deep in the call stack. Sure, it's not perfect, but programmers are very good at convincing themselves that it's good enough. I'm sure that I'll do it myself.

This change isn't necessarily bad by itself, but it does directly push against the Go team's efforts to put more context into error values, an effort that actually landed changes in the forthcoming Go 1.13 (see also the draft design). It's possible to combine try() and better errors in clever ways, as shown by How to use 'try', but it's not the obvious, easy path, and I don't think it's going to be a common way to use try().

I am neither for or against try() at the moment, because I think that being for or against it in isolation is asking the wrong question. The important question is how Go wants errors to work, and right now the Go team does not seem to have made up its mind. If the Go team decides that errors should frequently be wrapped on their way up the call stack, I believe that try() in its current form is a bad idea.

(If the Go team thinks that they can have both try() in its current form and people routinely wrapping errors, I think that they are fooling themselves. try() will be used in the easiest way to use it, because that's what people do.)

PS: While Go culture is currently relatively in favour of wrapping errors with additional information, I don't think that this culture will survive the temptation of try(). You can't persuade people to regularly do things the hard way for very long.

Update: The Go team dropped the try() proposal due to community objections, rendering the issue moot.

programming/GoTryWillBeUsedSimply written at 23:36:22; Add Comment

2019-07-15

ZFS on Linux still has annoying issues with ARC size

I'll start with my tweet:

One of the frustrating things about operating ZFS on Linux is that the ARC size is critical but ZFS's auto-tuning of it is opaque and apparently prone to malfunctions, where your ARC will mysteriously shrink drastically and then stick there.

Linux's regular filesystem disk cache is very predictable; if you do disk IO, the cache will relentlessly grow to use all of your free memory. This sometimes disconcerts people when free reports that there's very little memory actually free, but at least you're getting value from your RAM. This is so reliable and regular that we generally don't think about 'is my system going to use all of my RAM as a disk cache', because the answer is always 'yes'.

(The general filesystem cache is also called the page cache.)

This is unfortunately not the case with the ZFS ARC in ZFS on Linux (and it wasn't necessarily the case even on Solaris). ZFS has both a current size and a 'target size' for the ARC (called 'c' in ZFS statistics). When your system boots this target size starts out as the maximum allowed size for the ARC, but various events afterward can cause it to be reduced (which obviously limits the size of your ARC, since that's its purpose). In practice, this reduction in the target size is both pretty sticky and rather mysterious (as ZFS on Linux doesn't currently expose enough statistics to tell why your ARC target size shrunk in any particular case).

The net effect is that the ZFS ARC is not infrequently quite shy and hesitant about using memory, in stark contrast to Linux's normal filesystem cache. The default maximum ARC size starts out as only half of your RAM (unlike the regular filesystem cache, which will use all of it), and then it shrinks from there, sometimes very significantly, and once shrunk it only recovers slowly (if at all).

This sounds theoretical, so let me make it practical. We have six production ZFS on Linux NFS fileservers, all with 196 GB of RAM and a manually set ARC maximum size of 155 GB. At the moment their ARC sizes range from 117 GB to 145 GB; specifically, 117 GB, 127 GB, three at 132 GB, and 145 GB. On top of this, the fileserver at 117 GB of ARC is a very active one with some very popular and big filesystems (such as our mail spool, which is perennially the most active filesystem we have). Even if we're still getting a good ARC hit rate during active periods, I'm pretty sure that we could get some use out of it caching more ZFS data in RAM than it actually is.

(We don't currently have ongoing ARC stats for our fileservers, so I don't know what the ARC hit rate is or why ARC misses happen (cf).)

Part of the problem here is not just that the ARC target size shrinks, it's that you can't tell why and there aren't really any straightforward and reliable ways to tell ZFS to reset it. And since you can't tell why the ARC target size shrunk, you can't tell if ZFS actually did have a good reason for shrinking the ARC. The auto-sizing is great when it works but very opaque when it doesn't, and you can't tell the difference.

PS: Several years ago, I saw memory competition between the ARC and the page cache on my workstation, but then the issue went away. I don't think our fileserver ARC issues are due to page cache contention, partly because the entire ext4 root filesystem on them is only around 20 GB. Even if all of it is completely cached in RAM, there's a bunch of ARC shrinkage that's left unaccounted for. Similarly, the sum of smem's PSS for all user processes is only a gigabyte or two. There just isn't very much happening on these machines.

PPS: This is with an older version of ZFS on Linux, but my office workstation with a bleeding edge ZoL doesn't do any better (in fact it does worse, with periodic drastic ARC collapses).

linux/ZFSOnLinuxARCShrinkage written at 22:49:23; Add Comment

2019-07-14

We're going to be separating our redundant resolving DNS servers

We have a number of OpenBSD machines in various roles; they're our firewalls, our resolving DNS servers as well as our public authoritative DNS server, and so on. For pretty much all of these, we actually have two identical servers per role in a hot spare setup, so that we can rapidly recover from various sorts of failures. For our firewalls, switching from one to another takes manual action (we have to change which one is plugged into the live network, although their firewall state is synchronized with pfsync so that a switch is low impact). For our DNS resolvers, we have both on the network and list both addresses in our /etc/resolv.conf, because this works perfectly fine with DNS servers.

(All of our machines list the same resolver first, which we consider a feature for reasons beyond the scope of this entry. Our routing firewalls don't use CARP for various reasons, some of them historical, but in practice it doesn't matter, as we haven't had a firewall hardware failure. When we have switched firewalls, it's been for software reasons.)

All of this sounds great, except for the bit where I haven't mentioned that these redundant resolving DNS servers are racked next to each other (one on top of the other), plugged into the same rack PDU, and connected to the same leaf switch. We have great protection against server failure, which is what we designed for, but after we discovered that switches can wind up in weird states after power failures it no longer feels quite so sufficient, since working DNS is a crucial component of our environment (as we found out in an earlier power failure).

(Most of our paired redundant servers are racked up this way because it's the most convenient option. They're installed at the same time, generally worked on at the same time, and they need the same network connections. For firewalls, in fact, you need to switch their network cables back and forth to change which is the live one.)

So, as the title of this entry says, we're now going to be separating our resolving DNS servers, both physically and for their network connection, so that the failure of a single rack PDU or leaf switch can't take both of them offline. Unfortunately we can't put one DNS server directly on the same switch as our fileservers; the fileserver switch is a 10G-T switch with a very limited supply of ports.

(Now that I write this entry the obvious question is whether all of our fileservers should be on the same 10G-T switch. Probably it's harmless, because our entire environment will grind to a halt if even a single fileserver drops off the network.)

PS: I suspect that our resolving DNS servers are the only redundant pair that are important to separate this way, but it's clearly something we should think about. We could at least add some extra redundancy for our VPN servers by separating the pairs, and that might be important during a serious problem.

sysadmin/SeparatingOurDNSResolvers written at 22:21:22; Add Comment

2019-07-13

Our switches can wind up in weird states after a power failure

We've had two power failures so far this year, which is two more than we usually have. Each has been a learning experience, because both times around our overall environment failed to come back up afterward. The first time around the problem was DNS, due to a circular dependency that we still don't fully understand. The second time around, what failed was much more interesting.

Three things failed to come back up after the second power failure. The more understandable and less fatal problem was that our OpenBSD external bridging firewall needed some manual attention to deal with a fsck issue. By itself this just cut us off from the external world. Much worse, two of our core switches didn't fully boot up; instead, they stopped in their bootloader and waiting for someone to tell them to continue. Since the switches didn't boot and apply their configuration, they didn't light up their ports and none of our leaf switches could pass traffic around. The net effect was to create little isolated pools of machines, one pool per leaf switch.

(Then naturally most of these pools didn't have access to our DNS servers, so we also had DNS problems. It's always DNS. But no one would have gotten very far even with DNS, because all of our fileservers were isolated on their own little pool on a 10G-T switch.)

We've never seen this happen before (and certainly it didn't happen in prior power outages and scheduled shutdowns), so we've naturally theorized that the power failure wasn't a clean one (either during the loss of power or when it came back) and this did something unusual to the switches. It's more comforting to think that something exceptional happened than that this is a possibility that's always lurking there even in clean power loss and power return situations.

(While we shut down all of our Unix servers in advance for scheduled power shutdowns, we've traditionally left all of our switches powered on and just assumed that they'd come back cleanly afterward. We probably won't change that for the next scheduled power shutdown, but we may start explicitly checking that the core switches are working right before we start bringing servers up the next day.)

That we'd never seen this switch behavior before also complicated our recovery efforts, because we initially didn't recognize what had gone wrong with the switches or even what the problem with our network was. Even once my co-worker recognized that something was anomalous about the switches, it took a bit of time to figure out what the right step to resolve it was (in this case, to tell the switch bootloader to go ahead and boot the main OS).

(The good news is that the next time around we'll be better prepared. We have a console server that we access the switch consoles through, and it supports informational banners when you connect to a particular serial console. The consoles for the switches now have a little banner to the effect of 'if you see this prompt from the switch it's stuck in the bootloader, do the following'.)

PS: What's likely booting here is the switch's management processor. But the actual switching hardware has to be configured by the management processor before it lights up the ports and does anything, so we might as well talk about 'the switch booting up'.

sysadmin/SwitchesAndPowerGlitch written at 23:58:51; Add Comment

2019-07-12

Browers can't feasibly stop web pages from talking to private (local) IP addresses

I recently read Jeff Johnson's A problem worse than Zoom (via), in which Johnson says:

[...] The major browsers I've tested — Safari, Chrome, Firefox — all allow web pages to send requests not only to localhost but also to any IP address on your Local Area Network! Can you believe that? I'm both astonished and horrified.

(Johnson mostly means things with private IP addresses, which is the only sense of 'on your local and private network' that can be usefully determined.)

This is a tempting and natural viewpoint, but unfortunately this can't be done in practice without breaking things. To understand this, I'll outline a series of approaches and then explain why they fail or cause problems.

To start with, a browser can't refuse to connect to private IP addresses unless the URL was typed in the URL bar because there are plenty of organizations that use private IP addresses for their internal web sites. Their websites may link to each other, load resources from each other, put each other in iframes, and in general do anything you don't want an outside website to do to your local network, and it is far too late to tell everyone that they can't do this all of a sudden.

It's not sufficient for a browser to just block access by explicit IP address, to stop web pages from poking URLs like 'http://192.168.10.10/...'. If you control a domain name, you can make hosts in that have arbitrary IP addresses, including private IP addresses and 127.0.0.1. Some DNS resolvers will screen these out except for 'internal' domains where you've pre-approved them, but a browser can't assume that it's always going to be behind such a DNS resolver.

(Nor can the browser implement such a resolver itself, because it doesn't know what the valid internal domains even are.)

To avoid this sort of DNS injection, let's say that the browser will only accept private IP addresses if they're the result of looking up hosts in top level domains that don't actually exist. If the browser looks up 'nasty.evil.com' and gets a private IP address, it's discarded; the browser only accepts it if it comes from 'good.nosuchtld'. Unfortunately for this idea, various organizations like to put their internal web sites into private subdomains under their normal domain name, like '<host>.corp.us.com' or '<host>.internal.whoever.net'. Among other reasons to do this, this avoids problems when your private top level domain turns into a real top level domain.

So let's use a security zone model. The browser will divide websites and URLs into 'inside' and 'outside' zones, based on what IP address the URL is loaded from (something that the browser necessarily knows at the time it fetches the contents). An 'inside' page or resource may refer to outside things and include outside links, but an outside page or resource cannot do this with inside resources; going outside is a one-way gate. This looks like it will keep internal organizational websites on private IP addresses working, no matter what DNS names they use. (Let's generously assume that the browser manages to get all of this right and there are no tricky cases that slip by.)

Unfortunately this isn't sufficient to keep places like us working. We have a 'split horizon' DNS setup, where the same DNS name resolves to different IP addresses depending on whether you're inside or outside our network perimeter, and we also have a number of public websites that actually live in private IP address space but that are NAT'd to public IPs by our external firewall. These websites are publicly accessible, get linked to by outside things, and may even have their resources loaded by outside public websites, but if you're inside our network perimeter and you look up their name, you get a private IP address and you have to use this IP address to talk to them. This is exactly an 'outside' host referring to an 'inside' resource, which would be blocked by the security zone model.

If browsers were starting from scratch today, there would probably be a lot of things done differently (hopefully more securely). But they aren't, and so we're pretty much stuck with this situation.

web/BrowsersAndLocalIPs written at 21:49:48; Add Comment

Reflections on almost entirely stopping using my (work) Yubikey

Several years ago (back in 2016), work got Yubikeys for a number of us for reasons beyond the scope of this entry. I got designated as the person to figure out how to work with them, and in my usual way with new shiny things, I started using my Yubikey's SSH key for lots of additional things over and above their initial purpose (and I added things to my environment to make that work well). For a long time since then, I've had a routine of plugging my Yubikey in when I got in to work, before I unlocked my screen the first time. The last time I did that was almost exactly a week ago. At first, I just forgot to plug in the Yubikey when I got in and didn't notice all day. But after I noticed that had happened, I decided that I was more or less done with the whole thing. I'm not throwing the Yubikey away (I still need it for some things), but the days when I defaulted to authenticating SSH with the Yubikey SSH key are over. In fact, I should probably go through and take that key out of various authorized_keys files.

The direct trigger for not needing the Yubikey as much any more and walking away from it are that I used it to authenticate to our OmniOS fileservers, and we took the last one out of service a few weeks ago. But my dissatisfaction has been building for some time for an assortment of reasons. Certainly one part of it is that the big Yubikey security issue significantly dented my trust in the whole security magic of a hardware key, since using a Yubikey actually made me more vulnerable instead of less (well, theoretically more vulnerable).

Another part of it is that for whatever reason, every so often the Fedora SSH agent and the Yubikey would stop talking to each other. When this happened various things would start failing and I would have to manually reset everything, which obviously made relying on Yubikey based SSH authentication far from the transparent experience of things just working that I wanted. At some points, I adopted a ritual of locking and then un-locking my screen before I did anything that I knew required the Yubikey.

Another surprising factor is that I had to change where I plugged in my Yubikey, and the new location made it less convenient. When I first started using my Yubikey I could plug it directly into my keyboard at the time, in a position that made it very easy to see it blinking when it was asking for me to touch it to authenticate something. However I wound up having to replace that keyboard (cf) and my new keyboard has no USB ports, so now I have to plug the Yubikey into the USB port at the edge of one of my Dell monitors. This is more awkward to do, harder to reach and touch the Yubikey's touchpad, and harder to even see it blinking. The shift in where I had to plug it in made everything about dealing with the Yubikey just a bit more annoying, and some bits much more annoying.

(I have a few places where I currently use a touch authenticated SSH key, and these days they almost always require two attempts, with a Yubikey reset in the middle because one of the reliable ways to have the SSH agent stop talking to the Yubikey is not to complete the touch authentication stuff in time. You can imagine how enthused I am about this.)

On the whole, the most important factor has been that using the Yubikey for anything has increasingly felt like a series of hassles. I think Yubikeys are still reasonably secure (although I'm less confident and trusting of them than I used to be), but I'm no longer interested in dealing with the problems of using one unless I absolutely have to. Nifty shiny things are nice when they work transparently; they are not so nice when they don't, and it has surprised me how little it took to tip me over that particular edge.

(It's also surprised me how much happier I feel after having made the decision and carrying it out. There's all sorts of things I don't have to do and deal with and worry about any more, at least until the next occasion when I really need the Yubikey for something.)

sysadmin/YubikeyMostlyDropped written at 01:27:37; Add Comment

2019-07-10

I brought our Django app up using Python 3 and it mostly just worked

I have been worrying for some time about the need to eventually get our Django web application running under Python 3; most recently I wrote about being realistic about our future plans, which mostly amounted to not doing anything until we had to. Well, guess what happened since then.

For reasons beyond the scope of this entry, last Friday I ended up working on moving our app from Django 1.10.7 to 1.11.x, which was enlivened by the usual problem. After I had it working under 1.11.22, I decided to try running it (in development mode, not in production) using Python 3 instead of Python 2, since Django 1.11.22 is itself fully compatible with Python 3. To my surprise, it took only a little bit of cleanup and additional changes beyond basic modernization to get it running, and the result is so far fully compatible with Python 2 as well (I committed the changes as part of the 1.11 move, and since Monday they're running in production).

I don't think this is particularly due to anything I've done in our app's code; instead, I think it's mostly due to the work that Django has done to make everything work more or less transparently. As the intermediate layer between your app and the web (and the database), Django is already the place that has to worry about character set conversion issues, so it can spare you from most of those. And generally that's the big difference between Python 2 and Python 3.

(The other difference is the print statement versus 'print()', but you can make Python 2.7 work in the same way as Python 3 with 'from __future__ import print_function', which is what I did.)

I haven't thoroughly tested our web app under Python 3, of course, but I did test a number of the basics and everything looks good. I'm fairly confident that there are no major issues left, only relatively small corner cases (and then the lurking issue of how well the Python 3 version of mod_wsgi works and if there are any traps there). I'm still planning to keep us on Python 2 and Django 1.11 through at least the end of this year, but if we needed to I could probably switch over to a current Django and Python 3 with not very much additional work (and most of the work would be updating to a new version of Django).

There was one interesting and amusing change I had to make, which is that I had to add a bunch of __str__ methods to various Django models that previously only had __unicode__ methods. When building HTML for things like form <select> fields, Django string-izes the names of model instances to determine what to put in here, but in Python 2 it actually generates the Unicode version and so ends up invoking __unicode__, while in Python 3 str is Unicode already and so Django was using __str__, which didn't exist. This is an interesting little incompatibility.

Sidebar: The specific changes I needed to make

I'm going to write these down partly because I want a coherent record, and partly because some of them are interesting.

  • When generating a random key to embed in a URL, read from /dev/urandom using binary mode instead of text mode and switch from an ad-hoc implementation of base64.urlsafe_b64encode to using the real thing. I don't know why I didn't use the base64 module in the first place; perhaps I just didn't look for it, since I already knew about Python 2's special purpose encodings.

  • Add __str__ methods to various Django model classes that previously only had __unicode__ ones.

  • Switch from print statements to print() as a function in some administrative tools the app has. The main app code doesn't use print, but some of the administrative commands report diagnostics and so on.

  • Fix mismatched tabs versus spaces indentation, which snuck in because my usual editor for Python used to use all-tabs and now uses all-spaces. At some point I should mass-convert all of the existing code files to use all-spaces, perhaps with four-space indentation.

  • Change a bunch of old style exception syntax, 'except Thing, e:', to 'except Thing as e:'. I wound up finding all of these with grep.

  • Fix one instance of sorting a dictionary's .keys(), since Python 3 now returns an iterator here instead of a sortable object.

Many of these changes were good ideas in general, and none of them are ones that I find objectionable. Certainly switching to just using base64.urlsafe_b64encode makes the code better (and it makes me feel silly for not using it to start with).

python/DjangoAppPython3Surprise written at 21:46:22; Add Comment

2019-07-09

Systemd services that always restart should probably set a restart delay too

Ubuntu 18.04's package of the Prometheus host agent comes with a systemd .service unit that is set with 'Restart=always' (something that comes from the Debian package, cf). This is a perfectly sensible setting for the host agent for a metrics and monitoring system, because if you have it set to run at all, you almost always want it to be running all the time if at all possible. When we set up a local version of the host agent, I started with the Ubuntu .service file and kept this setting.

In practice, pretty much the only reason the Prometheus host agent aborts and exits on our machines is that the machine has run out of memory and everything is failing. When this happens with 'Restart=always' and the default systemd settings, systemd will wait its default of 100 milliseconds (the normal DefaultRestartSec value) and then try to restart the host agent again. Since the out of memory condition has probably not gone away in 100 ms, this restart is almost certain to fail. Systemd will repeat this until the restart has failed five times in ten seconds, and then, well, let me quote the documentation:

[...] Note that units which are configured for Restart= and which reach the start limit are not attempted to be restarted anymore; [...]

With the default restart interval, this takes approximately half a second. Our systems do not clear up out of memory situations in half a second, and so the net result was that when machines ran out of memory sufficiently badly that the host agent died, it was dead until we restarted it manually.

(I can't blame systemd for this, because it's doing exactly what we told it to do. It is just that what we told it to do isn't the right thing under the circumstances.)

The ideal thing to do would be to try restarting once or twice very rapidly, just in case the host agent died due to an internal error, and then to back off to much slower restarts, say once every 30 to 60 seconds, as we wait out the out of memory situation that is the most likely cause of problems. Unfortunately systemd only offers a single restart delay, so the necessary setting is the slower one; in the unlikely event that we trigger an internal error, we'll accept that the host agent has a delay before it comes back. As a result we've now revised our .service file to have 'RestartSec=50s' as well as 'Restart=always'.

(We don't need to disable StartLimitBurst's rate limiting, because systemd will never try to restart the host agent more than once in any ten second period.)

There are probably situations where the dominant reason for a service failing and needing to be restarted is an internal error, in which case an almost immediate restart minimizes downtime and is the right thing to do. But if that's not the case, then you definitely want to have enough of a delay to let the overall situation change. Otherwise, you might as well not set a 'Restart=' at all, because it's probably not going to work and will just run you into the (re)start limit.

My personal feeling is that most of the time, your services are not going to be falling over because of their own bugs, and as a result you should almost always set a RestartSec delay and consider what sort of (extended) restart limit you want to set, if any.

Sidebar: The other hazard of always restarting with a low delay

The other big reason for a service to fail to start is if you have an error in a configuration file or the command line (eg a bad argument or option). In this case, restarting in general does you no good (since the situation will only be cleared up with manual attention and changes), and immediately restarting will flood the system with futile restart attempts until systemd hits the rate limits and shuts things off.

It would be handy to be able to tell systemd that it should not restart the service if it immediately fails during a 'systemctl start', or at least to tell it that the failure of an ExecStartPre program should not trigger the restarting, only a failure of the main ExecStart program (since ExecStartPre is sometimes used to check configuration files and so on). Possibly systemd already behaves this way, but if so it's not documented.

linux/SystemdRestartUseDelay written at 23:45:39; Add Comment

(Previous 10 or go back to July 2019 at 2019/07/08)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.