Wandering Thoughts

2018-12-03

Linux disk IO stats in Prometheus

Suppose, not hypothetically, that you have a shiny new Prometheus setup and you are running the Prometheus host agent on your Linux machines, some of which have disks whose IO statistics might actually matter (for example, we once had a Linux Amanda backup server with a very slow disk). The Prometheus host agent provides a collection of disk IO stats, but it is not entirely clear where they come from and what they mean.

The good news is that the Prometheus host agent gives you the raw Linux kernel disk statistics and they're essentially unaltered. You get statistics only for whole disks, not partitions, but the host agent includes stats for software RAID devices and other disk level things. I've written about what these stats cover in my entry on what stats you get and also on what information you can calculate from them, which includes an aside on disk stats for software RAID devices and LVM devices on modern Linux kernels.

(The current version of the host agent makes two alterations to the stats; it converts the time based ones from milliseconds into seconds, and it converts the sector-based ones into bytes using the standard Linux kernel thing where one sector is 512 bytes. Both of these are much more convenient in a Prometheus environment.)

The mapping between Linux kernel statistics and Prometheus metrics names is fortunately straightforward, and it is easy to follow because the host agent's help text for all of the stats is pretty much their description in the kernel's Documentation/iostats.txt. There are a few changes, but they are pretty obvious. For example, the kernel description of field 10 is '# of milliseconds spent doing I/Os'; the host agent's corresponding description of node_disk_io_time_seconds_total is 'Total seconds spent doing I/Os'.

(In the current host agent the help text is somewhat inconsistent here; for instance, some of it talks about 'milliseconds'. This will probably be fixed in the future.)

Since Prometheus exposes all of the Linux kernel disk stats, you can generate all of the derived stats that I discussed in my entry on this. Actually calculating them will involve a lot of use of rate() or irate(); for pretty much every stat, you'll have to start out calculations by taking the rate() of it and then performing the relevant calculations from there. This is a bit annoying for several reasons, but Prometheus is Prometheus.

There are two limitations of these stats. First, as always, they're averages with everything that that implies (see here and here). Second, they're going to be averages over appreciable periods of time. At the limit, you're unlikely to be pulling stats from the Prometheus host agent more than once every 10 or 15 seconds, and sometimes less frequently than that. Very short high activity bursts will thus get smeared out into lower averages over your 10 or 15 or 30 second sample resolution. To get a second by second view that captures very short events, you're going to need to sit there on the server with a tool like mxiostat, or iostat if you must.

You can get around at least the issue of averages with something like the Cloudflare eBPF exporter (see also Cloudflare's blog post on it). If other burst events matter to you, you could probably build some infrastructure that would capture them in histograms in a similar way.

(Histograms that capture down to single exceptional events are really the way to go if you care a lot about this, because even a second by second view is still an average over that second. However you're a lot more likely to see things in a second by second view than in a 15, 30, or 60 second one, assuming that you can spot the exceptions as they flow by.)

PrometheusLinuxDiskIOStats written at 01:00:15; Add Comment

2018-12-02

Checking to see if a process is alive (on Linux)

For a long time I've used the traditional Unix way of checking to see if a given process was (still) alive, which is sending it the special signal of 0 with 'kill -0 <PID>'. If you're root, this only fails if the process doesn't exist; if you're not root, this can also fail because you lack the required permissions and sorting that case out is up to you.

(For the kill command, you'll need to scan the error message. If you can directly use the system call, you want to check for the difference between an EPERM and an ESRCH error.)

This is an okay method but it has various drawbacks in shell scripts (even when you're root). Today it struck me that there is another alternative on Linux; you can just check to see if /proc/<PID> exists. In a shell script this is potentially a lot more convenient, because it's very simple:

if [ -e /proc/$PID ]; then
   ....
fi

It's easy to invert, too, so that you take action when the PID doesn't exist (just use '! -e /proc/$PID').

I was going to say that this had a difference from the kill case that might be either an advantage or a fatal drawback, but then I decided to test Linux's behavior and I got a surprise (which maybe shouldn't have been a surprise). Linux threads within a process have their own PIDs, which I knew, and these PIDs also show up in /proc, which I hadn't known. Well, they sort of show up.

Specifically, the /proc/<PID> directories for threads are present in /proc if you directly access them, for example by doing '[ -e /proc/NNNN ]'. However, they are not visible if you just get a directory listing of /proc (including with such things as 'echo *' in your shell); in a directory listing, only full processes are visible. This is one way of telling whether you have a process or a thread. Another way is that a thread's /proc/<PID>/status file has a different Tgid than its Pid (for a discussion of this, see the manual page for proc(5)).

(Whether or not excluding threads is a feature or a serious limitation depends on your usage case. If you know that the PID you're checking should be a main process PID, not a thread, then only seeing them will help you avoid false positives from things like PID rollover. As I've encountered, rapid PID rollover can definitely happen to you in default Linux configurations.)

PS: FreeBSD and Illumos (and so OmniOS and other Illumos derivatives) also have a /proc with PIDs visible in it, so this approach is at least somewhat portable. OpenBSD doesn't have a /proc (Wikipedia says it was dropped in 5.7), and I haven't looked at NetBSD or Dragonfly BSD (I don't have either handy the way I have the others).

CheckForPIDViaProc written at 02:55:45; Add Comment

2018-11-23

Some Linux disk IO stats you can calculate from kernel information

I've written in the (distant) past about what disk IO stats you get from the Linux kernel and what per-partition stats you get. Now I'm interested in what additional stats you can calculate from these, especially stats that aren't entirely obvious.

These days, the kernel's Documentation/iostats.txt more or less fully documents all of the raw stats that the kernel makes available (it's even recently been updated to add some very recent new stats that are only available in 4.18+), and it's worth reading in general. However, for the rest of this entry I'm mostly going to use my names for the fields from my original entry for lack of better ones (iostats.txt describes the fields but doesn't give them short names).

The total number of reads or writes submitted to the kernel block layer is rio plus rmerge (or wio plus wmerge), since rio includes only reads that go to the disk. As it was more than a decade ago, the count of merged IOs (rmerge and wmerge) is increased when the IO is submitted, while the count of completed IOs only increases at the end of things. However, these days the count of sectors read and written is for completed IO, not submitted IO. If you ignore merged IOs (and I usually do), you can basically ignore the difference between completed IOs and submitted but not yet completed IOs.

(A discussion of the issues here is beyond the scope of this entry.)

The per-second rate of rio and wio over time is the average number of IOs per second. The per-second rate of rsect and wsect is similarly the average bandwidth per second (well, once you convert it from 512-byte 'sectors' to bytes or KB or whatever unit you like). As usual, averages can conceal wild variations and wild outliers. Bytes over time divided by requests over time gives you the average request size, either for a particular type of IO or in general.

(When using these stats, always remember that averages mislead (in several different ways), and the larger the time interval the more they mislead. Unfortunately the Linux kernel doesn't directly provide better stats.)

The amount of time spent reading or writing (ruse or wuse) divided by the number of read or write IOs gives you the average IO completion time. This is not the device service time; it's the total time from initial submission, including waiting in kernel queues to be dispatched to the device. If you want the average wait time across all IOs, you should compute this as '(ruse + wuse) / (rio + wio)', because you want to weight the average wait time for each type of IO by how many of them there were.

(In other words, a lot of very fast reads and a few slow writes should give you a still very low average wait time.)

The amount of time that there has been at least one IO in flight (my use) gives you the device utilization over your time period; if you're generating per-second figures, it should never exceed 1, but it may be less. The average queue size is 'weighted milliseconds spend doing IOs' (my aveq) divided by use.

Total bandwidth (rsect + wsect) divided by use gives you what I will call a 'burst bandwidth' figure. Writing 100 MBytes to a SSD and to a HD over the course of a second gives you the same per-second IO rate for both; however, if the HD took the full second to complete the write (with a use of the full second) while the SSD took a quarter of a second (with a use of .25 second), the SSD was actually writing at 400 MBytes/sec instead of the HD's 100 MBytes/sec. Under most situations this is going to be close to the actual device bandwidth, because you won't have IOs in the queue without at least one being dispatched to the drive itself.

There are probably additional stats you can calculate that haven't occurred to me yet, partly because I don't think about this very often (it certainly feels like there should be additional clever derived stats). I only clued in to the potential importance of my 'burst bandwidth' figure today, in fact, as I was thinking about what else you could do with use.

If you want latency histograms for your IO and other more advanced things, you can turn to eBPF, for instance through the BCC tools. See also eg Brendan Gregg on eBPF. But all of that is beyond the scope of this entry, partly because I've done basically nothing with these eBPF tools yet for various reasons.

Sidebar: Disk IO stats for software RAID devices and LVM

The only non-zero stats that software RAID devices provide (at least for mirrors and stripes) is read and write IOs completed and the number of sectors read and written. Unfortunately we don't get any sort of time or utilization information for software RAID IO devices.

LVM devices (which on my Fedora systems show up in /proc/diskstats as 'dm-N') do appear to provide time information; I see non-zero values for my ruse, wuse, use, and aveq fields. I don't know how accurate they are and I haven't attempted to use any of my tools to see.

DiskIOStatsIII written at 23:30:56; Add Comment

Qualified praise for the Linux ss program

For a long time now, I've reached for a combination of netstat and 'lsof -n -i' whenever I wanted to know things like who was talking to what on a machine. Mostly I've tended to use lsof, even though it's slower, because I find netstat to be vaguely annoying (and I can never the exact options I want without checking the manpage yet again). Recently I've started to use another program for this, ss, which is part of the iproute2 suite (also Wikipedia).

The advantage of ss is that it will give you a bunch of useful information, quite compactly, and it will do this very fast and without fuss and bother. Do you want to know every listening TCP socket and what program or programs are behind it? Then you want 'ss -tlp'. The output is pretty parseable, which makes it easy to feed to programs, and a fair bit of information is available without root privileges. You can also have ss filter the output so that you don't have to, or at least so that you don't have to do as much.

In addition, some of the information that ss will give you is relatively hard to get anywhere else (or at least easily) and can be crucial to understanding network issues. For example, 'ss -i' will show you the PMTU and MSS of TCP connections, which can be very useful for some sorts of network issues.

One recent case where I reached for ss was I wanted to get a list of connections to the local machine's port 25 and port 587, so I could generate metrics information for how many SMTP connections our mail servers were seeing. In ss, the basic command for this is:

ss -t state established '( sport = :25 or sport = :587 )'

(Tracking this information was useful to establish that we really were seeing a blizzard of would-be spammers connecting to our external MX gateway and clogging up its available SMTP connections.)

Unfortunately, this is where the qualifications come in. As you can see here, ss has a filtering language, and a reasonably capable one at that. Unfortunately, this filtering language is rather underdocumented (much like many things in iproute2). Using ss without any real documentation on its filtering language is kind of frustrating, even when I'm not trying to write a filter expression. There is probably a bunch of power that I could use, except it's on the other side of a glass wall and I can't touch it. In theory there's documentation somewhere; in practice I'm left reading other people's articles like this and this copy of the original documentation.

(This is my big lament about ss.)

As you'll see if you play around with it, ss also has a weird output format for all of its extended information. I'm sure it makes sense to its authors, and you can extract it with determination ('egrep -o' will help), but it isn't the easiest thing in the world to deal with. It's also not the most readable thing in the world if you're using ss interactively. It helps a bit to have a very wide terminal window.

Despite my gripes about it, I've wound up finding ss an increasingly important tool that I reach for more and more. Partly this is for all of the information it can tell me, partly it's for the filtering capabilities, and partly it's for its speed and low impact on the system.

(Also, unlike lsof, it doesn't complain about random things every so often.)

(ss was mentioned in passing back when I wrote about how there's real reasons for Linux to replace ifconfig and netstat. I don't think of ss as a replacement for netstat so much as something that effectively obsoletes it; ss is just better, even in its relatively scantily documented and awkward state. With that said, modern Linux netstat actually shows more information than I was expecting, and in some ways it's in a more convenient and readable form than ss provides. I'm probably still going to stick with ss for various reasons.)

SsQualifiedPraise written at 00:30:49; Add Comment

2018-11-18

Old zombie Linux distribution versions aren't really doing you any favours

One bit of recent news in the Linux distribution world is Mark Shuttleworth's recent announcement that Ubuntu 18.04 LTS will get ten years of support (Slashdot, ServerWatch). As it happens, I have some views on this. First, before people start getting too excited, note that Shuttleworth hasn't said anything about what form this support will take, especially whether or not you'll have to pay for it. My own guess is that Canonical will be expanding their current paid Ubuntu 12.04 ESM (Extended Security Maintenance) to also cover 18.04 and apparently 16.04. This wouldn't be terribly surprising, since back in September they expanded it to cover 14.04.

More broadly, I've come to feel that keeping on running really old versions of distributions is generally not helping you, even if they have support. After a certain point, old distribution versions are basically zombies; they shamble on and they sort of look alive, but they are not because time has moved past them. Their software versions are out of date and increasingly will lack features that you actively want, and even if you try to build your own versions of things, a steadily increasing number of programs just won't build on the versions of libraries, kernels, and so on that those old Linuxes have. Upgrading from very old versions is also an increasing problem as time goes by; often, so much has changed that what you do is less upgrading and more rebuilding the same functionality from scratch on a new, more modern base.

(Here I'm not just talking about the actual systems; I'm talking about things like configuration files for programs. You can often carry configuration files forward with only modest changes even if you reinstall systems from scratch, but that only works so far.)

You can run such zombie systems for a long time, but they have to essentially be closed and frozen appliances, where absolutely nothing on them needs to change. This is very hard to do on systems that are exposed directly or indirectly to the Internet, because Internet software decays and must be actively maintained. Even if you don't have systems that are exposed this way, you may find that you end up wanting to put new software on them, for example a metrics and monitoring system, except that your old systems are too old for that to work well (or perhaps at all).

(Beyond software you want to put on such old systems, you're also missing out on an increasing number of new features of things. Some of the time these are features that you could actively use and that will improve your life when you can finally deploy them and use them. I know it sounds crazy, but software on Linux really does improve over time in a lot of cases.)

Having run and used a certain number of ancient systems in my time (and we're running some now), my view is that I now want to avoid doing it if I can. I don't know what the exact boundary is for Linux today (and anyway it varies depending on what you're using the system for), but I think getting towards ten years is definitely too long. An eight year old Linux system is going to be painfully out of date on pretty much everything, and no one is going to be sympathetic about it.

So, even if Ubuntu 18.04 had ten years of free support (or at least security updates), I'm pretty certain that neither you nor we really want to be taking advantage of that. At least not for those full ten years. Periodic Linux distribution version updates may be somewhat of a pain at the time, but overall they're good for us.

ZombieDistroVersions written at 22:43:43; Add Comment

2018-11-14

Linux iptables compared to OpenBSD PF (through a real example)

One of the things I've been asked about in response to our attachment to OpenBSD PF is how OpenBSD PF differs from the Linux alternatives, especially iptables. I don't have a good, satisfying answer to that, so today I'm going to cheat by showing a realistic case written out in both and then discussing some of the less obvious differences between the two.

To implement our custom NFS mount authorization, we need to block access to a collection of NFS-related ports (for both TCP and UDP) unless the source machine is either on our own server subnet or is in a dynamically maintained table of machines that have authenticated to us through our special system. In OpenBSD PF syntax, this looks something like the following:

table <NFSAUTHED> persist
pass in quick on $IFACE inet proto { tcp, udp } \
    from { 128.100.3.0/24, <NFSAUTHED> } to any \
    ports { 111, 2049, 10100 }

block in quick log on $IFACE inet proto { tcp, udp } \
    from any to any \
    ports { 111, 2049, 10100 }

(For obvious reasons I haven't actually tested this, although I think it would work. There is a cultural assumption embedded here in the form of $IFACE; it's usual, at least here, to have pf.conf know what the system's interfaces are called.)

What I usually say about Linux iptables is that it's an assembly language for creating firewalls. As the equivalent of an assembly language it's very flexible, but it's also rather verbose and there are almost always a bunch of different ways to do something. With that said, here's the actual iptables ruleset we're using (and I know it works). Because we're using ipsets, our ruleset has to be represented as a series of commands (as far as I know), since we have to run commands to create the ipsets before we create the iptables rules using them.

ipset create nfsports bitmap:port range 0-12000 counters
ipset add nfsports 111
ipset add nfsports 2049
ipset add nfsports 10100

ipset create nfsauthed hash:ip counters

# Accept from localhost as a precaution
iptables -A INPUT -p tcp -i lo -m set --match-set nfsports dst -j ACCEPT
iptables -A INPUT -p udp -i lo -m set --match-set nfsports dst -j ACCEPT

# Accept our server network.
iptables -A INPUT -p tcp -s 128.100.3.0/24 -m set --match-set nfsports -j ACCEPT
iptables -A INPUT -p udp -s 128.100.3.0/24 -m set --match-set nfsports -j ACCEPT

# Accept authorized machines.
iptables -A INPUT -p tcp -m set --match-set nfsports dst -m set --match-set nfsauthed -j ACCEPT
iptables -A INPUT -p udp -m set --match-set nfsports dst -m set --match-set nfsauthed -j ACCEPT

# Reject everyone else, with logging
iptables -A INPUT -p tcp -m set --match-set nfsports dst -j NFLOG --nflog-prefix "deny"
iptables -A INPUT -p udp -m set --match-set nfsports dst -j NFLOG --nflog-prefix "deny"
iptables -A INPUT -p tcp -m set --match-set nfsports dst -j REJECT
iptables -A INPUT -p udp -m set --match-set nfsports dst -j REJECT

(The choice of explicitly allowing loopback versus guarding everything else with '-i $IFACE' is partly cultural and partly pragmatic; we would have to write the latter in a lot more places than on OpenBSD, and interface names are often rather less predictable.)

Some of the rule count difference here is illusory. For example, if we applied the OpenBSD pf.conf stanza on an OpenBSD machine and then dumped the actual resulting rules with 'pfctl -s rules', we'd discover that pfctl had expanded each of those '{ ... }' groups of things out into separate rules, one for each option. If I'm doing the math right, that means our two lines of pf.conf would turn into 24 actual rules, which is more than the Linux version has. However, generally what matters is how many rules people need to write, not what the rules expand to when implemented at a low level. Here OpenBSD PF gives us a number of tools to write compact rules that set out everything we're doing in one place (which is important for coming back later and understanding what your rules are doing).

(OpenBSD tables specifically only contain IP addresses and networks, so as far as I know we can't create an equivalent of the nfsports ipset we're matching port numbers against. This does mean that in OpenBSD, we couldn't change the ports we were matching against on the fly; if they changed, we'd have to update the pf.conf rules and reload them.)

A larger difference is that these rules don't actually mean quite the same thing, because Linux iptables are normally stateless while OpenBSD is stateful by default and in customary use. Here this actually would make a difference in how we want to operate the overall authentication system. In Linux removing something from the nfsauthed ipset immediately removes all its access, because the rules are checked for every packet, while in OpenBSD we would also have to kill off the removed IP's state table entries (if any) with 'pfctl -k', because once a TCP connection is in the OpenBSD state tables it completely bypasses pf.conf rules.

Our OpenBSD 'block in quick ...' rule is also not necessarily natural in PF because of another cultural difference. OpenBSD PF configurations are normally written with denying packets being the default, with a default catch all 'block in log all' somewhere in your pf.conf; in a default-deny environment, a specific block rule such as this is unnecessary. But with Linux, you normally leave iptables accepting packages by default on an otherwise un-firewalled machine such as we have here, which means that we have to write out those four lines of 'log and drop' iptables stuff at the end. In a real OpenBSD PF configuration this would probably cause us to write more pf.conf rules to specify what other inbound traffic we wanted to allow.

PS: Linux ipsets are now well over five years old so they're pretty universally available in distribution versions you actually want to run, but in the past you could easily find Linux systems that didn't have them available (they weren't in RHEL/CentOS 5, for example). Without ipsets, this setup would be far more difficult. OpenBSD PF has basically always had tables.

IptablesVsOpenBSDPF written at 22:24:29; Add Comment

2018-11-04

My view on Debian versus Ubuntu LTS for us today

When we started with Ubuntu in 2006, Debian was mired in problems such as slow releases and outdated software that drove people to run 'testing' instead of 'stable'. Ubuntu essentially offered 'Debian with the problems fixed'; Ubuntu LTS had regularly scheduled releases, offered a wide package selection of reasonably current software, and gave us a long support period of five years. This was very attractive to us and made Ubuntu the dominant Linux here ever since (cf). However, we don't and never really have entirely liked it. We weren't enthused from very early on, and we soon came to understand various limitations of Ubuntu such as them not really fixing bugs. Recently we've come to understand that a large portion of Ubuntu's packages are effectively abandonware, cloned once from Debian and then never updated (making bug reports to Ubuntu useless).

(It's not just packages in Ubuntu's 'universe' repo that are abandonware, although being in 'universe' basically guarantees it; our experience is that packages from 'main' don't see many bug fixes either. And 'universe' is much of what's important to us.)

All by itself this has started making Debian look more attractive to me. Debian doesn't have the reliable release schedule of Ubuntu but these days it's managing roughly every two years (which is the same as Ubuntu LTS), and we're not locked to upgrading only at a specific time of year the way some people are. Our user-facing machines are upgraded every Ubuntu LTS release, so they're already not taking advantage of the long LTS support cycle, and we would likely get better support for packages in practice. And since Debian and Ubuntu are already so close, switching probably wouldn't be too hard. But things are actually better for Debian than this, because since I looked last in 2014 Debian has gained some degree of relatively official long term support (and even extra extended LTS for Debian 7).

(Part of the extended support is driven by people paying for it, which is both good in general and means that it might be possible for us to contribute if we started to use Debian.)

As a result, I now have a much more positive view of Debian and I've come around to thinking that it'd probably be a perfectly viable alternative to Ubuntu LTS for us, and in some ways likely a superior one (although we wouldn't know for sure until we actually tried to use it over the full life cycle of a machine).

Will we actually switch? Probably not, unfortunately. Debian being just as good and maybe a bit better doesn't overcome the fact that we're already using Ubuntu and it hasn't blown up in our faces yet. Perhaps I'll do an experimental install of the next Debian when it comes out (hopefully in mid 2019) to see what it's like and how easy it would be to integrate into our environment.

(This entry was prompted by an exchange on Twitter, except that it turns out I was wrong about the Debian support duration; I found out about Debian LTS support as a result of doing research for this entry.)

DebianVsUbuntuForUs written at 01:55:59; Add Comment

2018-11-01

In Linux, hitting a strict overcommit limit doesn't trigger the OOM killer

By now, we're kind of used to our Linux machines running out of memory, because people or runaway processes periodically do it (to our primary login server, to our primary web server, or sometimes other service machines). It has a familiar litany of symptoms, starting with massive delays and failures in things like SSH logins and ending with Linux's OOM killer activating to terminate some heavy processes. If we're lucky the OOM killer will get the big process right away; if we're not, it will first pick off a few peripheral ones before getting the big one.

However, every so often recently we've been having some out of memory situations on some of our machines that didn't look like this. We knew the machines had run out of memory because log messages told us:

systemd-networkd[828]: eno1: Failed to save LLDP data to /run/systemd/netif/lldp/2: No space left on device
[...]
systemd[1]: user@NNN.service: Failed to fork: Cannot allocate memory
[...]
sshd[29449]: fatal: fork of unprivileged child failed
sshd[936]: error: fork: Cannot allocate memory

That's all pretty definitely telling us about a memory failure (note that /run is a tmpfs filesystem, and so 'out of space on device' means 'out of memory'). What we didn't see was any indication that the OOM killer had been triggered. There were no kernel messages about it, for example, and the oom_kill counter in /proc/vmstat stubbornly reported '0'. We spent some time wondering where the used memory was going so that we didn't really see it and, more importantly, why the kernel didn't think it had to invoke the OOM killer. Was the kernel failing to account for memory used in tmpfs somewhere, for example?

(In the process of looking into this issue I did learn that memory used by tmpfs shows up in /proc/meminfo's Shmem field. Tmpfs also apparently gets added to Cached, which is a bit misleading since it can't be freed up, unlike a lot of what else gets counted in Cached.)

Then last night the penny dropped and I came to a sudden realization about what was happening. What was happening with these machines was that they were running into strict overcommit limits, and when your machine hits strict overcommit limits, the kernel OOM killer is not triggered (or at least, isn't necessarily triggered). Most of our machines don't use strict overcommit (and this is generally the right answer), but our Linux compute servers do have it turned on, and it was our compute servers that we were experiencing these unusual out of memory situations on. This entirely explains how we could be out of memory without the kernel panicing about it; we had simply run into the limit of how much memory we told the kernel to allow people to allocate.

(Since the OOM killer wasn't invoked, it seems likely that some of this allocated memory space wasn't in active use and may not even have been touched.)

In a way, things are working exactly as designed. We said to use strict overcommit on these machines and the kernel is dutifully carrying out what we told it to do. We're enforcing memory limits that insure these machines don't get paralyzed, and they mostly don't. In another way, how this is happening is a bit unfortunate. If the OOM killer activates, generally you lose a memory hog but other things aren't affected (in our environment the OOM killer seems pretty good at only picking such processes). But if a machine runs into the strict overcommit limit, lots of things can start failing because they suddenly can't allocate memory, can't fork, can't start new processes or daemons, and so on. Sometimes this leaves things in a failed or damaged state, because your average Unix program simply doesn't expect memory allocation or fork or the like to fail. In an environment where we're running various background tasks for system maintenance, this can be a bit unfortunate.

(Go programs panic, for instance. We got a lot of stack traces from the Prometheus host agent.)

One of the things that we should likely be doing to deal with this is increasing the vm.admin_reserve_kbytes sysctl (documented here). This defaults to 8 MB, which is far too low on a modern machine. Unfortunately it turns out to be hard to find a good value for it, because it includes existing usage from current processes as well. In experimentation, I found that a setting as high as 4 GB wasn't enough to allow a login through ssh (5 GB was enough, at the time, but I didn't try binary searching from there). This appears to be partly due to memory surges from logging in, because an idle machine has under a GB in /proc/meminfo's Committed_AS field.

(I didn't know about admin_reserve_kbytes until I started researching this entry, so once again blogging about something turns out to be handy.)

StrictOvercommitVsOOM written at 21:51:25; Add Comment

2018-10-31

Do I feel uncertain about CentOS's future now? Yes, a bit

I was going to write an entry about how CentOS remains quietly important to us because of its long support period, with CentOS 7 supported through 2024 or so for security updates (per the CentOS wiki and FAQ). Then I paused to think about that in light of IBM buying Red Hat.

The end of 2023 is five years from now. A lot of things can happen in five years after a company is acquired, and a lot of intentions and plans can change. IBM is probably not going to stop doing Red Hat 7 updates, so a CentOS can continue in some form, but at the same time the current Red Hat provides a certain degree of support and assistance to CentOS (see the CentOS wikipedia entry for one account of this). It seems likely that CentOS's activities would at least slow down if IBM decided that it wasn't going to, say, continue to fund as many people to work on CentOS as Red Hat currently does.

Of course, nothing was certain even when Red Hat was an independent company; although we may pretend otherwise, blithe projections of support for Linux distributions three or four or five years into the future are just that (for many of them, not just CentOS). But I can't help but feel that things are now a bit more uncertain than they used to be, and I'm not quite as confident at projecting support for CentOS 7 out to the middle of 2024 as I was a month ago. A month ago, Red Hat pulling away from CentOS would have felt cataclysmic; in the future, it will just be IBM conducting a strategic reassessment (although perhaps a drastic one).

CentOS is still going to continue to be our best option for certain things, but that's another entry (the one I was going to write before I started thinking). And we're certainly not going to migrate our current CentOS 7 machines to anything else any time soon. So in practice what we're going to do is nothing; we're going to carry on and hope for no changes any time soon.

(As far as what this means for Fedora, well, I have no idea (and I certainly hope nothing changes, since I don't want to change Linux distributions). But again things feel a little bit more uncertain now than they used to.)

CentOSUncertainty written at 21:17:08; Add Comment

2018-10-10

Even systemd services and dependencies are not self-documenting

I tweeted:

I'm sure that past-me had a good reason for configuring my Wireguard tunnel to only start during boot after the VMWare modules had been loaded. I just wish he'd written it down for present-me.

Systemd units are really easy to write, straightforward to read, and quite easy to hack on and modify. But, just like everything else in system administration, they aren't really self documenting. Systemd units will generally tell you clearly what they're doing, but they won't (and can't) tell you why you set them up that way, and one of the places where this can be very acute is in what their dependencies are. Sometimes those dependencies are entirely obvious, and sometimes they are sort of obvious and also sort of obviously superstitious. But sometimes, as in this case, they are outright mysterious, and then your future self (if no one else) is going to have a problem.

(Systemd dependencies are often superstitious because systemd still generally still lacks clear documentation for standard dependencies and 'depend on this if you want to be started only when <X> is ready'. Admittedly, some of this is because the systemd people disagree with everyone else about how to handle certain sorts of issues, like services that want to activate only when networking is nicely set up and the machine has all its configured static IP addresses or has acquired its IP address via DHCP.)

Dependencies are also dangerous for this because it is so easy to add another one. If you're in a hurry and you're slapping dependencies on in an attempt to get something to work right, this means that adding a comment to explain yourself adds proportionally much more work than it would if you already had to do a fair bit of work to add the dependency itself. Since it's so much extra work, it's that much more tempting to not write a comment explaining it, especially if you're in a hurry or can talk yourself into believing that it's obvious (or both). I'm going to have to be on the watch for this, and in general I should take more care to document my systemd dependency additions and other modifications in the future.

(This is one of the thing that version controlled configuration files are good for. Sooner or later you'll have to write a commit message for your change, and when you do hopefully you'll get pushed to explain it.)

As for this particular case, I believe that what happened is that I added the VMWare dependency back when I was having mysteries Wireguard issues on boot because, it eventually turned out, I had forgotten to set some important .service options. When I was working on the issue, one of my theories was that Wireguard was setting up its networking, then VMWare's own networking stuff was starting up and taking Wireguard's interface down because the VMWare code didn't recognize this 'wireguard' type interface. So I set a dependency so that Wireguard would start after VMWare, then when I found the real problem I never went back to remove the spurious dependency.

(I uncovered this issue today as part of trying to make my machine boot faster, which is partially achieved now.)

DocumentStartupDependencies written at 01:37:27; Add Comment

(Previous 10 or go back to October 2018 at 2018/10/09)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.