vm.admin_reserve_kbytes sysctl is both not big enough and not sufficient
We enable Linux's strict overcommit on some
of our servers (mostly compute servers). Every so often people run
big enough programs that they run the machine out of memory, and
some of the time when this happens we get various plaintive reports
from cron and other things that periodic system processes have
failed with out of memory errors. The Linux kernel has a sysctl
that's supposed to help with this,
(documented in vm.txt), but
in practice we've found two issues.
The first is that the default value of admin_reserve_kbytes is set for systems not operating in strict overcommit mode, and in any case the value dates from 2013. The kernel's own documentation suggests turning this up to 128 MB for strict overcommit, but I suspect that that's not sufficient for modern programs (a brief check suggests the total virtual size is at least 190 MB or so for sshd, bash, and top on 64-bit x86 Ubuntu 18.04; their combined RSS is over 16 Mbytes). Perhaps 256 Mbytes could be enough in strict overcommit mode. In any case, we need to tune this up and it's hard to know by how much to make sure that cron jobs still keep running while not taking too much memory away from people, especially on machines with modest amounts of memory.
(If we were serious about this, we should look into collecting some sort of memory usage information from cron jobs on at least a test machine. As it is, this is a sufficiently infrequent issue that we don't care enough to do that work.)
The second is that often, no setting of admin_reserve_kbytes
will let you log in to a server that's in memory overcommit,
because of what I could call the DBus daemon problem.
Specifically, during login, parts of the SSH server run as
non-privileged users. As deliberately unprivileged UIDs, memory
allocations made by these processes are not covered by
admin_reserve_kbytes. If ordinary users can't allocate memory,
you're almost certainly not going to be able to ssh in even as root.
If you could get the SSH daemon to authenticate you, your eventual
bash processes would be covered by
admin_reserve_kbytes, but sadly you need that authentication
to happen before you get there.
(Turning off sshd's privilege separation is a cure far worse than the disease.)
The second issue lowers my motivation to try to fix the first problem by finding setting of admin_reserve_kbytes so that our administrative cron jobs reliably keep working. If a machine runs out of memory and stays there, we may not be able to get in to deal with whatever the problem is and things other than cron jobs may run into issues (we've seen the DBus daemon have problems in the past). Plus, our machines almost never run out of memory to the extent that we get cron email complaints about it.
PS: Someday our Ubuntu LTS machines may run systemd-oomd, which will undoubtedly need its own configuration and tuning. This might even show up in the future Ubuntu 22.04 LTS, which is not all that far away.
A bit on what Unix system pre-boot environments used to look like
Unix was first implemented on general purpose computers that hadn't been specifically designed for it, such as the PDP-11 and the DEC Vax. These machines could have intricate start up procedures (also) and in any case their pre-boot environment wasn't designed for Unix specifically. This changed in the early 1980s when computers both got more complex and began to be designed specifically for Unix, such as the Sun-1. These Unix computers, designed and built by Unix vendors who integrated their hardware and their Unix, soon got increasingly sophisticated and Unix specific pre-boot environments. The most well known and commonly experienced of these Unix machine pre-boot environments is probably Sun Microsystem's OpenBoot (later Open Firmware).
Broadly speaking, this pre-boot firmware tended to have three jobs. First, it had to configure and bring up the low level hardware, doing things like booting the CPU, enabling DRAM refresh, and any other basic hardware setup work required. Often the firmware would also do power on self tests, sometimes very time consuming ones; some SGI servers that we used to have could take five minutes to complete this phase of boot (and they weren't particularly big servers). Second, the firmware loaded and started the vendor's Unix kernel itself, possibly passing various hardware information to it. In the height of the Unix era, this was a complex job; the kernel could be found in any number of devices, would have to be read from the (Unix) filesystem when it was on a local disk, might have different default arguments or a default name, and you could be network booting the machine, which required even configuring Ethernet hardware and talking protocols like BOOTP or DHCP.
(As part of being able to read the kernel from disk, the firmware naturally understood how the Unix vendor chose to implement disk slices and partitioning. There was no standard for this, although I think there were common approaches inherited from the historic Unix variant the vendor's Unix was derived from.)
The broad third job of the firmware was debugging the kernel, including forcing a reboot when it was hung. Most Unix machines let you break into a firmware debugger from the console while the system was running, which would let you poke around at machine state and often force a crash dump. My memory is that crash dumps called the kernel to do the actual work, but there may have been firmware that could write out memory to a designated disk area on its own.
(On Unix workstations, the firmware typically could work with the graphical display to write text over top of your windowing session. Unix workstations typically didn't have the separation between text mode and graphics modes that x86 PCs wound up with.)
Although all Unix firmware was capable of booting on its own if you let it sit (and had set it up right), it generally gave you a command line environment that you could break into to change things like what would be booted from where. On workstations, the Unix firmware would generally talk to either or both of a hardware console and a serial connection; on servers (which in that era were headless, without video output), the firmware only talked to a serial interface. Naturally you could configure the baud rate and so on of the serial interface in the firmware settings. Firmware settings tended to be represented as some form of environment variables.
I believe that some of the modern free BSDs have x86 second stage boot environments that are broadly similar to the old Unix firmware environments (such as OpenBSD's boot(8); see also the FreeBSD boot process).
The x86 PC BIOS does the same job of early hardware initialization that Unix pre-boot firmware did, but its traditional way of booting things is much more primitive (although also much more general). And obviously PC BIOSes haven't tended to offer command lines, instead having some kind of 'graphical' user interface (first using text mode graphics, then later real pixel graphics). Modern UEFI BIOSes have many of the general features of Unix firmware, such as firmware variables, extensive firmware services, and loading the operating system (or a next stage boot environment) from a real filesystem, but they still don't have a command line or the kind of full bore support for serial consoles that Unix firmware tended to have.
Most computer standards broadly require good faith implementations
One of the quietly unsaid things about computer standards, both formal and relatively informal, is that they were generally not created to defend us against deliberately perverse implementors with malicious intent. Most useful computer standards aren't written so tightly and narrowly that they block out useless but conforming behavior (or perverse but conforming); they assume that people following the standards are actively attempting to deliver a useful result. The implementors may misunderstand things or be sloppy, but they aren't trying to be evil.
This is broadly not a problem in practice in any situation where people can choose their implementation (including choosing not to use any implementation), or where things actively have to work. When quality of implementation matters, deliberately perverse implementations generally score very low. When whether it works or not matters, perverse implementations generally don't work in any practical sense.
However, perverse implementors can still claim that they conform to the standard, and if they were sufficiently clever that claim is actually true by the letter of the standard, although never by its spirit. In certain social contexts, this can cause people great irritation from cognitive dissonance (on the one hand, they value standards conformance, but on the other hand, this 'conforming' implementation is terrible and wrong). Also, sometimes what really matters about a standard is the social claims of conformance around it, so the perverse implementor gets the benefit of the social claims without actually implementing things in a useful way.
(If you're using standards for legal reasons, this may not help you. But if you're dealing with hostile counterparties, you have bigger problems.)
Sidebar: Perverse conformance and requests for bids
Requests for bids by large organizations may sometimes require conformance to various computing standards in order to theoretically level the field between competitors (for example, to POSIX). When actual useful conformance to the standard matters (for example when the organization plans to actually use the purchased systems for Unix things), a perverse implementation (or a crippled but nominally conforming one) doesn't really matter, because it should never get past the organization's general quality checking processes.
My views so far on bookmarklets versus addons in Firefox
A while back I saw a message in the fediverse advocating for bookmarklets (via):
90% of browser extensions could be bookmarklets, and they’re 100% better for privacy. Bookmarklets should make a comeback.
* Opt in to an action
• No snooping on pages in the background
• Universally supported in browsers
I miss bookmarklets.
I'm a heavy user of Firefox addons, and I've also recently been using bookmarklets in the titlebar for dealing with fixed position page elements and making pages more readable with brute force. The fediverse message got me thinking, and now I think I have some views that are well enough formed to write down.
For me, the simultaneous advantage and drawback of bookmarklets is that they don't activate automatically when you load the page. This makes addons better for things that I want to happen all of the time, automatically. Conversely, bookmarklets are better for things that are sometimes great but also sometimes bad ideas that destroy the page's readability or functionality. Both of my current bookmarklets are of this nature, where I want to apply them to only some web pages (and sometimes only some of the time).
(There are also practical differences, such as addons being able to present a UI and be interactive, but let's hand wave that away. I think most of my addons would currently be flat out impossible as bookmarklets.)
Now that I've thought about this, that my addons are all things that I want to happen all the time is not really a coincidence. Because addons work this way, I've been strongly biased to using only addons that I actually did want all of the time. An 'all the time' addon that you only want some of the time is painful, even if it offers UI controls for this. I probably have blind spots about useful 'some of the time' page modifications that could be bookmarklets.
On servers maybe moving to M.2 NVMe drives for their system drives
We've been looking into getting some new servers (partly because a number of our existing Dell R210 IIs are starting to fail). Although we haven't run into this yet ourselves, one of the things we've heard in the process of this investigation is that various lines of basic servers are trying to move to M.2 NVMe system disks instead of 3.5" or 2.5" disks. People are generally unhappy about this, for a number of reasons including that these are not fancy hot-swappable M.2 NVMe, just basic motherboard plug-in M.2.
(Not all basic servers have hot swap drive bays even as it stands, though. Our Dell R210 IIs have fixed disks, for example, and we didn't consider that a fatal flaw at the time.)
My first knee-jerk reaction was that server vendors were doing this in order to get better markups from more expensive M.2 NVMe drives (and I just assumed that M.2 NVMe drives were more expensive than 2.5" SATA SSDs, which doesn't seem to always be the case). However, now I think they have a somewhat different profit focused motive, which is to lower their manufacturing costs (they may or may not lower their prices; certainly they would like to pocket the cost savings as profit).
As far as I know, basic motherboard M.2 NVMe is fairly straightforward on a parts basis. I believe that you break out some PCIe lanes that the chipset already supplies to a new connector with some additional small mounting hardware, and that's about it. The M.2 NVMe drive PCB that you've received as a complete part then plugs straight into that motherboard connector during server assembly.
(If the chipset doesn't have the spare two sets of x4 PCIe lanes (or maybe one set), you don't try to make the server an M.2 NVMe one.)
By contrast, 2.5" or 3.5" drives require more physical stuff and work inside the server. Each drive bay needs a power cable and a SATA or SAS cable (which go into their own set of motherboard connectors), and then you need some physical mounting hardware in the server itself, either as hot swap drive bays on the front or internal mounts. During physical enclosure design you'll have extra objects in the way of the airflow through the server and you'll need to figure out the wiring harness setup. During server assembly you'll have extra work to wire up the extra wiring hardness, mount whatever drives people have bought into the drive bays, and put the drive bays in the machine.
All of this is probably not a huge extra cost, especially at scale. But it's an extra cost, and I suspect that server vendors (especially of inexpensive basic servers) are all for getting rid of it, whether or not their customers really like or want M.2 NVMe system disks. If M.2 NVMe drives end up being the least expensive SSD drive form factor, as I suspect will happen, server vendors get another cost savings (or opportunity to undercut other server vendors on pricing).
Unfortunately, damaged ZFS filesystems can be more or less unrepairable
An unfortunate piece of ZFS news of the time interval is that Ubuntu 21.10 shipped with a serious ZFS bug that created corrupted ZFS filesystems (see the 21.10 release notes; via). This sort of ZFS bug happens from time to time and has likely happened as far back as Solaris ZFS, and there are two unfortunate aspects of them.
(For an example of Solaris ZFS corruption, Solaris ZFS could write ACL data that was bad in a way that it ignored but modern ZFS environments care about. This sort of ZFS issue is not specific to Ubuntu or modern OpenZFS development, although you can certainly blame Ubuntu for this particular case of it and for shipping Ubuntu 21.10 with it.)
The first unfortunate aspect is that many of these bugs normally panic your kernel. At one level it's great that ZFS is loaded with internal integrity and consistency checks that try to make sure the ZFS objects it's dealing with haven't been corrupted. At another level it's not so great that the error handling for integrity problems is generally to panic. Modern versions of OpenZFS has made some progress on allowing some of these problems to continue instead of panic, but there are still a lot left.
The second unfortunate aspect is that generally you can't repair this damage the way you can in more conventional filesystems. Because of ZFS's immutability and checksums, once something makes it to disk with a valid checksum, it's forever. If what made it to disk was broken or corrupted, it stays broken or corrupted; there's no way to fix it in place and no mechanism in ZFS to quietly fix it in a new version. Instead, the only way to get rid of the problem is to delete the corrupted data in some way, generally after copying out as much of the rest of your data as you can (and need to). If you're lucky, you can delete the affected file; if you're somewhat unfortunate, you're going to have to destroy the filesystem; if you're really unlucky, the entire pool needs to be recreated.
This creates two reasons to make regular backups (and not using
zfs send', because that may well just copy the damage to your
backups). The first reason is of course so that you have the backup
to restore from. The second reason is because making a backup with
rsync, or another user level tool of your choice will read
everything in your ZFS filesystems, which creates regular assurance
that everything is free of corruption.
PS: Even if you don't make regular backups, perhaps it's a good
idea just to read all of your ZFS filesystems every so often by
tar'ing them to /dev/null or similar things. I should probably
do this on my home machine, which I am really bad at backing up.
Prometheus will make persistent connections to agents (scrape targets)
The other day, I discovered something new to me about Prometheus:
Today I learned that the Prometheus server will keep persistent TCP/HTTP connections open to clients it's pulling data from if you're pulling data frequently enough (eg, every 15 seconds). This can matter if you add firewall rules that will block new connections to things.
Julien Pivotto added an important additional note:
It also [h]as an impact if you use DNS because we will only resolve when we make new connections
I discovered this when we set up some new firewall rules on a machine
that should have blocked connections to its host agent (and did, for my
wget), but our Prometheus server seemed to find it still
perfectly reachable. That was because the machine's new firewall rules
allowed existing connections to continue, so the Prometheus server could
still talk to the client's host agent to scrape it.
Prometheus itself doesn't do anything special to get these persistent connections. Instead, they're a feature of HTTP keep alives and the underlying Go HTTP library, specifically of a http.Transport. Prometheus enables KeepAlives in the Transport it uses, and currently allows a rather large number of idle connections (in NewRoundTripperFromConfig, in http_config.go), and allows them to stay idle for five minutes before being closed (which in practice means they won't go idle, because they're scraped more frequently than that).
One of the non-obvious consequences of this is that your Prometheus server process may use more file descriptors than you expect, especially if you have a big environment where you scrape a lot of different agents. Roughly speaking, every separate scrape target is a persistent connection and so a file descriptor. One exception is multiple probes to things like Blackbox, where each Blackbox instance will be one or perhaps a few connections but probably many more probes.
(Of course, now that I look at the file descriptor usage of our Prometheus server, the file descriptor usage for persistent connections to targets is a drop in the bucket compared to the file descriptor usage for Prometheus's TSDB files.)
I don't know if Prometheus closes and reopens these connections if (and when) it reloads its scrape targets through service discovery or whatever. In blithe ignorance of what Prometheus does, I'm not certain which I'd prefer. Not closing connections for existing target hosts that are going to continue involves less resource usage, but it also postpones finding problems like new firewall rules or DNS resolution problems.
Speaking of DNS problems, this means that if your local resolving DNS stops being able to resolve your hosts, your Prometheus server will still (probably) be scraping them even if you specify targets by hostname instead of IP. You won't get a sudden wave of scrape target failures because you can't resolve their DNS names, because your Prometheus server has already connected to them while your DNS was working. For us, this is a feature, as we already explicitly monitor our DNS resolvers.
(Of course if your local DNS resolution is broken, your Alertmanager may not be able to contact anything to tell you about it.)
The long term relative prices of M.2 NVMe drives and 2.5" SSDs
For reasons outside the scope of this entry, I recently found myself wondering if in the end, M.2 NVMe drives will wind up being less expensive at moderate capacities than 2.5" SSDs of the same capacity (regardless of the 2.5" interface interface involved, which might be SATA, SAS, or U.2 NVMe). I reflexively think of M.2 NVMe drives as a better, high end product that is and always will be more expensive than ordinary 2.5" SATA SSDs, but the more I thought about it, the more I suspect that the economics tilt the other way in the long run.
The reason why is what we could call the tyranny of physical stuff. Both M.2 and 2.5" SSDs have the same basic electronics; they need flash chips, some DRAM (hopefully), a controller, a PCB, and assorted bits and pieces on the PCB. But 2.5" SSDs also need a case and some extra connectors. Those extra components have a cost, and eventually that cost (and the cost of assembling them) will probably dominate over other things like the relative cost of controller chipsets.
(I suspect that SATA SSD controller chipsets currently are cheaper in bulk than NVMe controller chipsets, partly because NVMe controller chipsets keep iterating for things like PCIe 4.0, while SATA SSD controllers have a relatively static job.)
The 2.5" form factor provides more room for internal electronics, but anecdotes are the modern 2.5" SSDs are mostly empty air inside the case. Right now, I believe they have an advantage over M.2 NVMe for higher sizes because it's hard to squeeze that much flash into the modest space of the M.2 form factor (and you may have to use more expensive, higher density flash), but flash chips seem to be relentlessly marching to higher and higher densities in general. Plus, a lot of drives are relatively modest sizes, especially if you're looking at the server space. The 2.5" form factor may always have a price advantage at really large sizes, just like HDs so far have that advantage over all SSDs, but that seems likely to be less and less of the market over time.
In fact, now that I look at pricing online, it seems that this may have already happened without me noticing. At the very least, it looks like M.2 and 2.5 SATA SSD prices are roughly the same for moderate sizes, and there may be more M.2 NVMe drives (or sometimes M.2 SATA) than 2.5" SSDs listed. On the other hand, any number of the M.2 drives are from brand names that I barely recognize, while most of the 2.5" SSDs are from relatively well known names. It may be that the M.2 market is where everyone thinks the opportunity is for really low end, compromised products.
(I checked one well regarded brand that sells both M.2 and 2.5" drives, and for their smallest drives the sale price on the retailer I looked at was the same, although the list price of the M.2 NVMe version was slightly higher.)
PS: It's likely helped the M.2 form factor that it's what laptops have used for a while if they're going to have separate drives, and laptops still have a decently large volume. Although I don't know if more M.2 drives are made and shipped than 2.5" SSDs, even with laptops. But I may be biased because we get servers and use 2.5" SSDs in them (so far).
The problem I have with Pip's dependency version handling
Python's Pip package manager has
a system where main programs and packages can specify the general
versions of dependencies that they want. When you install a program
through pip (either directly into a virtual environment or with a
convenient tool like
pipx), pip resolves the
general version specifications to specific versions of the packages
and installs them too. Like many language package managers, pip
follows what I'll call a maximal version selection algorithm; it
chooses the highest currently available version of dependencies
that satisfy all constraints. Unfortunately I have come to feel
that this is a bad choice for at least programs, for two reasons.
One of the reasons is general and one of them is specific to pip's
current capabilities and tooling.
The general reason is that it makes the installed set of dependencies not automatically reproducible. If I install the Python LSP server today and you install it a week from now, we may well not wind up with the same total version of everything even if the Python LSP server project hasn't released a new version. All it takes is a direct or indirect dependency to release a new version that's compatible with the version restrictions in the intervening week. Your pip install will pick up that new version, following pip's maximal version selection.
This is theoretically great, since you're getting the latest and thus best versions of everything. It is not necessarily practically great, since as we've all experienced, sometimes the very latest versions of things are not in fact the best versions, or at least the best versions in the context you're using them. If nothing else, you're getting a different setup than I am, which may wind up with confusing differences in behavior.
(For instance, your Python LSP server environment might have a new useful capability that mine doesn't. You'll tell me 'just do <X>', and I'll say 'what?'.)
The specific reason is that once I have pip install my version of
something, pip doesn't really seem to provide a good way to update
it to the versions of everything I'd get if I reinstalled today.
That way, it would at least be easy for me and you to get the same
versions of everything in our installs of the Python LSP server,
which would let us get rid of problems (or at least let me see your
problems, if more recent package versions have new problems). Pip
has some features to try to do this, but in
practice they don't seem to work very well for me. I'm left to do
manual inspection with '
pip list --outdated', manual upgrades of
things with '
pip install --upgrade', and then use of '
afterward to make sure that I haven't screwed up and upgraded
something too far.
Pip is not going to change its general approach of maximal version selection (I think only Go has been willing to go that far). But I hope that someday either pip or additional tools have a good way to bring existing installs up to what they would be if reinstalled today.
(Pipx has its 'reinstall' option, but that's a blunt hammer and I'm not sure it works in all cases. I suppose I should try it someday on my Python LSP server installation, which has various additional optional packages installed too.)
Two stories of how and why simultaneous multithreading works
Simultaneous multithreading has been controversial since Intel introduced hyper-threading. Some people felt they were getting a nice bonus; others thought they were getting a net negative (back in 2011 I sort of had this view). I don't have any well informed views on whether or not SMT is useful for me (or for us), but what I do have is two stories I've absorbed over the years over how it works and could benefit you.
The first story is that SMT works by covering up memory latency. When a core would otherwise have to stall the currently executing thread to wait for a memory fetch (or a memory store), it can instead instantly switch to another thread and perhaps get some additional work done. This switching can't be handled by the OS for various reasons; instead, the other thread of execution must be ready for the core's hardware to start executing its instructions. The simplest way to do this is to present the extra hardware thread context as an additional CPU. This then generally forces the processor to actually schedule back and forth between both threads, rather than starving the secondary thread until (and unless) the primary thread stalls.
(The memory fetch might be for either data or for branches and calls in the program's code. Of course, as we famously know due to Spectre and Meltdown, a core may continue on with speculative execution of the thread even when it's nominally stalling for a memory fetch.)
The second and more recent story is that SMT works partly by increasing the utilization of a core's execution units (EUs). Modern x86 processor cores have a number of execution units of each type in order to extract as much instruction level parallelism as possible (a superscalar processor). However, not all code can use all of those execution units at once, and some code can leave entire types of execution units totally idle (for example, integer only code leaves floating point EUs idle). If you have a core execute more than one thread at once, your EU utilization is what both of them can use, not just what one thread can use. This intrinsically requires both threads to be executing simultaneously, because otherwise they won't both be using EUs at the same time.
These two stories are probably both true these days, but to what extent they're each true will depend partly on what your code actually does. The worst case for SMT is probably dense, highly optimized code that makes minimal memory fetches and tries to use all of the available execution units.