Wandering Thoughts

2023-01-31

I've had bad luck with transparent hugepages on my Linux machines

Normally, pages of virtual memory are a relatively small size, such as 4 Kbytes. Hugepages (also) are a CPU and Linux kernel feature which allows programs to selectively have much larger pages, which generally improves their performance. Transparent hugepage support is an additional Linux kernel feature where programs can be more or less transparently set up with hugepages if it looks like this will be useful for them. This sounds good but generally I haven't had the best of luck with them:

It appears to have been '0' days since Linux kernel (transparent) hugepages have dragged one of my systems into the mud for mysterious reasons. Is my memory too fragmented? Who knows, all I can really do is turn hugepages off.

(Yes they have some performance benefit when they work, but they're having a major performance issue now.)

This time around, the symptom was that Go's self-tests were timing out while I was trying to build it (or in some runs, the build itself would stall). While this was going on, top said that the 'khugepaged' kernel daemon process was constantly running (on a single CPU).

(I'm fairly sure I've seen this sort of 'khugepaged at 100% and things stalling' behavior before, partly because when I saw top I immediately assumed THP were the problem, but I can't remember details.)

One of the issues that can cause problems with hugepages is that to have huge pages, you need huge areas of contiguous RAM. These aren't always available, and not having them is one of the reasons for kernel page allocation failures. To get these areas of contiguous RAM, the modern Linux kernel uses (potentially) proactive compaction, which is normally visible as the 'kcompactd0' kernel daemon. Once you have aligned contiguous RAM that's suitable for use as huge pages, the kernel needs to turn runs of ordinary sized pages into hugepages. This is the job of khugepaged; to quote:

Unless THP is completely disabled, there is [a] khugepaged daemon that scans memory and collapses sequences of basic pages into huge pages.

In the normal default kernel settings, this only happens for processes that use the madvise(2) system call to tell the kernel that a mmap()'d area of theirs is suitable for this. Go can do this under some circumstances, although I'm not sure what they are exactly (the direct code that does it is deep inside the Go runtime).

If you look over the Internet, there are plenty of reports of khugepaged using all of a CPU, often with responsiveness problems to go along with it. Sometimes this stops if people quit and restart some application; at other times, people resort to disabling transparent hugepages or rebooting their systems. No one seems to have identified a cause, or figured out what's going on to cause the khugepaged CPU usage or system slowness (presumably the two are related, perhaps through lock contention or memory thrashing).

Disabling THP is done through sysfs:

echo never >/sys/kernel/mm/transparent_hugepage/enabled

The next time around I may try to limit THP's 'defragmentation' efforts:

echo never >/sys/kernel/mm/transparent_hugepage/defrag

(The normal settings for both of these these days are 'madvise'.)

If I'm understanding the documentation correctly, this will only use a hugepage if one is available at the time that the program calls madvise(); it won't try to get one later and swap it in.

(Looking at the documentation makes me wonder if Go and khugepaged were both fighting back and forth trying to obtain hugepages when Go made a madvise() call to enable hugepages for some regions.)

I believe I've only really noticed this behavior on my desktops, which are unusual in that I use ZFS on Linux on them. ZFS has its own memory handling (the 'ARC'), and historically has had some odd and uncomfortable interaction with the normal Linux kernel memory system. Still, it doesn't seem to be just me who has khugepaged problems.

(I don't think we've seen these issues on our ZFS fileservers, but then we don't run anything else on the fileservers. They sit there handling NFS in the kernel and that's about it. Well, there is one exception these days in our IMAP server, but I'm not sure it runs anything that uses madvise() to try to use hugepages.)

TransparentHugepagesBadLuck written at 23:04:27; Add Comment

2023-01-26

Some notes on using using TRIM on SSDs with ZFS on Linux

One of the things you can do to keep your SSDs performing well over time is to explicitly discard ('TRIM') disk blocks that are currently unused. ZFS on Linux has support for TRIM commands for some time; the development version got it in 2019, and it first appeared in ZoL 0.8.0. When it was new, I was a bit nervous about using it immediately, but it's been years since then and recently I did some experimentation with it. Well, with one version of ZoL's TRIM support, the manual one.

ZFS on Linux has two ways to periodically TRIM your pool(s), the automatic way and the manual way. The automatic way is to set 'autotrim=on' for selected pools; this comes with various cautions that are mostly covered in zpoolprops(7). The manual way is to periodically run 'zpool trim' with suitable arguments. One significant advantage of explicitly running 'zpool trim' is that you have a lot more control over the process, and in particular manual trims let you restrict trimming to a single device, instead of having trimming happen on all of them at once. If you trim your pools for only one device at a time (or only one device per vdev) and then scrub your pool afterward, you're pretty well protected against something going wrong in the TRIM process and the wrong disk blocks getting erased.

(My current experiments with 'zpool trim' are on Ubuntu 22.04 on some test pools, and scrubs say that nothing has gotten damaged in them afterward.)

The manual 'zpool trim' supports a -r command line option that controls how fast ZFS asks the disk to TRIM blocks. If you set this to, for example, 100 MBytes (per second), ZoL will only ask your SSD (or SSDs) to TRIM 100 MBytes of blocks every second. Sending TRIM commands to the SSD doesn't use read or write bandwidth as such, but it does ask the SSD to do things and that may affect other things that the SSD is doing. I wouldn't be surprised if some SSDs can TRIM at basically arbitrary rates with little to no impact on IO, while other SSDs get much more visibly distracted. As far as I can tell from some tests, this rate option does work (at least as far as ZFS IO statistics report).

I'm not sure how much information 'zpool iostat' will report about ongoing TRIMs (either automatic or manual), but various information is available in the underlying statistics exported from the kernel. Your options for getting at this detailed information aren't great. At the moment, the available IO statistics appear to be a per-vdev 'bytes trimmed' number that counts up during TRIM operations (in sys/fs/zfs.h's vdev_stat structure), which only appears to have non-zero values for per-disk IO statistics, and histograms of the 'IO size' of TRIM operations (but 'individual' IO is not necessarily what you think it is, and there are some comments that individual TRIM 'IOs' of larger than 16 MBytes will be counted as 16 MBytes in the histograms, as that's their largest bucket). As with the 'rate' of trimming, all of these numbers are really counting the amount of data that ZFS has told the SSD or SSDs to throw away.

(All of these TRIM IO statistics are exposed by my version of the ZFS exporter for Prometheus.)

I'm not sure you can do very much with these IO statistics except use them to tell when your TRIMs ran and on what vdev, and for that there are other IO 'statistics' that are exposed by ZFS on Linux, although probably 'zpool iostat' won't tell you about them.

(The 'vdev trim state' is the vdev_trim_state_t enum in sys/fs/zfs.h, where 1 means a trim is active, 2 is it's been canceled, 3 is it's been suspended, and 4 is that it has completed. A zero means that a trim hasn't been done on this disk.)

ZFSOnLinuxTrimNotes written at 22:59:00; Add Comment

2023-01-24

Linux software RAID mirrors, booting, mdadm.conf, and disk counts for non-fun

Linux software RAID mirrors have a count of the number of active disks that are in the array; this is what is set or changed by mdadm's --raid-devices argument. Your mdadm.conf may also list how many active disks an array is supposed to have, in the 'num-devices=' setting (aka a 'tag') for a particular array. The mdadm.conf manual page dryly describes this as "[a]s with level= this is mainly for compatibility with the output of mdadm --examine --scan", which historically and currently is not quite accurate, at least when booting (perhaps only under systemd).

I will give my current conclusion up front; if you're currently specifying num-devices= for any software RAID mirrors in your mdadm.conf, you should probably take the setting out. I can't absolutely guarantee that this is either harmless or an improvement, but the odds seem good.

Updating the device count in software RAID mirrors is required when you add devices, for example to add your new disks along side your old disks, and recommended when you remove disks (removing your old disks because you've decided that your new disks are fine). If you don't increase the number of devices when you add extra disks, what you're really doing is adding spares. If you don't decrease the number of devices on removal, mdadm will send you error reports and generally complain that there are devices missing. So let's assume that your software RAID mirror has a correct count.

Let's suppose that you have num-devices set in mdadm.conf and that your root filesystem's mdadm.conf is the same as the version in your initramfs (an important qualification because it's the version in the initramfs that counts during boot). Then there are several cases you may run into. The happy cases is that the mdadm.conf disk count matches the actual array's disk count and all disks are visible and included in the live array. Congratulations, you're booting fine.

If the mdadm.conf num-devices is higher than the number of devices claimed by the software RAID array, and the extra disks you removed are either physically removed or have had their RAID superblocks zeroed, then your boot will probably stall and likely error out, or at least that's my recent experience. This is arguably reasonable, especially if num-devices is a genuinely optional parameter in mdadm.conf; you told the boot process this array should have four devices but now it has two, so something is wrong.

If the mdadm.conf num-devices is higher than the number of devices claimed by the array but the extra disks you removed are present and didn't have their RAID superblock zeroed, havoc may ensue. It seems quite likely that your system will assemble the wrong disks into the software RAID array; perhaps it prefers the first disk you failed out and removed, because it still claims to be part of a RAID array that has the same number of disks as mdadm.conf says it should have.

(The RAID superblocks on devices have both a timestamp and an event count, so mdadm could in theory pick the superblocks with the highest event count and timestamp, especially if it can assemble an actual mirror out of them instead of only having one device out of four. But mdadm is what it is.)

If the mdadm.conf num-devices is lower than the number of devices claimed by the software RAID array and all of the disks are present and in sync with each other, then your software RAID array will assemble without problems during boot. This seems to make num-devices a minimum for the number of disks your boot environment expects to see before it declares the RAID array healthy; if you provide extra disks, that's fine with mdadm. However, if you've removed some disks from the array and not zeroed their superblocks, in the past I've had the system assemble the RAID array with the wrong disk even though the RAID superblocks on the other disks agreed with mdadm.conf's num-devices. That may not happen today.

A modern system with all the disks in sync will boot with an mdadm.conf that doesn't have any num-devices settings. This is in fact the way that our Ubuntu 18.04, 20.04, and 22.04 servers set up their mdadm.conf for the root software RAID array, and it works for me on Fedora 36 for some recently created software RAID arrays (that aren't my root RAID array). However, I don't know how such a system reacts when you remove a disk from the RAID array but don't zero the disk's RAID superblock. On the whole I suspect that it won't be worse than what happens when num-devices is set.

SoftwareRaidDiskCountEffects written at 22:37:33; Add Comment

2023-01-13

Ubuntu 22.04 LTS servers and phased apt updates

I was working on getting one of our 22.04 LTS servers up to date, even for packages we normally hold, when I hit a mystery and posted about it on the Fediverse:

Why does apt on this 22.04 Ubuntu machine want to hold back a bunch of package updates even with '--with-new-pkgs --ignore-hold'? Who knows, it won't tell me why it doesn't like any or all of:

open-vm-tools openssh-client openssh-server openssh-sftp-server osinfo-db python3-software-properties software-properties-common

(Apt is not my favorite package manager for many reasons, this among them.)

Steve suggested that it was Ubuntu's "Phased Update" system, which is what it turned out to be. This set me off to do some investigations, and it turns out that phased (apt) updates explain some other anomalies we've seen with package updates on our Ubuntu 22.04 machines.

The basic idea of phased updates is explained in the "Phasing" section of Ubuntu's page on Stable Release Updates (SRUs); it's a progressive rollout of the package to more and more of the system base. Ubuntu introduced phased updates in 2013 (cf) but initially they weren't directly supported by apt, only by the desktop upgrade programs. Ubuntu 21.04 added apt support for phased updates and Ubuntu 22.04 LTS is thus the first LTS version to subject servers to phased updates. More explanations of phased updates are in this askubuntu answer, which includes one way to work around them.

(Note that as far as I know and have seen, security updates are not released as phased updates; if it's a security update, everyone gets it right away. Phased updates are only used for regular, non-security updates.)

Unfortunately apt (or apt-get) won't tell you if an update is being held back because of phasing. This user-hostile apt issue is tracked in Ubuntu bug #1988819 and you should add yourself as someone it affects if this is relevant to you. Ubuntu has a web page on what updates are currently in phased release, although packages are removed from this page once they reach 100%. Having reached 100%, such a package is no longer a phased update, which will become relevant soon. If you can't see a reason for a package to be held back, it's probably a phased update but you can check the page to be sure.

(As covered in the "Phasing" section, packages normally move forward through the phased rollout every six hours, so you can have a package held back on some server in the morning and then be not-held in the afternoon. This is great fun for troubleshooting why a given server didn't get a particular update.)

Your place in a phased update is randomized across both different servers and different packages. If you have a fleet of servers, they will get each phased update at different times, and the order won't be consistent from package to package. This explains an anomaly we've been seeing in our package updates for some time, where different 22.04 servers would get updates at different times without any consistent pattern.

The phased update related apt settings available and some of the technical details are mostly explained in this askubuntu answer. If you want to opt out of phased updates entirely, you have two options; you can have your servers install all phased updates right away (basically putting you at the 0% start line), or you can skip all phased updates and only install such packages when they reach 100% and stop being considered phased updates at all. Unfortunately, as of 22.04 there's no explicit option to set your servers to have a particular order within all updates (so that you can have, for example, a 'canary' server that always installs updates at 0% or 10%, ahead of the rest of the fleet).

For any given package update, machines are randomized based on the contents of /etc/machine-id, which can be overridden for apt by setting APT::Machine-ID to a 32 hex digit value of your choice (the current version of apt appears to only use the machine ID for phased updates). If you set this to the same value across your fleet, your fleet will update in sync (although not at a predictable point in the phase process); you can also set subsets of your fleet to different shared values so that the groups will update at different times. The assignment of a particular machine to a point in the phased rollout is done through a relatively straightforward approach; the package name, version, and machine ID are all combined into a seed for a random number generator, and then the random number generator is used to produce a 0 to 100 value, which is your position in the phased rollout. The inclusion of the package name and version means that a given machine ID will be at different positions in the phased update for different packages. All of this turns out to be officially documented in the "Phased Updates" section of apt_preferences(5), although not in much detail.

(There is a somewhat different mechanism for desktop updates, covered in the previously mentioned askubuntu answer.)

As far as I can see from looking at the current apt source code, apt doesn't log anything at any verbosity if it holds a package back because the package is a phased update and your machine doesn't qualify for it yet. The fact that a package was a phased update the last time apt looked may possibly be recorded in /var/log/apt/eipp.log.xz, but documentation on this file is sparse.

Now that I've looked at all of this and read about APT::Machine-ID, we'll probably set it to a single value across all of our fleet because we find different machines getting updates at different times to be confusing and annoying (and it potentially complicates troubleshooting problems that are reported to us, since we normally assume that all 22.04 machines have the same version of things like OpenSSH). If we could directly control the position within a phased rollout we'd probably set up some canary machines, but since we can't I don't think there's a strong reason to have more than one machine-id group of machines.

(We could set some very important machines to only get updates when packages reach 100% and stop being phased updates, but Ubuntu has a good record of not blowing things up with eg OpenSSH updates.)

Ubuntu2204ServerPhasedUpdates written at 22:56:18; Add Comment

2023-01-11

Sometimes it actually is a kernel bug: bind() in Linux 6.0.16

There's a common saying and rule of thumb in programming (possibly originating in the C world) that it's never a compiler bug, it's going to be a bug in your code even if it looks crazy or impossible. Like all aphorisms it's not completely true, because compilers have bugs, but it's almost always the case that you haven't actually found a compiler bug and it's something else. You can say a similar thing about weird system issues (not) being the fault of a kernel bug, and so that's what I thought when the development version of Go started failing a self test when I built it on my Fedora 37 office desktop:

--- FAIL: TestTCPListener (0.00s)
    listen_test.go:72: skipping tcp  test
    listen_test.go:72: skipping tcp 0.0.0.0 test
    listen_test.go:72: skipping tcp ::ffff:0.0.0.0 test
    listen_test.go:72: skipping tcp :: test
    listen_test.go:90: tcp 127.0.0.1 should fail

Where this test in net/listen_test.go is failing is when it attempts to listen twice on the same localhost IPv4 address and port. It first binds to and listens on 127.0.0.1 port 0 (that port causes the kernel to assign a free ephemeral port for it), extracts the actual assigned port, and then attempts to bind to 127.0.0.1 on the same port a second time.

(The Go networking API bundles the binding and listening together in one Listen() API, but the socket API itself has them as two operations; you bind() a socket to some address, then listen() on it.)

This obviously should fail, except the development version of Go was claiming that it didn't. Aft first I thought this had to be a Go change, but soon I found that even older versions of Go didn't pass this test (when I knew they had when I'd built them), and also that this test passed on my Fedora 36 home desktop. Which I noticed was running Fedora's 6.0.15 kernel, while my office machine was running 6.0.16. That certainly looked like a kernel bug, and indeed I was able to reproduce it in Python (which is when I eventually realized this was an issue with bind() instead of listen()).

The Python version allows me to see more about what's going on:

>>> from socket import *
>>> s1 = socket(AF_INET, SOCK_STREAM)
>>> s2 = socket(AF_INET, SOCK_STREAM)
>>> s1.bind(('127.0.0.1', 0))
>>> s2.bind(('127.0.0.1', s1.getsockname()[1]))
>>> s1.getsockname()
('127.0.0.1', 54785)
>>> s2.getsockname()
('0.0.0.0', 0)

Rather than binding the second socket or failing with an error, the kernel has effectively left it unbound (the s2.getsockname() result here is the same as when the socket is newly created ('0.0.0.0' is usually known as INADDR_ANY). Replacing SOCK_STREAM with SOCK_DGRAM causes things to fail with 'address already in use' (errno 98), so this issue seems specific to TCP.

This kernel error is in Fedora 37's 6.0.16 and 6.0.18, but is gone in the Rawhide 6.2.0-rc2 and isn't present in the Fedora 6.0.15. I don't know if it's in any version of 6.1, but I'll probably find out soon when Fedora updates to it. Interested parties can try it for themselves, and it's been filed as Fedora bug #2159802.

(This elaborates on a Fediverse thread. I looked at the 6.0.16 changelog, but nothing jumped out at me.)

KernelBindBugIn6016 written at 22:59:37; Add Comment

2023-01-07

Our ZFS spares handling system sort of relies on our patterns of disk usage

In my entry on our Linux ZFS spares handling system, I wrote about how we used spares in a two step preference order, first on another disk connected the same way (SATA or SAS) and then on any disk necessary. In a comment on the entry, Simon asked:

Doesn't this mean you could end up mirroring to the same (real) disks? So the redundancy you can normally expect is severely reduced. Mirroring to the same disk only helps with localized read/write errors (like a bad sector), but not things like a failed disk.

This is a question with a subtle answer, which starts with how we use disks and what that implies for available spares. We always use disks in mirrored pairs, and the pairs are fixed; every partition of every disk has a specific partner. The first partition of the first SAS-connected disk is always mirrored with the first partition of the first SATA-connected disk, and so on. This means that in normal operation (when a disk hasn't failed), all spares also come in pairs; if the last partition of the first 'SAS' disk isn't used, neither will be the last partition of the first 'SATA' disk, so both are available as spares. In addition, we spread our partition usage across all disks, using the first partition on all pairs before we start using the second partition on any of them, and so on.

Since spares come in pairs, if we have as many pairs of spares as we have partitions on a disk (so four pairs, eight spares in total, with our current 2 TB disks with four partitions), we're guaranteed to have enough spares on the same 'type' (SAS connected or SATA connected) of disk to replace a failed disk. Since the other side of every mirrored pair is on the different type, the replacement spares can't wind up on the same physical disk as the other side. Since we don't entirely allocate one disk before we mostly allocate all of them, all disks have either zero partitions free or one partition free and our spares are all on different disks.

(Now that I've written this down I've realized that it's only true as long as we have no more partitions per disks than we have disks of a particular type. We have eight disks per type so we're safe with 4 TB disks and eight partitions per disk, but we'll need to think about this again if we move beyond that.)

If we have fewer spares than that, we could be forced to use a spare on the same type of disk as the surviving side of a pair. Even then we can try to avoid using a partition on the same disk and often we'll be able to. If the failed disk had no free partitions, its pair also has no free partitions and we're safe. If it had one free partition and we have more spares than the number of partitions per disk (eg six spares with 2 TB disks), we can still find a spare on another disk than its pair.

The absolute worst case in our current setup is if we're down to four spares and we lose a disk with one of the spares. Here we need three spares (for the used partitions on the disk), we only have three spares left, and one of them is on the pair disk to the one we lost, which is the disk that needs new mirroring. In this case we'll mirror one partition on that disk with another partition on that disk. This still gives us protection against ZFS checksum errors, but it also means that we overlooked a case when we decided it was okay to drop down to a minimum of only four spares.

I'll have to think about this analysis for our 4 TB disk, eight partition case, but certainly for the 2 TB disk, four partition case it means that the minimum number of spares we should be keeping is six, not four. Fortunately we don't have any fileservers that have that few spares at the moment. Also, I need to re-check our actual code to see if it specifically de-prioritizes the disk of the partition we're adding a spare to.

(One fileserver wound up at four spares before we upgraded its data disks to 4 TB SSDs.)

ZFSOurSparesSystemVI written at 21:14:48; Add Comment

2023-01-04

Finding people's use of /usr/bin/python with the Linux audit framework

Our Ubuntu systems have had a /usr/bin/python that was Python 2 for more or less as long as we've had Ubuntu systems, which by now is more than fifteen years. Over that time, our users have written a certain amount of Python 2 programs that use '#!/usr/bin/python' to get Python 2, because that's been the standard way to do it for a relatively long time. However, Python 2 is going away on Ubuntu since it has on Debian, and as part of that we're probably going to stop having a /usr/bin/python in our future 24.04 LTS servers. It would be nice to find out which of our users are still using '/usr/bin/python' so that we can contact them in advance and get them either to move their programs to Python 3 or at the very least start using '#!/usr/bin/python2'. One way to do this is to use the Linux kernel's audit framework. Or, really, two ways, the broad general way and the narrow specific way. Unfortunately neither of these are ideal.

The ideal option we would like is an audit rule for 'if /usr/bin/python is being used as the first argument to execve()', or equivalently 'if the name /usr/bin/python is being accessed in order to execute it'. Unfortunately, as far as I can tell you can't write either of these potential audit rules, although it may appear that you can.

The narrow specific way is to set a file audit on '/usr/bin/python' for read access, and then post-process the result to narrow it down to suitable system calls. For example:

-w /usr/bin/python -p r -k bin-python-exec

When you run a program that has a '#!/usr/bin/python', it will result in an audit log line like:

type=SYSCALL msg=audit(1672884109.008:233812): arch=c000003e syscall=89 success=yes exit=7 a0=560a65335980 a1=7fffe74c59e0 a2=1000 a3=560a632ba4e0 items=1 ppid=183379 pid=184601 auid=915 uid=915 gid=1010 euid=915 suid=915 fsuid=915 egid=1010 sgid=1010 fsgid=1010 tty=pts1 ses=6837 comm="fred" exe="/usr/bin/python2.7" subj=unconfined key="bin-python-exec"

Syscall 89 is (64-bit x86) readlink() (per this table), which in this case is being done inside the kernel as part of an execve() system call. The 'exe=' being Python 2.7 means that this can't be a readlink() call being done by some other program (ls, for example); however, we can't tell this from someone running Python 2.7 themselves and doing 'os.readlink("/usr/bin/python")'. This last case is probably sufficiently uncommon that you can not worry about it, and just contact the person in question (obtained from the uid= value) to let them know.

(One drawback of this narrow specific way is that you may not be able to tell people very much about what program of theirs is still using '/usr/bin/python'. The comm= value tells you what it's more or less called, but you don't have the specific path, although often you can dig it out by decoding the associated 'type=PROCTITLE' audit line for this audit record.)

Using '-p x' to trigger this on 'execute' in the audit rule doesn't work, because as far as the audit framework is concerned the symbolic link here is not being executed, it's being read (this is the same trap as I ran into when I worked out how to use the audit framework to find 32-bit programs).

The other approach, the broad general way, is to start by auditing execve(), possibly limited to execve() of /usr/bin/python2.7. I'm using a filter key option as good general practice, but we're going to see that it's not actually important:

-a always,exit -F arch=b64 -S execve -F path=/usr/bin/python2.7 -k bin-python-exec
-a always,exit -F arch=b32 -S execve -F path=/usr/bin/python2.7 -k bin-python-exec

(You can leave out the second line if you don't have to worry about 32-bit x86 programs.)

This will get you a set of audit records every time Python 2 gets executed, either directly or via some symlink. Starting from these, you want to pick out the records where "/usr/bin/python" is the initial argument to execve. The relevant lines from these records will look like this:

type=SYSCALL msg=audit(1672886976.075:237851): arch=c000003e syscall=59 success=yes exit=0 a0=55fa1621dd90 a1=55fa1621ddf0 a2=55fa16220e40 a3=8 items=3 ppid=183379 pid=189501 auid=915 uid=915 gid=1010 euid=915 suid=915 fsuid=915 egid=1010 sgid=1010 fsgid=1010 tty=pts1 ses=6837 comm="fred" exe="/usr/bin/python2.7" subj=unconfined key="bin-python-exec"
type=EXECVE msg=audit(1672886976.075:237851): argc=2 a0="/usr/bin/python" a1="/tmp/fred"

The 'type=EXECVE' record's 'a0=' value tells you that execve() was called on the /usr/bin/python symlink, instead of eg /usr/bin/python2. To get the user ID of the person doing this, you need to look back to the corresponding 'type=SYSCALL' record for the execve, which has a matching msg= value. Unfortunately as far as I know the audit system can't directly match type=EXECVE records for you. The second argument of the EXECVE record will generally tell you what Python program is being run, which you can pass on to the user involved.

The advantage of the broad general way is that you may already be tracing execve() system calls for general system auditing purposes. If you are, you can exploit your existing auditing logs by just searching for the relevant EXECVE lines.

Because Linux's audit framework is quite old by now, it's everywhere and all of the programs and components work. However, these days it's probably not the best tool for this sort of narrowly scoped question. Instead, I suspect that something using eBPF tracing is a better approach these days, even though various aspects of the eBPF tools are still works in progress, even on relatively recent Linux distributions.

(I'm still a little bit grumpy that both Ubuntu 22.04 LTS and Fedora 36 broke bits of bpftrace for a while, and I believe 22.04 LTS still hasn't fixed them. We're better than we were in 2020, but still not great, and then there's problems with kernel lockdown mode in some environments.)

FindingPython2UsesWithAudit written at 22:57:50; Add Comment

2022-12-28

Some practical notes on the systemd cgroups/units hierarchies

Systemd organizes everything on your system into a hierarchy of cgroups, or if you prefer a hierarchy of units that happen to be implemented with cgroups. However, what this hierarchy is (or is going to be) isn't always obvious, and sometimes what shows up matters, for example because you're generating per-cgroup metrics and might hit a cardinality explosion. So here are some notes on things you may see in, for example, systemd-cgls or 'systemctl status' (or if you're writing something to dump cgroup memory usage).

At the top level, systemd has a -.slice (the root slice or cgroup). Underneath that are up to three slices: user.slice, for all user sessions, system.slice, for all system services, and machine.slice, for your virtual machines that are started in ways that systemd knows about (for example, libvirt). You'll probably always have a system.slice and usually a user.slice if you're looking at a machine, but many of your machines may not have a machine.slice. There's also an init.scope, which has PID 1 in it, and possibly some essentially empty .mount cgroups that systemd-cgls won't bother showing you.

In a virtualization environment using libvirt, machine.slice will have a 'machine-qemu-<N>-<name>.scope' for every virtual machine, except that everything after the 'machine-qemu' bit will have bits of hex encoding, such as '\x2d' for the '-'. Under each active VM are some libvirt-created cgroups under 'libvirt', which isn't a systemd unit (I'm going to skip inventorying them, since I don't feel qualified to comment). If you've started some virtual machines and then shut them all down again, 'systemd-cgls' probably won't show you machine.slice any more, but it's still there as a cgroup and may well have some amount of RAM usage still charged to it.

Under user.slice, there will normally be a hierarchy for any individual user login that I'm going to present in a text diagram form (from systemd-cgls):

├─user.slice
│ └─user-<UID>.slice
│   ├─user@<UID>.service
│   │ ├─session.slice
│   │ │ ├─dbus-broker.service
[...]
│   │ │ └─pipewire.service
│   │ └─init.scope
│   └─session-<NNN>.scope
[...]

Depending on the system setup, things may also be in an 'app.slice' and a 'background.slice' instead of a session.slice; see Desktop Environment Integration. What units you see started in the session and app slices depends on your system and how you're logging in to it (and you may be surprised by what gets started for a SSH login, even on a relatively basic server install).

(The init.scope for a user contains their systemd user instance.)

Under system.slice, you will normally see a whole succession of '<thing>.service', one for every active systemd service. You can also see a two level hierarchy for some things, such as templated systemd services:

  ├─system-serial\x2dgetty.slice 
  │ └─serial-getty@ttyS0.service 
  [...]
  ├─system-getty.slice 
  │ └─getty@tty1.service 
  [...]
  ├─system-postfix.slice 
  │ └─postfix@-.service 
  [...]

Templated systemd socket service units (with their long names) will show up (possibly very briefly) under a .slice unit for them, eg 'system-oidentd.slice'. This slice won't necessarily show in 'systemd-cgls' unless there's an active socket connection at the moment, but systemd seems to leave it there in /sys/fs/cgroup/system.slice even when it's inactive.

You can also get nested system.slice cgroups for dbus services:

  ├─system-dbus\x2d:1.14\x2dorg.freedesktop.problems.slice
  │ └─dbus-:1.14-org.freedesktop.problems@0.service

Inspecting the actual cgroups in /sys/fs/cgroup may also show you <thing>.mount, <thing>.socket, and <thing>.swap cgroups. Under rare circumstances you may also see a 'system-systemd\x2dfsck.slice' cgroup with one or more .service cgroups for fscks of specific devices.

Now that I've looked at all of this, my view is that if I'm generating resource usage metrics, I want to stop one level down from the top level user and system slices in the cgroup hierarchy (which means I will get 'system-oidentd.slice' but not the individually named socket activations). This captures most everything interesting and mostly doesn't risk cardinality explosions from templated units. Virtual machines under machine.slice need extra handling for cardinality, because the 'machine-qemu-<N>-[...]' is a constantly incrementing sequence number; I'll need to take that out somehow.

If I'm reporting on the fly on resource usage, it's potentially interesting to break user slices down into each session scope and then the user@<UID>.service. Being detailed under the user service runs into issues because there's so much potential variety in how processes are broken up into cgroups. I'd definitely want to be selective about what cgroups I report on so that only ones with interesting resource usage show up in the report.

Sidebar: User cgroups on GNOME and perhaps KDE desktops

You may remember my bad experience with systemd-oomd, where it killed my entire desktop session. Apparently one reason for systemd-oomd's behavior is that on a modern GNOME desktop, a lot of applications are confined into separate cgroups, so if (for example) your Firefox runs away with memory, systemd-oomd will only kill its cgroup, not your entire session-<NNN>.scope cgroup. On Fedora 36, this appears to look like this:

 │ ├─app.slice
[...]
 │ │ ├─app-cgroupify.slice
 │ │ │ └─cgroupify@app-gnome-firefox-2838.scope.service
 │ │ │   └─ 2845 /usr/libexec/cgroupify app-gnome-firefox-2838.scope
[...]
 │ │ ├─app-gnome-firefox-2838.scope
 │ │ │ ├─3028
 │ │ │ │ └─ 3028 /usr/lib64/firefox/firefox -contentproc [...]
 │ │ │ ├─3024
 │ │ │ │ └─ 3024 /usr/lib64/firefox/firefox -contentproc [...]
[...]

Gnome terminal sessions also have a complex structure:

 │ │ ├─app-org.gnome.Terminal.slice (#10028)
 │ │ │ ├─vte-spawn-04ae3315-d673-47fc-a31e-f657648a0146.scope (#10774)
 │ │ │ │ ├─ 2625 bash
 │ │ │ │ ├─ 2654 systemd-cgls
 │ │ │ │ └─ 2655 less
 │ │ │ └─gnome-terminal-server.service (#10508)
 │ │ │   └─ 2478 /usr/libexec/gnome-terminal-server

And then there's:

 │ │ ├─app-gnome\x2dsession\x2dmanager.slice (#5885)
 │ │ │ └─gnome-session-manager@gnome.service
 │ │ │   └─ 1663 /usr/libexec/gnome-session-binary [...]

So a GNOME desktop can have a lot of nested things in a session or under an app.slice.

The Fedora 36 Cinnamon desktop doesn't seem to go as far as this, with a bunch of things still running in the 'session-NNN.scope' unit, but it does seem to do some things to split Firefox and other processes off into their own systemd units and cgroups.

(Cgroupify apparently comes from the uresourced RPM package.)

SystemdCgroupsHierarchies written at 23:38:23; Add Comment

2022-12-24

The systemd journal is primarily time-based

Recently, I encountered a surprising systemd behavior:

Dear systemd: apparently using absolute system date (at boot?!) to determine 'the system boot before this one' and so on is not necessarily a great idea, given that systems can boot with scrambled system time. If I want to look at 'the previous boot', I want that to work even given bad time. Especially if the system booted with bad time, ran with bad time for a bit, and then rebooted again.

(There are some replies by Lennart Poettering that are worth reading, 1, 2, and a systemd issue about this.)

However, on deeper examination I realized that this goes deeper than just what 'journalctl -b-1' will report is the boot before the current one (or even earlier boots). If your system boots with a bad time in the past and then corrects the time, 'journalctl -r' will stop abruptly at a log line where the time began to be stepped to the correct value, like this (in reversed order):

[...]
Dec 19 08:49:44 <host> systemd-journald[881]: [...] Journal header limits reached or header out-of-date, rotating.
Dec 19 08:49:44 <host> systemd-journald[881]: [...] is older than the configured file retention duration (1month), suggesting rotation.
Jul 13 20:00:33 <host> chronyd[2510]: Selected source [...]
-- Boot <previous boot id> --
[...]

If you can retrieve the correct journalctl boot identifier, you can see all of that boot's messages (in forward or reverse time order), but otherwise you get cut off and jump straight back to the previous boot. You do at least get one line of warning about it, and if you're specifically looking for early boot lines you'll likely notice that they're missing.

Assuming that the system time was corrected after boot, the times reported for such a boot in 'journalctl --list-boots' seem to be set to when the time was corrected. This particular boot had its start time reported as '2022-12-19 08:49:44', which is actually not the correct start of boot time even after the time correction (it seems that it took 33 seconds to reach that point). I'm not sure what sort of boot record you get if you boot a system with such incorrect time and never reset it to the correct time before it reboots.

This is probably old hand to systemd journal people, but it's not at all how I expect '(syslog) log files' to work. Log lines have an order and that order is not dependent on the nominal system time at the time they were logged. Still, the journal is what it is, and we can always use syslog files in addition to the journal when this matters.

(And I do feel that the systemd journal is in many ways an improvement on syslog files. It captures more information and provides more convenient access to it.)

SystemdJournalTimeBased written at 22:13:52; Add Comment

2022-12-19

Systemd unit templates don't provide a native way to have multiple parameters

We have a collection of Apache servers, and as part of our Prometheus metrics system we indirectly scrape metrics from them, using a third party Apache exporter that queries Apache's mod_status information and converts it to Prometheus metrics. Because the exporter's author chose to write it that way, you need to run one instance of the exporter per Apache server you want to extract metrics from. Ordinary people would probably run these exporter instances on the Apache hosts themselves. We opted to instead run all of the exporters on our metrics server, which means that we need one systemd unit per exporter instance. As we gather metrics from more and more Apache servers, this had led to more and more systemd units, each of them basically identical.

On the surface this sounds like a great use for a systemd template unit file. We could create a templated 'apache-exp@.service' file, and then start 'apache-exp@server1', 'apache-exp@server2', and so on. However, there is a problem that I haven't yet mentioned: because they're all running on the same host, each of these exporter instances needs to be told not just what Apache to query but also what local port to listen on (since they can't all listen on the same one, the way they could if they were on separate servers).

One obvious way of providing these two arguments to a template unit is some format of multi-part instance names, for example 'server1:9112'. Systemd will create multi-part instance names for templated socket units, but there are no systemd specifiers that will separate an instance name into parts for you. Any separation would have to be done by a front end script, taking the '%I' (or '%i') from systemd and breaking it up into the expected parts. This is perfectly doable but does mean adding a local script, with all that entails.

Systemd can do some processing of command lines, but it doesn't support Bourne shell style extraction of parts of variables. In theory your systemd ExecStart can be a 'sh -c' invocation (there's an example in the Command lines section), but in practice this feels like a lot of (fragile) work just to get some inline extraction of two parts of the instance name.

(You'll also have to put the instance name in a shell variable, because you can only use '${...%:}' and '${...#:}' with variables, not with literal text that you shove in. This is a fair Bourne shell limitation.)

I think this is a fair limitation for systemd instance name handling to have. The situation I'm in is somewhat obscure and handling anything like it would get complicated fast. Systemd is not the place to put powerful text rewriting and substitution mangling, because honestly we've seen where that goes and it often doesn't go anywhere good once you start.

SystemdTemplatesParameterLimit written at 22:21:50; Add Comment

(Previous 10 or go back to December 2022 at 2022/12/15)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.