Wandering Thoughts


The Linux Out-Of-Memory killer process list can be misleading

Recently, we had a puzzling incident where the OOM killer was triggered for a cgroup, listed some processes, and then reported that it couldn't kill anything:

acctg_prof invoked oom-killer: gfp_mask=0x1100cca (GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
memory: usage 16777224kB, limit 16777216kB, failcnt 414319
swap: usage 1040kB, limit 16777216kB, failcnt 0
Memory cgroup stats for /system.slice/slurmstepd.scope/job_31944
Tasks state (memory values in pages):
[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 252732]     0 252732     1443        0    53248        0             0 sleep
[ 252728]     0 252728    37095     1915    90112       54         -1000 slurmstepd
[ 252740]  NNNN 252740  7108532    17219 39829504        5             0 python3
[ 252735]     0 252735    53827     1886    94208      151         -1000 slurmstepd
Out of memory and no killable processes...

We scratched our heads a lot, especially as something seemed to be killing systemd-journald at the same time and the messages being logged suggested that it had been OOM-killed instead (although I'm no longer so sure). Why was the kernel saying that there were no killable processes when there was a giant Python process right there?

What was actually going on is that the OOM task state list leaves out a critical piece of information, namely whether or not the process in question had already been killed. A surprising number of minutes before this set of OOM messages, the kernel had done another round of a cgroup OOM kill for this cgroup and:

oom_reaper: reaped process 252740 (python3), now anon-rss:0kB, file-rss:68876kB, shmem-rss:0kB

So the real problem was that this Python process was doing something that had it stuck sitting there, using memory, even after it was OOM killed. The Python process was indeed not killable, for the reason that it had already been killed.

The whole series of events is probably sufficiently rare that it's not worth cluttering the tasks state listing with some form of 'task status' that would show if a particular process was already theoretically dead, just not cleaned up. Perhaps it could be done with some clever handling of the adjusted OOM score, for example marking such processes with a blank value or a '-'. This would make the field not parse as a number, but then kernel log messages aren't an API and can change as the kernel developers like.

(This happened on one of the GPU nodes of our SLURM cluster, so our suspicion is that some CUDA operation (or a GPU operation in general) was in progress and until it finished, the process could not be cleaned and collected. But there were other anomalies at the time so something even odder could be going on.)

linux/OOMKillProcessListMisleading written at 23:30:12; Add Comment


SSH has become our universal (Unix) external access protocol

When I noted that brute force attackers seem to go away rapidly if you block them, one reaction was to suggest that SSH shouldn't be exposed to the Internet. While this is viable in some places and arguably broadly sensible (since SSH has a large attack surface, as we've seen recently in CVE-2024-6387), it's not possible for us. Here at a university, SSH has become our universal external access protocol.

One of the peculiarities of universities is that people travel widely, and during that travel they need access to our systems so they can continue working. In general there are a lot of ways to give people external access to things; you can set up VPN servers, you can arrange WireGuard peer to peer connections, and so on. Unfortunately, often two issues surface; our people have widely assorted devices that they want to work from, with widely varying capabilities and ease of using VPN and VPN like things, and their (remote) network environments may or may not like any particular VPN protocol (and they probably don't want to route their entire Internet traffic the long way around through us).

The biggest advantage of SSH is that pretty much everything can do SSH, especially because it's already a requirement for working with our Unix systems when you're on campus and connecting from within the department's networks; this is not necessarily so true of the zoo of different VPN options out there. Because SSH is so pervasive, it's also become a lowest common denominator remote access protocol, one that almost everyone allows people to use to talk to other places. There are a few places where you can't use SSH, but most of them are going to block VPNs too.

In most organizations, even if you use SSH (and IMAP, our other universal external access protocol), you're probably operating with a lot less travel and external access in general, and hopefully a rather more controlled set of client setups. In such an environment you can centralize on a single VPN that works on all of your supported client setups (and meets your security requirements), and then tell everyone that if they need to SSH to something, first they bring up their VPN connection. There's no need to expose SSH to the world, or even let the world know about the existence of specific servers.

(And in a personal environment, the answer today is probably WireGuard, since there are WireGuard clients on most modern things and it's simple enough to only expose SSH on your machines over WireGuard. WireGuard has less exposed attack surface and doesn't suffer from the sort of brute force attacks that SSH does.)

sysadmin/SSHOurUniversalAccessProtocol written at 22:59:18; Add Comment

My self-inflicted UPS and computer conundrum

Today the area where I live experienced what turned out to be an extended power outage, and I wound up going in to the office. In the process, I wound up shooting myself in the foot as far as my ability to tell from the office if power had returned to home, oddly because I have a UPS at home.

The easy way to use a UPS is just to let it run down until it powers off. But this is kind of abrupt on the computer, even if you've more or less halted it already, and it also means that the computer's load reduces the run time for everything else. In my case this matters a bit because after a power loss, my phone line is typically slow to get DSL signal and sync up, so that I can start doing PPPoE and bring up my Internet connection. So if it looks like the UPS's power is running low, my reflex is to power off the computer and hope that power will come back before the UPS bottoms out and the DSL modem turns off (and thus loses line sync).

The first problem is that this only really works if I'm going to stick around to turn the computer on should the power outage end early enough (before the UPS loses power). That turned out not to be a problem this time; the power outage lasted more than long enough to run the UPS out of power, even with only the minor load of the DSL modem, a five-port switch, and a few other small things. The bigger problem is that because of how I have my computer set up right now due to hardware reasons, if I want the computer to be drawing no power (as opposed to being 'off' in some sense), I have to turn the computer off using the hard power switch on the PSU. Once I've flipped this switch, the computer is off until I flip it back, and if I flip it back with (UPS) power available, the computer will power back up again and start drawing power and all that.

(My BIOS is set to 'always power up when AC power is restored', and apparently one side effect of this is that the chassis fans and so on keep spinning even when the system is 'powered off' from Linux.)

The magic UPS feature I would like to fix this is a one-shot push button switch for every outlet that temporarily switches the outlet to 'wait until AC power returns to give this outlet any power'. With this, I could run 'poweroff' on my computer, then push the button to cut power and have it come back when Toronto Hydro restored service. I believe it might be possible to do this with commands to the UPS, but that mostly doesn't help me since the host that would issue those commands is the one I'm running 'poweroff' on.

(The better solution would be a BIOS and hardware that turns everything off after 'poweroff' even when set to always power up after AC comes back. Possibly this is fixed in a later BIOS revision than I have.)

tech/MySelfInflictedUPSConundrum written at 00:44:09; Add Comment


People at universities travel widely and unpredictably

Every so often, people make the entirely reasonable suggestion that if one day you see a particular person log in from locally and then a few days later they're logging in from half way around the world, perhaps you should investigate. This may work for many organizations, but unfortunately it is one of the ways in which universities are peculiar places. At universities, a lot of people travel, they do it a fair bit (and unpredictably), and they go to all sorts of strange places, where they will connect back to the university to continue doing work (for professors and graduate students, at least).

There are all sorts of causes for this travel. Professors, postdocs, and graduate students go to conferences in various locations. Professors go on sabbatical, or go visit another university for a month or two, or even go hang out at a company for a while (perhaps as a visiting researcher). Graduate students also go home to visit their family, which can put them pretty much anywhere in the world, and they can also visit places for other reasons.

(Graduate students are often strongly encouraged to keep working all the time, including on holiday visits to their family. Even professors can feel similar pressures in the modern academic environment.)

Professors, postdocs, and graduate students will not tell you all of this information ahead of time, and even if you forced them to share their travel plans, it would not necessarily be useful because they may well have no idea how they will be connecting to the Internet at their destination (and what IP address ranges that would involve). Plus, geolocation of Internet IP addresses is not particularly exact or accurate, especially if you need to do it for free.

One corollary of this is that at a university, you often can't safely do broad 'geographic' blocks of logins (or VPN connections, or whatever) from IP address ranges, because there's no guarantee that one of your people isn't going to pop up there. The more populous the geographic area, the more likely that some of your people are going to be there sooner or later.

(An additional complication is people who move elsewhere (or are elsewhere) but maintain a relationship with your part of the university, and as part of that may to visit in person every so often. These people travel too, and are even less likely to tell you their travel plans, since now you're a third party to them.)

tech/UniversityPeopleTravelWidely written at 22:11:08; Add Comment


The Firefox source code's 'StaticPrefs' system (as of Firefox 128)

The news of the time interval is that Mozilla is selling out Firefox users once again (although Firefox remains far better than Chrome), in the form of 'Privacy-Preserving Attribution', which you might impolitely call 'browser managed tracking'. Mozilla enabled this by default in Firefox 128 (cf), and if you didn't know already you can read about how to disable it here or here. In the process of looking into all of this, I attempted to find where in the Firefox code the special dom.private-attribution.submission.enabled was actually used, but initially failed. Later, with guidance from @mcc's information, I managed to find the code and learned something about how Firefox handles certain 'about:config' preferences through a system called 'StaticPrefs'.

The Firefox source code defines a big collection of statically known about:config preferences, with commentary, default values, and their types, in modules/libpref/init/StaticPrefList.yaml (reading the comments for preferences you're interested is often quite useful). Our dom.private-attribution.submission.enabled preference is defined there. However, you will search the Firefox source tree in vain for any direct reference to accessing these preferences from C++ code, because their access functions are actually created as part of the build process, and even in the build tree they're accessed through #defines that are in StaticPrefListBegin.h. In the normal C++ code, all that you'll see is calls to 'StaticPrefs:<underscore_name>()', with the name of the function being the name of the preference with .'s converted to '_' (underscores), giving names like dom_private_attribution_submission_enabled. You can see this in dom/privateattribution/PrivateAttribution.cpp in functions like 'PrivateAttribution::SaveImpression()' (for as long as this source code lives in Firefox before Mozilla rips it out, which I hope is immediately).

(In the Firefox build tree, the generated file to look at is modules/libpref/init/StaticPrefList_dom.h.)

Some preferences in StaticPrefList.yaml aren't accessed this way by C++ code (currently, those with 'mirror: never' that are used in C++ code), so their name will appear in .cpp files in the Firefox source if you search for them. I believe that Firefox C++ code can also use additional preferences not listed in StaticPrefList, but it will obviously have to access those preferences using their preferences name. There are various C++ interfaces for working with such preferences, so you'll see things like a preference's value looked up by its name, or its name passed to methods like nsIPrincipal's 'IsURIInPrefList()'.

A significant amount of Firefox is implemented in JavaScript. As far as I know, that JavaScript doesn't use StaticPrefs or any equivalent of it and always accesses preferences by their normal about:config name.

web/FirefoxStaticPrefsSystem written at 22:02:47; Add Comment


That software forges are often better than email is unfortunate

Over on the Fediverse, there was a discussion of doing software development things using email and I said something:

My heretical opinion is that I would rather file a Github issue against your project than send you or your bug tracker email, because I do not trust you to safeguard my email against spammers, so I have to make up an entire new email address for you and carefully manage it. I don't trust Github either, but I have already done all of this email address handling for them.

(I also make up an email address for my Git commits. And yes, spammers have scraped it and spammed me.)

Github is merely a convenient example (and the most common one I deal with). What matters is that the forge is a point of centralization (so it covers a lot of projects) and that it does not require me to expose my email to lots of people. Any widely used forge-style environment has the same appeal (and conversely, small scale forges do not; if I am going to report issues to only one project per forge, it is not much different than a per-project bug tracker or bug mailing list).

That email is so much of a hassle today is a bad and sad thing. Email is a widely implemented open standard with a huge suite of tools that allows for a wide range of ways of working with it. It should be a great light-weight way of sending in issues, bug reports, patches, etc etc, and any centralized, non-email place to do this (like Github) has a collection of potential problems that should make open source/free software people nervous.

Unfortunately email has been overrun by spammers in a way that forges have not (yet) been, and in the case of email the problem is essentially intractable. Even my relatively hard to obtain Github-specific email address gets spam email, and my Git commit email address gets more. And demonstrating the problem with not using forges, the email address I used briefly to make some GNU Emacs bug reports about MH-E got spam almost immediately, which shows why I really don't want to have to send my issues by email to an exposed mailing list with public archives.

While there are things that might make the email situation somewhat better (primarily by hiding your email address from as many parties as possible), I don't think there's any general fix for the situation. Thanks to spam and abuse, we're stuck with a situation where setting yourself up on a few central development sites with good practices about handling your contact methods is generally more convenient than an open protocol, especially for people who don't do this all the time.

programming/EmailVsForgesUnfortunate written at 23:09:48; Add Comment


Network switches aren't simple devices (not even basic switches)

Recently over on the Fediverse I said something about switches:

"Network switches are simple devices" oh I am so sorry. Hubs were simple devices. Switches are alarmingly smart devices even if they don't handle VLANs or support STP (and almost everyone wants them to support Spanning Tree Protocol, to stop loops). Your switch has onboard packet buffering, understands Ethernet addresses, often generates its own traffic and responds to network traffic (see STP), and is actually a (layer 2) high speed router with a fallback to being a hub.

(And I didn't even remember about multicast, plus I omitted various things. The trigger for my post was seeing a quote from Making a Linux-managed network switch, which is speaking (I believe) somewhat tongue in cheek and anyway is a fun and interesting article.)

Back in the old days, a network hub could simply repeat incoming packets out each port, with some hand waving about having to be aware of packet boundaries (see the Wikipedia page for more details). This is not the case with switches. Even a very basic switch must extract source and destination Ethernet addresses out of packets, maintain a mapping table between ports and Ethernet addresses, and route incoming packets to the appropriate port (or send them to all ports if they're to an unknown Ethernet address). This generally needs to be done at line speed and handle simultaneous packets on multiple ports at once.

Switches must have some degree of internal packet buffering, although how much buffering switches have can vary (and can matter). Switches need buffering to deal with both a high speed port sending to a low speed one and several ports all sending traffic to the same destination port at the same time. Buffering implies that packet reception and packet transmission can be decoupled from each other, although ideally there is no buffering delay if the receive to transmit path for a packet is clear (people like low latency in switches).

A basic switch will generally be expected to both send and receive special packets itself, not just pass through network traffic. Lots of people want switches to implement STP (Spanning Tree Protocol) to avoid network loops (which requires the switch to send, receive, and process packets itself), and probably Ethernet flow control as well. If the switch is going to send out its own packets in addition to incoming traffic, it needs the intelligence to schedule this packet transmission somehow and deal with how it interacts with regular traffic.

If the switch supports VLANs, several things get more complicated (although VLAN support generally requires a 'managed switch', since you have to be able to configure the VLAN setup). In common configurations the switch will need to modify packets passing through to add or remove VLAN tags (as packets move between tagged and untagged ports). People will also want the switch to filter incoming packets, for example to drop a VLAN-tagged packet if the VLAN in question is not configured on that port. And they will expect all of this to still run at line speed with low latency. In addition, the switch will generally want to segment its Ethernet mapping table by VLAN, because bad things can happen if it's not.

(Port isolation, also known as "private VLANs", adds more complexity but now you're well up in managed switch territory.)

PS: Modern small network switches are 'simple' in the sense that all of this is typically implemented in a single chip or close to it; the Making a Linux-managed network switch article discusses a couple of them. But what is happening inside that IC is a marvel.

tech/NetworkSwitchesNotSimple written at 23:13:55; Add Comment


Brute force attackers seem to switch targets rapidly if you block them

Like everyone else, we have a constant stream of attackers trying brute force password guessing against us using SSH or authenticated SMTP, from a variety of source IPs. Some of the source IPs attack us at a low rate (although there can be bursts when a lot of them are trying), but some of them do so at a relatively high rate, high enough to be annoying. When I notice such IPs (ones making hundreds of attempts an hour, for example), I tend to put them in our firewall blocks. After recently starting to pay attention to what happens next, what I've discovered is that at least currently, most such high volume IPs give up almost immediately. Within a few minutes of being blocked their activity typically drops to nothing.

Once I thought about it, this behavior feels like an obvious thing for attackers to do. Attackers clearly have a roster of hosts they've obtained access to and a whole collection of target machines to try brute force attacks against, with very low expectations of success for any particular attack or target machine; to make up for the low success rate, they need to do as much as possible. Wasting resources on unresponsive machines cuts down the number of useful attacks they can make, so over time attackers have likely had a lot of motivation to move on rapidly when their target stops responding. If the target machine comes back some day, well, they have a big list, they'll get around to trying it again sometime.

The useful thing about this attacker behavior is that if attackers are going to entirely stop using an IP to attack you (at least for a reasonable amount of time) within a few minutes of it being blocked, you only need to block attacker IPs for those few minutes. After five or ten or twenty minutes, you can remove the IP block again. Since the attackers use a lot of IPs and their IPs may get reused later for innocent purposes, this is useful for keeping the size of firewall blocks down and limiting the potential impact of a mis-block.

(A traditional problem with putting IPs in your firewall blocks is that often you don't have a procedure to re-assess them periodically and remove them again. So once you block an IP, it can remain blocked for years, even after it gets turned over to someone completely different. This is especially the case with cloud provider IPs, which are both commonly used for attacks and then commonly turn over. Fast and essentially automated expiry helps a lot here.)

sysadmin/BruteAttackersSeemToGoAwayFast written at 22:27:17; Add Comment


Fedora 40 probably doesn't work with software RAID 0.90 format superblocks

On my home machine, I have an old pair of HDDs that have (had) four old software RAID mirrors. Because these were old arrays, they were set up with the old 0.90 superblock metadata format. For years the arrays worked fine, although I haven't actively used them since I moved my home machine to all solid state storage. However, when I upgraded from Fedora 39 to Fedora 40, things went wrong. When Fedora 40 booted, rather than finding four software RAID arrays on sdc1+sdd1, sdc2+sdd2, sdc3+sdd3, and sdc4+sdd4 respectively, Fedora 40 decided that the fourth RAID array was all I had, and it was on sdc plus sdd (the entire disks). Since the fourth array have a LVM logical volume that I was still mounting filesystems from, things went wrong from there.

One of the observed symptoms during the issue was that my /dev had no entries for the sdc and sdd partitions, although the kernel messages said they had been recognized. This led me to stopping the 'md53' array and running 'partprobe' on both sdc and sdd, which triggered an automatic assembly of the four RAID arrays. Of course this wasn't a long term solution, since I'd have to redo it (probably by hand) every time I rebooted my home machine. In the end I wound up pulling the old HDDs entirely, something I probably should have done a while back.

(This is filed as Fedora bug 2280699.)

Many of the ingredients of this issue seem straightforward. The old 0.90 superblock format is at the end of the object it's in, so a whole-disk superblock is at the same place as a superblock in the last partition on the disk, if the partition goes all the way to the end. If the entire disk has been assembled into a RAID array, it's reasonable to not register 'partitions' on it, since those are probably actually partitions inside the RAID array. But this doesn't explain why the bug started happening in Fedora 40; something seems to have changed so that Fedora 40's boot process 'sees' a whole disk RAID array based on the 0.90 format superblock at the end, where Fedora 39 did not.

I don't know if other Linux distributions have also picked up whatever change in whatever software is triggering this in Fedora 40, or if they will; it's possible that this is a Fedora specific issue. But the general moral I think people should take from this is that if you still have software RAID arrays using superblock format 0.90, you need a plan to change that. The Linux Raid Wiki has a somewhat dangerous looking in-place conversion process, but I wouldn't want to try that without backups. And if you have software RAID arrays that old, they probably contain old filesystems that you may want to recreate so they pick up new features (which isn't always possible with an in-place conversion).

Sidebar: how to tell what superblock format you have

The simple way is to look at /proc/mdstat. If the status line for a software RAID array mentions a superblock version, you have that version, for example:

.pn prewap on

md26 : active raid1 sda4[0] sdb4[1]
      94305280 blocks super 1.2 [2/2] [UU]

This is a superblock 1.2 RAID array.

If the status line doesn't mention a 'super' version, then you have an old 0.90 superblock. For example:

md53 : active raid1 sdd4[1] sdc4[0]
      2878268800 blocks [2/2] [UU]
      bitmap: 0/22 pages [0KB], 65536KB chunk

Unless you made your software RAID arrays a very long time ago and faithfully kept upgrading their system ever since, you probably don't have superblock 0.90 format arrays.

(Although you could have deliberately asked mdadm to make new arrays with 0.90 format superblocks.)

linux/Fedora40OldSuperblockIssue written at 22:47:31; Add Comment


Some (big) mail senders do use TLS SNI for SMTP even without DANE

TLS SNI (Server Name Indication) is a modern TLS feature where clients that are establishing a TLS session with a server tell it what name they are connecting to, so the server can give them the right TLS server certificate. TLS SNI is essential for the modern web's widespread HTTPS hosting, and so every usable HTTPS-capable web client uses SNI. However, other protocols also use TLS, and whether or not the software involved uses SNI is much more variable.

DANE is a way to bind TLS certificates to domain names through DNS and DNSSEC. In particular it can be used to authenticate the SMTP connections used to deliver email (RFC 7672). When you use DANE with TLS over SMTP, using SNI is required and is also straightforward, because DNSSEC and DANE have told you (the software trying to deliver email over SMTP) what server name to use.

Recently, SNI came up on the Exim mailing list, where I learned that when it's sending email, Exim doesn't normally use SNI when establishing TLS over SMTP (unless it's using DANE). According to Exim developers on the mailing list, the reasons for this include not being sure of what TLS SNI name to use and uncertainties over whether SMTP servers would malfunction if given SNI information. This caused me to go look at our (Exim) logs for our incoming mail gateway, where I noticed that although we don't use DANE and don't have DNSSEC, a number of organizations sending email to us were using SNI when they established their TLS sessions (helpfully, Exim logs this information). In fact, the SNI information logged is more interesting than I expected.

We have a straightforward inbound mail situation; our domains have a single DNS MX record to a specific host name that has a direct DNS A record (IP address). Despite that, a small number of senders supplied wild SNI names of 'dummy' (which look like mostly spammers), a RFC 1918 IP address (a sendnode.com host), and the IP address of the inbound mail gateway (from barracuda.com). However, most sending mailers that used SNI at all provided our inbound mail gateway's host name as the SNI name.

Using yesterday's logs because it's easy, roughly 40% of the accepted messages were sent using SNI; a better number is that about 46% of the messages that used TLS at all were using SNI (roughly 84% of the accepted incoming messages used TLS). One reason the percentage of SNI is so high is that a lot of the SNI sources are large, well known organizations (often ones with a lot invested in email), including amazonses.com, outlook.com, google.com, quora.com, uber.com, mimecast.com, statuspage.io, sendgrid.net, and mailgun.net.

Given this list of organizations that are willing to use SNI when talking to what is effectively a random server on the Internet with nothing particularly special about its DNS setup, my assumption is that today, sending SNI when you set up TLS over SMTP doesn't hurt delivery very much. At the same time, that some people's software send bogus values suggests that fumbling the SNI name doesn't do too much harm, which is often unlike the situation with HTTPS.

PS: I suspect that the software setting 'dummy' as the SNI name isn't actually mail software, but is instead some dedicated spam sending software that's using a TLS library that has a default SNI name set and is (of course) not overriding the name, much as some web spider software doesn't specifically set the HTTP User-Agent and so inherits whatever vague User-Agent their HTTP library defaults to.

spam/MailersUseSNIWithoutDANE written at 22:31:36; Add Comment

(Previous 10 or go back to July 2024 at 2024/07/08)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.