2024-09-15
Why we're interest in FreeBSD lately (and how it relates to OpenBSD here)
We have a long and generally happy history of using OpenBSD and PF for firewalls. To condense a long story, we're very happy with the PF part of our firewalls, but we're increasingly not as happy with the OpenBSD part (outside of PF). Part of our lack of cheer is the state of OpenBSD's 10G Ethernet support when combined with PF, but there are other aspects as well; we never got OpenBSD disk mirroring to be really useful and eventually gave up on it.
We wound up looking at FreeBSD after another incident with OpenBSD doing weird and unhelpful hardware things, because we're a little tired of the whole area. Our perception (which may not be reality) is that FreeBSD likely has better driver support for modern hardware, including 10G cards, and has gone further on SMP support for networking, hopefully including PF. The last time we looked at this, OpenBSD PF was more or less limited by single-'core' CPU performance, especially when used in bridging mode (which is what our most important firewall uses). We've seen fairly large bandwidth rates through our OpenBSD PF firewalls (in the 800 MBytes/sec range), but never full 10G wire bandwidth, so we've wound up suspecting that our network speed is partly being limited by OpenBSD's performance.
(To get to this good performance we had to buy servers that focused on single-core CPU performance. This created hassles in our environment, since these special single-core performance servers had to be specially reserved for OpenBSD firewalls. And single-core performance isn't going up all that fast.)
FreeBSD has a version of PF that's close enough to OpenBSD's older versions to accept much or all of the syntax of our pf.conf files (we're not exactly up to the minute on our use of PF features and syntax). We also perceive FreeBSD as likely more normal to operate than OpenBSD has been, making it easier to integrate into our environment (although we'd have to actually operate it for a while to see if that was actually the case). If FreeBSD has great 10G performance on our current generation commodity servers, without needing to buy special servers for it, and fixes other issues we have with OpenBSD, that makes it potentially fairly attractive.
(To be clear, I think that OpenBSD is (still) a great operating system if you're interested in what it has to offer for security and so on. But OpenBSD is necessarily opinionated, since it has a specific focus, and we're not really using OpenBSD for that focus. Our firewalls don't run additional services and don't let people log in, and some of them can only be accessed over a special, unrouted 'firewall' subnet.)
2024-09-14
Getting maximum 10G Ethernet bandwidth still seems tricky
For reasons outside the scope of this entry, I've recently been trying to see how FreeBSD performs on 10G Ethernet when acting as a router or a bridge (both with and without PF turned on). This pretty much requires at least two more 10G test machines, so that the FreeBSD server can be put between them. When I set up these test machines, I didn't think much about them so I just grabbed two old servers that were handy (well, reasonably handy), stuck a 10G card into each, and set them up. Then I actually started testing their network performance.
I'm used to 1G Ethernet, where long ago it became trivial to achieve full wire bandwidth, even bidirectional full bandwidth (with test programs; there are many things that can cause real programs to not get this). 10G Ethernet does not seem to be like this today; the best I could do was get close to around 950 MBytes a second in one direction (which is not 10G's top speed). With the right circumstances, bidirectional traffic could total to just over 1 GByte a second, which is of course nothing like what we'd like to see.
(This isn't a new problem with 10G Ethernet, but I was hoping this had been solved in the past decade or so.)
There's a lot of things that could be contributing to this, like the speed of the CPU (and perhaps RAM), the specific 10G hardware I was using (including if it lacked performance increasing features that more expensive hardware would have had), and Linux kernel or driver issues (although this was Ubuntu 24.04, so I would hope that they were sorted out). I'm especially wondering about CPU limitations, because the kernel's CPU usage did seem to be quite high during my tests and, as mentioned, they're old servers with old CPUs (different old CPUs, even, one of which seemed to perform a bit better than the other).
(For the curious, one was a Celeron G530 in a Dell R210 II and the other a Pentium G6950 in a Dell R310, both of which date from before 2016 and are something like four generations back from our latest servers (we've moved on slightly since 2022).)
Mostly this is something I'm going to have to remember about 10G Ethernet in the future. If I'm doing anything involving testing its performance, I'll want to use relatively modern test machines, possibly several of them to create aggregate traffic, and then I'll want to start out by measuring the raw performance those machines can give me under the best circumstances. Someday perhaps 10G Ethernet will be like 1G Ethernet for this, but that's clearly not the case today (in our environment).
2024-09-13
Threads, asynchronous IO, and cancellation
Recently I read Asynchronous IO: the next billion-dollar mistake? (via), and had a reaction to one bit of it. Then yesterday on the Fediverse I said something about IO in Go:
I really wish you could (easily) cancel io Reads (and Writes) in Go. I don't think there's any particularly straightforward way to do it today, since the io package was designed way before contexts were a thing.
(The underlying runtime infrastructure can often actually do this because it decouples 'check for IO being possible' from 'perform the IO', but stuff related to this is not actually exposed.)
Today this sparked a belated realization in my mind, which is that a model of threads performing blocking IO in each thread is simply a harder environment to have some sort of cancellation in than an asynchronous or 'event loop' environment. The core problem is that in their natural state, threads are opaque and therefor difficult to interrupt or stop safely (which is part of why Go's goroutines can't be terminated from the outside). This is the natural inverse of how threads handle state for you.
(This is made worse if the thread is blocked in the operating system itself, for example in a 'read()' system call, because now you have to use operating system facilities to either interrupt the system call so the thread can return to user level to notice your user level cancellation, or terminate the thread outright.)
Asynchronous IO generally lets you do better in a relatively clean way. Depending on the operating system facilities you're using, either there is a distinction between the OS telling you that IO is possible and your program doing IO, providing you a chance to not actually do the IO, or in an 'IO submission' environment you generally can tell the OS to cancel a submitted but not yet completed IO request. The latter is racy, but in many situations the IO is unlikely to become possible right as you want to cancel it. Both of these let you implement a relatively clean model of cancelling a conceptual IO operation, especially if you're doing the cancellation as the result of another IO operation.
Or to put it another way, event loops may make you manage state explicitly, but that also means that that state is visible and can be manipulated in relatively natural ways. The implicit state held in threads is easy to write code with but hard to reason about and work with from the outside.
Sidebar: My particular Go case
I have a Go program that at its core involves two goroutines, one reading from standard input and writing to a network connection, one reading from the network connection and writing to standard output. Under some circumstances, the goroutine reading from the network will want to close down the network collection and return to a top level, where another two way connection will be made. In the process, it needs to stop the 'read from stdin, write to the network' goroutine while it is parked in 'read from stdin', without closing stdin (because that will be reused for the next connection).
To deal with this cleanly, I think I would have to split the 'read from standard input, write to the network' goroutine into two that communicated through a channel. Then the 'write to the network' side could be replaced separately from the 'read from stdin' side, allowing me to cleanly substitute a new network connection.
(I could also use global variables to achieve the same substitution, but let's not.)
2024-09-12
What admin access researchers have to their machines here
Recently on the Fediverse, Stephen Checkoway asked what level of access fellow academics had to 'their' computers to do things like install software (via). This is an issue very relevant to where I work, so I put a short-ish answer in the Fediverse thread and now I'm going to elaborate it at more length. Locally (within the research side of the department) we have a hierarchy of machines for this sort of thing.
At the most restricted are the shared core machines my group operates in our now-unusual environment, such as the mail server, the IMAP server, the main Unix login server, our SLURM cluster and general compute servers, our general purpose web server, and of course the NFS fileservers that sit behind all of this. For obvious reasons, only core staff have any sort of administrative access to these machines. However, since we operate a general Unix environment, people can install whatever they want to in their own space, and they can request that we install standard Ubuntu packages, which we mostly do (there are some sorts of packages that we'll decline to install). We do have some relatively standard Ubuntu features turned off for security reasons, such as "user namespaces", which somewhat limits what people can do without system privileges. Only our core machines live on our networks with public IPs; all other machines have to go on separate private "sandbox" networks.
The second most restricted are researcher owned machines that want to NFS mount filesystems from our NFS fileservers. By policy, these must be run by the researcher's Point of Contact, operated securely, and only the Point of Contact can have root on those machines. Beyond that, researchers can and do ask their Point of Contact to install all sorts of things on their machines (the Point of Contact effectively works for the researcher or the research group). As mentioned, these machines live on "sandbox" networks. Most often they're servers that the researcher has bought with grant funding, and there are some groups that operate more and better servers than we (the core group) do.
Next are non-NFS machines that people put on research group "sandbox" networks (including networks where some machines have NFS access); people do this with both servers and desktops (and sometimes laptops as well). The policies on who has what power over these machines is up to the research group and what they (and their Point of Contact) feel comfortable with. There are some groups where I believe the Point of Contact runs everything on their sandbox network, and other groups where their sandbox network is wide open with all sorts of people running their own machines, both servers and desktops. Usually if a researcher buys servers, the obvious person to have run them is their Point of Contact, unless the research work being done on the servers is such that other people need root access (or it's easier for the Point of Contact to hand the entire server over to a graduate student and have them run it as they need it).
Finally there are generic laptops and desktops, which normally go on our port-isolated 'laptop' network (called the 'red' network after the colour of network cables we use for it, so that it's clearly distinct from other networks). We (the central group) have no involvement in these machines and I believe they're almost always administered by the person who owns or at least uses them, possibly with help from that person's Point of Contact. These days, some number of laptops (and probably even desktops) don't bother with wired networking and use our wireless network instead, where similar 'it's yours' policies apply.
People who want access to their files from their self-managed desktop or laptop aren't left out in the cold, since we have a SMB (CIFS) server. People who use Unix and want their (NFS, central) home directory mounted can use the 'cifs' (aka 'smb3') filesystem to access it through our SMB server, or even use sshfs if they want to. Mounting via cifs or sshfs is in some cases superior to using NFS, because they can give you access to important shared filesystems that we can't NFS export to machines outside our direct control.
2024-09-11
Rate-limiting failed SMTP authentication attempts in Exim 4.95
Much like with SSH servers, if you have a SMTP server exposed to the Internet that supports SMTP authentication, you'll get a whole lot of attackers showing up to do brute force password guessing. It would be nice to slow these attackers down by rate-limiting their attempts. If you're using Exim, as we are, then this is possible to some degree. If you're using Exim 4.95 on Ubuntu 22.04 (instead of a more recent Exim), it's trickier than it looks.
One of Exim's ACLs, the ACL specified by acl_smtp_auth, is consulted just before Exim accepts a SMTP 'AUTH <something>' command. If this ACL winds up returning a 'reject' or a 'defer' result, Exim will defer or reject the AUTH command and the SMTP client will not be able to try authenticating. So obviously you need to put your ratelimit statement in this ACL, but there are two complications. First, this ACL doesn't have access to the login name the client is trying to authenticate (this information is only sent after Exim accepts the 'AUTH <whatever>' command), so all you can ratelimit is the source IP (or a network area derived from it). Second, this ACL happens before you know what the authentication result is, so you don't want to actually update your ratelimit in it, just check what the ratelimit is.
This leads to the basic SMTP AUTH ACL of:
acl_smtp_auth = acl_check_authbegin aclacl_check_auth: # We'll cover what this is for later warn set acl_c_auth = true deny ratelimit = 10 / 10m / per_cmd / readonly / $sender_host_address delay = 10s message = You are failing too many authentication attempts. # you might also want: # log_message = .... # don't forget this or you will be sad # (because no one will be able to authenticate) accept
(The 'delay = 10s' usefully slows down our brute force SMTP authentication attackers because they seem to wait for the reply to their SMTP AUTH command rather than giving up and terminating the session after a couple of seconds.)
This ratelimit is read-only because we don't want to update it unless the SMTP authentication fails; otherwise, you will wind up (harshly) rate-limiting legitimate people who repeatedly connect to you, authenticate, perhaps send an email message, and then disconnect. Since we can't update the ratelimit in the SMTP AUTH ACL, we need to somehow recognize when authentication has failed and update the ratelimit in that place.
In Exim 4.97 and later, there's a convenient and direct way to do this through the events system and the 'auth:fail' event that is raised by an Exim server when SMTP authentication fails. As I understand it, the basic trick is that you make the auth:fail event invoke a special ACL, and have the user ACL update the ratelimit. Unfortunately Ubuntu 22.04 has Exim 4.95, so we must be more clever and indirect, and as a result somewhat imperfect in what we're doing.
To increase the ratelimit when SMTP authentication has failed, we add an ACL that is run at the end of the connection and increases the ratelimit if an authentication was attempted but did not succeed, which we detect by the lack of authentication information. Exim has two possible 'end of session' ACL settings, one that is used if the session is ended with a SMTP QUIT command and one that is ended if the SMTP session is just ended without a QUIT.
So our ACL setup to update our ratelimit looks like this:
[...] acl_smtp_quit = acl_count_failed_auth acl_smtp_notquit = acl_count_failed_auth begin acl [...] acl_count_failed_auth: warn: condition = ${if bool{$acl_c_auth} } !authenticated = * ratelimit = 10 / 10m / per_cmd / strict / $sender_host_address accept
Our $acl_c_auth SMTP connection ACL variable tells us whether or not the connection attempted to authenticate (sometimes legitimate people simply connect and don't do anything before disconnecting), and then we also require that the connection not be authenticated now to screen out people who succeeded in their SMTP authentication. The settings for the two 'ratelimit =' settings have to match or I believe you'll get weird results.
(The '10 failures in 10 minutes' setting works for us but may not work for you. If you change the 'deny' to 'warn' in acl_check_auth and comment out the 'message =' bit, you can watch your logs to see what rates real people and your attackers actually use.)
The limitation on this is that we're actually increasing the ratelimit based not on the number of (failed) SMTP authentication attempts but on the number of connections that tried but failed SMTP authentication. If an attacker connects and repeatedly tries to do SMTP AUTH in the session, failing each time, we wind up only counting it as a single 'event' for ratelimiting because we only increase the ratelimit (by one) when the session ends. For the brute force SMTP authentication attackers we see, this doesn't seem to be an issue; as far as I can tell, they disconnect their session when they get a SMTP authentication failure.
2024-09-10
Ways ATX power supply control could work on server motherboards
Yesterday I talked about how ATX power supply control seems to work on desktop motherboards, which is relatively straightforward; as far as I can tell from various sources, it's handled in the chipset (on modern Intel chipsets, in the PCH), which is powered from standby power by the ATX power supply. How things work on servers is less clear. Here when I say 'server' I mean something with a BMC (Baseboard management controller), because allowing you to control the server's power supply is one of the purposes of a BMC, which means the BMC has to hook into this power management picture.
There appear to be a number of ways that the power control and management could or may be done and the BMC connected to it. People on the Fediverse replying to my initial question gave me a number of possible answers:
- The power management still happens in the chipset and the
BMC has a PCIe lane to the chipset
that it uses to have the chipset do power things.
- The power management happens in the chipset but the physical
power button is wired to the BMC and the BMC controls the 'power
button wire' that goes to the chipset.
To power on or off the server the BMC basically pretends there's
a person doing the appropriate things with the power button,
and then lets the chipset handle it.
(The BMC has various ways to tell what the current state of the power supply is.)
- The BMC is where parts of the chipset functionality are implemented, including the power button parts.
I found documentation for some of Intel's older Xeon server chipsets (with provisions for BMCs) and as of that generation, power management was still handled in the PCH and described in basically the same language as for desktops. I couldn't spot a mention of special PCH access for the BMC, so BMC control over server power might have been implemented with the 'BMC controls the power button wire' approach.
I can also imagine hybrid approaches. For example, you could in theory give the BMC control over the 'turn power on' wire to the power supplies, and route the chipset's version of that line to the BMC, in addition to routing the power button wire to the BMC. Then the BMC would be in a position to force a hard power off even if something went wrong in the chipset (or a hard power on, although if the chipset refuses to trigger a power on there might be a good reason for that).
(Server power supplies aren't necessarily 'ATX' power supplies as such, but I suspect that they all have similar standby power, 'turn power on', and 'is the PSU power stable' features as ATX PSUs do. Server PSUs often clearly aren't plain ATX units because they allow the BMC to obtain additional information on things like the PSU's state, temperature, current power draw, and so on.)
Our recent experience with BMCs that wouldn't let their servers power on when they should have suggests that on these servers (both Dell R340s), the BMC has some sort of master control or veto power over the normal 'return to last state' settings in the BIOS. At the same time, the 'what to do after AC power returns' setting is in the BIOS, not in the BMC, so it seems that the BMC is not the sole thing controlling power.
(I tried to take a look at how this was done in OpenBMC, but rapidly got lost in a twisty maze of things. I think at least some of the OpenBMC supported hardware does this through I2C commands, although what I2C device it's talking to is a good question. Some of the other hardware appears to have GPIO signal definitions for power related stuff, including power button definitions.)
2024-09-09
How ATX power supply control seems to work on desktop motherboards
Somewhat famously, the power button on x86 PC desktop machines with ATX power supplies is not a 'hard' power switch that interrupts or enables power through the ATX PSU but a 'soft' button that is controlled by the overall system. The actual power delivery is at least somewhat under software control, both the operating system (which enables modern OSes to actually power off the machine under software control) and the 'BIOS', broadly defined, which will do things like signal the OS to do an orderly shutdown if you merely tap the power button instead of holding it down for a few seconds. Because they're useful, 'soft' power buttons and the associated things have also spread to laptops and servers, even if their PSUs are not necessarily 'ATX' as such. After recent events, I found myself curious about actually did handle the chassis power button and associated things. Asking on the Fediverse produced a bunch of fascinating answers, so today I'm starting with plain desktop motherboards, where the answer seems to be relatively straightforward.
(As I looked up once, physically the power button is normally a momentary-contact switch that is open (off) when not pressed. A power button that's stuck 'pressed' can have odd effects.)
At the direct electrical level, ATX PSUs are either on, providing their normal power, or "off", which is not really completely off but has the PSU providing +5V standby power (with a low current limit) on a dedicated pin (pin 9, the ATX cable normally uses a purple wire for this). To switch an ATX PSU from "off" to on, you ground the 'power on' pin and keep it grounded (pin 16; the green wire in normal cables, and ground is black wires). After a bit of stabilization time, the ATX PSU will signal that all is well on another pin (pin 8, the grey wire). The ATX PSU's standby power is used to power the RTC and associated things, to provide the power for features like wake-on-lan (which requires network ports to be powered up at least a bit), and to power whatever handles the chassis power button when the PSU is "off".
On conventional desktop motherboards, the actual power button handling appears to be in the PCH or its equivalent (per @rj's information on the ICH, and also see Whitequark's ICH/PCH documentation links). In the ICH/PCH, this is part of general power management, including things like 'suspend to RAM'. Inside the PCH, there's a setting (or maybe two or three) that controls what happens when external power is restored; the easiest to find one is called AFTERG3_EN, which is a single bit in one of the PCH registers. To preserve this register's settings over loss of external power, it's part of what the documentation calls the "RTC well", which is apparently a chunk of stuff that's kept powered as part of the RTC, either from standby power or from the RTC's battery (depending on whether or not there's external power available). The ICH/PCH appears to have a direct "PWRBTN#" input line, which is presumably eventually connected to the chassis power button, and it directly implements the logic for handling things like the 'press and hold for four seconds to force a power off' feature (which Intel describes as 'transitioning to S5', the "Soft-Off" state).
('G3' is the short Intel name for what Intel calls "Mechanical Off", the condition where there's no external power. This makes the AFTERG3_EN name a bit clearer.)
As far as I can tell there's no obvious and clear support for the modern BIOS setting of 'when external power comes back, go to your last state'. I assume that what actually happens is that the ICH/PCH register involved is carefully updated by something (perhaps ACPI) as the system is powered on and off. When the system is powered on, early in the sequence you'd set the PCH to 'go to S0 after power returns'; when the system is powered off, right at the end you'd set the PCH to 'stay in S5 after power returns'.
(And apparently you can fiddle with this register yourself (via).)
All of the information I've dug up so far is for Intel ICH/PCH, but I suspect that AMD's chipsets work in a similar manner. Something has to do power management for suspend and sleep, and it seems that the chipset is the natural spot for it, and you might as well put the 'power off' handling into the same place. Whether AMD uses the same registers and the same bits is an open question, since I haven't turned up any chipset documentation so far.
2024-09-08
I should probably reboot BMCs any time they behave oddly
Today on the Fediverse I said:
It has been '0' days since I had to reset a BMC/IPMI for reasons (in this case, apparently something power related happened that glitched the BMC sufficiently badly that it wasn't willing to turn on the system power). Next time a BMC is behaving oddly I should just immediately tell it to cold reset/reboot and see, rather than fiddling around.
(Assuming the system is already down. If not, there are potential dangers in a BMC reset.)
I've needed to reset a BMC before, but this time was more odd and less clear than the KVM over IP that wouldn't accept the '2' character.
We apparently had some sort of power event this morning, with a number of machines abruptly going down (distributed across several different PDUs). Most of the machines rebooted fine, either immediately or after some delay. A couple of the machines did not, and conveniently we had set up their BMCs on the network (although they didn't have KVM over IP). So I remotely logged in to their BMC's web interface, saw that the BMC was reporting that the power was off, and told the BMC to power on.
Nothing happened. Oh, the BMC's web interface accepted my command, but the power status stayed off and the machines didn't come back. Since I had a bike ride to go to, I stopped there. After I came back from the bike ride I tried some more things (still remotely). One machine I could remotely power cycle through its managed PDU, which brought it back. But the other machine was on an unmanaged PDU with no remote control capability. I wound up trying IPMI over the network (with ipmitool), which had no better luck getting the machine to power on, and then I finally decided to try resetting the BMC. That worked, in that all of a sudden the machine powered on the way it was supposed to (we set the 'what to do after power comes back' on our machines to 'last power state', which would have been 'powered on').
As they say, I have questions. What I don't have is any answers. I believe that the BMC's power control talks to the server's motherboard, instead of to the power supply units, and I suspect that it works in a way similar to desktop ATX chassis power switches. So maybe the BMC software had a bug, or some part of the communication between the BMC and the main motherboard circuitry got stuck or desynchronized, or both. Resetting the BMC would reset its software, and it could also force a hardware reset to bring the communication back to a good state. Or something else could be going on.
(Unfortunately BMCs are black boxes that are supposed to just work, so there's no way for ordinary system administrators like me to peer inside.)
2024-09-07
I wish (Linux) WireGuard had a simple way to restrict peer public IPs
WireGuard is an obvious tool to build encrypted, authenticated connections out of, over which you can run more or less any network service. For example, you might expose the rsync daemon only over a specific WireGuard interface, instead of running rsync over SSH. Unfortunately, if you want to use WireGuard as a SSH replacement in this fashion, it has one limitation; unlike SSH, there's no simple way to restrict the public IP address of a particular peer.
The rough equivalent of a WireGuard peer is a SSH keypair. In SSH, you can restrict where a keypair will be accepted from with the 'from="..."' restriction in your .ssh/authorized_keys. This provides an extra layer of protection against the key being compromised; not only does an attacker have to acquire the key, they have to be able to use it from exactly the same IP (or the expected IPs). However, more or less by design WireGuard doesn't have a particular restriction on where a WireGuard peer key can be used from. You can set an expected public IP for the peer, but if the peer contacts you from another IP, your (Linux kernel) WireGuard will update its idea of where the peer is. This is handy for WireGuard's usual usage cases but not what we necessarily want for a wired down connection where the IPs should never change.
(I don't think this is a technical restriction in the WireGuard protocol, just something not done in most or all implementations.)
The normal answer is firewall rules that restrict access to the WireGuard port, but this has two limitations. The first and lesser limitation is that it's external to WireGuard, so it's possible to have WireGuard active but your firewall rules not properly applied, theoretically allowing more access than you intend. The bigger limitation is that if you have more than one such wired down WireGuard peer, firewall rules can't tell which WireGuard peer key is being used by which external peer. So in a straightforward implementation of firewall rules, any peer public IP can impersonate any other (if it has the required WireGuard peer key), which is different from the SSH 'from="..."' situation, where each key is restricted separately.
(On the other hand, the firewall situation is better in one way in that you can't accidentally add a WireGuard peer that will be accepted from anywhere the way you can with a SSH key by forgetting to put in a 'from="..."' restriction.)
To get firewall rules that can tell peers apart, you need to use different listening ports for each peer on your end. Today, this requires different WireGuard interfaces (and probably different server keys) for each peer. I think you can probably give all of the interfaces the same internal IP to simplify your life, although I haven't tested this.
(Having written this entry, I now wonder if it would be possible to write an nftables or iptables extension that hooked into the kernel side of WireGuard enough to know peer identities and let you match on them. Existing extensions are already able to be aware of various things like cgroup membership, and there's an existing extension for IPsec. Possibly you could do this with eBPF programs, since there's a BPF/eBPF iptables extension.)
Operating system threads are always going to be (more) expensive
Recently I read Asynchronous IO: the next billion-dollar mistake? (via). Among other things, it asks:
Now imagine a parallel universe where instead of focusing on making asynchronous IO work, we focused on improving the performance of OS threads [...]
I don't think this would have worked as well as you'd like, at least not with any conventional operating system. One of the core problems with making operating system threads really fast is the 'operating system' part.
A characteristic of all mainstream operating systems is that the operating system kernel operates in a separate hardware security domain than regular user (program) code. This means that any time the operating system becomes involved, the CPU must do at least two transitions between these security domains (into kernel mode and then back out). Doing these transitions is always more costly than not doing them, and on top of that the CPU's ISA often requires the operating system go through non-trivial work in order to be safe from user level attacks.
(The whole speculative execution set of attacks has only made this worse.)
A great deal of the low level work of modern asynchronous IO is about not crossing between these security domains, or doing so as little as possible. This is summarized as 'reducing system calls because they're expensive', which is true as far as it goes, but even the cheapest system call possible still has to cross between the domains (if it is an actual system call; some operating systems have 'system calls' that manage to execute entirely in user space).
The less that doing things with threads crosses the CPU's security boundary into (and out of) the kernel, the faster the threads go but the less we can really describe them as 'OS threads' and the harder it is to get things like forced thread preemption. And this applies not just for the 'OS threads' themselves but also to their activities. If you want 'OS threads' that perform 'synchronous IO through simple system calls', those IO operations are also transitioning into and out of the kernel. If you work to get around this purely through software, I suspect that what you wind up with is something that looks a lot like 'green' (user-space) threads with asynchronous IO once you peer behind the scenes of the abstractions that programs are seeing.
(You can do this today, as Go's runtime demonstrates. And you still benefit significantly from the operating system's high efficiency asynchronous IO, even if you're opting to use a simpler programming model.)
(See also thinking about event loops versus threads.)