2024-10-01
Two views of what a TLS certificate verifies
One of the things that you could ask about TLS is what a validated TLS certificate means or is verifying. Today there is a clear answer, as specified by the CA/Browser Forum, and that answers is that when you successfully connect to https://microsoft.com/, you are talking to the "real" microsoft.com, not an impostor who is intercepting your traffic in some way. This is known as 'domain control' in the jargon; to get a TLS certificate for a domain, you must demonstrate that you have control over the domain. The CA/Browser Forum standards (and the browsers) don't require anything else.
Historically there has been a second answer, what TLS (then SSL) sort of started with. A TLS certificate was supposed to verify that not just the domain but that you were talking to the real "Microsoft" (which is to say the large, world wide corporation with its headquarters in Redmond WA, not any other "Microsoft" that might exist). More broadly, it was theoretically verifying that you were talking to a legitimate and trustworthy site that you could, for example, give your credit card number to over the Internet, which used to be a scary idea.
This second answer has a whole raft of problems in practice, which is why the CA/Browser Forum has adopted the first answer, but it started out and persists because it's much more useful to actual people. Most people care about talking to (the real) Google, not some domain name, and domain names are treacherous things as far as identity goes (consider IDN homograph attacks, or just 'facebook-auth.com'). We rather want this human version of identity and it would be very convenient if we could have it. But we can't. The history of TLS certificates has convincingly demonstrated that this version of identity has comprehensively failed for a collection of reasons including that it's hard, expensive, difficult or impossible to automate, and (quite) fallible.
(The 'domain control' version of what TLS certificates mean can be automated because it's completely contained within the Internet. The other version is not; in general you can't verify that sort of identity using only automated Internet resources.)
A corollary of this history is that no Internet protocol that's intended for wide spread usage can assume a 'legitimate identity' model of participants. This includes any assumption that people can only have one 'identity' within your system; in practice, since Internet identity can only verify that you are something, not that you aren't something, an attacker can have as many identities as they want (including corporate identities).
PS: The history of commercial TLS certificates also demonstrates that you can't use costing money to verify legitimacy. It sounds obvious to say it, but all that charging someone money demonstrates is that they willing and able to spend some money (perhaps because they have a pet cause), not that they're legitimate.
2024-09-21
TLS certificates were (almost) never particularly well verified
Recently there was a little commotion in the TLS world, as discussed in We Spent $20 To Achieve RCE And Accidentally Became The Admins Of .MOBI. As part of this adventure, the authors of the article discovered that some TLS certificate authorities were using WHOIS information to validate who controlled a domain (so if you could take over a WHOIS server for a TLD, you could direct domain validation to wherever you wanted). This then got some people to realize that TLS Certificate Authorities were not actually doing very much to verify who owned and controlled a domain. I'm sure that there were also some people who yearned for a hypothetical old days when Certificate Authorities actually did that, as opposed to the modern days when they don't.
I'm afraid I have bad news for anyone with this yearning. Certificate Authorities have never done a particularly strong job of verifying who was asking for a TLS (then SSL) certificate. I will go further and be more controversial; we don't want them to be thorough about identity verification for TLS certificates.
There are a number of problems with identity verification in theory and in practice, but one of them is that it's expensive, and the more thorough and careful the identity verification, the more expensive it is. No Certificate Authority is in a position to absorb this expense, so a world where TLS certificates are carefully verified is also a world where they are expensive. It's also probably a world where they're difficult or impossible to obtain from a Certificate Authority that's not in your country, because the difficulty of identity verification goes up significantly in that case.
(One reason that thorough and careful verification is expensive is that it takes significant time from experienced, alert humans, and that time is not cheap.)
This isn't the world that we had even before Let's Encrypt created the ACME protocol for automated domain verifications. The pre-LE world might have started out with quite expensive TLS certificates, but it shifted fairly rapidly to ones that cost only $100 US or less, which is a price that doesn't cover very much human verification effort. And in that world, with minimal human involvement, WHOIS information is probably one of the better ways of doing such verification.
(Such a world was also one without a lot of top level domains, and most of the TLDs were country code TLDs. The turnover in WHOIS servers was probably a lot smaller back then.)
PS: The good news is that using WHOIS information for domain verification is probably on the way out, although how soon this will happen is an open question.
2024-09-13
Threads, asynchronous IO, and cancellation
Recently I read Asynchronous IO: the next billion-dollar mistake? (via), and had a reaction to one bit of it. Then yesterday on the Fediverse I said something about IO in Go:
I really wish you could (easily) cancel io Reads (and Writes) in Go. I don't think there's any particularly straightforward way to do it today, since the io package was designed way before contexts were a thing.
(The underlying runtime infrastructure can often actually do this because it decouples 'check for IO being possible' from 'perform the IO', but stuff related to this is not actually exposed.)
Today this sparked a belated realization in my mind, which is that a model of threads performing blocking IO in each thread is simply a harder environment to have some sort of cancellation in than an asynchronous or 'event loop' environment. The core problem is that in their natural state, threads are opaque and therefor difficult to interrupt or stop safely (which is part of why Go's goroutines can't be terminated from the outside). This is the natural inverse of how threads handle state for you.
(This is made worse if the thread is blocked in the operating system itself, for example in a 'read()' system call, because now you have to use operating system facilities to either interrupt the system call so the thread can return to user level to notice your user level cancellation, or terminate the thread outright.)
Asynchronous IO generally lets you do better in a relatively clean way. Depending on the operating system facilities you're using, either there is a distinction between the OS telling you that IO is possible and your program doing IO, providing you a chance to not actually do the IO, or in an 'IO submission' environment you generally can tell the OS to cancel a submitted but not yet completed IO request. The latter is racy, but in many situations the IO is unlikely to become possible right as you want to cancel it. Both of these let you implement a relatively clean model of cancelling a conceptual IO operation, especially if you're doing the cancellation as the result of another IO operation.
Or to put it another way, event loops may make you manage state explicitly, but that also means that that state is visible and can be manipulated in relatively natural ways. The implicit state held in threads is easy to write code with but hard to reason about and work with from the outside.
Sidebar: My particular Go case
I have a Go program that at its core involves two goroutines, one reading from standard input and writing to a network connection, one reading from the network connection and writing to standard output. Under some circumstances, the goroutine reading from the network will want to close down the network collection and return to a top level, where another two way connection will be made. In the process, it needs to stop the 'read from stdin, write to the network' goroutine while it is parked in 'read from stdin', without closing stdin (because that will be reused for the next connection).
To deal with this cleanly, I think I would have to split the 'read from standard input, write to the network' goroutine into two that communicated through a channel. Then the 'write to the network' side could be replaced separately from the 'read from stdin' side, allowing me to cleanly substitute a new network connection.
(I could also use global variables to achieve the same substitution, but let's not.)
2024-09-10
Ways ATX power supply control could work on server motherboards
Yesterday I talked about how ATX power supply control seems to work on desktop motherboards, which is relatively straightforward; as far as I can tell from various sources, it's handled in the chipset (on modern Intel chipsets, in the PCH), which is powered from standby power by the ATX power supply. How things work on servers is less clear. Here when I say 'server' I mean something with a BMC (Baseboard management controller), because allowing you to control the server's power supply is one of the purposes of a BMC, which means the BMC has to hook into this power management picture.
There appear to be a number of ways that the power control and management could or may be done and the BMC connected to it. People on the Fediverse replying to my initial question gave me a number of possible answers:
- The power management still happens in the chipset and the
BMC has a PCIe lane to the chipset
that it uses to have the chipset do power things.
- The power management happens in the chipset but the physical
power button is wired to the BMC and the BMC controls the 'power
button wire' that goes to the chipset.
To power on or off the server the BMC basically pretends there's
a person doing the appropriate things with the power button,
and then lets the chipset handle it.
(The BMC has various ways to tell what the current state of the power supply is.)
- The BMC is where parts of the chipset functionality are implemented, including the power button parts.
I found documentation for some of Intel's older Xeon server chipsets (with provisions for BMCs) and as of that generation, power management was still handled in the PCH and described in basically the same language as for desktops. I couldn't spot a mention of special PCH access for the BMC, so BMC control over server power might have been implemented with the 'BMC controls the power button wire' approach.
I can also imagine hybrid approaches. For example, you could in theory give the BMC control over the 'turn power on' wire to the power supplies, and route the chipset's version of that line to the BMC, in addition to routing the power button wire to the BMC. Then the BMC would be in a position to force a hard power off even if something went wrong in the chipset (or a hard power on, although if the chipset refuses to trigger a power on there might be a good reason for that).
(Server power supplies aren't necessarily 'ATX' power supplies as such, but I suspect that they all have similar standby power, 'turn power on', and 'is the PSU power stable' features as ATX PSUs do. Server PSUs often clearly aren't plain ATX units because they allow the BMC to obtain additional information on things like the PSU's state, temperature, current power draw, and so on.)
Our recent experience with BMCs that wouldn't let their servers power on when they should have suggests that on these servers (both Dell R340s), the BMC has some sort of master control or veto power over the normal 'return to last state' settings in the BIOS. At the same time, the 'what to do after AC power returns' setting is in the BIOS, not in the BMC, so it seems that the BMC is not the sole thing controlling power.
(I tried to take a look at how this was done in OpenBMC, but rapidly got lost in a twisty maze of things. I think at least some of the OpenBMC supported hardware does this through I2C commands, although what I2C device it's talking to is a good question. Some of the other hardware appears to have GPIO signal definitions for power related stuff, including power button definitions.)
2024-09-09
How ATX power supply control seems to work on desktop motherboards
Somewhat famously, the power button on x86 PC desktop machines with ATX power supplies is not a 'hard' power switch that interrupts or enables power through the ATX PSU but a 'soft' button that is controlled by the overall system. The actual power delivery is at least somewhat under software control, both the operating system (which enables modern OSes to actually power off the machine under software control) and the 'BIOS', broadly defined, which will do things like signal the OS to do an orderly shutdown if you merely tap the power button instead of holding it down for a few seconds. Because they're useful, 'soft' power buttons and the associated things have also spread to laptops and servers, even if their PSUs are not necessarily 'ATX' as such. After recent events, I found myself curious about actually did handle the chassis power button and associated things. Asking on the Fediverse produced a bunch of fascinating answers, so today I'm starting with plain desktop motherboards, where the answer seems to be relatively straightforward.
(As I looked up once, physically the power button is normally a momentary-contact switch that is open (off) when not pressed. A power button that's stuck 'pressed' can have odd effects.)
At the direct electrical level, ATX PSUs are either on, providing their normal power, or "off", which is not really completely off but has the PSU providing +5V standby power (with a low current limit) on a dedicated pin (pin 9, the ATX cable normally uses a purple wire for this). To switch an ATX PSU from "off" to on, you ground the 'power on' pin and keep it grounded (pin 16; the green wire in normal cables, and ground is black wires). After a bit of stabilization time, the ATX PSU will signal that all is well on another pin (pin 8, the grey wire). The ATX PSU's standby power is used to power the RTC and associated things, to provide the power for features like wake-on-lan (which requires network ports to be powered up at least a bit), and to power whatever handles the chassis power button when the PSU is "off".
On conventional desktop motherboards, the actual power button handling appears to be in the PCH or its equivalent (per @rj's information on the ICH, and also see Whitequark's ICH/PCH documentation links). In the ICH/PCH, this is part of general power management, including things like 'suspend to RAM'. Inside the PCH, there's a setting (or maybe two or three) that controls what happens when external power is restored; the easiest to find one is called AFTERG3_EN, which is a single bit in one of the PCH registers. To preserve this register's settings over loss of external power, it's part of what the documentation calls the "RTC well", which is apparently a chunk of stuff that's kept powered as part of the RTC, either from standby power or from the RTC's battery (depending on whether or not there's external power available). The ICH/PCH appears to have a direct "PWRBTN#" input line, which is presumably eventually connected to the chassis power button, and it directly implements the logic for handling things like the 'press and hold for four seconds to force a power off' feature (which Intel describes as 'transitioning to S5', the "Soft-Off" state).
('G3' is the short Intel name for what Intel calls "Mechanical Off", the condition where there's no external power. This makes the AFTERG3_EN name a bit clearer.)
As far as I can tell there's no obvious and clear support for the modern BIOS setting of 'when external power comes back, go to your last state'. I assume that what actually happens is that the ICH/PCH register involved is carefully updated by something (perhaps ACPI) as the system is powered on and off. When the system is powered on, early in the sequence you'd set the PCH to 'go to S0 after power returns'; when the system is powered off, right at the end you'd set the PCH to 'stay in S5 after power returns'.
(And apparently you can fiddle with this register yourself (via).)
All of the information I've dug up so far is for Intel ICH/PCH, but I suspect that AMD's chipsets work in a similar manner. Something has to do power management for suspend and sleep, and it seems that the chipset is the natural spot for it, and you might as well put the 'power off' handling into the same place. Whether AMD uses the same registers and the same bits is an open question, since I haven't turned up any chipset documentation so far.
2024-09-07
Operating system threads are always going to be (more) expensive
Recently I read Asynchronous IO: the next billion-dollar mistake? (via). Among other things, it asks:
Now imagine a parallel universe where instead of focusing on making asynchronous IO work, we focused on improving the performance of OS threads [...]
I don't think this would have worked as well as you'd like, at least not with any conventional operating system. One of the core problems with making operating system threads really fast is the 'operating system' part.
A characteristic of all mainstream operating systems is that the operating system kernel operates in a separate hardware security domain than regular user (program) code. This means that any time the operating system becomes involved, the CPU must do at least two transitions between these security domains (into kernel mode and then back out). Doing these transitions is always more costly than not doing them, and on top of that the CPU's ISA often requires the operating system go through non-trivial work in order to be safe from user level attacks.
(The whole speculative execution set of attacks has only made this worse.)
A great deal of the low level work of modern asynchronous IO is about not crossing between these security domains, or doing so as little as possible. This is summarized as 'reducing system calls because they're expensive', which is true as far as it goes, but even the cheapest system call possible still has to cross between the domains (if it is an actual system call; some operating systems have 'system calls' that manage to execute entirely in user space).
The less that doing things with threads crosses the CPU's security boundary into (and out of) the kernel, the faster the threads go but the less we can really describe them as 'OS threads' and the harder it is to get things like forced thread preemption. And this applies not just for the 'OS threads' themselves but also to their activities. If you want 'OS threads' that perform 'synchronous IO through simple system calls', those IO operations are also transitioning into and out of the kernel. If you work to get around this purely through software, I suspect that what you wind up with is something that looks a lot like 'green' (user-space) threads with asynchronous IO once you peer behind the scenes of the abstractions that programs are seeing.
(You can do this today, as Go's runtime demonstrates. And you still benefit significantly from the operating system's high efficiency asynchronous IO, even if you're opting to use a simpler programming model.)
(See also thinking about event loops versus threads.)
2024-09-03
TLS Server Name Indications can be altered by helpful code
In TLS, the Server Name Indication is how (in the modern TLS world) you tell the TLS server what (server) TLS certificate you're looking for. A TLS server that has multiple TLS certificates available, such as a web server handling multiple websites, will normally use your SNI to decide what server TLS certificate to provide to you. If you provide an SNI that the TLS server doesn't know or don't provide a SNI at all, the TLS server can do a variety of things, but many will fall back to some default TLS certificate. Use of SNI is pervasive in web PKI but not always used elsewhere; for example, SMTP clients don't always send SNI when establishing TLS with a SMTP server.
The official specification for SNI is section 3 of RFC 6066, and it permits exactly one format of the SNI data, which is, let's quote:
"HostName" contains the fully qualified DNS hostname of the server, as understood by the client. The hostname is represented as a byte string using ASCII encoding without a trailing dot. [...]
Anything other than this is an incorrectly formatted SNI. In particular, sending a SNI using a DNS name with a dot at the end (the customary way of specifying a fully qualified name in the context of DNS) is explicitly not allowed under RFC 6066. RFC 6066 SNI names are always fully qualified and without the trailing dots.
So what happens if you provide a SNI with a trailing dot? That depends. In particular, if you're providing a name with a trailing dot to a client library or a client program that does TLS, the library may helpfully remove the trailing dot for you when it sends the SNI. Go's crypto/tls definitely behaves this way, and it seems that some TLS libraries may. Based on observing behavior on systems I have access to, I believe that OpenSSL does strip the trailing dot but GnuTLS doesn't, and probably Mozilla's NSS doesn't either (since Firefox appears to not do this).
(I don't know what a TLS server sees as the SNI if it uses these libraries, but it appears likely that OpenSSL doesn't strip the trailing dot but instead passes it through literally.)
This dot stripping behavior is generally silent, which can lead to confusion if you're trying to test the behavior of providing a trailing dot in the SNI (which can cause web servers to give you errors). At the same time it's probably sensible behavior for the client side of TLS libraries, since some of the time they will be deriving the SNI hostname from the host name the caller has given them to connect to, and the caller may want to indicate a fully qualified DNS name in the customary way.
PS: Because I looked it up, the Go crypto/tls client code strips a trailing dot while the server code rejects a TLS ClientHelo that includes a SNI with a trailing dot (which will cause the TLS connection to fail).
2024-09-01
The status of putting a '.' at the end of domain names
A variety of things that interact with DNS interpret the host or domain name 'host.domain.' (with a '.' at the end) as the same as the fully qualified name 'host.domain'; for example this appears in web browsers and web servers. At this point one might wonder whether this is an official thing in DNS or merely a common convention and practice. The answer is somewhat mixed.
In the DNS wire protocol, initially described in RFC 1035, we can read this (in section 3.1):
Domain names in messages are expressed in terms of a sequence of labels. Each label is represented as a one octet length field followed by that number of octets. Since every domain name ends with the null label of the root, a domain name is terminated by a length byte of zero. [...]
DNS has a 'root', which all DNS queries (theoretically) start from, and a set of DNS servers, the root nameservers, that answer the initial queries that tell you what the DNS servers are for the top level domain is (such as the '.edu' or the '.ca' DNS servers). In the wire format, this root is explicitly represented as a 'null label', with zero length (instead of being implicit). In the DNS wire format, all domain names are fully qualified (and aren't represented as plain text).
RFC 1035 also defines a textual format to represent DNS information, Master files. When processing these files there is usually an 'origin', and textual domain names may be relative to that origin or absolute. The RFC says:
[...] Domain names that end in a dot are called absolute, and are taken as complete. Domain names which do not end in a dot are called relative; the actual domain name is the concatenation of the relative part with an origin specified in a $ORIGIN, $INCLUDE, or as an argument to the master file loading routine. A relative name is an error when no origin is available.
So in textual DNS data that follows RFC 1035's format, 'host.domain.' is how you specify an absolute (fully qualified) DNS name, as opposed to one that is under the current origin. Bind uses this format (or something derived from it, here in 2024 I don't know if it's strictly RF 1035 compliant any more), and in hand-maintained Bind format zone files you can find lots of use of both relative and absolute domain names.
DNS data doesn't have to be represented in text in RFC 1035 form (and doing so has some traps), either for use by DNS servers or for use by programs who do things like look up domain names. However, it's not quite accurate to say that 'host.domain.' is only a convention. A variety of things use a more or less RFC 1035 format, and in those things a terminal '.' means an absolute name because that's how RFC 1035 says to interpret and represent it.
Since RFC 1035 uses a '.' at the end of a domain name to mean a fully qualified domain name, it's become customary for code to accept one even if the code already only deals with fully qualified names (for example, DNS lookup libraries). Every program that accepts or reports this format creates more pressure on other programs to accept it.
(It's also useful as a widely understood signal that the textual domain name returned through some API is fully qualified. This may be part of why Go's net package consistently returns results from various sorts of name resolutions with a terminating '.', including in things like looking up the name(s) of IP addresses.)
At the same time, this syntax for fully qualified domain names is explicitly not accepted in certain contexts that have their own requirements. One example is in email addresses, where 'user@some.domain.' is almost invariably going to be rejected by mail systems as a syntax error.
2024-08-31
In practice, abstractions hide their underlying details
Very broadly, there are two conflicting views of abstractions in computing. One camp says that abstractions simplify the underlying complexity but people still should know about what is behind the curtain, because all abstractions are leaky. The other camp says that abstractions should hide the underlying complexity entirely and do their best not to leak the details through, and that people using the abstraction should not need to know those underlying details. I don't particularly have a side, but I do have a pragmatic view, which is that many people using abstractions don't know the underlying details.
People can debate back and forth about whether people should know the underlying details and whether they are incorrect to not know them, but the well established pragmatic reality is that a lot of people writing a lot of code and building a lot of systems don't know more than a few of the details behind the abstractions that they use. For example, I believe that a lot of people in web development don't know that host and domain names can often have a dot at the end. And people who have opinions about programming probably have a favorite list of leaky abstractions that people don't know as much about as they should.
(One area a lot of programming abstractions 'leak' is around performance issues. For example, the (C)Python interpreter is often much faster if you make things local variables inside a function than if you use global variables because of things inside the abstraction it presents to you.)
That this happens should not be surprising. People have a limited amount of time and a limited amount of things that they can learn, remember, and keep track of. When presented with an abstraction, it's very attractive to not sweat the details, especially because no one can keep track of all of them. Computing is simply too complicated to see behind all of the abstractions all of the way down. Almost all of the time, your effort is better focused on learning and mastering your layer of the abstraction stack rather than trying to know 'enough' about every layer (especially when it's not clear how much is enough).
(Another reason to not dig too deeply into the details behind abstractions is that those details can change, especially if one reason the abstraction exists is to allow the details to change. We call some of these abstractions 'APIs' and discourage people investigating and using the specific details behind the current implementations.)
One corollary of this is that safety and security related abstractions need to be designed with the assumption that people using them won't know or remember all of the underlying details. If forgetting one of those details will leave people using the abstraction with security problems, the abstraction has a design flaw that will inevitably lead to a security issue sooner or later. This security issue is not the fault of the people using the abstraction, except in a mathematical security way.
2024-08-23
My (current) view on open source moral obligations and software popularity
A while back I said something pretty strong in a comment on my entry on the Linux kernel CVE story:
(I feel quite strongly that the importance of a project cannot create substantial extra obligations on the part of the people working on the project. We do not get to insist that other people take on more work just because their project got popular. In my view, this is a core fallacy at the heart of a lot of "software supply chain security" stuff, and I think things like the Linux kernel CVE handling are the tip of an iceberg of open source reactions to it.)
After writing that, I thought about it more and I think I have a somewhat more complicated view on moral obligations (theoretically) attached to open source software. To try to boil it down, I feel that other people's decisions should not create a moral obligation on your part.
If you write a project to scratch your itch and a bunch of other people decide to use it too, that is on them, not on you. You have no moral obligation to them that accrues because they started using your software, however convenient it might be for them if you did or however much time might be saved if you did something instead of many or all of them doing something. Of course you may be a nice person, and you may also be the kind of person who is extremely conscious of how many people are relying on your software and what might happen to them if you did or didn't do various things, but that is your decision. You don't have a positive moral obligation to them.
(It's my view that this lack of obligations is a core part of what makes free software and open source software work at all. If releasing open source software came with firm moral or legal obligations, we would see far less of it.)
However, in a bit of a difference from what I implied in my comment, I also feel that while other people's actions don't create a moral obligation on you, your own actions may. If you go out and actively promote your software, try to get it widely used, put forward that you're responsive and stand ready to fix problems, and so on, then the moral waters are at least muddy. If you explicitly acted to put yourself and your software forward, other people sort of do have the (moral) right to assume that you're going to live up to your promises (whether they're explicit or implicit). However, there has to be a line somewhere; you shouldn't acquire an unlimited, open-ended obligation to do work for other people using your software just because you promoted your software a bit.
(The issue of community norms is another thing entirely. I'm sure there are some software communities where merely releasing something into the community comes with the social expectation that you'll support it.)