2025-01-10
The problem with combining DNS CNAME records and anything else
A famous issue when setting up DNS records for domains is that you can't combine a CNAME record with any other type, such as a MX record or a SOA (which is required at the top level of a domain). One modern reason that you would want such a CNAME record is that you're hosting your domain's web site at some provider and the provider wants to be able to change what IP addresses it uses for this, so from the provider's perspective they want you to CNAME your 'web site' name to 'something.provider.com'.
The obvious reason for 'no CNAME and anything else' is 'because the RFCs say so', but this is unsatisfying. Recently I wondered why the RFCs couldn't have said that when a CNAME is combined with other records, you return the other records when asked for them but provide the CNAME otherwise (or maybe you return the CNAME only when asked for the IP address if there are other records). But when I thought about it more, I realized the answer, the short version of which is caching resolvers.
If you're the authoritative DNS server for a zone, you know for sure what DNS records are and aren't present. This means that if someone asks you for an MX record and the zone has a CNAME, a SOA, and an MX, you can give them the MX record, and if someone asks for the A record, you can give them the CNAME, and everything works fine. But a DNS server that is a caching resolver doesn't have this full knowledge of the zone; it only knows what's in its cache. If such a DNS server has a CNAME for a domain in its cache (perhaps because someone asked for the A record) and it's now asked for the MX records of that domain, what is it supposed to do? The correct answer could be either the CNAME record the DNS server has or the MX records it would have to query an authoritative server for. At a minimum combining CNAME plus other records this way would require caching resolvers to query the upstream DNS server and then remember that they got a CNAME answer for a specific query.
In theory this could have been written into DNS originally, at the cost of complicating caching DNS servers and causing them to make more queries to upstream DNS servers (which is to say, making their caching less effective). Once DNS existed with the CNAME behavior such that caching DNS resolvers could cache CNAME responses and serve them, the CNAME behavior was fixed.
(This is probably obvious to experienced DNS people, but since I had to work it out in my head I'm going to write it down.)
Sidebar: The pseudo-CNAME behavior offered by some DNS providers
Some DNS providers and DNS servers offer an 'ANAME' or 'ALIAS' record type. This isn't really a DNS record; instead it's a processing instruction to the provider's DNS software that it should look up the A and AAAA records of the target name and insert them into your zone in place of the ANAME/ALIAS record (and redo the lookup every so often in case the target name's IP addresses change). In theory any changes in the A or AAAA records should trigger a change in the zone serial number; in practice I don't know if providers actually do this.
(If your DNS provider doesn't have ANAME/ALIAS 'records' but does have an API, you can build this functionality yourself.)
2025-01-04
There are different sorts of WireGuard setups with different difficulties
I've now set up WireGuard in a number of different ways, some of which were easy and some of which weren't. So here are my current views on WireGuard setups, starting with the easiest and going to the most challenging.
The easiest WireGuard setup is where the 'within WireGuard' internal IP address space is completely distinct from the outside space, with no overlap. This makes routing completely straightforward; internal IPs reachable over WireGuard aren't reachable in any other way, and external IPs aren't reachable over WireGuard. You can do this as a mesh or use the WireGuard 'router' pattern (or some mixture). If you allocate all internal IP addresses from the same network range, you can set a single route to your WireGuard interface and let AllowedIps sort it out.
(An extreme version of this would be to configure the inside part of WireGuard with only link local IPv6 addresses, although this would probably be quite inconvenient in practice.)
A slightly more difficult setup is where some WireGuard endpoints are gateways to additional internal networks, networks that aren't otherwise reachable. This setup potentially requires more routing entries but it remains straightforward in that there's no conflict on how to route a given IP address.
The next most difficult setup is using different IP address types inside WireGuard than from outside it, where the inside IP address type isn't otherwise usable for at least one of the ends. For example, you have an IPv4 only machine that you're giving a public IPv6 address through an IPv6 tunnel. This is still not too difficult because the inside IP addresses associated with each WireGuard peer aren't otherwise reachable, so you never have a recursive routing problem.
The most difficult type of WireGuard setup I've had to do so far is a true 'VPN' setup, where some or many of the WireGuard endpoints you're talking to are reachable both outside WireGuard and through WireGuard (or at least there are routes that try to send traffic to those IPs through WireGuard, such as a VPN 'route all traffic through my WireGuard link' default route). Since your system could plausibly recursively route your encrypted WireGuard traffic over WireGuard, you need some sort of additional setup to solve this. On Linux, this will often be done using a fwmark (also) and some policy based routing rules.
One of the reasons I find it useful to explicitly think about these different types of setups is to better know what to expect and what I'll need to do when I'm planning a new WireGuard environment. Either I will be prepared for what I'm going to have to do, or I may rethink my design in order to move it up the hierarchy, for example deciding that we can configure services to talk to special internal IPs (over WireGuard) so that we don't have to set up fwmark-based routing on everything.
(Some services built on top of WireGuard handle this for you, for example Tailscale, although Tailscale can have routing challenges of its own depending on your configuration.)
2024-12-29
My screens now have areas that are 'good' and 'bad' for me
Once upon a time, I'm sure that everywhere on my screen (because it would have been a single screen at that time) was equally 'good' for me; all spots were immediately visible, clearly readable, didn't require turning my head, and so on. As the number of screens I use has risen, as the size of the screens has increased (for example when I moved from 24" non-HiDPI 3:2 LCD panels to 27" HiDPI 16:9 panels), and as my eyes have gotten older, this has changed. More and more, there is a 'good' area that I've set up so I'm looking straight at and then increasingly peripheral areas that are not as good.
(This good area is not necessarily the center of the screen; it depends on how I sit relatively to the screen, the height of the monitor, and so on. If I adjust these I can change what the good spot is, and I sometimes will do so for particular purposes.)
Calling the peripheral areas 'bad' is a relative term. I can see them, but especially on my office desktop (which has dual 27" 16:9 displays), these days the worst spots can be so far off to the side that I don't really notice things there much of the time. If I want to really look, I have to turn my head, which means I have to have a reason to look over there at whatever I put there. Hopefully it's not too important.
For a long time I didn't really notice this change or think about its implications. As the physical area covered by my 'display surface' expanded, I carried over the much the same desktop layout that I had used (in some form) for a long time. It didn't register that some things were effectively being exiled into the outskirts where I would never notice them, or that my actual usage was increasingly concentrated in one specific area of the screen. Now that I have consciously noticed this shift (which is a story for another entry), I may want to rethink some of how I lay things out on my office desktop (and maybe my home one too) and what I put where.
(One thing I've vaguely considered is if I should turn my office displays sideways, so the long axis is vertical, although I don't know if is feasible with their current stands. I have what is in practice too much horizontal space today, so that would be one way to deal with it. But probably this would give me two screens that each are a bit too narrow to be comfortable for me. And sadly there are no ideal LCD panels these days; I would ideally like a HiDPI 24" or 25" 3:2 panel but vendors don't do those.)
2024-12-25
x86 servers, ATX power supply control, and reboots, resets, and power cycles
I mentioned recently a case when power cycling an (x86) server wasn't enough to recover it, although perhaps I should have put quotes around "power cycling". The reason for the scare quotas is that I was doing this through the server's BMC, which means that what was actually happening was not clear because there are a variety of ways the BMC could be doing power control and the BMC may have done something different for what it described as a 'power cycle'. In fact, to make it less clear, this particular server's BMC offers both a "Power Cycle" and a "Power Reset" option.
(According to the BMC's manual, a "power cycle" turns the system off and then back on again, while a "power reset" performs a 'warm restart'. I may have done a 'power reset' instead of a 'power cycle', it's not clear from what logs we have.)
There are a spectrum of ways to restart an x86 server, and they (probably) vary in their effects on peripherals, PCIe devices, and motherboard components. The most straightforward looking is to ask the Linux kernel to reboot the system, although in practice I believe that actually getting the hardware to do the reboot is somewhat complex (and in the past Linux sometimes had problems where it couldn't persuade the hardware, so your 'reboot' would hang). Looking at the Linux kernel code suggests that there are multiple ways to invoke a reboot, involving ACPI, UEFI firmware, old fashioned BIOS firmware, a PCIe configuration register, via the keyboard, and so on (for a fun time, look at the 'reboot=' kernel parameter). In general, a reboot can only be initiated by the server's host OS, not by the BMC; if the host OS is hung you can't 'reboot' the server as such.
Your x86 desktop probably has a 'reset' button on the front panel. These days the wire from this is probably tied into the platform chipset (on Intel, the ICH, which came up for desktop motherboard power control) and is interpreted by it. Server platforms probably also have a (conceptual) wire and that wire may well be connected to the BMC, which can then control it to implement, for example a 'reset' operation. I believe that a server reboot can also trigger the same platform chipset reset handling that the reset button does, although this isn't sure. If I'm reading Intel ICH chipset documentation correctly, triggering a reset this way will or may signal PCIe devices and so on that a reset has happened, although I don't think it cuts power to them; in theory anything getting this signal should reset its state.
(The CF9 PCI "Reset Control Register" (also) can be used to initiate a 'soft' or 'hard' CPU reset, or a full reset in which the (Intel) chipset will do various things to signals to peripherals, not just the CPU. I don't believe that Linux directly exposes these options to user space (partly because it may not be rebooting through direct use of PCI CF9 in the first place), although some of them can be controlled through kernel command line parameters. I think this may also control whether the 'reset' button and line do a CPU reset or a full reset. It seems possible that the warm restart of this server's BMC's "power reset" works by triggering the reset line and assuming that CF9 is left in its default state to make this a CPU reset instead of a full reset.)
Finally, the BMC can choose to actually cycle the power off and then back on again. As discussed, 'off' is probably not really off, because standby power and BMC power will remain available, but this should put both the CPU and the platform chipset through a full power-on sequence. However, it likely won't leave power off long enough for various lingering currents to dissipate and capacitors to drain. And nothing you do through the BMC can completely remove power from the system; as long as a server is connected to AC power, it's supplying standby power and BMC power. If you want a total reset, you must either disconnect its power cords or turn its outlet or outlets off in your remote controllable PDU (which may not work great if it's on a UPS). And as we've seen, sometimes a short power cycle isn't good enough and you need to give the server a time out.
(While the server's OS can ask for the server to be powered down instead of rebooted, I don't think it can ask for the server to be power cycled, not unless it talks to the BMC instead of doing a conventional reboot or power down.)
One of the things I've learned from this is that if I want to be really certain I understand what a BMC is doing, I probably shouldn't rely on any option to do a power cycle or power reset. Instead I should explicitly turn power off, wait until that's taken, and then turn power on. Asking a BMC to do a 'power cycle' is a bit optimistic, although it will probably work most of the time.
(If there's another time of our specific 'reset is not enough' hang, I will definitely make sure to use at least the BMC's 'power cycle' and perhaps the full brief off then on approach.)
2024-12-21
When power cycling your (x86) server isn't enough to recover it
We have various sorts of servers here, and generally they run without problems unless they experience obvious hardware failures. Rarely, we experience Linux kernel hangs on them, and when this happens, we power cycle the machines, as one does, and the server comes back. Well, almost always. We have two servers (of the same model), where something different has happened once.
Each of the servers either crashed in the kernel and started to reboot or hung in the kernel and was power cycled (both were essentially unused at the time). As each server was running through the system firmware ('BIOS'), both of them started printing an apparently endless series of error dumps to their serial consoles (which had been configured in the BIOS as well as in the Linux kernel). These were like the following:
!!!! X64 Exception Type - 12(#MC - Machine-Check) CPU Apic ID - 00000000 !!!! RIP - 000000006DABA5A5, CS - 0000000000000038, RFLAGS - 0000000000010087 RAX - 0000000000000008, RCX - 0000000000000000, RDX - 0000000000000001 RBX - 000000007FB6A198, RSP - 000000005D29E940, RBP - 000000005DCCF520 RSI - 0000000000000008, RDI - 000000006AB1B1B0 R8 - 000000005DCCF524, R9 - 000000005D29E850, R10 - 000000005D29E8E4 R11 - 000000005D29E980, R12 - 0000000000000008, R13 - 0000000000000001 R14 - 0000000000000028, R15 - 0000000000000000 DS - 0000000000000030, ES - 0000000000000030, FS - 0000000000000030 GS - 0000000000000030, SS - 0000000000000030 CR0 - 0000000080010013, CR2 - 0000000000000000, CR3 - 000000005CE01000 CR4 - 0000000000000668, CR8 - 0000000000000000 DR0 - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000 DR3 - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400 GDTR - 0000000076E46000 0000000000000047, LDTR - 0000000000000000 IDTR - 000000006AC3D018 0000000000000FFF, TR - 0000000000000000 FXSAVE_STATE - 000000005D29E5A0 !!!! Can't find image information. !!!!
(The last line leaves me with questions about the firmware/BIOS but I'm unlikely to get answers to them. I'm putting the full output here for the usual reason.)
Some of the register values varied between reports, others didn't after the first one (for example, from the second onward the RIP appears to have always been 6DAB14D1, which suggests maybe it's an exception handler).
In both cases, we turned off power to the machines (well, to the hosts; we were working through the BMC, which stayed powered on), let them sit for a few minutes, and then powered them on again. This returned them to regular, routine, unexciting service, where neither of them have had problems since.
I knew in a theoretical way that there are parts of an x86 system that aren't necessarily completely reset if the power is only interrupted briefly (my understanding is that a certain amount of power lingers until capacitors drain and so on, but this may be wrong and there's a different mechanism in action). But I usually don't have it demonstrated in front of me this way, where a simple power cycle isn't good enough to restore a system but a cool down period works.
(Since we weren't cutting external power to the entire system, this also left standby power (also) available, which means some things never completely lost power even with the power being 'off' for a couple of minutes.)
PS: Actually there's an alternate explanation, which is that the first power cycle didn't do enough to reset things but a second one would have worked if I'd tried that instead of powering the servers off for a few minutes. I'm not certain I believe this and in any case, powering the servers off for a cool down period was faster than taking a chance on a second power cycle reset.
2024-12-06
Common motherboards are supporting more and more M.2 NVMe drive slots
Back at the start of 2020, I wondered if common (x86 desktop) motherboards would ever have very many M.2 NVMe drive slots, where by 'very many' I meant four or so, which even back then was a common number of SATA ports for desktop motherboards to provide. At the time I thought the answer was probably no. As I recently discovered from investigating a related issue, I was wrong, and it's now fairly straightforward to find x86 desktop motherboards that have as many as four M.2 NVMe slots (although not all four may be able to run at x4 PCIe lanes, especially if you have things like a GPU).
For example, right now it's relatively easy to find a page full of AMD AM5-based motherboards that have four M.2 NVMe slots. Most of these seem to be based on the high end X series AMD chipsets (such as the X670 or the X870, but I found a few that were based on the B650 chipset. On the Intel side, should you still be interested in an Intel CPU in your desktop at this point, there's also a number of them based primarily on the Z790 chipset (and some the older Z690). There's even a B760 based motherboard with four M.2 NVMe slots (although two of them are only x1 lanes and PCIe 3.0), and an H770 based one that manages to (theoretically) support all four M.2 slots at x4 lanes.
One of the things that I think has happened on the way to this large supply of M.2 slots is that these desktop motherboards have dropped most of their PCIe slots. These days, you seem to commonly get three slots in total on the kind of motherboard that has four M.2 slots. There's always one x16 slot, often two, and sometimes three (although that's physical x16; don't count on getting all 16 PCIe lanes in every slot). It's not uncommon to see the third PCIe slot be physically x4, or a little x1 slot tucked away at the bottom of the motherboard. It also isn't necessarily the case that lower end desktops have more PCIe slots to go with their fewer M.2 slots; they too seem to have mostly gone with two or three PCIe slots, generally with limited number of lanes even if they're physically x16.
(I appreciate having physical x16 slots even if they're only PCIe x1, because that means you can use any card that doesn't require PCIe bifurcation and it should work, although slowly.)
As noted by commentators on my entry on PCIe bifurcation and its uses for NVMe drives, a certain amount of what we used to need PCIe slots for can now be provided through high speed USB-C and similar things. And of course there are only so many PCIe lanes to go around from the CPU and the chipset, so those USB-C ports and other high-speed motherboard devices consume a certain amount of them; the more onboard devices the motherboard has the fewer PCIe lanes there are left for PCIe slots, whether or not you have any use for those onboard devices and connectors.
(Having four M.2 NVMe slots is useful for me because I use my drives in mirrored pairs, so four M.2 slots means I can run my full old pair in parallel with a full new pair, either in a four way mirror or doing some form of migration from one mirrored pair to the other. Three slots is okay, since that lets me add a new drive to a mirrored pair for gradual migration to a new pair of drives.)
2024-12-04
Sorting out 'PCIe bifurcation' and how it interacts with NVMe drives
Suppose, not hypothetically, that you're switching from one mirrored set of M.2 NVMe drives to another mirrored set of M.2 NVMe drives, and so would like to have three or four NVMe drives in your desktop at the same time. Sadly, you already have one of your two NVMe drives on a PCIe card, so you'd like to get a single PCIe card that handles two or more NVMe drives. If you look around today, you'll find two sorts of cards for this; ones that are very expensive, and ones that are relatively inexpensive but require that your system supports a feature that is generally called PCIe bifurcation.
NVMe drives are PCIe devices, so a PCIe card that supports a single NVMe drive is a simple, more or less passive thing that wires four PCIe lanes and some other stuff through to the M.2 slot. I believe that in theory, a card could be built that only required x2 or even x1 PCIe lanes, but in practice I think all such single drive cards are physically PCIe x4 and so require a physical x4 or better PCIe slot, even if you'd be willing to (temporarily) run the drive much slower.
A PCIe card that supports more than one M.2 NVMe drive has two options. The expensive option is to put a PCIe bridge on the card, with the bridge (probably) providing a full set of PCIe lanes to the M.2 NVMe drives locally on one side and doing x4, x8, or x16 PCIe with the motherboard on the other. In theory, such a card will work even at x4 or x2 PCIe lanes, because PCIe cards are supposed to do that if the system says 'actually you only get this many lanes' (although obviously you can't drive four x4 NVMe drives at full speed through a single x4 or x2 PCIe connection).
The cheap option is to require that the system be able to split a single PCIe slot into multiple independent groups of PCIe lanes (I believe these are usually called links); this is PCIe bifurcation. In PCIe bifurcation, the system takes what is physically and PCIe-wise an x16 slot (for example) and splits it into four separate x4 links (I've seen this sometimes labeled as 'x4/x4/x4/x4'). This is cheap for the card because it can basically be four single M.2 NVMe PCIe cards jammed together, with each set of x4 lanes wired through to a single M.2 NVMe slot. A PCIe card for two M.2 NVMe drives will require an x8 PCIe slot bifurcated to two x4 links; if you stick this card in an x16 slot, the upper 8 PCIe lanes just get ignored (which means that you can still set your BIOS to x4/x4/x4/x4).
As covered in, for example, this Synopsys page, PCIe bifurcation isn't something that's negotiated as part of bringing up PCIe connections; a PCIe device can't ask for bifurcation and can't be asked whether or not it supports it. Instead, the decision is made as part of configuring the PCIe root device or bridge, which in practice means it's a firmware ('BIOS') decision. However, I believe that bifurcation may also requires hardware support in the 'chipset' and perhaps the physical motherboard.
I put chipset into quotes because for quite some time, some PCIe lanes come directly from the CPU and only some others come through the chipset as such. For example, in desktop motherboards, the x16 GPU slot is almost always driven directly by CPU PCIe lanes, so it's up to the CPU to have support (or not have support) for PCIe bifurcation of that slot. I don't know if common desktop chipsets support bifurcation on the chipset PCIe slots and PCIe lanes, and of course you need chipset-driven PCIe slots that have enough lanes to be bifurcated in the first place. If the PCIe slots driven by the chipset are a mix of x4 and x1 slots, there's no really useful bifurcation that can be done (at least for NVMe drives).
If you have a limited number of PCIe slots that can actually support x16 or x8 and you need a GPU card, you may not be able to use PCIe bifurcation in practice even if it's available for your system. If you have only one PCIe slot your GPU card can go in and it's the only slot that supports bifurcation, you're stuck; you can't have both a bifurcated set of NVMe drives and a GPU (at least not without a bifurcated PCIe riser card that you can use).
(This is where I would start exploring USB NVMe drive enclosures, although on old desktops you'll probably need one that doesn't require USB-C, and I don't know if a NVMe drive set up in a USB enclosure can later be smoothly moved to a direct M.2 connection without partitioning-related problems or other issues.)
(This is one of the entries I write to get this straight in my head.)
Sidebar: Generic PCIe riser cards and other weird things
The traditional 'riser card' I'm used to is a special proprietary server 'card' (ie, a chunk of PCB with connectors and other bits) that plugs into a likely custom server motherboard connector and makes a right angle turn that lets it provide one or two horizontal PCIe slots (often half-height ones) in a 1U or 2U server case, which aren't tall enough to handle PCIe cards vertically. However, the existence of PCIe bifurcation opens up an exciting world of general, generic PCIe riser cards that bifurcate a single x16 GPU slot to, say, two x8 PCIe slots. These will work (in some sense) in any x16 PCIe slot that supports bifurcation, and of course you don't have to restrict yourself to x16 slots. I believe there are also PCIe riser cards that bifurcate an x8 slot into two x4 slots.
Now, you are perhaps thinking that such a riser card puts those bifurcated PCIe slots at right angles to the slots in your case, and probably leaves any cards inserted into them with at least their tops unsupported. If you have light PCIe cards, maybe this works out. If you don't have light PCIe cards, one option is another terrifying thing, a PCIe ribbon cable with a little PCB that is just a PCIe slot on one end (the other end plugs into your real PCIe slot, such as one of the slots on the riser card). Sometimes these are even called 'riser card extenders' (or perhaps those are a sub-type of the general PCIe extender ribbon cables).
Another PCIe adapter device you can get is an x1 to x16 slot extension adapter, which plugs into an x1 slot on your motherboard and has an x16 slot (with only one PCIe lane wired through, of course). This is less crazy than it sounds; you might only have an x1 slot available, want to plug in a x4, x8, or x16 card that's short enough, and be willing to settle for x1 speeds. In theory PCIe cards are supposed to still work when their lanes are choked down this way.
2024-11-23
The general issue of terminal programs and the Alt key
When you're using a terminal program (something that provides a terminal window in a GUI environment, which is now the dominant form of 'terminals'), there's a fairly straightforward answer for what should happen when you hold down the Ctrl key while typing another key. For upper and lower case letters, the terminal program generates ASCII bytes 1 through 26, for Ctrl-[ you get byte 27 (ESC), and there are relatively standard versions of some other characters. For other characters, your specific terminal program may treat them as aliases for some of the ASCII control characters or ignore the Ctrl. All of this behavior is relatively standard from the days of serial terminals, and none of it helps terminal programs decide what should be generated when you hold down the Alt key while typing another key.
(A terminal program can hijack Alt-<key> to control its behavior, but people will generally find this hostile because they want to use Alt-<key> with things running inside the terminal program. In general, terminal programs are restricted to generating things at the character layer, where what they send has to fit in a sequence of bytes and be generally comprehensible to whatever is reading those bytes.)
Historically and even currently there have been three answers. The simplest answer is that Alt sets the 8th bit on what would otherwise be a seven-bit ASCII character. This behavior is basically a relic of the days when things actually were seven bit ASCII (at least in North America) and doing this wouldn't mangle things horribly (provided that the program inside the terminal understood this signal). As a result it's not too popular any more and I think it's basically died out.
The second answer is what I'll call the Emacs answer, which is that Alt plus another key generates ESC (Escape) and then the other key. This matches how Emacs handled its Meta key binding modifier (written 'M-...' in Emacs terminology) in the days of serial terminals; if an Emacs keybinding was M-a, you typed 'ESC a' to invoke it. Even today when we have real Alt keys and some programs could see a real Meta modifier (cf), basically every Emacs or Emacs-compatible system will accept ESC as the Meta prefix even if they're not running in a terminal.
(I started with Emacs sufficiently long ago that ESC-<key> is an ingrained reflex that I still sometimes use even though Alt is right there on my keyboard.)
The third answer is that Alt-<key> generates various accented or special characters in the terminal program's current locale (or in UTF-8, because that's increasingly hard-coded). Once upon a time this was the same as the first answer, because accented and special characters were whatever was found in the upper half of ASCII single-byte characters (bytes 128 to 255). These days, with people using UTF-8, it's generally different; for example, your Alt-a might generate 'รก', but the actual UTF-8 representation of this single Unicode codepoint is actually two bytes, 0xc3 0xa1.
Some terminal programs still allow you to switch between the second and the third answers (Unix xterm is one such program and can even be switched on the fly, see the 'Meta sends Escape' option in the menu you get with Ctrl-<mouse button 1>). Others are hard-coded with the second answer, where Alt-<key> sends ESC <key>. My impression is that the second answer is basically the dominant one these days and only a few terminal programs even potentially support the third option.
PS: How xterm behaves can be host specific due to different default X resources settings on different hosts. Fedora makes xterm default to Alt-<key> sending ESC-<key>, while Ubuntu leaves it with the xterm code default of Alt creating accented characters.
2024-11-09
A rough guess at how much IPv6 address space we might need
One of the reactions I saw to my entry on why NAT might be inevitable (at least for us) even with IPv6 was to ask if there really was a problem with being generous with IPv6 allocations, since they are (nominally) so large. Today I want to do some rough calculations on this, working backward from what we might reasonably assign to end user devices. There's a lot of hand-waving and assumptions here, and you can question a lot of them.
I'll start with the assumption that the minimum acceptable network size is a /64, for various reasons including SLAAC. As discussed, end devices presenting themselves on our network may need some number of /64s for internal use. Let's assume that we'll allocate sixteen /64s to each device, meaning that we give out /60s to each device on each of our subnets.
I think it's unlikely we'll want to ever have a subnet with more than 2048 devices on it (and even that's generous). That many /60s is a /49. However, some internal groups have more than one IPv4 subnet today, so for future expansion let's say that each group gets eight IPv6 subnets, so we give out /46s to research groups (or we could trim some of these sizes and give out /48s, which seems to be a semi-standard allocation size that various software may be more happy with).
We have a number of IPv4 subnets (and of research groups). If we want to allow for growth, various internal uses, and so on, we want some extra room, so I think we'd want space for at least 128 of these /46 allocations, which gets us to an overall allocation for our department of a /39 (a /38 if we want 256 just to be sure). The University of Toronto currently has a /32, so we actually have some allocation problems. For a start, the university has three campuses and it might reasonably want to split its /32 allocation into four and give one /34 to each campus. At a /34 for the campus, there's only 32 /39s and the university has many more departments and groups than that.
If the university starts with a /32, splits it to /34s for campuses, and wants to have room for 1024 or 2048 allocations within a campus, each department or group can get only a /44 or a /45 and all of our sizes would have to shrink accordingly; we'd need to drop at least five or six bits somewhere (say four subnets per group, eight or even four /64s per device, maybe 1024 devices maximum per subnet, etc).
If my understanding of how you're supposed to do IPv6 is correct, what makes all of this more painful in a purist IPv6 model is that you're not supposed to allocate multiple, completely separate IPv6 subnets to someone, unlike in the IPv4 world. Instead, everything is supposed to live under one IPv6 prefix. This means that the IPv6 prefix absolutely has to have enough room for future growth, because otherwise you have to go through a very painful renumbering to move to another prefix.
(For instance, today the department has multiple IPv4 /24s allocated to it, not all of them contiguous. We also work this way with our internal use of RFC 1918 address space, where we just allocate /16s as we need them.)
Being able to allocate multiple subnets of some size (possibly a not that large one) to departments and groups would make it easier to not over-allocate to deal with future growth. We might still have problems with the 'give every device eight /64s' plan, though.
(Of course we could do this multiple subnets allocation internally even if the university gives us only a single IPv6 prefix. Probably everything can deal with IPv6 used this way, and it would certainly reduce the number of bits we need to consume.)
2024-11-05
The general problem of losing network based locks
There are many situations and protocols where you want to hold some sort of lock across a network between, generically, a client (who 'owns' the lock) and a server (who manages the locks on behalf of clients and maintains the locking rules). Because a network is involved, one of the broad problems that can happen in such a protocol is that the client can have a lock abruptly taken away from it by the server. This can happen because the server was instructed to break the lock, or the server restarted in some way and notified the clients that they had lost some or all of their locks, or perhaps there was a network partition that led to a lock timeout.
When the locking protocol and the overall environment is specifically designed with this in mind, you can try to require clients to specifically think about the possibility. For example, you can have an API that requires clients to register a callback for 'you lost a lock', or you can have specific error returns to signal this situation, or at the very least you can have a 'is this lock still valid' operation (or 'I'm doing this operation on something that I think I hold a lock for, give me an error if I'm wrong'). People writing clients can still ignore the possibility, just as they can ignore the possibility of other network errors, but at least you tried.
However, network locking is sometimes added to things that weren't originally designed for it. One example is (network) filesystems. The basic 'filesystem API' doesn't really contemplate locking and especially it doesn't consider that you can suddenly have access to a 'file' taken away from you in mid-flight. If you add network locking you don't have a natural answer to handling losing locks and there's no obvious point in the API to add it, especially if you want to pretend that your network filesystem is the same as a local filesystem. This makes it much easier for people writing programs to not even think about the possibility of losing a network lock during operation.
(If you're designing a purely networked filesystem-like API, you have more freedom; for example, you can make locking operations turn a regular 'file descriptor' into a special 'locked file descriptor' that you have to do subsequent IO through and that will generate errors if the lock is lost.)
One of the meta-problems with handling losing a network lock is that there's no single answer for what you should do about it. In some programs, you've violated an invariant and the only safe move for the program is to exit or crash. In some programs, you can pause operations until you can re-acquire the lock. In other programs you need to bail out to some sort of emergency handler that persists things in another way or logs what should have been done if you still held the lock. And when designing your API (or APIs) for losing locks, how likely you think each option is will influence what features you offer (and it will also influence how interested programs are in handling losing locks).
PS: A contributing factor to programmers and programs not being interested in handling losing network locks is that they're generally somewhere between uncommon and rare. If lots of people are writing code to deal with your protocol and losing locks are uncommon enough, some amount of those people will just ignore the possibility, just like some amount of programmers ignore the possibility of IO errors.