Wandering Thoughts

2024-05-18

Realizing the hidden complexity of cloud server networking

We have our first cloud server. This cloud server has a public IP address that we can talk to, which is good because we need it and feels straightforward; we have lots of machines with public IP addresses. This public IP address has a firewall that we have to set rules for, which feels perfectly normal; we have firewalls too. Although if I think about it, the cloud provider is working at a much bigger scale, which makes it harder and more impressive. Except that our actual cloud server has a RFC 1918 IP address and is on an internal private network segment, so what we actually are working with is a NAT firewall gateway. And the RFC 1918 address is a sufficiently straightforward /24 that it's clear it's not unique to us; plenty of cloud customer servers must have their own version of the RFC 1918 /24.

That was when I realized how complex all of the infrastructure for this networking has to be behind the scenes. The cloud provider is not merely operating a carrier-grade NAT, which is already non-trivial. They're operating a CGNAT firewall system that can connect a public IP to an IP on a specific internal virtual network, where the IP (and subnet) aren't unique across all of the (internal) networks being NAT'd. I feel that I'm reasonably knowledgeable about networking and I'm not sure how I'd even approach designing a system that did that. It's different in kind from the NAT firewalls I work on, not merely in size (the way plain CGNAT sometimes feels).

Intellectually, I knew that cloud environments were fearsomely complex behind the scenes, with all sorts of spectacular technical underpinnings (and thus all sorts of things to go wrong). But running 'ip -br a' on our first cloud server and then thinking a bit about how it all worked was the first time it really came home to me. Things like virtual machine provisioning, replicated storage, and so on were sufficiently far outside what I work on that I just admired them from a distance. Connecting our cloud server's public IP with its actual IP was the first time I had the 'I work in this area and nothing I know of could pull that off' feeling.

(Of course if we'd all switched over to IPv6 we might not need this complex NAT environment, because in theory all of those cloud servers could have globally unique IPv6 addresses and subnets and all you'd need would be a carrier grade firewall system. I'm not sure that would work in practice, though, and I don't know how clouds handle IPv6 allocation for customer servers. Our cloud server didn't get assigned an IPv6 address when we set it up.)

tech/CloudNetworkHiddenComplexity written at 21:52:56; Add Comment

2024-05-17

The trade-offs in not using WireGuard to talk to our cloud server

We recently set up our first cloud server in order to check the external reachability of some of our services, where the cloud server runs a Prometheus Blackbox instance and our Prometheus server talks to it to have it do checks and return the results. Originally, I was planning for there to be a WireGuard tunnel between our Prometheus server and the cloud VM, which Prometheus would use to talk to Blackbox. In the actual realized setup, there's no WireGuard and we use restrictive firewall rules to restrict potentially dangerous access to Blackbox to the Prometheus server.

I had expected to use WireGuard for a combination of access control to Blackbox and to deal with the cloud server having a potentially variable public IP. In practice, this cloud provider gives us a persistent public IP (as far as I can tell from their documentation) and required us to set up firewall rules either way (by default all inbound traffic is blocked), so not doing WireGuard meant a somewhat simpler configuration. Especially, it meant not needing to set up WireGuard on the Prometheus server.

(My plan for WireGuard and the public IP problem was to have the cloud server periodically ping the Prometheus server over WireGuard. This would automatically teach the Prometheus server's WireGuard the current public IP, while the WireGuard internal IP of the cloud server would stay constant. The cloud server's Blackbox would listen only on its internal WireGuard IP, not anything else.)

In some ways the result of relying on a firewall instead of WireGuard is more secure, in that an attacker would have to steal our IP address instead of stealing our WireGuard peer private key. In practice neither are worth worrying about, since all an attacker would get is our Blackbox configuration (and the ability to make assorted Blackbox probes from our cloud VM, which has no special permissions).

The one clear thing we lose in not using WireGuard is that the Prometheus server is now querying Blackbox using unencrypted HTTP over the open Internet. If there is some Intrusion Prevention System (IPS) in the path between us and the cloud server, it may someday decide that it is unhappy with this HTTP traffic (perhaps it trips some detection rule) and that it should block said traffic. An encrypted WireGuard tunnel would hide all of our Prometheus HTTP query traffic (and responses) from any in-path IPS.

(Of course we have alerts that would tell us that we can't talk to the cloud server's Blackbox. But it's better not to have our queries blocked at all.)

There are various ways to work around this, but they all give us a more complicated configuration on at least the cloud server so we aren't doing any of them (yet). And of course we can switch to the WireGuard approach when (if) we have this sort of problem.

sysadmin/CloudVMNoWireGuardTradeoffs written at 23:42:11; Add Comment

2024-05-16

Thoughts on (not) automating the setup of our first cloud server

I recently set up our first cloud server, in a flailing way that's probably familiar to anyone who still remembers their first cloud VM (complete with a later discovery of cloud provider 'upsell'). The background for this cloud server is that we want to check external reachability of some of our systems, in addition to the internal reachability already checked by our metrics and monitoring system. The actual implementation of this is quite simple; the cloud server runs an instance of the Prometheus Blackbox agent for service checks, and our Prometheus server performs a subset of our Blackbox service checks through it (in addition to the full set of service checks that are done through our local Blackbox instance).

(Access to the cloud server's Blackbox instance is guarded with firewall rules, because giving access to Blackbox is somewhat risky.)

The proper modern way to set up cloud servers is with some automated provisioning system, so that you wind up with 'cattle' instead of 'pets' (partly because every so often the cloud provider is going to abruptly terminate your server and maybe lose its data). We don't use such an automation system for our existing physical servers, so I opted not to try to learn both a cloud provider's way of doing things and a cloud server automation system at the same time, and set up this cloud server by hand. The good news for us is that the actual setup process for this server is quite simple, since it does so little and reuses our existing Blackbox setup from our main Prometheus server (all of which is stored in our central collection of configuration files and other stuff).

(As a result, this cloud server is installed in a way fairly similar to our other machine build instructions. Since it lives in the cloud and is completely detached from our infrastructure, it doesn't have our standard local setup and customizations.)

In a way this is also the bad news. If this server and its operating environment was more complicated to set up, we would have more motivation to pick one of the cloud server automation systems, learn it, and build our cloud server's configuration in it so we could have, for example, a command line 'rebuild this machine and tell me its new IP' script that we could run as needed. Since rebuilding the machine as needed is so simple and fast, it's probably never going to motivate us into learning a cloud server automation system (at least not by itself, if we had a whole collection of simple cloud VMs we might feel differently, but that's unlikely for various reasons).

Although setting up a new instance of this cloud server is simple enough, it's also not trivial. Doing it by hand means dealing with the cloud vendor's website and going through a bunch of clicking on things to set various settings and options we need. If we had a cloud automation system we knew and already had all set up, it would be better to use it. If we're going to do much more with cloud stuff, I suspect we'll soon want to automate things, both to make us less annoyed at working through websites and to keep everything consistent and visible.

(Also, cloud automation feels like something that I should be learning sooner or later, and now I have a cloud environment I can experiment with. Possibly my very first step should be exploring whatever basic command line tools exist for the particular cloud vendor we're using, since that would save dealing with the web interface in all its annoyance.)

sysadmin/FirstCloudVMAndAutomation written at 22:52:33; Add Comment

2024-05-15

Turning off the X server's CapsLock modifier

In the process of upgraded my office desktop to Fedora 40, I wound up needing to turn off the X server's CapsLock modifier. For people with a normal keyboard setup, this is simple; to turn off the CapsLock modifier, you tap the CapsLock key. However, I turn CapsLock into another Ctrl key (and then I make heavy use of tapping CapsLock to start dmenu (also)), which leaves the regular CapsLock functionality unavailable to me under normal circumstances. Since I don't have a CapsLock key, you might wonder how the CapsLock modifier got turned on in the first place.

The answer is that sometimes I have a CapsLock key after all. I turn CapsLock into Ctrl with setxkbmap settings, and apparently some Fedora packages clears these keyboard mapping settings when they're updated. Since upgrading to a new Fedora release updates all of these packages, my 'Ctrl' key resets to CapsLock during the process and I don't necessarily notice immediately. Because I expect my input settings to get cleared, I have a script to re-establish them, which I run when I notice my special Ctrl key handling isn't working. What happened this time around was that I noticed that my keyboard settings had been cleared when CapsLock didn't work as Ctrl, then reflexively invoked the script. Of course at this point I had tapped CapsLock, which turned on the CapsLock modifier, and then when the script reset CapsLock to be Ctrl, I no longer had a key that I could use to turn CapsLock off.

(Actually dealing with this situation was made more complicated by how I could now only type upper case letters in shells, browser windows, and so on. Fortunately I had a phone to do Internet searches on, and I could switch to another Linux virtual console, which had CapsLock off, and access the X server with 'export DISPLAY=:0' so I could run commands that talked to it.)

There are two solutions I wound up with, the narrow one and the general one. The narrow solution is to use xdotool to artificially send a CapsLock key down/up event with this:

xdotool key Caps_Lock

This will toggle the state of the CapsLock modifier in the X server, which will turn CapsLock off if it's currently on, as it was for me. This key down/up event works even if you have the CapsLock key remapped at the time, as I did, and you can run it from another virtual console with 'DISPLAY=:0 xdotool key Caps_Lock' (although you may need to vary the :0 bit). Or you can put it in a script called 'RESET-CAPSLOCK' so you can type its name with CapsLock active.

(Possibly I should give my 'reset-input' script an all-caps alias. It's also accessible from a window manager menu, but modifiers can make those inaccessible too.)

However, I'd like something to clear the CapsLock modifier that I can put in my 're-establish my keyboard settings' script, and since this xdotool trick only toggles the setting it's not suitable. Fortunately you can clear modifier states from an X client; unfortunately, as far as I know there's no canned 'capslockx' program the way there is a numlockx (which people have and use for good reasons). Fortunately, the same AskUbuntu question and answer that I got the xdotool invocation from also had a working Python program (you want the one from this answer by diegogs. For assorted reasons, I'm putting my current version of that Python program here:

#!/usr/bin/python
from ctypes import *

class Display(Structure):
  """ opaque struct """

X11 = cdll.LoadLibrary("libX11.so.6")
X11.XOpenDisplay.restype = POINTER(Display)

display = X11.XOpenDisplay(c_int(0))
X11.XkbLockModifiers(display, c_uint(0x0100), c_uint(2), c_uint(0))
X11.XCloseDisplay(display)

(There is also a C version in the question and answers, but you obviously have to compile it.)

In theory there is probably some way to reset the setxkbmap settings state so that CapsLock is a CapsLock key again (after all, package updates do it), which would have let me directly turn off CapsLock. In practice I couldn't find out how to do this in my flailing Internet searches so I went with the answer I could find. In retrospect I could might also have been able to reset settings by unplugging and replugging my USB keyboard or plugging in a second keyboard, and we do have random USB keyboards sitting around in the office.

unix/XTurningOffCapslock written at 23:05:12; Add Comment

2024-05-14

The X Window System and the curse of NumLock

In X, like probably any graphical environment, there are a variety of layers to keys and characters that you type. One of the layers is the input events that the X server sends to applications. As covered in the xlib manual, these contain a keycode, representing the nominal physical key, a keysym, representing what is nominally printed on the key, and a bitmap of the modifiers currently in effect, which are things like 'Shift' or 'Ctrl' (cf). The separation between keycodes and keysyms lets you do things like remap your QWERTY keyboard to Dvorak; you tell X to change what keysyms are generated for a bunch of the keycodes. Programs like GNU Emacs read the state of the modifiers to determine what you've typed (from their perspective), so they can distinguish 'Ctrl-Return' from plain 'Return'.

Ordinary modifiers are normally straightforward, in that they are additional keys that are held down as you type the main key. Control, Shift, and Alt all work this way (by default). However, some modifiers are 'sticky', where you tap their key once to turn them on and then tap their key again to turn them off. The obvious example of this is Caps Lock (unless you turn its effects off, remapping its physical key to be, say, another Ctrl key). Another example, one that many X users have historically wound up quietly cursing, is NumLock. Why people wind up cursing NumLock, and why I have a program to control its state, is because of how X programs (such as window managers) often do their key and mouse button bindings.

(There are also things that will let you make non-sticky modifier keys into sticky keys.)

Suppose, for example, that you have a bunch of custom fvwm mouse bindings that are for things like 'middle mouse button plus Alt', 'middle mouse button plus Shift and Alt', 'plain right mouse button on the root', and so on. Fvwm and most other X programs will normally (have to) interpret this completely literally; when you create a binding for 'middle mouse plus Alt', the state of the current modifiers must be exactly 'Alt' and nothing else. If the X server has NumLock on for some reason (such as you hitting the key on the keyboard), the state of the current modifiers will actually be 'NumLock plus Alt', or 'NumLock plus Alt and Shift', or just 'NumLock' (instead of 'no modifiers in effect'). As a result, fvwm will not match any of your bindings and nothing will happen as you're poking away at your keyboard and your mouse.

Of course, this can also happen with CapsLock, which has the same sticky behavior. But CapsLock has extremely obvious effects when you type ordinary characters in terminal windows, editors, email, and so on, so it generally doesn't take very long before people realize they have CapsLock on. NumLock doesn't normally change the main letters or much of anything else; on some keyboard layouts, it may not change anything you can physically type. As a result, having NumLock on can be all but invisible (or completely invisible on keyboards with no NumLock LED). To make it worse, various things have historically liked 'helpfully' turning NumLock on for you, or starting in a mode with NumLock on.

(X programs can alter the current modifier status, so it's possible for NumLock to get turnd on even if there is no NumLock key on your keyboard. The good news is that this also makes it possible to turn it off again. A program can also monitor the state of modifiers, so I believe there are ones that give you virtual LEDs for some combination of CapsLock, ScrollLock, and NumLock.)

So the curse of NumLock in X is that having NumLock on can be cause mysterious key binding failures in various programs, while often being more or less invisible. And for X protocol reasons, I believe it's hard for window managers to tell the X server 'ignore NumLock when considering my bindings' (see, for example, the discussion of IgnoreModifiers in the fvwm3 manual).

unix/XNumlockCurse written at 23:15:38; Add Comment

2024-05-13

Some ideas on what Linux distributions can do about the new kernel situation

In a comment on my entry on how the current Linux kernel CVE policy is sort of predictable, Ian Z aka nobrowser asked what a distribution like Debian is supposed to do today, now that the kernel developers are not going to be providing security analysis of fixes, especially for unsupported kernels (this is a concise way of describing the new kernel CVE policy). I don't particularly have answers, but I have some thoughts.

The options I can see today are:

  • More or less carrying on with a distribution specific kernel and backporting fixes into it if they seem important enough or otherwise are security relevant. This involves accepting that there will be some number of security issues in your kernel that are not in the upstream kernel, but this is already the case in reality today (cf).

  • Synchronize distribution releases to when the upstream kernel developers put out a LTS kernel version that will be supported for long enough, and then keep updating (LTS) kernel patch levels as new ones are released. Unfortunately the lifetime of LTS kernels is a little bit uncertain.

    My guess is that this will still leave distributions with any number of kernel security issues, because only bugfixes recognized as important are applied to LTS kernels. The Linux kernel developers are historically not great at recognizing when a bugfix has a security impact (cf again). However, once a security issue is recognized in your (LTS) kernel, at least the upstream LTS team are the ones who'll be fixing it, not you.

  • Give up on the idea of sticking with a single kernel version (much less a single patch level within that version) for the lifetime of a distribution release. Instead, expect to more or less track the currently supported kernels, or at least the LTS kernels (which would let you do releases whenever you want).

    (This is what Fedora currently does with the mainline kernel, although a distribution like Debian might want to be less aggressive about tracking the latest kernel version and patchlevel.)

Broadly, distributions are going to have to decide what is important to them. Just as we say 'good, cheap, fast, pick at most two', distributions are not going to be able to release whenever they want, use a single kernel version for years and years, and get perfect security in that kernel. Or at least they are not going to get that without doing a lot of work themselves.

(Again, the reality is that distributions didn't have this before; any old distribution kernel probably had a number of unrecognized security issues that had already been fixed upstream. That's kind of what it means for the average kernel CVE fix time between 2006 and 2018 to be '-100 days' and for 41% of kernel CVEs to have already been fixed when the CVE was issued.)

A volunteer-based distribution that prioritizes security almost certainly has no option other than closely tracking mainline kernels, and accepting whatever stability churn ensues (probably it's wise to turn off most or all configurable new features, freezing on the feature set of your initially released kernel). Commercial distributions like Red Hat Enterprise and Canonical Ubuntu can do whatever their companies are willing to pay for, but in general I don't think we're going to keep getting long term support for free.

(A volunteer based distribution that prioritizes not changing anything will have to accept that there are going to be security issues in their kernels and they will periodically scramble to find fixes or create fixes for them, and maybe get their own CVEs issued (and possibly have people write sad articles about how this distribution is using ancient kernels with security issues). I don't think this is a wise or attractive thing myself; I would rather keep up with kernel updates, at least LTS ones.)

Distributions don't have to jump on new kernel patchlevels (LTS or otherwise) or kernel versions immediately when they're released; not even Fedora does that. It's perfectly reasonable to do as much build farm testing as you can before rolling out a new LTS patch release or whatever, assuming that there are no obvious security issues that force a fast release.

linux/DistributionKernelHandling2024 written at 23:32:49; Add Comment

2024-05-12

The Linux kernel giving CVEs to all bugfixes is sort of predictable

One of the controversial recent developments in the (Linux kernel) security world is that the Linux kernel developers have somewhat recently switched to a policy of issuing CVEs for basically all bugfixes made to stable kernels. This causes the kernel people to issue a lot of CVEs and means that every new stable kernel patch release officially fixes a bunch of them, and both of these are making some people annoyed. This development doesn't really surprise me (although I wouldn't have predicted it in advance), because I feel it's a natural result of the overall situation.

(This change happened in February when the Linux kernel became a CVE Numbering Authority; see also the LWN note and its comments.)

As as I understand it, the story starts with all of the people who maintain their own version of the kernel, which in practice means that they're maintaining some old version of the kernel, one that's not supported any more by the main developers. For a long time, these third parties have wanted the main kernel to label all security fixes. They wanted this because they wanted to know what changes they should backport into their own kernels; they only wanted to do this for security fixes, not every bugfix or change the kernel makes at some point during development.

Reliably identifying that a kernel bug fix is also a security fix is a quite hard problem, possibly a more or less impossible one, and there are plenty of times when the security impact of a fix has been missed, such as CVE-2014-9940. In many cases, seriously trying to assess whether a bugfix is a security fix would take noticeable extra effort. Despite all of this, my impression is that third party people keep yelling at the main Linux kernel developers about not 'correctly' labeling security fixes, and have been yelling at them for years.

(Greg Kroah-Hartman's 2019 presentation about CVEs and the Linux kernel notes that between 2006 and 2018, 41% of the Linux kernel CVEs were fixed in official kernels before the CVE had been issued, and the average fix date was '-100 days' (that is, 100 days before the CVE was issued).)

These third party people are exactly that; third parties. They would like the kernel developers to do extra work (work that may be impossible in general), not to benefit the kernel developers, but to benefit themselves, the third parties. These third parties could take on the (substantial) effort of classifying every bug fix to the kernel and evaluating its security impact, either individually or collectively, but they don't want to do the work; they want the mainstream kernel developers to do it for them.

The Linux kernel is an open source project. The kernel developers work on what is interesting to them, or in some cases what they're paid by their employers to work on. They do not necessarily do free work for third parties, even (or especially) if the third parties yell at them. And if things become annoying enough (what will all of the yelling), then the kernel developers may take steps to make the whole issue go away. If every bug fix has a CVE, well, you can't say that the kernel isn't giving CVEs to security issues it fixes. Dealing with the result is your problem, not the kernel developers' problem. This is not a change in the status quo; it has always been your problem. It was just (more) possible to pretend otherwise until recently.

(This elaborates on part of something I said on the Fediverse.)

Sidebar: The other bits of the kernel's CVE's policy

There are two interesting other aspects of the current policy. First, the kernel developers will only issue CVEs for currently supported versions of the kernel. If you are using some other kernel and you find a security issue, the kernel people say you go to the provider of that kernel, but the resulting CVE won't be a 'kernel.org Linux kernel CVE', it will be an 'organization CVE'. Second, you can't automatically get CVEs assigned for unfixed issues; you have to ask the kernel's CVE team (after you've reported the security issue through the kernel's process for this). This means that you have to persuade the kernel developers that there actually is an issue, which I think is a reaction to junk kernel CVEs that people have gotten issued in the past.

linux/KernelBugfixCVEsAStory written at 23:23:23; Add Comment

2024-05-11

Where NS records show up in DNS replies depends on who you ask

Suppose, not hypothetically, that you're trying to check the NS records for a bunch of subdomains to see if one particular DNS server is listed (because it shouldn't be). In DNS, there are two places that have NS records for a subdomain; the nameservers for the subdomain itself (which lists NS records as part of the zone's full data), and the nameservers for the parent domain, which have to tell resolvers what the authoritative DNS servers for the subdomain are. Today I discovered that these two sorts of DNS servers can return NS records in different parts of the DNS reply.

(These parent domain NS records are technically not glue records, although I think they may commonly be called that and DNS people will most likely understand what you mean if you call them 'NS glue records' or the like.)

A DNS server's answer to your query generally has three sections, although not all of them may be present in any particular reply. The answer section contains the 'resource records' that directly answer your query, the 'authority' section contains NS records of the DNS servers for the domain, and the 'additional' section contains potentially helpful additional data, such as the addresses of some of the DNS servers in the authority section. Now, suppose that you ask a DNS server (one that has the data) for the NS records for a (sub)domain.

If you send your NS record query to either a DNS resolver (a DNS server that will make recursive queries of its own to answer your question) or to an authoritative DNS server for the domain, the NS records will show up in the answer section. You asked a (DNS) question and you got an answer, so this is exactly what you'd expect. However, if you send your NS record query to an authoritative server for the parent domain, its reply may not have any NS records in the answer section (in fact the answer section can be empty); instead, the NS records show up in the authority section. This can be surprising if you're only printing the answer section, for example because you're using 'dig +noall +answer' to get compact, grep'able output.

(If the server you send your query to is authoritative for both the parent domain and the subdomain, I believe you get NS records in the answer section and they come from the subdomain's zone records, not any NS records explicitly listed in the parent.)

This makes a certain amount of sense in the DNS mindset once you (I) think about it. The DNS server is authoritative for the parent domain but not for the subdomain you're asking about, so it can't give you an 'answer'; it doesn't know the answer and isn't going to make a recursive query to the subdomain's listed DNS servers. And the parent domain's DNS server may well have a different list of NS records than the subdomain's authoritative DNS servers have. So all the parent domain's DNS server can do is fill in the authority section with the NS records it knows about and send this back to you.

So if you (I) are querying a parent domain authoritative DNS server for NS records, you (I) should remember to use 'dig +noall +authority +answer', not my handy 'cdig' script that does 'dig +noall +answer'. Using the latter will just lead to some head scratching about how the authoritative DNS server for the university's top level domain doesn't seem to want to tell me about its DNS subdomain delegation data.

sysadmin/DNSRepliesWhereNSRecordsShowUp written at 22:08:38; Add Comment

2024-05-10

It's very difficult to tell if a Linux kernel bug is a security issue

One of the controversial recent developments in the (Linux kernel) security world is that the Linux kernel developers have somewhat recently switched to a policy of aggressively issuing CVEs for kernel changes. It's simplest to quote straight from the official kernel.org documentation:

Note, due to the layer at which the Linux kernel is in a system, almost any bug might be exploitable to compromise the security of the kernel, but the possibility of exploitation is often not evident when the bug is fixed. Because of this, the CVE assignment team is overly cautious and assign CVE numbers to any bugfix that they identify. [...]

Naturally this results in every new patch release of a stable kernel containing a bunch of CVE fixes, which has people upset. There are various things I have to say about this, but I'll start with the straightforward one: the kernel people are absolutely right about the difficulty of telling whether a kernel bug is also a security issue.

Modern exploit development technology has become terrifying capable, as has today's exploit developers. It's routine to chain multiple bugs together in complex ways to create an exploit, reliably achieving results that often seem like sorcery to an outsider like me (consider Google Project Zero's Analyzing a Modern In-the-wild Android Exploit or An analysis of an in-the-wild iOS Safari WebContent to GPU Process exploit). Modern exploit techniques have made entire classes of bugs previously considered relatively harmless into security bugs, like 'use after free' or 'double free' memory usage bugs; the assumption today is that any instance of one of those in the kernel can probably be weaponized as part of an exploit chain, even if no one has yet worked out how to do it for a specific instance.

Once upon a time, it was reasonably possible to immediately tell whether or not a bug had a security impact, and you could divide fixed kernel bugs into 'ordinary bug fixes' and 'fixes a security problem'. For many years now, that has not been the case; instead, the fact that a bugfix was also a security fix may only become clear years after it was made. Today, the line between the two is not so much narrow as invisible. Often the difference between 'a kernel bug' and 'a kernel bug that can be weaponized as part of an exploit chain' seems to be whether or not a highly skilled person (or team) has spent enough time and effort to come up with something sufficiently ingenious.

Are there bug fixes that are genuinely impossible to weaponize as part of a security exploit? Probably. But reliably identifying them has proven to be very challenging, or to put it another way the Linux kernel has tried to do it in the past and failed repeatedly, identifying bug fixes as non-security ones when they turned out to be bugs that could be weaponized.

(The Linux kernel still periodically has security bugs that are obvious once discovered and often straightforward to exploit, but these are relatively uncommon.)

(This expands a bit on something I said in a conversation on the Fediverse.)

linux/KernelBugsSecurityNotClear written at 22:27:59; Add Comment

2024-05-09

One of OCSP's problems is the dominance of Chrome

To simplify greatly, OCSP is a set of ways to check whether or not a (public) TLS certificate has been revoked. It's most commonly considered in the context of web sites and things that talk to them. Today I had yet another problem because something was trying to check the OCSP status of a website and it didn't work. I'm sure there's a variety of contributing factors to this, but it struck me that one of them is that Chrome, the dominant browser, doesn't do OCSP checks.

If you break the dominant browser, people notice and fix it; indeed, people prioritize testing against the dominant browser and making sure that things are going to work before you put them in production. But if something is not supported in the dominant browser, it's much less noticeable if it breaks. And if something breaks in a way that doesn't affect even less well used browsers (like Firefox), the odds of it being noticed are even lower. Something in the broad network environment broke OCSP for wget, but perhaps not for browsers? Good luck having that noticed, much less fixed.

Of course this leads to a spiral. When people run into OCSP problems on less common platforms, they can either try to diagnose and fix the problem (if fixing it is even within their power), or they can bypass or disable OCSP. Often they'll chose the latter (as I did), at which point they increase the number of non-OCSP people in the world and so further reduce the chances of OCSP problems being noticed and fixed. For instance, I couldn't cross-check the OCSP situation with Firefox, because I'd long ago disabled OCSP in Firefox after it caused me problems there.

I don't have any particular solutions, and since I consider OCSP to basically be a failure in practice I'm not too troubled by the problem, at least for OCSP.

PS: In this specific situation, OCSP was vanishingly unlikely to actually be telling me that there was a real security problem. If Github had to revoke any of its web certificates due to them being compromised, I'm sure I would have heard about it because it would be very big news.

web/OCSPVersusDominantBrowser written at 23:23:35; Add Comment

(Previous 10 or go back to May 2024 at 2024/05/08)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.