Why I'm not enthused about live patching kernels and systems
Unpopular Unix sysadmin opinion: long system uptimes are generally a sign of multiple problems (including lack of security updates).
(This is a drum that I've beaten before, more or less.)
In response a number of people raised the possibility of long uptimes through kernel live patching (eg) and I was rather unenthused about the idea and the technology involved. My overall view is that live patching really needs to be designed into the system from the ground up in order to really work well; otherwise it is a hack to avoid having to reboot. This hack may be justified under some exceptional circumstances, but I'm not enthused about it being used routinely.
Live patching has two broad pragmatic problems as generally implemented. First, the result of live patching a server is a machine where the system in memory is not the same as the system that you'd get after a reboot. You certainly hope that they're very similar, similar enough that you can assume 'it works now' means 'it'll work after a reboot', but you can't be sure. This is the same fundamental problem we have on a larger scale with servers that have been installed a while ago and then updated in place, where they're not the same as what you'd get if you rebuilt from scratch now (although you sure hope that they're close enough). The pragmatic problems with this have increasingly driven people to designs involving immutable system installs (in one way or another), where you never try to update things in place and always rebuild from scratch.
The second is that live patching becomes a difficult technical problem when data structures in memory change between versions of the code, for example by being reshaped or by having the semantics of fields change (including such thing as new flag values in flag fields). To really deal with this you need to disallow such changes entirely, do flawless on the fly translation of data structures between their old and new versions, or have data structures be versioned and have the code be able to deal with all versions (which is basically on the fly translation implemented in the OS code instead of the patcher). This problem is sufficiently hard that many live patching systems simply punt and explicitly can't be used if data structures have changed (it looks like all of the Linux kernel live patching systems take this approach).
(Not supporting live patching if data structures have changed semantics in any way does open up the interesting question of how you detect if this has happened. Do you rely on people to carefully read the code changes by hand and hope that they never make a mistake?)
You can build a system that's designed to make live patching fully reliable even in the face of these issues. But I don't think you can add live patching to an existing system (well) after the fact and get this; you have to build in handling both issues from the ground up, and this is going to affect your system design in a lot of ways. For instance, I suspect it drives you towards a lot of modularity and explicitly codified (and enforced) internal APIs and message formats. No current widely used Unix system has been designed this way, and so all kernel live patching systems for them are hacks.
(Live patching a kernel is a technically very neat hack, as is automatically analyzing source code diffs in order to determine what to patch where in the generated binary code. I do admire the whole technology from a safe distance.)
Comments on this page:Written on 09 November 2017.