2017-02-06
Systemd's slowly but steadily increased idealism about the world
When it started out, systemd was in many ways relentlessly pragmatic.
My shining example of this is that the developers went to fairly
great lengths to integrate both System V init scripts and /etc/fstab
into systemd in a fairly deep and thus quite convenient way. The
easy way would have been to just run things and mount filesystems
through some compatibility shims and programs. Systemd went the extra
distance to make them more or less real units, which means that you can
do things like add extra dependencies to System V init scripts through
/etc/systemd/system
overrides, just as if they were native systemd
units.
(This has not always worked seamlessly, especially for mounts, but it has gotten better over time.)
As well as being convenient for people using systemd, I suspect that this was a pragmatic decision. Being a better System V init than SysV init itself undoubtedly didn't hurt systemd's case to be the winning init system; it gave people a few more reasons to like systemd and approve of it and maybe even push for it.
Unfortunately, since then systemd developers have shown an unfortunate and increasing streak of idealism. More and more, systemd seems not to be interested in dealing with the world as it actually is, with all of its grubby inconvenient mess; instead it simply assumes some pure and ideal version of things. If the world does not measure up to how it is supposed to be, well, that is not systemd's problem. Systemd will do the nominally right thing no matter how that works out in practice, or doesn't.
Exhibit one for this is how systemd interprets LSB dependencies in System V init scripts. These dependencies are predictably wrong in any number of cases, because they've never been really used before. Ignoring them and just running init scripts in order (with some pauses in the middle) would be the pragmatic choice, but instead systemd chose the idealistic one of 'we will assume that declared dependencies are correct, gain a minor speed boost, and if things blow up it's not our fault'.
Exhibit two for me is the general non-resolution of our SATA port multiplier issue with device naming. The general systemd view seems to be that this is not their problem; either it should be in some vague diffuse other system that no one is writing today, or the kernel's sysfs should provide more direct information, or both. In no case is this going to be solved by systemd. Never mind that systemd is getting things blatantly wrong; it is not their problem to fix, even though they could. This once again is clear idealism and purity triumphing over actual usability on actual hardware and systems.
It seems clear to me that systemd is less and less a pragmatic system where the developers are willing to make compromises and go out of their way to deal with the grubby, messy world as it actually is, and more and more a project where the developers want to deal with a pure world where things are all done in the philosophically right way. We all know how this ends up, because we have seen this play out in security, among other places. If you're not solving problems in the real world, you're not really solving problems; you are just being smug and 'clever'.
(This elaborates on and explains an old tweet of mine.)
PS: Or perhaps systemd was always sort of like this, and I didn't really notice it before. You do need more than a little bit of idealism to think 'we will do an init system right this time', and certainly systemd had some idealistic beliefs right from the start. Socket activation and (not) handling things that wanted to depend on the network being up are the obvious cases. Systemd was arguably correct but certainly kind of annoying about them.
Our advantage in reliable backups is that we get restore requests
Recently Gitlab had a data loss incident where part of the problem was that, to quote their incident report:
- So in other words, out of five backup/replication techniques deployed none are working reliably or set up in the first place. [...]
Like many people, this set me to thinking about the perennial issue of quietly broken backups and how we're doing on that score (since there's nothing like someone else's catastrophe to make you wonder how you'd do in a similar situation). One of my realizations here is that we have a subtle advantage over many other organizations. That advantage masquerades as what some people would see as a drawback; it is that we get asked to restore accidentally deleted files on a reasonably routine basis.
In a lot of places, backups are only for disaster recovery. If you have users and the users delete something thoroughly enough that it defeats all your chances to give them a chance to change their minds, well, that thing is gone. Sure, you could theoretically spin up the backups, pull out a database backup that should have the deleted thing, restore the database to a scratch system, extract the deleted data, and add it back to the live database, but in practice you're not going to do that short of an extreme case that involves someone high up in the organization making a very special exemption. In these places your backups are completely untested unless you go out of your way to check them.
But we serve a general user population keeping (Unix) files on (Unix) fileservers, and sometimes they delete their files by accident and want them back, and our backups exist in part for that. So we restore them on request, with no exceptional permission needed or anything. The mere process of satisfying routine restore requests of 'I deleted file X sometime Tuesday, please get it back' serves as a basic test of our backup system (and sometimes of our long term archive system as well, depending on how far back they deleted what they want back). We probably get one or two such restore requests a month, so once or twice a month we re-verify that our backups are working as a side effect of handling them.
One consequence of this is that we actively pay attention to the state of our backup system, because we know it's going to be used reasonably regularly. We get daily backup reports and we look for at least outright failures to back things up (and they happen every so often). And in general, tending the backup system is something that's actively on our minds; not as a high priority most of the time, but as a routine ongoing background activity that does need a certain amount of manual attention (and we know it needs that attention, so if it shut up about that we'd probably start wondering what was wrong so that it, eg, wasn't demanding new 'tapes'.
(We don't necessarily look carefully at everything our backup system reports, which can mean some issues linger for a while.)
There are decided limits to this implicit testing we're doing. For a start it's neither systematic nor ongoing, and on top of that we very rarely get asked to restore more than a single file, so we don't really test full restores of even a directory hierarchy very often (and we've never had to test a full restore of a filesystem). Even when we do restore a directory hierarchy, we haven't tested to see if deleted or renamed files are handled correctly; if the restore leaves kind of a mess, that's left for the user and their Point of Contact to sort out.
Possibly we should try to do better than this. On the other hand you can argue that the likely payoff for much better testing is pretty low, since we have a good understanding of the components of our backup system and it doesn't involve anything that's particularly likely to be fragile in ways that Amanda's existing reports wouldn't notice and cover.
(I'm waving my hands here and justifying it is going to require another entry.)