The general lesson from the need for metrics
The lesson I learned about why metrics are important is an important lesson, but it's a specific lesson. It would be a shame to stop there, because there is a general lesson lurking in the underbrush behind it. That is:
Fallible humans are always going to overlook something.
This is the real lesson of fragile complexity, in all its various specific facets. Our systems are too complex for us to genuinely understand, and that complexity means we are always going to overlook something (and sooner or later that something will matter).
One of the things we need to do in system administration is to engineer large scale, high level approaches to our problems that can deal with this messy realization and that do not depend on post-facto specific fixes. It's always tempting to apply post-facto fixes, to say things like 'I'll make sure to check for performance problems after future changes to our fileserver infrastructure', but this is never going to be good enough. Even apart from the pragmatic issues pointed out by Perry Lorier in a comment, this is a fundamentally backwards looking solution; it deals with the problem we found this time around but it doesn't necessarily deal with a future problem.
This is the generalized reason for automated metrics collection and monitoring. If you gather metrics you're constructing a backstop for human fallibility. If and when something goes wrong because of something people overlooked, you have a chance to see it and catch it before things explode, a chance that you would not have if you relied purely on post-facto fixes.
A direct corollary of this is that it's important to gather all the metrics that you can, even for things that you don't think you have any use for. Gathering only metrics you have a use for now is a backwards looking solution; you're assuming that you know what you need. Fragile complexity says that you're wrong, you don't know yet what you're going to want to spot the next problem, a problem that you didn't even foresee being possible. So gather everything you can. That way you have a chance to beat the future.
Things that systemd gets right
On Twitter, I recently put forward the heretical opinion that systemd is actually a good thing (as I've written a bit about before). Now, systemd is not flawless or without worrisome tendencies and it has a number of features that I'm indifferent to, but I do think that it gets quite a lot of things right. Today I feel like trying to list them off (partly so that I have this in one place for future use).
(A disclaimer: this is from the perspective of someone who runs servers and thus doesn't really care about systemd features like minimizing boot time or not actually starting various sorts of programs until someone asks for them.)
To begin with, a terminology note. What systemd calls a unit is what we would otherwise call an init script (well, it's a superset of that, but we'll ignore that for now). I'll be using 'unit' and 'units' throughout this.
So, in no particular order:
- systemd has a strong separation between system-supplied units, which
/lib/systemd, and sysadmin-supplied units, which go in
/etc/systemd. This is very helpful for keeping track of the latter.
- you can override a system-supplied unit with a sysadmin-supplied one
without changing or removing the system-supplied one.
(Why, you would think that systemd was written by people who understood modern package management.)
- what units are enabled in various states is stored in the filesystem
in a visible form, not locked up in a magic database somewhere.
- you can have units installed without being activated,
- systemd allows units to shim themselves into the startup order so that
they get started before some other unit; you do not have to alter
the other unit to enable this (unlike Upstart again).
(systemd is not perfect here; in the general case you can't reorder existing units without editing some of them. But you can do this by overriding the system-supplied unit with your own copy, per above.)
- systemd unit configuration files are easy to write and easy to read
(cf); they contain almost the minimal
information necessary with very little extraneous fluff. They do not
- systemd handles a lot of annoying infrastructure for you; for example,
you do not have to arrange to daemonize programs you run.
- systemd starts and restarts services in a consistent and isolated
environment, not in whatever your
current environment is when you run the start and restart commands.
- systemd keeps track of what processes belong to a particular service,
so it can both list all the processes that are part of a service and
tell you what service a particular process is part of. This is a boon
- because it actively tracks unit status, conditional restarts are
not dangerous; it shares this behavior with
any competently implemented active init system.
(SysV init scripts are a passive system, Upstart, Solaris SMF, and systemd are all active ones.)
- during boot, systemd reports unit startups as they happen (and reports
if they succeeded or failed). You would think that this is a basic
feature that everyone has, but no; neither SMF nor Upstart do this.
- unit names are succinct (unlike SMF).
- it apparently does per-user fair share scheduling by default (but I haven't had a chance to run systemd in a situation where I could really see this in action).
In common with other active systems, systemd starts units in parallel when possible. I don't consider this a striking advantage, especially because other systems do it too.
(I may update this with additional things as they occur to me or as people mention them, since I've probably missed some.)
Sidebar: how I feel about the competition
The competition that I know of is SMF and Upstart. SMF is encrusted with complexity and dates from the days when people thought XML was a good idea; it is 'enterprisey' in a bad way. I consider it a step backwards from System V init scripts. Upstart is a flawed attempt and not bold enough; even ignoring the flaws, it isn't a significant enough improvement over SysV init scripts to be worth the pain of conversion.
(In other words, Upstart is an improvement but not a significant and worthwhile one.)