Something systemd is missing for diagnosing oddly slow boots
In yesterday's entry, I mentioned that
my office machine has always been oddly slow to boot (and this goes
back years, as far back as when I switched to systemd's networkd). Over the years I have made various attempts
to figure out why this was so and what I could do about it, generally
not getting anywhere with them. As part of this, I have of course
poked around at what
systemd-analyze could tell me, looking at
things like '
systemd-analyze blame', '
systemd-analyze plot'. None of them have given me any useful
understanding of what's going on.
(One of the things that drives this is that I have a very similar
Fedora system at home, and it has always booted visibly faster.
This has driven me not just with the general question of 'why does
my office machine boot what feels like slowly' but also the specific
question of 'what makes my office machine slower than my home
machine'. However I've just realized that I've never systematically
explored the details of this difference, for example by carefully
systemd-analyze blame' on both machines.)
Based on '
systemd-analyze blame', I suspect that some of the delay
is happening inside the startup of specific units and even specific
programs (for example, starting up ZFS seems to take significantly
longer on my office machine than at home). Diagnosing this is
something that systemd can only partially help with, although it
could do more than it currently does. It would be quite helpful to
get the timings of each of the separate
Exec steps in a single
.service unit, so that I could see, for example, which bit of
Unbound's multi-step .service is making it take over 12 seconds to
(In Unbound's case I suspect that it's '
so slow, but it would be nice to know for sure. This information
may inconsistently be available in the giant blob of stuff you
get from '
systemd-analyze dump'; it is on one machine and it
isn't on the other. It may depend on how long ago the system was
However, there's something important that systemd could do that it doesn't, which is record what it was that finally allowed (or caused) a service to be started. Every boot time unit has a set of dependencies and preconditioned, and it only starts when all of them are ready; the whole thing forms an ordered graph. But right now, as far as I can tell, systemd does not explicitly record the information required to reconstruct the critical path through the graph for any given service. To put it another way, you can't reliably ask 'why didn't this service start before then', at least not easily. In an ideal world you could load up a plot of your entire boot, click on any unit, and have at least its critical predecessor shown.
(In theory 'systemd-analyze critical-chain' might answer this question. In practice it gives me answers that are clearly off; for example, it currently claims that one of the services on the critical chain for email@example.com started more than ten seconds after it says that the getty did. Also, the manual page specifically mentions various cautions.)
In a related issue, I also wish that
systemd-analyze would give
you the unit start time and duration information in text form,
instead of only embedded in a SVG through '
This would make it easier to pick through and do my own analysis
of this sort of thing, and the information is clearly available.
It's just that systemd-analyze has apparently decided that the only
form we could want it in is a SVG.
Sidebar: Deciding on what you care about for startup duration
If you run '
systemd-analyze critical-chain', it gives you the
critical chain for multi-user.target by default. But is this actually
what you care about? In my case, I've decided that one thing that
I care about is when I can log in to my just rebooted machine, which
seems to happen well before multi-user.target is considered officially
finished. That's why I've also been looking at firstname.lastname@example.org,
which that's a reasonable indication of reaching this point.