Something systemd is missing for diagnosing oddly slow boots

October 9, 2018

In yesterday's entry, I mentioned that my office machine has always been oddly slow to boot (and this goes back years, as far back as when I switched to systemd's networkd). Over the years I have made various attempts to figure out why this was so and what I could do about it, generally not getting anywhere with them. As part of this, I have of course poked around at what systemd-analyze could tell me, looking at things like 'systemd-analyze blame', 'systemd-analyze critical-path', and 'systemd-analyze plot'. None of them have given me any useful understanding of what's going on.

(One of the things that drives this is that I have a very similar Fedora system at home, and it has always booted visibly faster. This has driven me not just with the general question of 'why does my office machine boot what feels like slowly' but also the specific question of 'what makes my office machine slower than my home machine'. However I've just realized that I've never systematically explored the details of this difference, for example by carefully comparing 'systemd-analyze blame' on both machines.)

Based on 'systemd-analyze blame', I suspect that some of the delay is happening inside the startup of specific units and even specific programs (for example, starting up ZFS seems to take significantly longer on my office machine than at home). Diagnosing this is something that systemd can only partially help with, although it could do more than it currently does. It would be quite helpful to get the timings of each of the separate Exec steps in a single .service unit, so that I could see, for example, which bit of Unbound's multi-step .service is making it take over 12 seconds to start.

(In Unbound's case I suspect that it's 'unbound-anchor' that's so slow, but it would be nice to know for sure. This information may inconsistently be available in the giant blob of stuff you get from 'systemd-analyze dump'; it is on one machine and it isn't on the other. It may depend on how long ago the system was booted.)

However, there's something important that systemd could do that it doesn't, which is record what it was that finally allowed (or caused) a service to be started. Every boot time unit has a set of dependencies and preconditioned, and it only starts when all of them are ready; the whole thing forms an ordered graph. But right now, as far as I can tell, systemd does not explicitly record the information required to reconstruct the critical path through the graph for any given service. To put it another way, you can't reliably ask 'why didn't this service start before then', at least not easily. In an ideal world you could load up a plot of your entire boot, click on any unit, and have at least its critical predecessor shown.

(In theory 'systemd-analyze critical-chain' might answer this question. In practice it gives me answers that are clearly off; for example, it currently claims that one of the services on the critical chain for getty@tty1.service started more than ten seconds after it says that the getty did. Also, the manual page specifically mentions various cautions.)

In a related issue, I also wish that systemd-analyze would give you the unit start time and duration information in text form, instead of only embedded in a SVG through 'systemd-analyze plot'. This would make it easier to pick through and do my own analysis of this sort of thing, and the information is clearly available. It's just that systemd-analyze has apparently decided that the only form we could want it in is a SVG.

Sidebar: Deciding on what you care about for startup duration

If you run 'systemd-analyze critical-chain', it gives you the critical chain for by default. But is this actually what you care about? In my case, I've decided that one thing that I care about is when I can log in to my just rebooted machine, which seems to happen well before is considered officially finished. That's why I've also been looking at getty@tty1.service, which that's a reasonable indication of reaching this point.

Written on 09 October 2018.
« It's good to check systemd for new boot-time things every so often
Even systemd services and dependencies are not self-documenting »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Oct 9 00:03:41 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.