Chris's Wiki :: blog/linux/SystemdUbuntuRebootFailure Commentshttps://utcc.utoronto.ca/~cks/space/blog/linux/SystemdUbuntuRebootFailure?atomcommentsDWiki2017-09-07T15:45:36ZRecent comments in Chris's Wiki :: blog/linux/SystemdUbuntuRebootFailure.By Tiago on /blog/linux/SystemdUbuntuRebootFailuretag:CSpace:blog/linux/SystemdUbuntuRebootFailure:2423b01a11579b3bf0cd656e2f947568a97892d7Tiago<div class="wikitext"><p>Did you try <a href="https://freedesktop.org/wiki/Software/systemd/Debugging/#shutdowncompleteseventually">https://freedesktop.org/wiki/Software/systemd/Debugging/#shutdowncompleteseventually</a>?</p>
<p>You could also <a href="https://bugs.launchpad.net/ubuntu/+source/systemd/+filebug">open a bug against systemd in Ubuntu</a> with the steps you have taken and logs collected so far.</p>
</div>2017-09-07T15:45:36ZFrom 12.34.36.250 on /blog/linux/SystemdUbuntuRebootFailuretag:CSpace:blog/linux/SystemdUbuntuRebootFailure:acfb364c57715e3850c0c3e0562f9d5b09615755From 12.34.36.250<div class="wikitext"><p>I don't know, Chris - that seems like more blame shifting. Every time there's a problem with systemd, the response is 'well...its not really systemd's fault, but something else. The best response is to fix the underlying problem with the foobar service' and that really seems to kick up the angst. At some point, deserved or not, systemd needs to actually own the process of starting, stopping, and rebooting the system without having to rely on fingerpointing.</p>
<p>But that's the gripe some of us have -- systemd is under active development -- its a large, complex piece of software that we're trusting to do the right thing and its failing in many cases. It seems like it could've used a few more years of development to not feel like we're beta testing it.</p>
</div>2017-09-07T13:33:17ZBy Chris Adams on /blog/linux/SystemdUbuntuRebootFailuretag:CSpace:blog/linux/SystemdUbuntuRebootFailure:851df52ec8b305f15a127f7f49cd8a3651a5709fChris Adams<div class="wikitext"><p>D.F. if this turns out to be related to filesystem unmounting, note that the same problem occurs with Upstart and SysV. There's no simple answer which satisfies everyone: a hard unmount loses data and waiting too long is effectively a denial of service. The underlying fix would be hardening NFS so it wouldn't fail so easily into an unresponsive state.</p>
</div>2017-09-06T23:36:50ZBy D.F. on /blog/linux/SystemdUbuntuRebootFailuretag:CSpace:blog/linux/SystemdUbuntuRebootFailure:86842cc68197a4de52cc31aadc8570ae2bbc779fD.F.<div class="wikitext"><p>I just have to ask...after all the crapulence that systemd has introduced into the linux world, why do people continue to indulge in this horrible software? Why are there excuses made for it? Its microsoft of yesteryear -- 'well, sure...it messed up on something easy and erased all my data, but I love it!'</p>
</div>2017-09-06T21:49:31ZBy K.C. Marshall on /blog/linux/SystemdUbuntuRebootFailuretag:CSpace:blog/linux/SystemdUbuntuRebootFailure:d00267d95dda98c4cf485a6368ee445a1a904ca1K.C. Marshall<div class="wikitext"><p>Perhaps you could activate a hardware watchdog to kick the machine at the BIOS level if the reboot doesn't finish in 5 or 10 minutes. Another option might be to use the IPMI/idrac interface to power cycle the machine when the reboot takes too long to finish. The solution currently is a hard boot, so a watchdog or ipmi triggered reboot is not much different. There is still some need to monitor the machine to know if it is supposed to be kicked harder.</p>
</div>2017-09-06T17:58:48ZBy Alan on /blog/linux/SystemdUbuntuRebootFailuretag:CSpace:blog/linux/SystemdUbuntuRebootFailure:ed6baa1c7285f36565903ba58c815275020a092bAlan<div class="wikitext"><p>@Dan there's a <code>job</code> timeout set on the shutdown targets.</p>
<p><a href="https://github.com/systemd/systemd/commit/58f2fab16da947052756b7f9ace40f6ee7fa1519#diff-bd47e097256c120a87c488b56cc9b133">https://github.com/systemd/systemd/commit/58f2fab16da947052756b7f9ace40f6ee7fa1519#diff-bd47e097256c120a87c488b56cc9b133</a></p>
</div>2017-09-06T16:25:16ZBy Dan.Astoorian on /blog/linux/SystemdUbuntuRebootFailuretag:CSpace:blog/linux/SystemdUbuntuRebootFailure:e81b00359c37b963e0345a08b7b5ab2a47b9ddd8Dan.Astoorian<div class="wikitext"><p>For what it's worth, I've also experienced this a few times with one of our CentOS7 servers (a Dell PowerEdge R430).</p>
<p>On one of these instances earlier this summer, I left it alone when it failed to reboot, and it eventually did reboot on its own after almost exactly 30 minutes. (I was being particularly patient on that occasion because the reboot was for a BIOS update.)</p>
<p>The server was an NFS and CIFS client (but not a network fileserver), so it's plausible that the delay was related to one of those services, but I really had no way of evaluating what systemd might have been waiting for. It's also a web server, but I can't think of anything Apache could be doing to tie things up.</p>
<p>I spent a short time hunting for 30-minute timeouts in the service definitions, but didn't find anything promising; if anyone knows what takes 30 minutes for systemd and/or the kernel to give up on, that might be a clue.</p>
<p>--Dan</p>
</div>2017-09-06T14:32:03ZBy Florian Beer on /blog/linux/SystemdUbuntuRebootFailuretag:CSpace:blog/linux/SystemdUbuntuRebootFailure:63b5115611191ba5f500db0d48e1e18561a3c149Florian Beerhttps://blog.no-panic.at<div class="wikitext"><p>We have the same problem here, although on CentOS 7 VMs (ESXi) with systemd. They are <code>very</code> unreliable when rebooting after our periodic patch cycles and so far we have found no remedy.</p>
<p>What we've found out is that it only affects busy systems and is somehow (possibly?) connected to the swap partition not being able to unmount quickly enough or being locked somehow. We even went as far as defining custom runlevels (aka targets) that encompass all installed software ofter the blank OS setup. Then our reboot process would first go through shutdown of this software (webservers, DBs etc.) before actually attempting to reboot. It was our hope that swap is then sufficiently empty/unused and systemd wont hang, but it has proven very difficult to even test this, because it is such an inconsistent bug.</p>
</div>2017-09-06T10:50:17ZBy Alan on /blog/linux/SystemdUbuntuRebootFailuretag:CSpace:blog/linux/SystemdUbuntuRebootFailure:d77f7f2216188d6aea7920f69600e31f33c92d96Alanhttps://twitter.com/sourcejedi<div class="wikitext"><p>In summary, this one might be sufficiently widely infuriating that you'll eventually see some work on it merged upstream.</p>
<p>It seems a clear victim of code churn (and people giving up and/or completely failing to understand the previous design).</p>
<p>It would be interesting know how Debian's handled this. I know they're even less likely to backport fixes. It just that their discussions work a bit differently; there might be less noise about possibly unrelated hardware issues. (I've seen a developer point out that their tracker effectively raises the bar on who will even try to report issues, reducing their workload).</p>
</div>2017-09-06T09:28:32ZBy Alan on /blog/linux/SystemdUbuntuRebootFailuretag:CSpace:blog/linux/SystemdUbuntuRebootFailure:61a8557e08dd47f0377fc070c655ebb4c69aec06Alanhttps://twitter.com/sourcejedi<div class="wikitext"><p>Also some other comments in the issue</p>
<p><a href="https://github.com/systemd/systemd/issues/6115">https://github.com/systemd/systemd/issues/6115</a></p>
<p>Looking at sysvinit on Debian I can't immediately see any timeouts for unmounting. It has a nice sequence</p>
<ul><li>K03sendsigs</li>
<li>K04rsyslog</li>
<li>K05hwclock.sh</li>
<li>K05umountnfs.sh</li>
<li>K06networking</li>
<li>K07umountfs</li>
<li>K08lvm2</li>
<li>K09umountroot</li>
<li>K10halt</li>
</ul>
<p>The main difference is that, as you've noticed, systemd doesn't have a `sendsigs` unit.</p>
<p>If you have any user processes that survive... including any SSH session?? (they aren't affected by getty units, or the xdm unit, unlike other types of session, and they're deliberately supposed not to be affected by the SSH unit,)... The user processes survive service bringdown (shutdown.target), the mount units they pin will "fail" (not stop cleanly, I think; no timeout delay), and then you're reliant on the crude systemd-shutdown logic. But you've stopped the networking service, so I guess NFS mounts get cranky at that point.</p>
</div>2017-09-06T08:22:29ZBy Alan on /blog/linux/SystemdUbuntuRebootFailuretag:CSpace:blog/linux/SystemdUbuntuRebootFailure:187ca987d346845cab798d3dbf69d3ea334350a6Alanhttps://twitter.com/sourcejedi<div class="wikitext"><p>It might not be the only issue, but I was working on systemd recently and I remember seeing this PR come in</p>
<blockquote><p>The remount read-only and subsequent umount operations are currently not limited. As a result, the shutdown operation can stall endlessly due to a inaccessible NFS mounts, or a number of similar factors. This results in a manual system reset being necessary.</p>
</blockquote>
<blockquote><p>With these changes, the remount is now limited to a maximum of 6 attempts (UMOUNT_MAX_RETRIES + 1). In addition, the remount operation has been moved to a separate child process that is limited in duration. Each remount operation is limited to 90 seconds (DEFAULT_TIMEOUT_USEC) before the child process exits with a SIGALRM and reports the failure.</p>
</blockquote>
<p><a href="https://github.com/systemd/systemd/pull/6598">https://github.com/systemd/systemd/pull/6598</a></p>
</div>2017-09-06T07:52:50Z