Systemd on Ubuntu 16.04 can't (or won't) reliably reboot your server
We just went through a periodic exercise of rebooting all of our Ubuntu servers in order to get up to date on kernels and so on. By now almost all of our servers are running Ubuntu 16.04, which means that they're using systemd. Unfortunately this gives us a real problem, because on Ubuntu 16.04, systemd won't reliably reboot your system. On some servers, usually the busiest and most important ones, the system will just stop during the shutdown process and sit there. And sit there. And sit there. Perhaps it would eventually recover after tens of minutes, but as mentioned these are generally our busiest and most important servers, so we're not exactly going to let them sit there to find out what happens eventually.
(There also probably isn't much point to finding out. It's unlikely that there's some miracle cure we can do ourselves, and making a bug report to Ubuntu is almost completely pointless since Ubuntu only fixes security issues and things that are actively on fire. My previous experience wasn't productive and produced no solutions from anyone.)
This goes well beyond my previous systemd reboot irritation. Reliably rebooting servers despite what users are doing to them is a fairly foundational thing, yet Ubuntu's systemd not only can't get this right but doesn't even tell us what's wrong (in the sense of 'what is keeping me from rebooting'). The net effect is to turn rebooting many of our servers into a minefield. Not only may a reboot require in-person intervention in our machine room, but that we can't count on a reboot just working means that we actively have to pay attention to the state of every machine when we reboot them; we can't just assume that machines will come back up on their own unless something is fairly wrong. The whole experience angers me every time I have to go through it.
By now we've enabled persistent systemd journals on most everything
in the hopes of capturing useful information so we can perhaps guess
why this is happening. Unfortunately so far we've gotten nothing
useful; systemd has yet to log or display on the screen, say, 'still
waiting N seconds for job X'. I'm not even convinced that the systemd
journal has captured all of the log messages that it should from
an unsuccessful shutdown, as what '
journalctl -b-1' shows is much
less than I'd expect and just stops abruptly.
(Without an idea of how and why systemd is screwing up, I'm reluctant
DefaultTimeoutStopSec from its Ubuntu default, as I
once discussed here, or make other
changes like forcing all user cron jobs to run under user slices.)
(This Ubuntu bug matches one set of symptoms we see, but not all of them. Note that our problem is definitely not the Linux kernel having problems rebooting the hardware; the same Dell servers were previously running Ubuntu 14.04 and rebooting fine, and Magic SysRQ will force reboots without problems. There's also this Ubuntu bug and this report of problems with shutting down when you have NFS mounts, which certainly could be part of our problems.)