NetworkManager and network device races
When we set up CentOS 7 on our new iSCSI backends we (and by that I mean 'I') left them using NetworkManager because it worked and I generally believe in leaving systems in their standard state when possible. Things have changed since then and we're now in the process of dropping NetworkManager.
The direct reason we're doing this is that we discovered that our
iSCSI backends will not boot reliably with NetworkManager; some of
the time the machines would have one of their iSCSI networks be
unconfigured after the boot finished (we never saw it happen to
both, but I believe it theoretically could). This is obviously
really bad, and on top of it NetworkManager (in the form of
would claim that the interface was
unavailable and so would refuse
to bring it up (this included if you went through
of directly invoking NM).
As far as we can tell, what was happening is that during boot an interface would sometimes bounce; network link would come up, go down, and then come back up in very rapid succession. While NetworkManager got told about all of these events in order, it would sometimes not handle the final 'link up' event; it would configure the interface due to the initial link up event, deconfigure it due to the link down event, and then be completely convinced that the link was still down. Do not pass go, do not collect any network traffic.
(NM's logging clearly demonstrated that it saw the final link up event, and in fact saw it before it claimed to have started the deconfiguration process.)
Having this actually happen to us during boot was bad enough. Worse was the idea that this might happen during operation if link signal bounced then for one reason or another. If NetworkManager could decide to blow up our interfaces at any time if things went badly, it had to go.
I'm not going to question NetworkManager's decision to fully deconfigure a statically configured interface (eg removing its IP address) if the network link goes down, because no doubt it has good reason to do so. But this behavior is what made this bug a fatal one; if NM had left the IP address and so on on the 'down' interface, the only actual consequences would have been that NM was wrong about the state of the interface.
(We haven't reported this bug anywhere, which is another entry.)