NetworkManager and network device races

September 30, 2014

When we set up CentOS 7 on our new iSCSI backends we (and by that I mean 'I') left them using NetworkManager because it worked and I generally believe in leaving systems in their standard state when possible. Things have changed since then and we're now in the process of dropping NetworkManager.

The direct reason we're doing this is that we discovered that our iSCSI backends will not boot reliably with NetworkManager; some of the time the machines would have one of their iSCSI networks be unconfigured after the boot finished (we never saw it happen to both, but I believe it theoretically could). This is obviously really bad, and on top of it NetworkManager (in the form of nmcli) would claim that the interface was unavailable and so would refuse to bring it up (this included if you went through ifup instead of directly invoking NM).

As far as we can tell, what was happening is that during boot an interface would sometimes bounce; network link would come up, go down, and then come back up in very rapid succession. While NetworkManager got told about all of these events in order, it would sometimes not handle the final 'link up' event; it would configure the interface due to the initial link up event, deconfigure it due to the link down event, and then be completely convinced that the link was still down. Do not pass go, do not collect any network traffic.

(NM's logging clearly demonstrated that it saw the final link up event, and in fact saw it before it claimed to have started the deconfiguration process.)

Having this actually happen to us during boot was bad enough. Worse was the idea that this might happen during operation if link signal bounced then for one reason or another. If NetworkManager could decide to blow up our interfaces at any time if things went badly, it had to go.

I'm not going to question NetworkManager's decision to fully deconfigure a statically configured interface (eg removing its IP address) if the network link goes down, because no doubt it has good reason to do so. But this behavior is what made this bug a fatal one; if NM had left the IP address and so on on the 'down' interface, the only actual consequences would have been that NM was wrong about the state of the interface.

(We haven't reported this bug anywhere, which is another entry.)

Written on 30 September 2014.
« Don't split up error messages in your source code
The problem with making bug reports about CentOS bugs »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Sep 30 23:30:40 2014
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.