Wandering Thoughts archives

2014-09-30

NetworkManager and network device races

When we set up CentOS 7 on our new iSCSI backends we (and by that I mean 'I') left them using NetworkManager because it worked and I generally believe in leaving systems in their standard state when possible. Things have changed since then and we're now in the process of dropping NetworkManager.

The direct reason we're doing this is that we discovered that our iSCSI backends will not boot reliably with NetworkManager; some of the time the machines would have one of their iSCSI networks be unconfigured after the boot finished (we never saw it happen to both, but I believe it theoretically could). This is obviously really bad, and on top of it NetworkManager (in the form of nmcli) would claim that the interface was unavailable and so would refuse to bring it up (this included if you went through ifup instead of directly invoking NM).

As far as we can tell, what was happening is that during boot an interface would sometimes bounce; network link would come up, go down, and then come back up in very rapid succession. While NetworkManager got told about all of these events in order, it would sometimes not handle the final 'link up' event; it would configure the interface due to the initial link up event, deconfigure it due to the link down event, and then be completely convinced that the link was still down. Do not pass go, do not collect any network traffic.

(NM's logging clearly demonstrated that it saw the final link up event, and in fact saw it before it claimed to have started the deconfiguration process.)

Having this actually happen to us during boot was bad enough. Worse was the idea that this might happen during operation if link signal bounced then for one reason or another. If NetworkManager could decide to blow up our interfaces at any time if things went badly, it had to go.

I'm not going to question NetworkManager's decision to fully deconfigure a statically configured interface (eg removing its IP address) if the network link goes down, because no doubt it has good reason to do so. But this behavior is what made this bug a fatal one; if NM had left the IP address and so on on the 'down' interface, the only actual consequences would have been that NM was wrong about the state of the interface.

(We haven't reported this bug anywhere, which is another entry.)

linux/NetworkManagerRaceProblem written at 23:30:40; Add Comment

Don't split up error messages in your source code

Every so often, developers come up with really clever ways to frustrate system administrators and other people who want to go look at their code to diagnose problems. The one that I ran into today looks like this:

if (rval != IDM_STATUS_SUCCESS) {
	cmn_err(CE_NOTE, "iscsi connection(%u) unable to "
	    "connect to target %s", icp->conn_oid,
	    icp->conn_sess->sess_name);
	idm_conn_rele(icp->conn_ic);
}

In the name of keeping the source lines under 80 characters wide, the developer here has split the error message into two parts, using modern C's constant string concatenation to have the compiler put them back together.

Perhaps it is not obvious why this is at least really annoying. Suppose that you start with the following error message in your logs:

iscsi connection(60) unable to connect to target <tgtname>

You (the bystander, who is not a developer) would like find the code that produces this error message, so that you can understand the surrounding context. If this error message was on one line in the code, it would be very easy to search for; even if you need to wild-card some stuff with grep, the core string 'unable to connect to target' ought to be both relatively unique and easy to find. But because the message has been split onto multiple source lines, it's not; your initial search will fail. In fact a lot of substrings will fail to find the correct source of this message (eg 'unable to connect'). You're left to search for various substrings of the message, hoping both that they are unique enough that you are not going to be drowned in hits and that you have correctly guessed how the developer decided to split things up or parameterize their message.

(I don't blame developers for parameterizing their messages, but it does make searching for them in the code much harder. Clearly some parts of this message are generated on the fly, but are 'connect' or 'target' among them instead of being constant part of the message? You don't know and have to guess. 'Unable to <X> to <Y> <Z>' is not necessarily an irrational message format string, or you equally might guess 'unable to <X> to target <Z>'.)

The developers doing this are not making life impossible for people, of course. But they are making it harder and I wish they wouldn't. It is worth long lines to be able to find things in source code with common tools.

(Messages aren't the only example of this, of course, just the one that got to me today.)

programming/DontBreakUpMessages written at 00:33:12; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.