Wandering Thoughts archives

2015-12-18

A fun tale of network troubleshooting involving VLANs and MACs

The following is not my story; it comes from my co-workers, who had the real fun of trying to figure this one out and then finding a fix.

To start with, let's set the background. We have an (OpenBSD) routing firewall machine that sits on a network segment whose egress router is not under our control. Actually, we have two of them, one active and one as a warm spare (it's on and being updated, but it is not connected to any of the production networks because otherwise it would fight the live firewall for the public IPs). A while back, as part of trying to fail over from the live machine to the warm spare, we discovered that the egress router for the network caches ARP information for a long time. Like, apparently, hours. This was obviously no good for being able to switch over (such as in the case of hardware failure). Since the egress router is not under our control, the only thing we could really do was explicitly set the warm spare to have the same Ethernet address as the active machine.

(This was tested at the time it was set up and worked, but we believe the test at the time was misleading.)

Recently my co-workers wanted to swap from the active machine to the warm spare, because the active machine had been up for literally years (we don't update OpenBSD all that often). Unfortunately, when they made the swap the (ex-)warm spare was not reachable on its public IP, so they failed back to the active machine and took the warm spare off for testing. Testing established that the warm spare was showing 'incomplete' for other machines in its ARP cache, although other machines picked it up fine for their ARP caches. Further, trying to inspect the traffic with tcpdump made the network suddenly work, but things broke again when they stopped tcpdump. Oh, and the problem was specific to using our preferred Intel Ethernet cards; if the warm spare was switched to use non-Intel network hardware, everything worked.

Now, it happens that this machine has a slightly unusual network configuration. Because it needs to talk to a number of external networks, it actually gets all of its external networks as tagged VLANs over a single physical network port. When we changed the machine to use the MAC of the active machine, we had set the Ethernet address on the VLAN for that particular network, because that was the network that mattered; we didn't change the MAC of anything else.

It turned out that this was the problem. Using Intel cards on our (old) version of OpenBSD, when the MAC of the VLAN differed from the MAC of the underlying physical interface and the interface was not in promiscuous mode, ARP (at least) didn't work because the kernel apparently never received the replies to its ARP queries. If you put the interface into promiscuous mode, such as by running tcpdump, things suddenly worked; the kernel received ARP replies and so on. We think that the whole setup worked when tested because we likely tested it with tcpdump running to watch traffic (and verify what MACs were being used).

(The obvious suspect here is hardware level receive filtering; perhaps the hardware is only being set by the driver to recognize the physical port MAC as its MAC. This is a driver and/or hardware issue, but these things happen.)

Once my co-workers figured out what the problem was, the fix was simple: explicitly set the MACs of both the physical port and all the VLANs on it to the active machine's MAC. But getting there took a whole frustrating and puzzling journey. This wasn't exactly a Heisenbug, but until my co-workers noticed the pattern that running tcpdump made it disappear it did look like one.

(Using 'tcpdump -p' is the obvious thing for the future, but I don't know if it would actually have worked in this situation. Still, it's something to try to remember for the next time around. Maybe tcpdump should default to -p these days.)

sysadmin/VLANAndMACSurprise written at 23:56:37; Add Comment

Some things about the XSettings system

Yesterday I mentioned the XSettings standard for exposing (some) toolkit related configuration options to theoretically interested parties in a theoretically toolkit-independent way. There are some slightly non-obvious or not entirely documented things about this and daemon support for it.

First, as hinted by the 'X' in the name, this is is not a DBus-based system. Instead it uses the old-fashioned approach of setting an X property on the root window and having programs read this property. Because this is an X property, all clients can see it, whether they are on the local machine or on a remote machine. In turn this means that remote clients may change their behavior if you start running xsettingsd or the like, because now they can see your (local) configuration settings. How your local configuration settings interact with what's available on a remote machine can be potentially chancy; for example, it's perfectly possible to specify a Gtk/FontName that doesn't exist on other machines.

Some but not all settings daemons have side effects when run. For example gnome-settings-daemon appears to also add some X resources for things like Xft settings. This itself can cause (some) programs to change their behavior, even if they don't use a toolkit with support for XSettings. As far as I can tell, xsettingsd does not do this.

At least xsettingsd allows you to set essentially arbitrary settings properties, including in existing namespaces; for instance, it sure looks like you can set all sorts of XFT properties in XSettings. However, this is an illusion. In practice, there is a small set of known shared settings for general cross-toolkit things and if something's not in there, you setting it will do nothing. Where this really starts to matter (at least to me) is that the available XFT settings are pretty minimal. In particular, they don't include the fontconfig lcdfilter setting, which turns out to be one of the settings necessary to get fonts to look how I want them to.

(It's not clear to me if lcdfilter can be set in the Xft.* X resources either. I suspect not, but it probably can't hurt to try.)

At the same time, modern GTK has way more settings exposed through XSettings than are documented in the registry. To find out what all of them are, you basically need to fire up gnome-settings-daemon temporarily and run dump_xsettings to extract them all. I don't know what settings KDE exposes (if any); I haven't tried to find and run the KDE equivalent of gnome-settings-daemon.

For XFT settings specifically, I'm not sure what reads XSettings, what reads the X resource database, and what ignores all of this. I expect that GTK applications read XSettings, but I've seen some basic X programs like xterm appear to read either XSettings or X resources or perhaps both.

(And gnome-settings-daemon itself seems to do at least some DBus stuff, although I don't know if that's used for querying settings. All of this is annoyingly complicated. See this blog entry from 2010 for a picture of how complicated it was back then, and it's probably worse now.)

On the whole, if you have a mostly or entirely working environment now without a settings daemon involved, it seems safest to have the daemon publish only an extremely minimal set of XSettings settings. I started out feeling quite enthused about setting all of the XFT options but I'm now shifting more and more towards publishing only Gtk/FontName as the minimal fix for my issues. Of course, the mere existence of an active XSettings daemon may change program behavior (most especially including on remote machines), but you take what you can get in the world of modern X.

linux/XSettingsNotes written at 01:40:23; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.