Our problem with Netplan and routes on Ubuntu 18.04
Today I've wound up getting back to our netplan dysfunction, so I think it's time to write a blog entry. Spoiler: highly specific network device names and configurations that can only be attached to or specified for a named network device interact very badly at scale.
We have a bunch of internal 'sandbox' networks, which connect to our main server subnet through various routing firewalls. Obviously the core router on our server subnet knows how to route to every sandbox; however, we also like to have the actual servers to have specific sandbox subnet routes to the appropriate routing firewall. For some servers this is merely vaguely nice; for high traffic servers (such as some NFS fileservers) it may be pretty important to avoid a pointless traffic bottleneck on the router. So many years ago we built a too-smart system to automatically generate the appropriate routes for any given host from a central set of information about what subnets were behind which gateway, and we ran it on boot to set things up. The result is a bunch of routing commands:
ip route add 10.63.0.0/16 via 184.108.40.206 ip route add 10.70.0.0/16 via 220.127.116.11 ip route add 172.31.0.0/16 via 18.104.22.168 [...]
This system is completely indifferent to what the local system's
network interface is called, which is good because in our environment
there is a huge assortment of interface names. We have
and on and on.
All of this worked great for the better part of a decade, until Ubuntu 18.04 came along with netplan. Netplan has two things that together combine to be quietly nearly fatal to what we want to do. First, the netplan setup on Ubuntu 18.04 will wipe out any 'foreign' routes it finds if and when it is re-run, which happens every so often during things like package upgrades. Second, the 18.04 version of netplan has no way to specify routes that are attached to a subnet instead of a specific named interface. If you want netplan to add extra routes to an interface, you cannot say 'associate the routes with whatever interface is on subnet <X>'; instead, you must associate the routes with an interface called <Y>, for whatever specific <Y> is in use on this system. As mentioned, <Y> is what you could call highly variable across our systems.
(Netplan claims to have some support for wildcards, but I couldn't get it to work and I don't think it ever would because it is wildcarding network interface names alone. Many of our machines have more than one network interface, and obviously only one of them is on the relevant subnet (and most of the others aren't connected to anything).)
The result is that there appears to be no good way for our perfectly
sensible desire for generic routing to interact well with netplan.
In a netplan world it appears that we should be writing and re-writing
/etc/netplan/02-cslab-routes.yaml file, but that file has to
have the name of the system's current network interface burned into
it instead of being generic. We do shuffle network interfaces around
every so often (for instance to move a system from 1G to 10G-T),
which would require us remembering that there is an additional magic
step to regenerate this file.
There are various additional problems here too, of course. First,
there appears to be no way to get netplan to redo just your routes
without touching anything else about interfaces, and we very much
want that. Second, on most systems we establish these additional
sandbox routes only after basic networking has come up and we've
NFS mounted our central administrative filesystem that has the data
file on it, which is far too late for normal netplan. I guess we'd
have to rewrite this file and then run '
(Ubuntu may love netplan a whole lot but I certainly hope no one else does.)