2011-05-16
How our network is implemented (as of May 2011)
In an earlier entry I gave a nice simple description of how our network looks at the logical level of subnets and so on. Today, it's time for a look at how it's actually implemented with switches and so on. As before, this picture is somewhat simplified.
The core of our physical network is a string of core switches. It's a string because each switch has only two 10G ports, so we have to chain them together in order to get all of them connected. We have space in three different buildings, and between the three buildings we have five core switches (two of them in our main machine room, one each in the core wiring area of the other two buildings, and the fifth in another machine room in the building that has our main machine room). This core backbone transports research group sandboxes, and in our main machine room it transports our primary public network across the room.
There are two approaches to connecting machines to this sort of core network; you can hang multi-VLAN switches off the core switches with tagged links and then configure individual ports on those switches with the appropriate sandbox network, or you can hang single-network switches off untagged single-network ports on the core switches. We've opted to do the second; this requires more configuration of the core switches but keeps all of the edge switches simple (in fact, identical). Thus there are a bunch of single-network switches hanging off core switches in the places where the various networks are needed.
These core switches do not transport our port isolated networks. Those are carried by a separate network of port isolated switches (with a separate set of links between our various buildings). This network is connected to one core switch so that it can be joined together with other sandboxes and transported to the NAT gateway.
(A few sandboxes also have dedicated links between buildings that run between their own single-network switches. Mostly this is for historical reasons; there's a lot of history in action around here.)
How our public networks are handled is a little peculiar. Only one public network (what is now our primary one) is carried on the core switches; all other public networks simply live on single-network switches in the machine room. Our public networks interconnect through our core router, which has an untagged port for each network that connects to the top level switch for that network (including the touchdown network that is our connection to the campus backbone). The 'top level' switch for our primary public network also connects to an untagged port on one of the core switches, thereby connecting everything up.
(The core router also carries static routes for the sandboxes that point to their NAT gateway.)
All of our fileservers are on the primary public network; to maximize aggregate bandwidth to them, each fileserver gets a direct connection to one of the core switches. A few machines that do a lot of NFS activity also get direct connections, such as our IMAP server and our Samba server. We have also split our login and compute servers onto several switches, each of which is directly connected to the core switch with the fileservers.
There is also a completely separate management network, which has its own switches and its own links between buildings. For reasons beyond the scope of this entry (they involve switch bugs) the main thing on it in other buildings is serial port to Ethernet bridges that give us remote serial console access to various switches, most crucially the core switches. In our main machine room, it also has various other things such as web-enabled PDUs.
2011-05-15
The most interesting reason for an unrouted sandbox network
I recently wrote up our network layout and mentioned that while most of our private 'sandbox' networks are routed internally, we had a few that are unrouted for various reasons. This is the story of the most interesting unrouted one.
We have a number of firewalls. Because we are cautious people and these are very important systems, we have hot spares. Now, there are at least two ways to handle hot spare firewalls; you can have them physically plugged into the production networks but with their network interfaces not configured with the live firewall IP addresses, or you can have their interfaces fully configured but not have cables plugged in. We've decided that we prefer the second approach for various reasons, including that it avoids certain sorts of bad accidents.
(This does mean that we have to go to the machine room to swap in a hot spare, but in practice we consider this not a serious limitation. Among other things, we almost never have to do that.)
Since we're sane sysadmins, we don't edit firewall rules on the actual machines; instead, we edit them on a central master machine and then push the update to the firewall (with an automatic reversion if there are obvious problems). Of course, to keep the hot spares ready to go at a moment's notice we need to push the configuration updates to them too.
So: what management IP address does a hot spare firewall get, and on what network? It can't be an address on our public networks, because hot spares are fully configured with the gateway public IPs and so their network cable can't be plugged in lest they and the live firewall have a big fight over who owns those IP addresses. It can't be an address on any of our routed sandboxes, because then the hot spare for the firewall that provides routing for that sandbox can't be plugged in for the same reason.
Our conclusion was that they had to be on a new unrouted sandbox. This sandbox has their management interfaces and the central master machine that updates are pushed from, and if we need to log in to the firewalls (which we do every so often) we do it by indirecting through the central machine.
(OpenSSH has some stuff that make this indirection painless and effectively invisible with the right setup.)
2011-05-14
Our environment illustrated: what network cable colours mean what
Yesterday I mentioned in passing that we had a colour scheme for network cables, at least in our machine room and some wiring closets. To illustrate our environment, here's the cable colours we currently use:
| red | Our 'red' network for general user-managed machines. |
| pink | The wireless network. |
| white | Serial console cables; rather than have real serial cables, we run serial over standard Cat-5 cables using adapters. |
| green | The management subnet(s). |
| yellow | One of our iSCSI networks. |
| black | The other iSCSI network. |
| blue | Everything else; public subnets and sandboxes. |
(I should note that this is for the cables themselves, not for the cable shrouds at the end of the cable. Our usual habit is to make the cable shrouds the same colour as the cable, but we don't worry about it too much.)
A couple of research groups also use some of these colours to keep their own network setups straight; I believe that black and yellow are 'shared' colours, and there is also someone using purple cables for something. The shared colours don't cause confusion because they're in completely separate places; our black and yellow wires only get used in one small area of our machine room.
(But where they get used they are very handy at keeping straight which iSCSI network is which and making sure that no one ever mistakes an iSCSI network or switch for anything ordinary.)
We're not very big on labeling our cables unless they're really important or unusual. Our experience with things like cable labels is that they are like comments in source code; they're fine until things start changing and then they never get updated and become actively misleading. If we're not going to be able to trust cable labels, it's better not to have them at all.
2011-05-13
Our network layout (as of May 2011)
Today, I feel like talking about how our networks are organized at the logical level of subnets and routing and interconnections (instead of the physical level of switches and so on; that's another entry). This will be simplified somewhat because going into full detail would make it feel too much like work.
We have a single connection to the campus backbone; these days it uses a touchdown network that connects to our core router (although it didn't used to, which caused some issues). Hanging off of the core router is all of our public subnets.
Now, we have nowhere near enough public IP addresses to go around, and especially we don't have enough subnets to isolate research groups from each other. So basically the only thing on our public subnets is centrally managed servers and infrastructure; actual people and research groups use subnets in private IP address spaces that we call 'sandboxes'. Most sandboxes are routed sandboxes; they sit behind routing NAT gateways (although sandbox machines are not NAT'd when they talk to internal machines), are routed internally, and can reach both internal machines and the outside world (and with the right configuration, the outside world can reach them if desired). Generally each research group gets its own sandbox (which they have a lot of control over), and we have a few generic sandboxes as well.
(We provide a few services for research group sandboxes; we host the
DNS data, provide DHCP if they want it, and of course manage the
NAT gateways. DNS-wise, sandbox machines have names that look like
sadat.core.sandbox; we have a complex split horizon DNS setup, and of
course .sandbox exists only in the internal view.)
The most important generic sandbox is what we sometimes call our 'laptop network', which is for general user-managed machines such as Windows machines. Unlike regular sandboxes, this sandbox is port isolated so that user machines can't infect each other. Because people sometimes want to run servers on their own machines and have them reachable by other people on the laptop network, we have a second port isolated subnet for 'servers' (broadly defined). We also have a port isolated wireless network that sits behind a separate NAT gateway. Unlike conventional sandboxes, the wireless network is NAT'd even for internal machines.
(There is also a PPTP VPN server, which needs yet another chunk of private address space for the tunnels it creates with clients.)
These NAT gateways sit on our normal public subnets, or I should actually say 'subnet'; we have been slowly relocating our servers so that almost everything lives on a single subnet. Among other advantages, this means that we avoid round trips through our core router when servers talk to each other or to sandbox machines. Since some research groups have compute clusters in their sandbox that NFS mount filesystems from our fileservers and care about their performance, we like avoiding extra hops when possible. Our core router has static routes configured for all of the routable sandbox subnets, and various important servers on the subnet also have them for efficiency.
(Well, okay, basically all of the servers on the subnet have the static routes because we automated almost all of the route setup stuff, and once you do that it's easy enough to make it happen everywhere.)
There is a general campus-wide requirement that networks at least try to block anonymous access. For the wireless network, we deal with this by requiring authentication on the wireless gateway (or that you use our VPN, which does its own authentication); for the laptop network, we have a couple of self-serve DHCP registration systems with a tangled history. For research group sandboxes, we leave it up to the research group and anyways, most research group sandboxes can only be used from physical areas with good access control like the group's lab space.
We (the central people) use a number of sandboxes ourselves for various reasons. Some of them are routed, but some are not for various reasons (for example, we consider it a feature that the iSCSI interconnect subnets are not reachable even by internal machines). There's a few sandboxes that we don't route for especially interesting reasons, but that's another entry.
Sidebar: on some names
Our laptop network is conventionally called the red network or just 'the red', and the collective public subnets that we have our servers on are called the blue network. These names come from the colours of network cables used for each subnet, especially in areas that normal people see. Being strict on cable colour allows us to tell people things like 'only plug your laptop into a red cable, but you can plug it into any red cable you find and it should work'.
(Partly because there aren't enough commonly available cable colours to go around, sandbox network cables are generally blue and blue has thus become our generic cable colour in public areas. A few groups have different, specific cable colours for their sandboxes for historical reasons. The rules are different and much more complicated in our machine room and in wiring closets.)