2024-05-18
Realizing the hidden complexity of cloud server networking
We have our first cloud server. This cloud server has a public IP address that we can talk to, which is good because we need it and feels straightforward; we have lots of machines with public IP addresses. This public IP address has a firewall that we have to set rules for, which feels perfectly normal; we have firewalls too. Although if I think about it, the cloud provider is working at a much bigger scale, which makes it harder and more impressive. Except that our actual cloud server has a RFC 1918 IP address and is on an internal private network segment, so what we actually are working with is a NAT firewall gateway. And the RFC 1918 address is a sufficiently straightforward /24 that it's clear it's not unique to us; plenty of cloud customer servers must have their own version of the RFC 1918 /24.
That was when I realized how complex all of the infrastructure for this networking has to be behind the scenes. The cloud provider is not merely operating a carrier-grade NAT, which is already non-trivial. They're operating a CGNAT firewall system that can connect a public IP to an IP on a specific internal virtual network, where the IP (and subnet) aren't unique across all of the (internal) networks being NAT'd. I feel that I'm reasonably knowledgeable about networking and I'm not sure how I'd even approach designing a system that did that. It's different in kind from the NAT firewalls I work on, not merely in size (the way plain CGNAT sometimes feels).
Intellectually, I knew that cloud environments were fearsomely complex behind the scenes, with all sorts of spectacular technical underpinnings (and thus all sorts of things to go wrong). But running 'ip -br a' on our first cloud server and then thinking a bit about how it all worked was the first time it really came home to me. Things like virtual machine provisioning, replicated storage, and so on were sufficiently far outside what I work on that I just admired them from a distance. Connecting our cloud server's public IP with its actual IP was the first time I had the 'I work in this area and nothing I know of could pull that off' feeling.
(Of course if we'd all switched over to IPv6 we might not need this complex NAT environment, because in theory all of those cloud servers could have globally unique IPv6 addresses and subnets and all you'd need would be a carrier grade firewall system. I'm not sure that would work in practice, though, and I don't know how clouds handle IPv6 allocation for customer servers. Our cloud server didn't get assigned an IPv6 address when we set it up.)