2023-01-11
Sometimes it actually is a kernel bug: bind() in Linux 6.0.16
There's a common saying and rule of thumb in programming (possibly originating in the C world) that it's never a compiler bug, it's going to be a bug in your code even if it looks crazy or impossible. Like all aphorisms it's not completely true, because compilers have bugs, but it's almost always the case that you haven't actually found a compiler bug and it's something else. You can say a similar thing about weird system issues (not) being the fault of a kernel bug, and so that's what I thought when the development version of Go started failing a self test when I built it on my Fedora 37 office desktop:
--- FAIL: TestTCPListener (0.00s) listen_test.go:72: skipping tcp test listen_test.go:72: skipping tcp 0.0.0.0 test listen_test.go:72: skipping tcp ::ffff:0.0.0.0 test listen_test.go:72: skipping tcp :: test listen_test.go:90: tcp 127.0.0.1 should fail
Where this test in net/listen_test.go is failing is when it attempts to listen twice on the same localhost IPv4 address and port. It first binds to and listens on 127.0.0.1 port 0 (that port causes the kernel to assign a free ephemeral port for it), extracts the actual assigned port, and then attempts to bind to 127.0.0.1 on the same port a second time.
(The Go networking API bundles the binding and listening together
in one Listen() API, but the socket API itself has them as two
operations; you bind()
a socket to some address, then listen()
on it.)
This obviously should fail, except the development version of Go was claiming that it didn't. Aft first I thought this had to be a Go change, but soon I found that even older versions of Go didn't pass this test (when I knew they had when I'd built them), and also that this test passed on my Fedora 36 home desktop. Which I noticed was running Fedora's 6.0.15 kernel, while my office machine was running 6.0.16. That certainly looked like a kernel bug, and indeed I was able to reproduce it in Python (which is when I eventually realized this was an issue with bind() instead of listen()).
The Python version allows me to see more about what's going on:
>>> from socket import * >>> s1 = socket(AF_INET, SOCK_STREAM) >>> s2 = socket(AF_INET, SOCK_STREAM) >>> s1.bind(('127.0.0.1', 0)) >>> s2.bind(('127.0.0.1', s1.getsockname()[1])) >>> s1.getsockname() ('127.0.0.1', 54785) >>> s2.getsockname() ('0.0.0.0', 0)
Rather than binding the second socket or failing with an error, the
kernel has effectively left it unbound (the s2.getsockname() result
here is the same as when the socket is newly created ('0.0.0.0'
is usually known as INADDR_ANY).
Replacing SOCK_STREAM
with SOCK_DGRAM
causes things to fail
with 'address already in use' (errno 98), so this issue seems
specific to TCP.
This kernel error is in Fedora 37's 6.0.16 and 6.0.18, but is gone in the Rawhide 6.2.0-rc2 and isn't present in the Fedora 6.0.15. I don't know if it's in any version of 6.1, but I'll probably find out soon when Fedora updates to it. Interested parties can try it for themselves, and it's been filed as Fedora bug #2159802.
(This elaborates on a Fediverse thread. I looked at the 6.0.16 changelog, but nothing jumped out at me.)