Sometimes it actually is a kernel bug: bind() in Linux 6.0.16

January 11, 2023

There's a common saying and rule of thumb in programming (possibly originating in the C world) that it's never a compiler bug, it's going to be a bug in your code even if it looks crazy or impossible. Like all aphorisms it's not completely true, because compilers have bugs, but it's almost always the case that you haven't actually found a compiler bug and it's something else. You can say a similar thing about weird system issues (not) being the fault of a kernel bug, and so that's what I thought when the development version of Go started failing a self test when I built it on my Fedora 37 office desktop:

--- FAIL: TestTCPListener (0.00s)
    listen_test.go:72: skipping tcp  test
    listen_test.go:72: skipping tcp test
    listen_test.go:72: skipping tcp ::ffff: test
    listen_test.go:72: skipping tcp :: test
    listen_test.go:90: tcp should fail

Where this test in net/listen_test.go is failing is when it attempts to listen twice on the same localhost IPv4 address and port. It first binds to and listens on port 0 (that port causes the kernel to assign a free ephemeral port for it), extracts the actual assigned port, and then attempts to bind to on the same port a second time.

(The Go networking API bundles the binding and listening together in one Listen() API, but the socket API itself has them as two operations; you bind() a socket to some address, then listen() on it.)

This obviously should fail, except the development version of Go was claiming that it didn't. Aft first I thought this had to be a Go change, but soon I found that even older versions of Go didn't pass this test (when I knew they had when I'd built them), and also that this test passed on my Fedora 36 home desktop. Which I noticed was running Fedora's 6.0.15 kernel, while my office machine was running 6.0.16. That certainly looked like a kernel bug, and indeed I was able to reproduce it in Python (which is when I eventually realized this was an issue with bind() instead of listen()).

The Python version allows me to see more about what's going on:

>>> from socket import *
>>> s1 = socket(AF_INET, SOCK_STREAM)
>>> s2 = socket(AF_INET, SOCK_STREAM)
>>> s1.bind(('', 0))
>>> s2.bind(('', s1.getsockname()[1]))
>>> s1.getsockname()
('', 54785)
>>> s2.getsockname()
('', 0)

Rather than binding the second socket or failing with an error, the kernel has effectively left it unbound (the s2.getsockname() result here is the same as when the socket is newly created ('' is usually known as INADDR_ANY). Replacing SOCK_STREAM with SOCK_DGRAM causes things to fail with 'address already in use' (errno 98), so this issue seems specific to TCP.

This kernel error is in Fedora 37's 6.0.16 and 6.0.18, but is gone in the Rawhide 6.2.0-rc2 and isn't present in the Fedora 6.0.15. I don't know if it's in any version of 6.1, but I'll probably find out soon when Fedora updates to it. Interested parties can try it for themselves, and it's been filed as Fedora bug #2159802.

(This elaborates on a Fediverse thread. I looked at the 6.0.16 changelog, but nothing jumped out at me.)

Written on 11 January 2023.
« My Git settings for carrying local changes on top of upstream development
A browser tweak for system administrators doing (web) network debugging »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Wed Jan 11 22:59:37 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.