2007-09-27
Fixing Ubuntu's ethN device names when you swap hardware
If you swap an Ubuntu 6.06 system disk between hardware units (even nominally identical hardware units, like from one Dell 2950 to another PE2950), the system will come up scrambled Ethernet devices and you won't be on the network. (As we found out the hard way today.)
This happens because the only way Ethernet devices get consistent numbers on Ubuntu machines is because Ubuntu remembers the mapping between hardware ethernet addresses and the ethN name that they should get. When you transplant hardware, the hardware addresses change, nothing matches up any more, and your Ethernet NICs get random names.
(This also means that eth0 in kernel messages is usually not
the device that ifconfig eth0 is talking about.)
/etc/iftab has the ethN ↔ hardware ethernet address mapping;
unknown hardware addresses will be given ethN names that are not in use
in here.
/etc/network/interfaces tells you what ethN names you are using for
what actual network connections.
Because Ubuntu seems to more or less randomly renumbers ethN devices on
each reboot, you can't just remove /etc/iftab, fix your interfaces
file to use the 'natural' names, and be done. Instead you have to
reconstruct it with the right MAC addresses for any interfaces you care
about. Steps to do this:
- look at
/etc/network/interfacesto find out what ethN names you care about. - find out what current ethN names are connected to the network:
cd /sys/class/net
for i in eth*; do ifconfig $i up; done
dmesg | tail -10Now look for devices that have reported that their link is up. If you are lucky, the machine only has one network connection and you are done.
If the machine has multiple network connections you will need to use various means (eg, unplugging and replugging network cables and watching kernel messages) to figure out which ethN name is which connection.
- look up the MAC of each interface you care about with
ifconfig ethN, and edit/etc/iftabto update the MAC for the 'correct' name for that interface. Delete every other ethN name iniftab, just so they don't confuse anyone.
Once you have updated everything you care about, reboot.
Optionally and possibly later, add name/MAC mappings for everything
to iftab, so that if you add another network connection in the future
it's sure to have a stable name. If you do this you will want to map out
what each physical port's Ethernet address is, so that you can assign
the names in a consistent and logical way and do things like make sure
that a card with two NICs gets two consecutive ethN names.
For bonus charm iftab is apparently being deprecated in favour of
another way of doing this, so the next Ubuntu LTS release will probably
require an entirely different way of fixing this.
(Of course, Ubuntu does not supply a convenient program to (re)build an
iftab for you.)
Sidebar: on the random renumbering issue
If a kernel device driver is responsible for more than one Ethernet
device, it always reports them in a consistent internal order. If you
have no iftab file, Ubuntu will probably only reorder blocks of ethN
devices served by different drivers. Eg, if you have four Intel NICs and
two Broadcom NICs, the internal order in the Intel NICs and the Broadcom
NICs is probably always going to be consistent, and the first Intel NIC
will be either eth0 (Intel driver loaded first) or eth2 (Broadcom driver
loaded first). However, which driver is loaded first seems to be more or
less random.
If Ubuntu does have an iftab file it will rename the kernel's initial
ethN names to not use any ethN name claimed by iftab, and I believe
the order it does this renaming is really random and can completely
shuffle ethN names, so that if iftab claims eth0 through eth3, you
could have eth4 being the third Intel NIC, eth5 being the first Broadcom
NIC, and so on.
2007-09-21
An interesting bind(2) failure
I fired up a version of Dovecot on a testing server today, only to be greeted with:
Fatal: listen(993) failed: Address already in use
That was kind of peculiar, since nothing else was running on the
machine, certainly nothing that should be using the imap-over-SSL port.
I tried starting Dovecot again and got the same error, looked at the
xinetd configuration just in case, tried lsof -i and saw that no
strange daemon was listening, tried connecting to the port and got a
connection refused, and finally wound up straceing the Dovecot process
just in case I had somehow asked it to bind to port 993 twice and it was
the second time around that was failing. Nothing had any enlightenment.
(At this point, as you might imagine, I was both frustrated and worried. Binding to sockets like this is just not supposed to fail mysteriously. I couldn't even suspect SELinux, since this was an Ubuntu machine.)
Finally I ran netstat --inet -a. In the listing of connected ports I
saw a TCP connection between port 993 on the local machine and port 2049
on one of our NFS servers, and the proverbial penny dropped.
What had happened is that the NFS client code uses so-called 'reserved ports' (ports under 1024) locally, starting from 1023 and counting down. Linux won't let you bind a listening port to a port that is already in use for the local end of a connection, and by coincidence we had enough NFS mounts (set up in the right sequence) so that port 993 was in use by the time I tried to start the version of Dovecot that we wanted to test.
The lesson I take away from this is that we should be sure all of our network daemons are started by the time we do NFS mounts, or we may run into this for real someday. And we're lucky that almost all of our mounts are UDP-based, or we would have run into this before now, since we have several hundred NFS mounts and, contrary to what I wrote the other day, it appears that the NFS client creates a new TCP socket for each separate mount.
Sidebar: what reserved ports are and why NFS uses them
On Unix based operating systems, only root is allowed to use local ports of 1023 or below; these are called reserved ports, since they are reserved for root.
As a weak security measure to prevent users on a client machine from forging NFS requests and reading the replies, NFS servers often require that clients talk to them using a reserved port. This way the server has some assurance that it is getting requests from root on the client, not a random user.
2007-09-18
Linux NFS client kernel tunable settings
We had a serious (lack of) performance issue today with a Linux NFS client machine, so I spent some time delving into the underdocumented world of kernel parameters that affect the Linux NFS client (not the NFS server, which has better documented stuff).
(I am going to use sysctl(8) notation for kernel tunables.)
The major tunable is sunrpc.udp_slot_table_entries and/or
tcp_slot_table_entries. These are the maximum number of outstanding
NFS RPC requests allowed per (UDP or TCP) RPC connection; the default is
16 and the maximum is 128 (and the minimum is 2). I believe that this
is effectively per NFS server (technically per NFS server IP address),
because it appears that the kernel reuses the same RPC connection for
all NFS filesystems mounted from the same server IP address.
Unfortunately existing RPC connections are not resized if you change the number of slot table entries.
Contrary to what you might read in various places, changing
net.core.[rw]mem_default and [rw]mem_max is not necessary and
does not help. The kernel RPC client code directly sets its send and
receive buffer sizes based on the read/write size and the number of
slot table entries it has, and ignores [rw]mem_max in the process;
rmem_max and wmem_max only limit the sizes that user-level code
can set.
(This does mean that if you set a high slot table size and mount from a lot of different NFS servers, you could possibly use up a decent amount of kernel memory with socket send buffers.)
If you are doing NFS over UDP, as we are for some fileservers, you may
want to check the value of net.ipv4.ipfrag_high_thresh, but I'm not
sure what a good value would be. I suspect that the minimum size should
be enough memory to reassemble a full-sized read from every different
NFS fileserver at once.
(I believe this is a global amount of memory, not per connection or per fileserver, so it is safe to set it to several megabytes.)
It's possible that you will also want to increase
net.core.netdev_max_backlog, the maximum number of received network
packets that can be queued for processing, because it kicks in before
fragment reassembly. It's safest to consider it a global limit, although
it's not quite that.
(It is a per-CPU queue limit, but you can't be sure that all of your network packet receive interrupts won't wind up being handled by the same CPU in a multi-CPU system).
2007-09-13
Limiting a process's memory usage on Linux
Due to recent events I have become interested in this issue, so I have been poking around and doing some experiments. Unfortunately, while Linux has a bewildering variety of memory related per-process resource limits that you can set, most of them don't work or don't do you any good.
What you have, in theory and practice:
ulimit -m, the maximum RSS, doesn't do anything; the kernel maintains the number but never seems to use it for anything.ulimit -d, the maximum data segment size, is effectively useless since it only affects memory that the program obtains throughbrk(2)/sbrk(2). These days, these aren't used very much; GNU libc does most of its memory allocation usingmmap(), especially for big blocks of memory.ulimit -v, the maximum size of the address space, works but affects allmmap()s, even of things that will never require swap space, such asmmap()ing a big file.
What I really want is something that can effectively limit a process's
'committed address space' (to use the term that /proc/meminfo and the
kernel documentation on swap overcommit use). I don't care if a process
wants to mmap() a 50 gigabyte file, but I care a lot if it wants 50G
of anonymous, unbacked address space, because the latter is what will
drive the system into out-of-memory.
Unfortunately I can imagine entirely legitimate reasons to want to
mmap() huge files (especially huge sparse files) on a 64-bit machine,
so any limit on the total process address space on our compute servers
will have to be a soft limit.
Since the Linux kernel already tracks committed address space information for the whole system, it's possible that it would not be too much work to extend it to a per-process limit. (The likely fly in the ointment is that memory regions can be shared between processes, which complicates the accounting and raises questions about what you do when a process modifies a virtual memory region in a way that is legal for it but pushes another process sharing the VMA over its limit.)
2007-09-10
A small drawback of 64-bit machines
It used to be that on a large memory 32-bit compute server, no single process could run away and exhaust all of the machine's memory. On an eight or sixteen gigabyte machine, processes ran into the 3 gigabyte (max) or so limit on per-process virtual address space well before they could run the machine itself into the ground.
(On a large enough machine you could survive a couple of such processes.)
This is no longer true on 64-bit large memory compute servers, as I noticed today; it is now possible for a single runaway process to take even a 32 gigabyte machine into an out of memory situation. I am now a bit nervous of what the kernel's OOM handling will do to us, since these are shared machines that can be running jobs for several people at once.
(Adding more swap space is probably not the solution.)
I have to say that the kernel OOM log messages are a beautiful case of messages being logged for developers instead of sysadmins. As a sysadmin, I would like a list of the top few processes by OOM score, with information like their start time, total memory usage, and their recent growth in memory usage if that information is available.
(And on machines with lots of CPUs, the kernel OOM messages get rather verbose. I hate to think what they will be like on our 16-core machine.)