Wandering Thoughts archives

2007-09-27

Fixing Ubuntu's ethN device names when you swap hardware

If you swap an Ubuntu 6.06 system disk between hardware units (even nominally identical hardware units, like from one Dell 2950 to another PE2950), the system will come up scrambled Ethernet devices and you won't be on the network. (As we found out the hard way today.)

This happens because the only way Ethernet devices get consistent numbers on Ubuntu machines is because Ubuntu remembers the mapping between hardware ethernet addresses and the ethN name that they should get. When you transplant hardware, the hardware addresses change, nothing matches up any more, and your Ethernet NICs get random names.

(This also means that eth0 in kernel messages is usually not the device that ifconfig eth0 is talking about.)

/etc/iftab has the ethN ↔ hardware ethernet address mapping; unknown hardware addresses will be given ethN names that are not in use in here. /etc/network/interfaces tells you what ethN names you are using for what actual network connections.

Because Ubuntu seems to more or less randomly renumbers ethN devices on each reboot, you can't just remove /etc/iftab, fix your interfaces file to use the 'natural' names, and be done. Instead you have to reconstruct it with the right MAC addresses for any interfaces you care about. Steps to do this:

  • look at /etc/network/interfaces to find out what ethN names you care about.

  • find out what current ethN names are connected to the network:
    cd /sys/class/net
    for i in eth*; do ifconfig $i up; done
    dmesg | tail -10

    Now look for devices that have reported that their link is up. If you are lucky, the machine only has one network connection and you are done.

    If the machine has multiple network connections you will need to use various means (eg, unplugging and replugging network cables and watching kernel messages) to figure out which ethN name is which connection.

  • look up the MAC of each interface you care about with ifconfig ethN, and edit /etc/iftab to update the MAC for the 'correct' name for that interface. Delete every other ethN name in iftab, just so they don't confuse anyone.

Once you have updated everything you care about, reboot.

Optionally and possibly later, add name/MAC mappings for everything to iftab, so that if you add another network connection in the future it's sure to have a stable name. If you do this you will want to map out what each physical port's Ethernet address is, so that you can assign the names in a consistent and logical way and do things like make sure that a card with two NICs gets two consecutive ethN names.

For bonus charm iftab is apparently being deprecated in favour of another way of doing this, so the next Ubuntu LTS release will probably require an entirely different way of fixing this.

(Of course, Ubuntu does not supply a convenient program to (re)build an iftab for you.)

Sidebar: on the random renumbering issue

If a kernel device driver is responsible for more than one Ethernet device, it always reports them in a consistent internal order. If you have no iftab file, Ubuntu will probably only reorder blocks of ethN devices served by different drivers. Eg, if you have four Intel NICs and two Broadcom NICs, the internal order in the Intel NICs and the Broadcom NICs is probably always going to be consistent, and the first Intel NIC will be either eth0 (Intel driver loaded first) or eth2 (Broadcom driver loaded first). However, which driver is loaded first seems to be more or less random.

If Ubuntu does have an iftab file it will rename the kernel's initial ethN names to not use any ethN name claimed by iftab, and I believe the order it does this renaming is really random and can completely shuffle ethN names, so that if iftab claims eth0 through eth3, you could have eth4 being the third Intel NIC, eth5 being the first Broadcom NIC, and so on.

UbuntuEthernetNaming written at 15:20:07; Add Comment

2007-09-21

An interesting bind(2) failure

I fired up a version of Dovecot on a testing server today, only to be greeted with:

Fatal: listen(993) failed: Address already in use

That was kind of peculiar, since nothing else was running on the machine, certainly nothing that should be using the imap-over-SSL port. I tried starting Dovecot again and got the same error, looked at the xinetd configuration just in case, tried lsof -i and saw that no strange daemon was listening, tried connecting to the port and got a connection refused, and finally wound up straceing the Dovecot process just in case I had somehow asked it to bind to port 993 twice and it was the second time around that was failing. Nothing had any enlightenment.

(At this point, as you might imagine, I was both frustrated and worried. Binding to sockets like this is just not supposed to fail mysteriously. I couldn't even suspect SELinux, since this was an Ubuntu machine.)

Finally I ran netstat --inet -a. In the listing of connected ports I saw a TCP connection between port 993 on the local machine and port 2049 on one of our NFS servers, and the proverbial penny dropped.

What had happened is that the NFS client code uses so-called 'reserved ports' (ports under 1024) locally, starting from 1023 and counting down. Linux won't let you bind a listening port to a port that is already in use for the local end of a connection, and by coincidence we had enough NFS mounts (set up in the right sequence) so that port 993 was in use by the time I tried to start the version of Dovecot that we wanted to test.

The lesson I take away from this is that we should be sure all of our network daemons are started by the time we do NFS mounts, or we may run into this for real someday. And we're lucky that almost all of our mounts are UDP-based, or we would have run into this before now, since we have several hundred NFS mounts and, contrary to what I wrote the other day, it appears that the NFS client creates a new TCP socket for each separate mount.

Sidebar: what reserved ports are and why NFS uses them

On Unix based operating systems, only root is allowed to use local ports of 1023 or below; these are called reserved ports, since they are reserved for root.

As a weak security measure to prevent users on a client machine from forging NFS requests and reading the replies, NFS servers often require that clients talk to them using a reserved port. This way the server has some assurance that it is getting requests from root on the client, not a random user.

InterestingBindFailure written at 23:00:01; Add Comment

2007-09-18

Linux NFS client kernel tunable settings

We had a serious (lack of) performance issue today with a Linux NFS client machine, so I spent some time delving into the underdocumented world of kernel parameters that affect the Linux NFS client (not the NFS server, which has better documented stuff).

(I am going to use sysctl(8) notation for kernel tunables.)

The major tunable is sunrpc.udp_slot_table_entries and/or tcp_slot_table_entries. These are the maximum number of outstanding NFS RPC requests allowed per (UDP or TCP) RPC connection; the default is 16 and the maximum is 128 (and the minimum is 2). I believe that this is effectively per NFS server (technically per NFS server IP address), because it appears that the kernel reuses the same RPC connection for all NFS filesystems mounted from the same server IP address.

Unfortunately existing RPC connections are not resized if you change the number of slot table entries.

Contrary to what you might read in various places, changing net.core.[rw]mem_default and [rw]mem_max is not necessary and does not help. The kernel RPC client code directly sets its send and receive buffer sizes based on the read/write size and the number of slot table entries it has, and ignores [rw]mem_max in the process; rmem_max and wmem_max only limit the sizes that user-level code can set.

(This does mean that if you set a high slot table size and mount from a lot of different NFS servers, you could possibly use up a decent amount of kernel memory with socket send buffers.)

If you are doing NFS over UDP, as we are for some fileservers, you may want to check the value of net.ipv4.ipfrag_high_thresh, but I'm not sure what a good value would be. I suspect that the minimum size should be enough memory to reassemble a full-sized read from every different NFS fileserver at once.

(I believe this is a global amount of memory, not per connection or per fileserver, so it is safe to set it to several megabytes.)

It's possible that you will also want to increase net.core.netdev_max_backlog, the maximum number of received network packets that can be queued for processing, because it kicks in before fragment reassembly. It's safest to consider it a global limit, although it's not quite that.

(It is a per-CPU queue limit, but you can't be sure that all of your network packet receive interrupts won't wind up being handled by the same CPU in a multi-CPU system).

KernelNFSClientTunables written at 23:34:45; Add Comment

2007-09-13

Limiting a process's memory usage on Linux

Due to recent events I have become interested in this issue, so I have been poking around and doing some experiments. Unfortunately, while Linux has a bewildering variety of memory related per-process resource limits that you can set, most of them don't work or don't do you any good.

What you have, in theory and practice:

  • ulimit -m, the maximum RSS, doesn't do anything; the kernel maintains the number but never seems to use it for anything.

  • ulimit -d, the maximum data segment size, is effectively useless since it only affects memory that the program obtains through brk(2)/sbrk(2). These days, these aren't used very much; GNU libc does most of its memory allocation using mmap(), especially for big blocks of memory.

  • ulimit -v, the maximum size of the address space, works but affects all mmap()s, even of things that will never require swap space, such as mmap()ing a big file.

What I really want is something that can effectively limit a process's 'committed address space' (to use the term that /proc/meminfo and the kernel documentation on swap overcommit use). I don't care if a process wants to mmap() a 50 gigabyte file, but I care a lot if it wants 50G of anonymous, unbacked address space, because the latter is what will drive the system into out-of-memory.

Unfortunately I can imagine entirely legitimate reasons to want to mmap() huge files (especially huge sparse files) on a 64-bit machine, so any limit on the total process address space on our compute servers will have to be a soft limit.

Since the Linux kernel already tracks committed address space information for the whole system, it's possible that it would not be too much work to extend it to a per-process limit. (The likely fly in the ointment is that memory regions can be shared between processes, which complicates the accounting and raises questions about what you do when a process modifies a virtual memory region in a way that is legal for it but pushes another process sharing the VMA over its limit.)

MemoryRlimits written at 23:18:21; Add Comment

2007-09-10

A small drawback of 64-bit machines

It used to be that on a large memory 32-bit compute server, no single process could run away and exhaust all of the machine's memory. On an eight or sixteen gigabyte machine, processes ran into the 3 gigabyte (max) or so limit on per-process virtual address space well before they could run the machine itself into the ground.

(On a large enough machine you could survive a couple of such processes.)

This is no longer true on 64-bit large memory compute servers, as I noticed today; it is now possible for a single runaway process to take even a 32 gigabyte machine into an out of memory situation. I am now a bit nervous of what the kernel's OOM handling will do to us, since these are shared machines that can be running jobs for several people at once.

(Adding more swap space is probably not the solution.)

I have to say that the kernel OOM log messages are a beautiful case of messages being logged for developers instead of sysadmins. As a sysadmin, I would like a list of the top few processes by OOM score, with information like their start time, total memory usage, and their recent growth in memory usage if that information is available.

(And on machines with lots of CPUs, the kernel OOM messages get rather verbose. I hate to think what they will be like on our 16-core machine.)

64BitDrawback written at 23:36:12; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.