Wandering Thoughts archives

2007-12-25

Process memory layout for 32-bit Linux programs

There are actually two memory layouts for 32-bit x86 Linux programs; the old and the new one. First, the simple version of the common bits:

  • first off, the 4Gb address space is split into the lower 3G, for the process, and the upper 1G, for the kernel.
  • the program's code, data, and bss start at 128 Mb (0x08048000) and go upwards.
  • the stack grows down from just below the 3 Gb boundary, ending at 0xbfffffff or 0xc0000000.

On modern Linuxes the top of the stack is randomized a bit, so you won't see it ending exactly at the 3 Gb boundary.

The difference between the old and the new memory layouts is the default location for mmap()'d objects, which includes the dynamic linker and shared libraries. In the old memory layout, mmap() starts at 1 Gb (0x40000000), aka TASK_UNMAPPED_BASE, and grows upwards. In the new memory layout, mmap() starts below the 'bottom' of the stack and grows downwards (although, like the stack, it is generally randomized a bit).

(More technically, the kernel picks a 'top of mmap() area' address that is at least 128 Mb below the top of the process's stack, but may be more if the process has a larger soft stack size limit.)

The old memory layout had a number of limitations and unfortunate consequences; for example, a dynamically linked process could not use sbrk() very much before it ran into the dynamic linker at 1Gb.

(In fact, on at least some kernel versions you simply couldn't run a dynamically linked program that had more than 896 Mb of code, data, and bss, because it would collide explosively with the dynamic linker. People periodically ran into this limitation.)

If you look at /proc/<pid>/maps you will also see a 'vdso' object at the top of memory, inside the kernel's upper 1Gb. This is a virtual shared library that the kernel maps into people's address space for them to use; you can see it in ldd output as 'linux-gate.so.1'. (A description of it is here.)

32BitProcessMemoryLayout written at 23:53:44; Add Comment

2007-12-19

How x86 Linux executes ELF programs

Yesterday I said that the kernel directly executes programs in place. Because I feel like walking through the details, here is what the kernel does to start ELF programs on x86 Linux; for simplicity, I'm going to talk about 32-bit programs.

  • First, the kernel maps the program's text, data, and BSS into memory. Almost all programs require these to be mapped at fixed addresses starting from 0x08048000 (128 Mb) and going on up.

  • if the program is dynamically linked, the kernel also maps the dynamic linker's text, data, and BSS into memory. Dynamic linkers are generally willing to be loaded anywhere in memory, so they get wedged into the first spot the kernel considers available.

    (ELF executables specify the full path of their dynamic linker, which is confusingly called the 'ELF interpreter' in various places.)

  • the kernel sticks an 'auxiliary table' of various information on the top of the stack.
  • the environment and the arguments are copied into the stack.

If the program is statically linked, the kernel sets the user-level program counter for the process to the start address in the program's ELF header, which is somewhere after 0x08048000. When the kernel returns back to user space, the program will wind up running directly.

(What the start address is depends on how much stuff has to go at the start of the program's text area, so it varies from program to program.)

If the program is dynamically linked, the kernel instead sets the program counter to the start address of the dynamic linker, and the process will start running the dynamic linker's code directly. The dynamic linker uses information in the auxiliary table to find the real program's code and data, and eventually start it.

(From this we can see how calling the dynamic linker an 'interpreter' is a misnomer; it works nothing like an interpreter for a script, although it is a regular ELF executable.)

Technically, you could make dynamically linked ELF executables that contained no actual machine code but instead had a 'dynamic linker' that actually was an interpreter. However, this would be tricky to pull off, because dynamic linkers cannot themselves be dynamically linked, so your interpreter would need either to not use any shared libraries (including the normal C and Unix runtime) or to bootstrap the regular dynamic linker somehow.

HowProgramsExecute written at 00:53:10; Add Comment

2007-12-13

Doing one-shot booting with GRUB

For a long time, one of the advantages of LILO over GRUB was that LILO had a one-shot mode, where it would make an entry the default for only one reboot. I've been pleased to recently discover that GRUB can now do this too, although in the grand Linux tradition it's not too documented.

The magic incantation is the GRUB shell command:

savedefault --default=<num> --once

(This has to be issued in a GRUB shell, such as what you get by just running grub. On my workstation I find the --no-floppy command line argument useful, but I may be one of the few people who still has a floppy drive in their machine.)

If you are using Red Hat's version of GRUB (found in Red Hat Enterprise and Fedora), this is all you need to do. On Ubuntu or anywhere else that uses the normal version of GRUB, you need to modify your grub.conf or menu.lst to have 'default = saved', and then add 'savedefault' (without a number) to every menu entry, or at least every menu entry that you will ever try to use in one-shot mode.

(I believe that you can tell which version of GRUB you have by whether or not you have a /boot/grub/default file and a grub-set-default command.)

The normal GRUB saves this in /boot/grub/default, which is a plain text file; a one-time entry shows up as N:O, where N is your normal default and O is the entry you'll boot once. Red Hat's GRUB directly embeds the information in the stage2 binary, and I don't know how you can see the current state.

(Judging from the changelog of the Fedora GRUB RPM, this may have been in their version of GRUB for a very long time. Unfortunately it has been undocumented for that very long time, or I might have felt slightly differently about GRUB back then.)

OneShotGrub written at 23:35:19; Add Comment

2007-12-10

A surprise about Linux serial consoles

Here is a surprise I just discovered about magic SysRq and serial consoles: in order to have a serial console respond to magic SysRq, you need something talking to it. If your serial console is /dev/console, I believe that this is automatic; however, if it is not (if you have hooked it up to debug system lockups in X, for example, as a not entirely hypothetical example), you need to run a getty or something on the serial port in order to have the kernel notice magic SysRq sequences.

If you don't, what happens is relatively weird. Instead of just ignoring the magic SysRq sequences outright, the kernel seems to buffer them until something opens the serial port, at which point they all suddenly take effect. And if you stop your getty process, things go back to sleep until you restart it.

(I tried to follow the code paths through the kernel to figure out what was really going on, but my knowledge of Linux kernel internals is not deep enough to be up to the task.)

Unfortunately, all this didn't do me much good; while the serial console did capture a kernel panic, it was the sort where the kernel is in a really bad state and magic SysRq doesn't work afterwards, so I suspect that I have reappearing hardware issues, hopefully heat and cooling related (they would be cheaper to fix than a broken motherboard).

SerialConsolesNeedGetty written at 22:47:08; Add Comment

2007-12-07

Using Linux's magic SysRq feature

The 'magic SysRq' is a feature of the Linux kernel where you can directly invoke certain kernel commands, regardless of what happens to be going on at user level and often even if the system is fairly broken otherwise. Commands are conventionally described by the key that's used to invoke them.

There are three ways of invoking magic SysRq commands:

  • writing the command key to /proc/sysrq-trigger, for example with 'echo <key> >/proc/sysrq-trigger' in a root shell.
  • on the physical console with Alt-SysRq-<key>.
  • on a serial console by BREAK followed by the key (within five seconds or so).

The first method always works, but the latter two are controlled by the kernel.sysrq sysctl setting; this defaults to on but is sometimes deliberately turned off by distributions.

(Red Hat seems to turn it off, but Ubuntu leaves it alone and thus enabled. Check your /etc/sysctl.conf.)

The magic SysRq keys that I find most useful are:

s Try to sync all filesystems.
u Unmount all filesystems or make them read-only.
b Immediately reboot the machine, without syncing.
h Print succinct help (any unknown key will do, but h is conventional).
o Immediately power off the machine.
t Dump out information about all processes.
r Recover your keyboard from a crashed X server, allowing you to switch console virtual terminals, although this may not do you much good.

(This is in roughly the frequency that I actually use them. The full list of available commands is in Documentation/sysrq.txt in the kernel source tree.)

A number of sources will give you involved sequences for rebooting a machine through magic SysRq commands. I tend to just sync, unmount, and then reboot; if a machine is damaged enough that I have to reboot it through magic SysRq, it is generally damaged enough that processes can't shut down cleanly if you send them signals.

(The case that's happened to us several times is a server's system disk going partly read-only and partly inaccessible; many things that weren't in active enough use to be in the kernel's cache were unavailable, including shutdown et al. In this situation it is very useful that echo is a shell builtin.)

A number of magic SysRq commands are basically kernel diagnostic aids, such as t. These dump out a great deal of information, so they tend to only be really useful if the system is still intact enough to log things or you have a serial console. Triggering such diagnostic dumps without physical access (or bothering to use the serial console) is one of the uses of /proc/sysrq-trigger.

(If the system is moderately intact but not able to log things, perhaps because your /var has become read only, you may be able to fish the dumped information out with dmesg and capture it with cut and paste or the like. This has happened to us.)

A trivia note: magic SysRq stuff always works on the physical console, whether or not it is a kernel console. It only works on serial ports if they are actual serial consoles.

UsingMagicSysrq written at 23:58:12; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.