2008-11-14
How to force a crash dump on Solaris 10 x86
Because SPARCs have OpenBoot, forcing Solaris SPARC to panic and do a crash dump is generally a pretty simple process. Because x86 hardware has no PROM environment (despite Sun trying to pretend otherwise), forcing a crash dump on Solaris 10 x86 is a little bit more intricate.
Crash dumps are done through the kernel debugger, which has to be loaded ahead of time from and on the system console. (Technically I think that you can load it on a serial connection, should you have one that is not the system console.)
You load the kernel debugger by running the command 'mdb -K' as root.
This immediately drops you into the kernel debugger, halting the rest
of the kernel and the system, so when you see the debugger prompt you
want to use the ':c' command to continue everything. Once the kernel
debugger is loaded, you can (theoretically) break into it at any time
with either F1-A or Shift-Pause.
(In a stroke of what I can only describe as Sun's typical brilliance, neither key sequence can be set as a hotkey in the SunFire X2100 and X2200 ILOM KVM-over-IP environment. F1-A does work if your local machine will pass it to the ILOM console application.)
Once you are in the kernel debugger, the command to crash the system is:
$<systemdump
If all goes well, the crash dump will appear in /var/crash/<hostname>/
as usual after the machine reboots.
When you are done with the kernel debugger (for example, the system
works fine during testing instead of crashing), you can and should
unload it again by running 'mdb -U'. Among other things, this makes
F1-A and Shift-Pause on the console not dangerous any more.
Mdb, including the kernel debugging stuff, is mostly
documented in the Solaris Modular Debugger Guide and in the mdb and kmdb
manpages. Note that the SMDG has not been updated for recent updates of
Solaris 10 (for example, its instructions on how to start the kernel
debugger on boot are for the old x86 boot environment, not the new one).
If the machine is still running normally, you have two additional options:
- '
reboot -d' will force a crash dump before/while rebooting.
(However, speaking from personal experience it is possible to get a Solaris 10 system into such a state that it cannot reboot, although it's still running relatively normally.) - '
savecore -L' will take a 'crash dump' of a live system without interrupting it, provided that you have configured a dedicated dump device withdumpadm. You'll want the system to be as quiet as possible, and even then you may not get a usable dump.
(This is one of those entries I write to have this information in an easily accessible place that I can remember.)
2008-11-05
An issue with quotas on ZFS pools
For peculiar local reasons, we have some ZFS pools that have overall pool quotas (on Solaris 10 U5, so these are real full quotas). We just had the first such pool fill up and it turns out that when this happens, ZFS has a somewhat undesirable bit of behavior:
$ rm tankdata
rm: tankdata not removed: Disc quota exceeded
(You can't truncate anything either. Not even root can remove or truncate files.)
This does not happen if the pool has no quota and fills up, and it
also does not happen if the quota is on anything but the pool itself.
For example, you can put all of your filesystems under a 'quota'
pseudo-filesystem and put what would otherwise be a pool quota on this
'quota' filesystem and everything works (users run out of space but
can fix it themselves).
(Note that there are no snapshots involved here; neither the pool that this happened to nor the test pool that I used to explore what was going on had any snapshots at all.)
I assume that what is going on here is that ZFS is counting the very temporary extra space needed for the internal metadata necessary when you remove the file against the pool quota and since there is no space left in the pool quota, disallowing the action. This is consistent with its snapshot behavior, although even less useful.
I suspect (and hope) that this behavior will go away with Solaris 10 update 6's new 'refquota' ZFS feature, which makes this yet another reason to upgrade to Solaris 10 U6 as soon as we can (now that it's finally out).
(By the way, the way to fix a pool with this problem is of course to temporarily increase or remove the ZFS pool quota.)