Dear Solaris boot sequence: SHUT UP

September 20, 2010

Here is how Solaris reduced me to a grim rage just now.

We had a power glitch this weekend, and one of our Solaris servers wouldn't come up afterwards; it hit some problem that left it needing attention in single-user mode.

What problem? I couldn't see. Solaris printed the problem to the machine console, but then it kept producing more boot-time output. Lots of boot time output, all of it unimportant (one or two lines per iSCSI target that it could connect to). This was more than enough to scroll the problem message off the top, and of course the Solaris console has no scrollback buffer. So I was left with a system that needed manual attention, but I had no idea of exactly what manual attention it needed.

The Solaris boot sequence is an egregious offender in the category of bad boot time messages; it is a peculiar mixture of completely uninformative and pointlessly verbose. In our environment, possibly the entire blame for this rests on iSCSI, but it's an OS component and so I blame Solaris cultural attitudes as a whole.

(Frankly, it's yet another really odd way that an 'enterprise' OS is not actually enterprise ready. One defining trait of enterprise environments is that they have a lot of whatever; a lot of disks, a lot of iSCSI targets, a lot of RAM, and so on. Thus I'd think than an enterprise ready OS would think about scale issues in messages and the like. But Solaris? Not so much.)

As usual, I got out of this by figuring out how to turn off iSCSI entirely, so that I could sort of see the boot messages. Fortunately our Solaris servers don't actually require iSCSI to boot, or we'd be in much more trouble.

PS: for people who are wondering why we don't have a serial console, one reason is this. I refuse to touch Solaris 10 serial support until there is some user friendly guide that actually works and explains what is going on and how you do things.

Sidebar: what it took to turn off iSCSI

Roughly:

  • fsck, so that the filesystems were clean
  • mount -o remount,rw /; this complained but seems to have actually worked.
  • iscsiadm to turn off discovery. Without the read-write remount of /, this completed without visible problems but didn't actually turn things off permanently.

Ironically, something in this whole sequence made the 'needs attention' issue go away too, but I don't know exactly what it was (or what the issue was), since I couldn't see the message that Solaris printed and as far as I know, it wasn't logged anywhere. Possibly it was my old friend boot archives.


Comments on this page:

From 8.8.38.2 at 2010-09-20 11:27:33:

What I've had to to in these situations (not just solaris) is record a movie of the boot up sequence with my cellphone's camera and play it back to find the error. This doesn't help in data centers where cameras aren't allowed, but might help in some situations.

From 207.23.96.10 at 2010-09-20 11:57:59:

Just wondering, do you have a Network Management port on that server (along with the dreaded serial port)? If so, I've found the Sun 'Advanced Lights Out Management Guide' (819-3250-11_ALOM is the name of the PDF I have for our T2000 server) to be very helpful for boot management/etc. I have the Net MGMT port assigned a private address that I can SSH to internally, etc (everything the serial consoler can do, but actually manageable :) ).

Not sure it would help in this situation, but I've abandoned the serial management port as a result.

Regards, Mike

By cks at 2010-09-20 12:58:03:

These machines have an ILOM/ALOM (and I was in fact using it), but the problem is that the ILOM emulates an x86 video console, and that doesn't have any scrollback unless the OS implements it. It would take a serial console to get external scrollback.

From 86.143.180.242 at 2010-09-21 03:00:14:

Regarding using a camera, I've seen that the Dell DRACs have this built in by recording the first few minutes of the boot sequence automatically. It then lets you step through the frames (one every second or so) via the web interface, which is pretty useful. -- Dominic

From 131.58.64.193 at 2010-09-23 06:48:58:

I agree the Sun boot time scrolling off the console has been annoying since the dawn of Sun. Wasn't dmesg helpful?

I'm not sure I follow what you mean about ALOM/ILOM requiring a serial console to get scrollback. You can ssh in to ILOM/ALOM if properly configured from any machine on that network and surely you have machines with an X display (or putty from a windows PC if you must) where you can scroll back in an xterm or such? I almost never mess with consoles / KVM's anymore and just do everything via ILOM/ALOM which works extremely well for me. It would be painful to go back.

By cks at 2010-09-23 11:03:20:

Dmesg was no help; the boot stuff doesn't seem to have logged whatever it wanted me to fix in dmesg. Presumably the people who wrote the code that printed the message so no need to dmesg it because, after all, it was sitting right there on the screen.

The serial console I'm talking about is the server serial console, not the ILOM serial port. The host server has two (possible) consoles; the normal PC video console and any serial console that's been configured in the OS. Like everything else, Solaris defaults to the video console, and as far as I know there's no way to scroll this back in the ILOM.

(I believe that the only access the ILOM gives you to the server video console is through its KVM-over-IP applet. This is handy but has a few limits.)

From 81.178.191.122 at 2010-09-23 18:13:35:

Well, in enterprise you need more than just an enterprise OS. You need enterprise infrastructure and people who can actually manage it. In enterprise environment you would most certainly use a serial console with message logging so you could easily retrieve all the messages.

ps. nevertheless it would be nice to have a scroll buffer on a gfx console in Solaris

From 81.178.191.122 at 2010-09-23 18:15:01:

additionally, kmdb -> $<msgbuf

From 98.109.163.36 at 2010-12-11 21:17:16:

The Solaris boot sequence is an egregious offender in the category of bad boot time messages; it is a peculiar mixture of completely uninformative and pointlessly verbose. In our environment, possibly the entire blame for this rests on iSCSI, but it's an OS component and so I blame Solaris cultural attitudes as a

Not saying anything that hasn't been said many times by people more knowledgeable and competent than I, but it seems to me that this is characteristic of current-day *nix and open-source communities more generally.

A few examples...

In my experience, Samba has one error message that it issues when anything goes wrong, and sometimes even when nothing is wrong (as best i can tell, anyway.) Exactly one. Turn on super-chatty debugging, and you get the same error message and an enormous amount of idle chatter. Maybe the chit-chat is useful if you're reading source. Maybe.

When I upgraded from FC8 to FC10 the experience was so painful that I didn't upgrade until I had to - FC13. Quite the experience. Little did I know how easily I had gotten off with the FC8->FC10 upgrade. I got FC13 tweaked to my liking about the time FC14 was released. (And by "upgrade" I mean "install on a clean partition", thanks just the same.)

VMware Workstation 6.5.4 installs broke on Fedora about kernel 2.6.29 - that's the install ... there are work-arounds, but once you get it installed, it turns out the VMware Tools install is similarly broken ... and once you get Tools installed, you really seriously need to upgrade to VMware Workstation 7 if you want the XP guest to run faster than about 100MHz - no idea, but there's an I/O bottleneck of some sort, even if you put the guest on a different physical disk (so I/O may be a symptom of a deeper problem). I found a couple of posters who had had similar experiences, but no-one seems to know, and vmware isn't saying.

uh, video cards. I ran the nVidia proprietary drivers for a long time, until I made the mistake of allowing an automatic upgrade ... because surely after the passage of a year, nVidia would have fixed the dual-screen bug I had seen.

Oops, silly me. (Truly, I should have known better. Reminds me of the joke that "remarriage is the triumph of hope over experience.")

sudo ... broken in several ways.

awk ... unless I've missed a trick, the awk Fedora installs wants me to convert all numeric input strings with strtonum, else they're just strings. This change may even be documented somewhere, if it isn't a bug.

dhclient ... with version 4, ISC moved dhclient-exit-hooks into /etc/dhcp (on Fedora, anyway, but same problem on ubuntu.) Quite the amusing rant from an ubuntu user several months ago. I had been running my firewall out of /etc/dhclient-exit-hooks. Imagine my surprise..

That one's been in redhat bugzilla for more than a year.

remote sudo X apps over ssh ... worked on something like 4 out of 6 machines, all six running FC13. The fix wasn't in ssh client or ssh server config... I don't remember the details, but something to do with user environment variables that sudo preserves (or doesn't.) The sudo that didn't work was the current version; the sudo that did was (current_version - 1). I had held back the sudo update because of other breakage in sudo, but hadn't blacklisted it on the machine(s) that didn't work.

Selinux - almost certainly an excellent idea - in theory - but as you observe elsewhere, the difference between theory and practice is much smaller in theory than it is in practice. To say that the dox suck would be an insult to poor documenters everywhere. I've spent days trying to bend some recalcitrant program to my will, only to find (when it finally occurs to me to 'sudo setenforce 0') that there's a boolean I should have enabled... (uh, dhclient ... why didn't I guess that dhcp couldn't execute iptables?? But then again, where is the audit log or syslog entry that would have fingered selinux? And yeah: I'm sure this is documented somewhere.)

Just a few of the uncounted joys of *nix computing.

While no doubt this says more about me than about linux (to wit: I'm a troglodyte misfit grouch who cannot get in step with the gnome and/or kde user experience), I wonder if there isn't a serious point here, perhaps that the rapid growth in complexity has outstripped the ability of even relatively experienced users to manage.

For example, how many different security subsystems are enabled in a current linux distro? If you include the (relatively) obscure ones like the posix capability set (which may not be compiled into Fedora kernels, even though the user space code and config files are present), I believe I could list as many as a dozen, maybe more. (I'm talking user-level security, not eg apache.) These overlapping subsystems reduce performance, introduce vulnerabilities, and increase the sheer drudgery of sysadmin ... no feeling of accomplishment when one finally wrestles one of these legacy problems to earth.

Maybe it's time to raze the edifice to the ground, sow salt into the earth, and make a clean start.

Jim Snyder (devnull at jhsnyder dot com)

Written on 20 September 2010.
« Your on the fly control system should not use toggles
The mysteries of video cards for Linux »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Sep 20 11:18:05 2010
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.