How to accidentally reboot a server

August 8, 2013

This is a little war story.

To start out, determine that you need to add more memory to one of your login servers. This requires taking it down, which requires getting all of its users to log off first. Then go through the following steps:

  1. Keep new logins off the system by changing your automated password management system to give all non-staff accounts a login shell that just prints out a 'this system is under maintenance, please use another one' and then quits.

    (Otherwise people will keep logging in to the server and you'll never, ever get a moment where there is no one on. Especially during working hours.)

  2. Mail all of the current users to ask them to log off.

  3. Get email from one user saying 'you can kill all of my processes on the machine'.

  4. Do this in the obvious way. From an existing root session:
    /bin/su user
    kill -1 -1

  5. Have all of your (staff) sessions to the machine suddenly disconnect. Get a sinking feeling.

Perhaps you can spot the mistake already. The mistake is that the su to the user did not actually work. The user had a locked login shell, so all the su did was print a message and then dump me back into the root shell. Then I ran the 'kill -1 -1' as root and of course it SIGHUP'd all processes on the machine, effectively rebooting it.

(It didn't actually reboot and in fact enough stayed up that I could ssh back in, which surprised me a little bit.)

I should have used '/bin/su user -c "kill -1 -1"' or in fact one of the ways we keep around to do 'run command as user no matter what their shell is'. But I didn't take the time to do either of them.

(On the 'good' side, we got to immediately add more memory to that server.)


Comments on this page:

From 129.102.5.21 at 2013-08-08 12:18:07:

Couldn't you just use the nologin file (/etc/nologin on Linux, /var/run/nologin on BSD)? It is designed for this kind of purpose and it still allows root to su to regular user accounts.

-- Arnaud Gomes

From 91.198.246.131 at 2013-08-08 12:28:36:

sudo is your friend, Chris. It stands for su very well in virtually every aspect (even if you want to require target user's password to be entered, just like su does). sudo may preserve $SHELL, what I use regularly (login to my account, then sudo to root and I still work in zsh with my config, not bash that root has set as default).

From 138.246.85.204 at 2013-08-08 14:00:40:

/etc/nologin and pkill -u $USER... ;)

From 75.119.247.31 at 2013-08-08 18:39:16:

You're quite the reasonable person, trying to get everyone to log off first. :)

Generally if you give people (e.g.) at least a week's notice, it may be easier to simply say "okay, it's the appointed time" and do an "init 0".

I don't think you'd qualify as an BOFH. :)

From 188.162.36.124 at 2013-08-13 04:10:17:

I usually use `killall -u user` when I need to kill all user's processes.

By cks at 2013-08-13 16:31:32:

Belatedly:

The problem with /etc/nologin is that it locks out everyone except root. We wanted to lock out normal users but not staff. Things like pkill or killall were what I should have used, but for me it's faster to su and then kill -1 -1 than remember exactly what arguments I need (and test them and so on).

Sudo versus su is one of those historical and cultural issues. I will summarize it by saying that we make basically no use of sudo (and it's not a direct substitution for su in this situation since they take different arguments).

On the advance notice: when we give a week's advance notice to everyone we don't wait. This was a lot less notice (in fact I sent the email only minutes before I made my mistake).

Written on 08 August 2013.
« My Cinnamon desktop customizations
Link: My current dmenu changes »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Aug 8 11:33:01 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.