Wandering Thoughts archives

2009-11-28

'Conditional restart' in init.d scripts can be dangerous

Yesterday, the lighttpd instance that I run on my workstation was effectively down for about twelve hours; while the daemon was running, it was using the wrong configuration file and so it wasn't really serving anything. In turn, this happened because I installed a lighttpd package update, and as part of the post-update actions the package did '/etc/init.d/lighttpd condrestart'.

In theory, conditional restart in an init.d script will only restart things if the init script has started the daemon in the first place. This is subtly different from 'if the daemon is running', which is what many init.d scripts implement, and what happened to me illustrates the importance of that difference. I don't start lighttpd with /etc/init.d/lighttpd, I start it with a different init.d script that points it to my local configuration file, so when the normal init.d script 'restarted' lighttpd, the new version was running with the system configuration file and thus not doing much.

I can't blame lighttpd and its init script for this problem; it's relying on standard functions provided by the Fedora init.d environment. And I can't really blame Fedora's init.d environment, because the problem is subtle and reasonably difficult to do completely correctly (and I've seen the same problem on other Linuxes). But regardless of where any fault is or isn't, the underlying issue is that 'condrestart' and similar features are dangerously fragile.

The only way to fix this and make conditional restart reliable is to make the daemons restart themselves; on some signal, any running copy of the daemon arranges to re-exec itself with appropriate command line arguments, environments, and so on. Then the init.d condrestart action simply sends this signal to all copies of the daemon that are currently running and lets them sort it all out.

(As a bonus you will have arranged to fix any copies of the daemon that are running, regardless of how they got started, which is probably what you really want to do.)

If you do not do this, please create an officially supported and documented way of changing all of the command line parameters that your init.d script uses to start the daemon, or as a minimum changing the configuration file.

(Note that this being official is important, because that means that I can count on it not breaking over updates.)

CondRestartDangerous written at 01:48:12; Add Comment

2009-11-26

Some notes for myself on git bisect

I know I'm late to the party on this one, but I just used git bisect for the first time today in order to hunt down where a kernel bug started showing up. Since there are more kernel bugs in my future, here are some things that I want to remember about the process and other random comments.

(I'll probably have more later when I do it again.)

First, I have to say that the whole process really is both cool and addictive, and it works. It doesn't help the addictiveness that each step usually gets faster and faster as you get closer and closer to identifying the bad commit (because you are generally rebuilding less and less).

Notes:

  • I can't see an easy way to go back a step (in case you make a mistake in 'git bisect [good|bad]'), so it might be a good idea to keep a log of the start points and the steps. When I do this again, I'll keep lab notes, and include the commit IDs of each step.

    Update: I'm wrong and there is a way to back up steps. In comments, Sergey Vlasov pointed out git bisect log and associated things.

    (I was especially nervous about this because I was building the kernel on one machine and testing on another, all while doing other things.)

  • having the tree be unbuildable is irritating but it happens every so often. If it does, the first thing to do is to stop and analyze why it's broken and what's going on, not to blindly start doing 'git bisect skip'. As it happens I got away with my first few uses of skip (done hastily before I dug deeper), but I was lucky.

    (Once I figured out the bad change, I was fortunately able to fix it up by hand so that the tree could be built. Which was important, because the first bad commit turned out to be in that otherwise unbuildable section.)

  • if I'm restricting the bisection to an area of the tree (and possibly if not), the interwoven 'branch and merge' kernel development means that a bunch of changes can show up basically out of nowhere due to an out-of-area merge. Things like 'git bisect visualize' are not great at helping sort this out, because they restrict your view to just the area of the tree that you're fixed on at the moment.

    (In hindsight it might have been faster to clone the repository and start an automated 'git bisect run' pass to find the change that broke the build. Instead I did it by various flailing around with 'git blame', 'gitk', and so on.)

  • gitk in a repository that's being bisected does somewhat odd things. If I want gitk instead of git bisect visualize, it's simpler to do it in a separate master repository.

  • restricting the bisection to a narrow area of the tree is good because it can speed things up a lot, but it's also potentially dangerous since you're implicitly assuming that things broke because of changes in that area of the tree, and this might not be correct. This is probably especially a concern if, like me, you're not all that familiar with kernel internals and are just going on guesses like 'let's restrict things to drivers/net/wireless since it's a wireless card that broke'.

    (I had a nervous moment when the tree stopped building, because the breakage wasn't in the area that I was bisecting on. That's what rubbed my nose into out of area code changes and merges making them show up and so on. For bonus nervousness, it was in net/wireless, which I had not previously been aware of.)

  • I would really like to be able to easily build 32-bit kernels on 64-bit machines.

    Update: I'm wrong (apparently I'm too scarred by memories of trying to build 32-bit RPMs on 64-bit Fedora). Sergey Vlasov also pointed out 'make ARCH=i386 ...', which works fine.

  • there must be a better solution to pushing kernels from a build server to the target machine than rsync'ing the entire kernel build tree just so I can run 'make modules_install install' on the target. I just have to find it.

    (I built on a separate machine because the target machine was a slow laptop. As it happens, all of our good servers to build on are 64-bit, so I had to use a less than ideal 32-bit server.)

In this case it would have been possible to totally automate the kernel testing (the laptop has Ethernet, and the wireless failure is easy to observe from a script), but I'd have had to build an entire set of scripts and sudo operations and so on and it just wasn't worth it for this. However, if I do this regularly I should look into it, since a by-hand git bisect can clearly totally eat all of my day.

(The actual work didn't take much time, but git bisect is by and large fast enough that I was constantly being interrupted to do the next by-hand thing.)

GitBisectNotes written at 01:53:47; Add Comment

2009-11-24

An important lesson for me on Fedora upgrades

As I mentioned in a recent entry, Flash has been broken for me for all of Fedora 11 on my 64-bit machine. The comments on my entry caused me to dig into the situation, and after some experimentation (building a fresh installed 64-bit Fedora 11 in a virtual machine to see if Flash worked, which it did), I found my problem and now have working Flash.

In fact, I found the problem doing what I should have done in the first place; I ran 'package-cleanup --problems', which reported that I had a libflashsupport RPM (in both 64-bit and 32-bit versions) that had unmet dependencies on specific versions of libcrypto.so.6 and libssl.so.6. Removing the libflashsupport RPMs unsurprisingly made Flash work again.

(libflashsupport seems to come from Fedora 8. package-cleanup comes from yum-utils, which I think everyone should have installed.)

I think it's time that I ran all of package-cleanup's tests to make sure my machine is relatively clean. In fact, I should make a habit of doing this after every Fedora upgrade that I do, because upgrading can easily leave you with exactly this sort of problems.

(Note that not all of what package-cleanup complains about is a genuine problem, especially for its --orphans option.)

CheckForPackageProblems written at 12:15:15; Add Comment

2009-11-23

My current unhappy thoughts on Fedora 12

Right now, I have two machines (a 64-bit desktop and a 32-bit laptop) at Fedora 11 and one (another 64-bit desktop) that's still back at Fedora 8. Upgrading to Fedora 12 soon is the obvious thing to do, since there are drawbacks to waiting too long to upgrade (although this is not an issue if you use PreUpgrade or yum-based upgrades).

Except, well, I haven't been having the best of luck with Fedora 11. On my desktop, Flash has been broken for all of Fedora 11 (and the free alternatives don't work for me), and then sound stopped really working in the 2.6.30 kernels. On my laptop, wireless stopped working with the 2.6.30 kernels (I'm detecting a trend). It doesn't seem likely that upgrading to Fedora 12 will fix those problems, especially the kernel related ones.

(I have not bothered filing a bug for the sound issue, because my impression is that sound is a huge mess in Fedora right now and worse, it is partly a political mess. I've certainly seen Fedora bugs of 'my sound card stopped working' be answered with replies that boil down to 'well, that's what you get for buying a sound card from people who aren't open source friendly'.)

I would like to upgrade to Fedora 12; I generally like getting the new stuff (although not always), and it avoids various future issues. But upgrading doesn't seem like a wise decision right now, and I'm not convinced that it ever will be; I have no confidence that any of my issues will get solved over the lifetime of Fedora 12.

(My cynicism suggests that things that stay broken in kernels for more than a relatively short amount of time stay broken for good, because no one cares enough to try to fix them. I can't blame the kernel hackers; I certainly don't have enough energy to try to build stock kernels and git bisect my way to the changeset that broke wireless. Not on what is an old and slow system that, for now, works fine when I stick to an older Fedora 11 kernel.)

What this really leaves me nervous about is the further future. If these issues aren't fixed in Fedora 12, will they be fixed in Fedora 13, Fedora 14, and so on? The odds seem against this, and there's only so long I can run Fedora 10 and Fedora 11.

Sidebar: dealing with my Fedora 8 machine

This means that I should bite the bullet and do the odd thing of upgrading the Fedora 8 machine to Fedora 10. Yes, it's just about to go out of support, but I don't really have a choice; it's the most recent Fedora where Flash worked for me in a 64-bit environment.

ConsideringFedora12 written at 00:47:08; Add Comment

2009-11-07

A gotcha with Bash on Ubuntu 8.04

Suppose that you have an Ubuntu 8.04 system where you have opted to make /bin/sh be bash, the way it used to be in 6.06, and you have an account with /bin/sh as the login shell (for example, you created it with plain useradd). So you log in to the account and everything seems normal and bash-y, until you try to do filename completion and get:

$ cd /-sh: <( compgen -d -- '/' ): No such file or directory

(The text in bold is what you typed before you hit <TAB>.)

I'll give you the fix first: use chsh to change your shell to be /bin/bash. Then everything will work right.

This is one of those interestingly misleading error messages, although if you read very carefully Bash is actually sort of telling you what is going on. Let me give you a related example:

$ cat < 'a random name'
sh: a random name: No such file or directory

This error message has the same form as the first one but makes it much more obvious what the shell is complaining about.

For filename completion, what seems to be going on is that when Bash is operating in sh-compatible mode as a login shell, it is bash-like enough to cause the Ubuntu 8.04 default dotfiles to load the bash command line completion shell functions, but those functions use Bash-specific syntax. As a result, Bash in sh-compatible mode interprets the compgen command seen in the error message as one giant redirection and, of course, cannot find such a peculiarly named file.

(I spent a long time being confused by the error message because I didn't read it carefully and thus didn't realize that it was complaining about a failed redirection instead of a failure to find compgen.)

Short summary: this is an Ubuntu 8.04 bug caused by them not expecting /bin/sh to be Bash and to be used as a login shell, although this is a theoretically supported configuration. This doesn't really surprise me; we've had plenty of experience to the effect that Ubuntu goes off the rails when you depart from their one standard configuration.

BashCompletionIssue written at 01:24:39; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.