Wandering Thoughts archives

2012-04-22

I may be wrong about my simple answer being the right one

In a recent entry I wrote about how I had misinterpreted an error message from bash about a script failing, and I also mentioned in passing that if I had paid attention to the structure of the error message I would have known that I was wrong. I take that back. Detailed investigation has now left me more confused than I was before and less confidant of what exactly my co-worker's problem was (and absolutely sure that paying attention to the structure of the error message does not really help). The problem is related to bash being too smart for its own good in error messages; because of bash's smartness but not huge smartness, we cannot tell what the actual error is.

As a reminder, here's bash's error message:

bash: /a/local/script: /bin/sh: bad interpreter: No such file or directory

You would think that this means that /bin/sh is not present; after all, it is the straightforward interpretation of the error, plus bash has actually gone out of its way to give you a more detailed error message. Unfortunately, that is the wrong interpretation of the error message. What bash is really reporting is two separate facts:

  • /bin/sh is the listed interpreter for /a/local/script
  • when bash attempted to exec() the script, the kernel told it ENOENT, 'No such file or directory'.

Bash does not mean that /bin/sh is missing; it never bothers to check that (and arguably it can't do so reliably). This matters because as we saw in my previous entry, the kernel will also report ENOENT if the ELF interpreter for a binary is missing. Now, you guessed it, if your script has a #! line that points to a binary which has a missing ELF interpreter:

bash: /tmp/exmpl: /tmp/a.out: bad interpreter: No such file or directory

(/tmp/a.out exists and is nominally executable, but I binary edited it to have a nonexistent ELF interpreter.)

So in my co-worker's case, we can't definitively conclude that /bin/sh was temporarily missing. All we know is that for some reason the exec() returned ENOENT, and that there are at least two potential reasons for it. A /bin/sh symlink being missing is still probably the most likely explanation, but on a system that's under unusual stresses things start getting rather uncertain here.

(I am far from certain that I could predict all of the reasons that the Linux kernel would return ENOENT on exec() without actually tracing the kernel code. And even then I'm not sure, since there's a lot of deep bits involved and thus a lot of code to really understand.)

BashNoInterpreterMsgII written at 03:12:22; Add Comment

2012-04-20

Sometimes the simple answers are the right ones (a lesson from bash)

A co-worker recently had a cronjob report the following error message, which he asked us for help with:

bash: /a/local/script: /bin/sh: bad interpreter: No such file or directory

At the time that this happened, other messages got logged suggesting that the machine was also apparently under memory pressure.

When I saw this, my mind immediately jumped to ELF interpreters, better known as the dynamic loader. I promptly suggested that maybe the kernel wasn't able to load the dynamic loader for sh, perhaps because of memory pressure or something. However I was wrong, as some web searchers for the error message showed me when I bothered to do them. What's going on is much simpler (although maybe odder) than something complicated about some part of the dynamic loader not working right.

In fact, what's going on is right there in the error message if I had bothered to read it. Here, let me show you with a little test:

$ bash
$ cat /tmp/exmpl
#!/bin/shx
echo hi there
$ /tmp/exmpl
bash: /tmp/exmpl: /bin/shx: bad interpreter: No such file or directory

As odd as it sounds, this error message was almost certainly generated because (temporarily) there was no /bin/sh. Since /bin/sh is a magically maintained symlink on many current Linuxes, this is slightly less odd and peculiar than it seems (it's possible that some package manipulation made the symlink disappear temporarily), but it's still pretty odd.

To sum up: bash really was telling my co-worker what was wrong and the error was report was not some peculiar coded message in a bottle that needed a complex, obscure interpretation. The simple answer was the right one. Sometimes that's just how it goes; not everything in system administration is a complex puzzle.

(As it happens I feel that if I'd paid attention to the error message and how it was structured, I would have seen that my complex theory was pretty sure to be wrong. But that's sufficiently tangled to need another entry.)

BashNoInterpreterMsg written at 01:32:47; Add Comment

2012-04-17

ls -l should show the presence of Linux capabilities

The other day I discovered that /bin/ping is not setuid on Fedora 16 machines, in particular on my machines. But it still worked, and trying to strace it showed that somehow it still had some sort of setuid permissions (because it didn't work any more, failing with permissions errors).

(Ping normally needs to be setuid root in order to send raw ICMP packets. You can question why this is a privileged operation and not just another socket type, but that's the historical practice.)

On a regular Fedora 16 machine I might have said 'oh, SELinux' and left it at that. But my machines have SELinux disabled, so something more was clearly going on. By prodding my fading memory and doing some searching with man -k, I eventually found capabilities and especially getcap, and it told me most of the answer:

/bin/ping = cap_net_raw+ep

Now here's the problem with this picture: when I did an ls -l of ping, there was no sign that ping also had capabilities. In fact there was no sign that ping had anything beyond normal permissions (and the normal permissions were showing that it had no setuid).

I don't expect ls to have shown me the capabilities themselves. But I've come to feel strongly that ls -l should always indicate if a file has additional attributes of some sort. If a file has ACLs, or capabilities, or even extended attributes, ls -l should display this. The reason is straightforward pragmatics; putting something in ls -l creates visibility, while having ls -l silent makes it more or less invisible.

(I was able to work out what was going on because I'm lucky and reasonably well informed, but plenty of people are not.)

This is especially important on a modern Linux system because modern Linux systems are already completely overgrown with special permission systems for having things magically happen or magically acquire privileges (and most of these systems are effectively invisible). The last thing modern Linux needs is more invisible permission systems; instead, programs such as ls should be working to make them visible.

(All of this was sparked by a Twitter conversation with @aderixon.)

Sidebar: why ls -l having a marker would work

The goal of putting a marker in ls -l output is twofold. If you already know about capabilities it would tell you that the file has them and you can get additional information with getcap and so on. If you don't know about capabilities, seeing something odd in the ls -l output would at least tell you that there's something odd about the file. You could then read the ls manpage to find out what it means, which would lead you to capabilities and getcap and so on. Or to ACLs, or to whatever else people want to add.

(Note that various versions of ls (Linux's included) have already fiddled around with the textual representation of permissions in ls -l to add various things, so it's not as if the format of this is frozen in stone beyond change.)

LsShowCapabilities written at 00:30:19; Add Comment

2012-04-05

Why I hate having /tmp as a tmpfs

There is a recent move in Linux to turn /tmp into a tmpfs. As a sysadmin, I am afraid that I have a visceral dislike of this (and always have).

The core problem with a RAM-backed /tmp is that it creates a new easy way to accidentally DOS your machine. When /tmp is only a disk, it's pretty clear how much space you have left and filling up /tmp is only a minor to moderate inconvenience. When /tmp is backed by RAM, filling up /tmp means driving your machine out of memory (something that Linux generally has an explosive reaction to). Worse, how much /tmp space you really have is unpredictable because it depends on how much RAM you need for other things. In theory this might be predictable, but in practice RAM demands are subject to abrupt and rapid swings as programs start and stop and change what they're doing.

(Even without a bad reaction from the Linux kernel to an OOM, an OOM situation is both worse and more wide-ranging than having /tmp or even the root filesystem run out of space. Being out of memory affects pretty much everything on the machine, and that's assuming you don't have enough swap space to cause your machine to melt down.)

This is bad enough on a single-user machine, where at least you are only blowing your own foot off when you DOS the machine through an accidental OOM because you (through old habits) or your program (through not being revised to the latest nominal standards) innocently put something sufficiently large in /tmp. On shared multi-user machines, it's pretty close to intolerable; the damage done is much larger and so is the chances of it happening, since all you need is one person to have one accident.

(By the way, this is not theoretical. We have had people put multi-gigabyte temporary files in /tmp, especially on our compute servers. Sometimes they manage to fill /tmp up, even though it has many gigabytes of disk space.)

Ultimately, what making /tmp into a tmpfs does in practice is to make the machine more fragile. How much more fragile depends on what happens on the machine, but it's undeniably more fragile. I don't like things that make my machines more fragile, so I don't like this.

By the way I'm aware that other systems (such as Solaris) did this years ago. I didn't like this transition on them either, for exactly this reason. I consider it a really good thing that only staff can log on to our Solaris machines, because a RAM-backed /tmp makes them too fragile for me to be happy with general access to Solaris.

(See also SysAdmin1138.)

Sidebar: the right way to do this transition

It's really simple: make a new /tmpfs mount point that is, well a tmpfs. The latest new standards make it clear that any number of programs need revising anyways to put their data in the right place; while you are revising those programs, you can perfectly well make them use /tmpfs when appropriate. And the result does not blow people's feet off when they continue following decades of user behavior and program defaults. If you want and it makes you feel better, you can then make /tmp into a symlink to /var/tmp.

(As usual, this is certain Linux people not solving the real problem.)

WhyNotTmpAsTmpfs written at 01:09:48; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.