|
2012-01-24
Why I use exec in my shell scripts
As with the little example yesterday, a fair number of my shell
scripts end with running a program and when they do, I almost invariably
go the little extra distance and do it with exec. In the old days, the
reason to do this was that it used slightly less resources, since it got
rid of the shell process and left only the process for the real program
you wound up running. But, while I was around then, the reason I use it
today isn't that; it's that it lets you freely edit the script while
that final program is running.
At this point some of you may be going 'wait, what?' That's because
most Bourne shell implementations are a little bit peculiar.
In most interpreted languages on Unix (like Python, Ruby, and Perl),
the interpreter completely loads and parses the script file before it
starts running it. This means that once your script has actually started
running, once that initial load and parse has finished, you can freely
change the script's file without the interpreter caring; it will only
look at the actual file and its contents again if and when you re-run
your script.
Bourne shell implementations have historically not worked this way (and
it's possible that it's actually impossible to preparse Bourne shell
scripts for some reason). Instead they not only parse the script on
the fly as it executes, but also they read the file on the fly as
the script runs. This means that if you edit a shell script while
it's running you can literally shuffle the code around underneath the
script. When the shell resumes reading and parsing the script after the
current command finishes, it can be reading from partway through a line,
from something that it had already read, or (if you deleted text) wind
up skipping over something that it should have run. This often causes
the shell script to fail with weird errors or, worse, to malfunction
spectacularly. This can happen even if the shell is on the last line of
the script.
But if you end a shell script with exec, you avoid this. The actual
shell interpreter effectively exits (by turning itself into the actual
program) and so there's nothing there to try to read anything more and
get confused by your edits.
(Of course nothing helps if you can't use exec; then you just have to
remember to never edit the script while it's running, at least with an
editor that overwrites the file in place.)
Sidebar: a detailed example of what happens
Let's start with a little script:
#!/bin/sh
echo "a"
firefox
Run this script. While Firefox is running, edit it so that the echo
string is four or five characters longer (using vi or some other
editor that overwrites files in place). When you exit Firefox, the
script will complain something like 'script: line 4: efox: command not
found'.
When the shell was running Firefox, its read position in the file was
just after the newline at the end of firefox. When you edited the
script and added more letters, that same byte position was now pointing
to the e in the 'firefox'. When Firefox exited and the shell resumed
reading from that byte position, it read 'efox<newline>', saw a
perfectly valid command execution, and tried to run 'efox' (and
failed).
(It reports that this happened on line 4 because it knew it had already
read three lines, so clearly this is line 4. As a corollary, you can't
trust the line numbers that are printed when something like this
happens.)
WhyShellScriptExec written at 00:06:12; Add Comment
2012-01-22
My view of the purpose of object orientation
A while back I read Rise and Fall of Classic OOP.
This caused me to realize that I am kind of a heathen as far as
object oriented programming is concerned, probably because I came
to explicit OO late and never actually learned how to do it the
'right way'. You see, to me object orientation is a technique
for code organization and nothing more.
This gives me a very pragmatic view of when to write OO code and when
not to; I use objects and classes where they make my code simpler,
and I don't use them when they don't. I don't consider them something
that has to be followed at all costs or as the only way to model the
real world (or any arbitrary artificial world). If the real world
entities that you're working with aren't amenable to being wedged
into an OO hierarchy, then don't. Given the wide variety of both code
structure and ways of organizing code so that it makes sense, it would
be fairly absurd to say that OO is always the right answer; it is just
one technique among many. Sometimes it's the right answer, sometimes
not.
(Of course, some languages as so in love with OO that they don't give
you a choice about it; you can't really have freestanding functions and
data containers.)
I won't say that all of those OO examples that modeled the real world
always struck me as a bit hokey and artificial, because honestly I
never really thought that much about it (and any small example is hokey
and artificial if you really look at it). But if people are switching
towards my view of the purpose of OO, I'm all for it.
(I would be shocked if this was new and novel. I sure hope that lots of
people have had this thought before me, because it just feels so obvious.)
ObjectOrientationPurpose written at 02:27:01; Add Comment
2012-01-21
The C juggernaut illustrated
Perhaps it is tempting, looking back at history from the vantage point
of today, to say that C succeeded so much because it was at the right
place at the right time. As you could tell the story, all sorts of people
in the 1980s wanted a low level programming language, C was around,
and so they seized on it. Any similar language would have done; it's
just that C was lucky enough to be the one that came out on top, partly
because of network effects.
(This story is especially tempting to people who don't like C and
Unix.)
This significantly understates the real appeal of C at the time,
even and especially to people who had alternative languages. A great
illustration of this is C on the early Macintosh. You see, unlike
environments like MS-DOS (which had no language associated with it,
just assembler), the early Macintosh systems already had a programming
language; they were designed to be programmed in Pascal (and the Mac
ROMs were originally written in Pascal before being converted to
assembler).
This was more than just an issue of Apple's suggested language being
Pascal instead of C. The entire Mac API was designed around Pascal
calling conventions and various Pascal data structures; it really was a
Pascal API. Programming a Mac in C involved basically swimming upstream
against this API, full of dealing with things like non-C strings (if
I remember right, Mac ROM strings were one byte length plus data). I
believe that Mac C compilers had to introduce a special way of declaring
that a C function should have the Pascal calling convention so that it
could be used as a callback function.
Despite all of this, C crushed Pascal to become by far the dominant
programming language on the Macintosh. I don't think it even took all
that long. Programmers didn't care that dealing with the API issues
were a pain; working in C was worth it to them. It didn't matter that
Pascal was the natural language to write Mac programs in or that it was
a perfectly good language in its own right. C was enough better to
displace Pascal in a hostile environment.
C did not win just because it was at the right place at the right
time. C won in significant part because it was (and is) a genuinely good
language for the job it does. As a result it was the
language that a lot of pragmatic people picked if you gave them anything
like a choice.
CTriumph written at 01:03:49; Add Comment
2011-12-31
My thoughts on the mockist versus classicalist testing approaches
To summarize aggressively, one of the quiet long-running disputes
in the OO test driven community is between classical TDD, where you
use real or stub support classes, and mockist TDD, where you use
behavior-based mock objects. My guide on this is primarily Martin
Fowler (via
Jake Goulding).
Jake Goulding summarizes the difference as stubs assert state while
mocks assert messages (or method calls). I'm mostly a classicalist
as far as testing goes for no greater reason than I generally find
it easier, but reading Martin Fowler's article started me rethinking
my passive attitude on this. On the whole I think I'm going to
remain a classicalist, but I want to run down why.
One way to look at the divide is to look at what is actually being
tested. When you use stubs (or real objects), what you are really
testing is the end result of invoking the code under test. When you use
mocks, you're testing the path that code under test used to get to its
end result. So the real question is whether or not the path the code
under test uses to derive its result is actually important.
Phrased this way, the answer is clearly 'sometimes'. The most obvious
case is situations where the calls to other objects create real side
effects; for example, exactly how you debit and credit accounts
matters significantly, not just that every account winds up with the
right totals in the end. This means means that sometimes you should
use mocks. If you're testing with stubs and the path is important,
you're not actually testing the full specification of your code; the
specification is not just 'it gets proper results', the specification
is 'it gets proper results in the following way'.
At the same time I feel that you should not test with mocks unless the
specific behavior actually is part of the specification of the thing
under test. Otherwise what you are actually testing is not just that the
code works but also that it has a specific implementation. I strongly
dislike testing specific implementations unless necessary because I've
found it to be a great way to unnecessarily break tests when you later
change the implementation.
This also ties into what sort of interfaces and interactions your
objects have with each other; there's a whole spectrum of closely
coupled objects to loosely coupled objects to deliberately isolated
objects. Where you have deliberately isolated objects, objects used
to create hard and limited interface boundaries, I think you should
almost always test the behavior as well as the actual outcome for
things that call them (because you want to make sure that the interface
is being respected and used in the right way). Conversely, closely
coupled objects (where you are only using multiple sorts of objects
because it fits the problem best) are a case where I'd almost never test
behavior because the split into different objects is essentially an
implementation artifact.
(Possibly some or all of this is obvious to experienced practitioners.
One of my weaknesses as a programmer is that I learned programing before
both OO and the testing boom, and I have never really caught up with
either.)
MockistVsClassicalist written at 01:04:56; Add Comment
2011-12-16
Avoiding the classic C quoting bug in your language
To summarize an earlier entry, the classic C
quoting bug that I'm talking about here is writing 'printf(ustr)' or
'syslog(pri, ustr)' where 'ustr' is a string that comes from some
form of external input. In that entry I mentioned that it's possible
for languages to avoid this entire class of bugs with the right design.
Well, it's time to elaborate on that parenthetical aside.
To start with, let's turn the issue around: what is it about these
functions that causes the bug? The answer is that the bug happens
because all of these functions do two things; they do something useful
and in the process of doing it they format their arguments for you. The
bug happens when you (the programmer) just want to do the useful thing
and you overlook the fact that you're also getting the formatting for
free.
The conclusion is straightforward. To make this bug impossible, we need
to make functions like this do only one thing; they should take a plain
string and not format it. But people still need string formatting, so
we need some easy way for people to do it in order to make these single
purpose functions both feasible and convenient, a way that involves as
close to no extra work as possible. In short, we need effortless generic
string formatting.
(The need for no extra work is why we can't do this in C. We need a
language where you don't need to explicitly manage string lifetimes,
since the result of string formatting is another string.)
In theory you can implement generic string formatting as a function call
(ideally with a very short function name). In practice I think that
it isn't going to work the way you want because of perception issues;
if string formatting is just a function call, it's still tempting to
create convenience functions that bundle the two function calls together
for you (one to format the arguments then one to do the useful thing).
Doing generic string formatting as an operator (such as Python's '%')
has the pragmatic benefit of drawing a much more distinct line between
regular function calls and formatting strings.
(The third approach to effortless generic string formatting is string
interpolation in certain sorts of strings. This has the benefit of
sidestepping the entire problem, although it has problems of its own.)
PS: another approach in an OO language is to give strings an explicit
formatting or interpolation method, so that you might write '"%s:
%s".fmt(a, b)'. My gut thinks that this is closer to string formatting
as an operator than anything else.
AvoidingQuotingBug written at 00:36:14; Add Comment
2011-12-05
Debuggers and two sorts of bugs
In the middle of reading Go Isn't C I ran across
some remarks about how it was bad that programmers ignore debuggers
in favour of print statements. This immediately sparked my standard
reaction, which is that debuggers are focused
on telling you what happens next but generally I want to know how on
earth my code got into its current state. Then I started thinking some
more and realized that I was being too strong because this isn't really
accurate. In fact I use debuggers periodically, but only on certain
sorts of bugs.
Let us say that there are two sorts of bugs. For lack of better names, I
will call them direct bugs and indirect bugs. A direct bug's cause
can be determined immediately by looking at the call stack, the local
variables, the code, and so on at the time when it happens. You can
say 'oh, the caller forgot that this function couldn't be called with
a NULL', or see that you forgot to handle a case and something fell
through to code that should never have been reached in this situation.
A decent debugger works very well on direct bugs, or even features like
automatic call stack backtraces on uncaught exceptions (as you get in
languages like Python).
Indirect bugs are data structure corruption bugs (or sometimes flow of
control bugs), where you are now in a 'this can't happen' situation
(whether caught by an assert() or not). Finding the immediate problem
in the code or diagnosing the source of corrupt results is only a
starting point; the real challenge is discovering where and how things
went off the rails so as to get you to where you are now. Indirect bugs
are the bugs where you need to look back into the past to answer 'how
did I get here?' questions.
(For practical examples, my recent Liferea issue was a direct bug; if I had read the code
carefully, the first stack backtrace would have shown me the problem.
My SIGCHLD signal handler race in Python
was an indirect bug; I always knew what the direct problem was, but
I had no idea how the program got into that state until I did some
careful analysis.)
My unconscious bias until now has been that direct bugs are
uninteresting because they are easy to solve from basic inspection, so I
only really thought about what I wanted to deal with indirect bugs. But
the main reason that direct bugs are easy to deal with is that I already
have tools like stack backtraces and inspection of local variables so
it's easy to see what's wrong with the program's current state.
Sidebar: a hazard of dealing with indirect bugs
It's often popular to 'fix' an indirect bug that crashes the program or
generates obviously bad results by making the code accept the impossible
state; for example, by adding a NULL check to the low-level routine
that's crashing with a NULL pointer exception. This is generally the
wrong idea (you're treating the symptoms instead of the disease), but
it's tempting as a quick fix and it's an easy approach to fall into if
you don't understand the code well enough to know that you're dealing
with an indirect bug, not a direct bug.
This is one of the issues that always makes me wary about fixing
'obvious' crash bugs, especially if I want to send the fix
upstream. Before I add a NULL pointer check or the like I need to be
sure that it's the real bug, and I need to understand what the code
should do instead of crashing (which is not always obvious).
DebuggersAndBugTypes written at 00:38:56; Add Comment
2011-11-17
A classic and standard C quoting bug
I recently read Evan Martin's entry on quoting and escaping, which
brought to mind one of those classic C and Unix programming mistakes
which hopefully is not done very much any more.
Suppose that ustr is a string that comes from user input. The classic
form of this particular quoting bug is to write printf(ustr) (or
perhaps fprintf(out, ustr)); a much obvious version of this is
syslog(pri, ustr).
(Some people might think that the printf() version is silly since
there are much more nominally efficient ways to print plain strings, but
in practice plenty of people find it much more efficient on programmer
time to use printf() or fprintf() to output everything.)
The problem with all of these is that the first argument to printf()
and syslog() is not a plain string to be printed; it is a format.
Like many quoting bugs, this goes undetected for much of the time
because a true plain string is also a valid format that formats to
itself. However, if someone supplies a 'plain string' that includes
printf formatting directives, things rapidly go off the rails; if
you are lucky the program crashes right away so you can immediately
notice the problem.
(If you are sufficiently unlucky, this is an exploitable security
vulnerability. And yes, people have written code like this that
made it into important programs.)
The right solution is of course to quote the user supplied string. The
simple way to do this is to supply your own simple formatting string:
syslog(pri, "%s", ustr) (or the equivalent with printf() et
al). Of course at this point you might want to think about other bad
characters that could appear in ustr and how you want to display the
message to make it clear that the string comes from the user, not from
your program's internals.
This bug can happen with any function that takes a format string as an
argument and in any language, not just C. C is just a more dangerous
language to have it happen in because C generally has no check for the
wrong number of arguments being supplied to a function.
(I have opinions on how languages can avoid this entire class of bugs,
but that's something for another entry.)
(Perhaps this is not strictly a quoting bug. It is in my mind, but I
may have a somewhat odd view of what constitutes one.)
ClassicCQuotingBug written at 00:39:57; Add Comment
2011-11-12
(Not) parsing wikitext
I have a confession: I don't know how to parse WikiText. Oh, I know how
to process it (using regular expressions and ad-hoc code), but I don't
know how to write a formal parser for it, something based on parsing
theory in order to have all of the advantages of real parsers. Parsers
matter because they're much easier to write, to maintain, and to reason
about than ad-hoc code; ad-hoc processing code is ultimately just a big
pile of mud. Complex, fragile mud.
I've been thinking about this for some time, and I've more or less
concluded that there are three aspects of wikitext that make it hard
to handle with traditional parsing techniques and tools. The first issue
is that wikitext is a mixture of stream-oriented and line-oriented
content. All of the parsing techniques that I learned in my now long-ago
compilers course were built around stream oriented parsing, where you
simply have a running stream of tokens and the most line oriented stuff
you have is that end of line is a delimiter. This is fine for parsing
running text, but pretty much all wikitexts also have an entire second
layer of line to line markup to signal things like headings, lists,
quoted (or indented) text, preformatted text, and so on. You need to
parse this second layer as well as the running text, and you generally
can't parse each line's running text independently because markup
constructs in running text can span physical lines.
(Even Python is parsed with a straightforward stream
oriented parser under the hood; I wrote about it
here and here. Unfortunately the technique that Python uses
can't be extended to parsing wikitext.)
I'm sure that there are parsing techniques for dealing with line
oriented languages (my impression is that a number of early pre-Algol
languages were significantly line oriented), but I didn't learn any
such techniques and I think they're now out of favour because everyone
sensibly shifted to stream oriented language designs decades ago.
The second big issue is that unlike a traditional language, wikitext
doesn't have any errors. In a traditional parser, a parse production
either succeeds or results in an error. In wikitext, there are no errors
and something like an incomplete external link is simply raw text;
'[[some text' is not an error, it is '[[some text'. The presence
of errors allows parsers to avoid huge amounts of backtracking (or
equivalently, a huge amount of lookahead). The dominant forms of
parsing today have very little lookahead because people found that
that was both possible and efficient for computer languages (and then
people carefully fiddled with languages so that they didn't need
lookahead). More advanced things like parsing expression grammars can tolerate
more lookahead, but my impression is that even a PEG based grammar
struggles with wikitext's potential massive backtracking.
The third issue is that the proper meaning of a lot of wikitext is
highly context dependent. For example, consider whitespace at the
start of a line. In DWikiText, the standard meaning of this is
preformatted text, but if we're in a list context it instead means
a continued list entry. Similar issues can bedevil parsing various
font change characters in running text, especially if you want to
insist on proper nesting. I think that some of this can be handled
naturally by things like parsing expression grammars, but I'm not
sure that all of it can be.
(I read a paper from sweble.org
that admitted they didn't handle determining what was and wasn't a
font change in the parser but instead deferred all of that to a
post-processing pass over the resulting AST. Maybe it's still possible to do
this in clear, systematic code instead of gradually devolving back
to ad-hoc processing.)
ParsingWikitext written at 01:41:04; Add Comment
2011-10-24
Salting passwords compared to slow password hashing
When you're storing encrypted passwords, you can do two different things
to ruin the life of an attacker who got your password hashes, or at
least annoy them; you can salt your passwords in various ways and you
can use a slow password hashing function. These two have different but
complementary effects.
(I've talked about the effects of the various ways of salting passwords
back in SaltingPasswords.)
Salting passwords reduces the payoff an attacker gets for making
a single password guess. They slow down how fast an attacker can
compromise a bunch of your accounts, but salts do not slow down how
fast an attacker can compromise a specific account. Individually
salted passwords with a fast hash function increase the time to mass
compromises but do not necessarily increase the time to the first
account compromise.
Using a slow password hashing function slows down how fast an attacker
can make password guesses but by itself does nothing to reduce the
payoff the attacker gets for making those guesses. If you have no salts
you're vulnerable to rainbow tables, and if you just have a single
site-wide salt every (slow) password guess can be checked against all of
your users at once.
If you want to delay the time to the first account compromise, the worst
situation is a fast password hash (like SHA1 or MD5) with at most a
single site wide salt; an attacker can make very fast hash checks and
all they need is some person to have a reasonably weak password. If you
have per user salts with a fast hash, the attacker needs to get lucky
by picking a user that used a weak password. How much luck they need
depends on the password practices of your users; if most users choose
weak passwords, the time to first compromise will barely budge between a
site-wide salt and a per-user salt.
(If you want scary figures on how fast this can be
done today, see How To Safely Store A Password.)
It's tempting to summarize this as that having good salting means that
the attacker can't rapidly break lots of your users at once while
having slow hashing means that they can't rapidly break any of your
users, but it's not quite that simple. Even with slow password hashing,
if you don't have per user salts and you have users who pick (very)
weak passwords the attacker won't need to make very many slow password
guesses in order to find a vulnerable account.
(Even with a slow hashing function it doesn't take too much time to try
20,000 to 30,000 guesses. That's almost certainly enough to find most
seriously weak common passwords.)
(This entry was sparked by reading On cryptography and dogmas (via Hacker News).
I wanted to fix some of this in my own head.)
SaltingAndCryptSpeed written at 01:12:56; Add Comment
2011-10-11
Why people are attracted to minimal language cores
In the last entry I described how some languages
rewrite loops with loop bodies by turning the loop body into an
anonymous function that the loop invokes repeatedly. I then mentioned
that some languages go all the way to turning loops into tail-recursive
function calls. You might wonder why languages do this and why these
sort of crazy transformations are considered attractive.
There are at least two reasons that these things are popular, for some
definition of popular. First, the intellectual purity of a minimal
core language appeals to a certain sort of language wonk; they tend
to call the result simpler. These people are often drawn towards
Lisp (especially Scheme, perhaps the canonical illustration of this
philosophy in action).
(Lisp is not the only language family that has this sort of minimalism.
For example, I think that Forth is just as minimal in its own way,
although it gets far less language design attention.)
Second, it means that your code generator or interpreter core only needs
to handle a minimal set of things because higher levels have transformed
code written in the general version of the language down into this
minimal form (often automatically). This has traditionally simplified
optimizers; rather than implementing very similar analysis for each of
a whole bunch of control flow constructs, they only have to analyze one
really well. Which control flow construct you pick as your base depends
on what language you're compiling; some languages pick goto, for
example.
(Then you can get a PhD thesis or two out of how to do this analysis.)
My understanding is that the pragmatic evidence is mixed on whether
this is a good idea or not. There have certainly been some significant
successes, but I have also heard stories of compilers where the frontend
carefully reduced all control flow constructions down to the single
fundamental control flow atom, passed the whole thing to the optimizer,
and had the optimizer reverse engineer the high level control flow stuff
again from the low-level minimized control flow information.
(The argument for still doing this even when it's inefficient is that
this lets the optimizer (re)discover the true high level control flows
in your program, regardless of how you actually wrote the code. In a
sense it's discovering what you actually meant (or at least, what you
created) instead of just what you wrote.)
WhyLanguageTransformations written at 01:28:21; Add Comment
|
These are my WanderingThoughts
(About the blog)
GettingAround
Full index of entries
Recent comments
This is part of CSpace, and is written by ChrisSiebenmann.
* * *
Atom feeds are available; see the bottom of most pages.
This is a DWiki.
(Help)
Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web
|