Wandering Thoughts archives

2012-04-29

My two approaches to learning (programming) languages

Since I just learned enough JavaScript to do something, I've been thinking about how I get to the point with a programming language where I can do something useful in it. There are two major patterns that I can recognize in my past, which I will call the surgical strike and assembling a construction kit.

In the surgical strike, I focus on learning just enough of the new language (and its libraries) to write the little program that I want. Surgical strikes work best in 'obvious' languages without dangerous surprises, ones that are similar to things I already know, and on relatively simple problems that the languages are good at dealing with. I've started learning a whole lot of languages through surgical strikes, including both Perl and Python.

(For example my first Python program was a quite simple 20-line script to scan a directory and generate a lilo.conf that corresponded to what kernels were there. Yes, this was quite a while ago.)

In the assembling a construction kit approach I spend a bunch of time learning much of the language (and its environment and libraries) in order to build up a 'construction kit' of how to use the language in my area. Only once I've assembled my kit can I actually tackle my real problem and write code. The construction kit approach is what I resort to when I'm faced with a complex environment, a novel language, or a problem that the language is not tuned for. I learned JavaScript (and JQuery) almost entirely this way, because it was clear to me that browser JavaScript is an inherently complex environment and any attempt at a surgical strike would probably produce a horrible mess of hacks.

The advantage of the surgical strike is that it gets me results fast, both solving my immediate problem and giving me firm evidence that I'm getting somewhere with the language. The drawback is that each step doesn't get me very far; it can take a lot of surgical strikes before I actually really know a language (and generally I will stop trying to do surgical strikes at some point and just thoroughly learn the remaining stuff).

The drawback of assembling a construction kit is that there is a lot of slogging before I feel I've actually achieved anything. I at least find it somewhat demotivating to spend a bunch of time reading through (eg) JavaScript and JQuery resources without actually doing anything real; sure, I've read things but that's not actual, visible progress. It's just a slog. The advantage of building a construction kit is that once I've assembled the kit I can suddenly make a lot of progress quite fast because there's a lot I can now build with very little further learning. Once it starts happening this is very motivating and makes a great payoff for all of my prior slogging.

The two are different experiences for me. I wouldn't call one better or worse, although it's easier to stay motivated with surgical strikes.

(Of course one can do things as a hybrid approach, for example assembling a construction kit with the base language and then mostly taking a surgical strike approach with some package or library you need to use. I sort of did this with the JQuery side of my JavaScript learning.)

LearningLanguagesTwoWays written at 02:13:06; Add Comment

2012-04-11

Faking (or not) a ternary if operator with && and ||

Suppose that you have a multiline if in some language and you want to rewrite this into a single line and a single expression; however, your language only has && and || operators without a ternary if operator (this is C's ?: operator). Can you do this rewrite safely?

To be concrete, let's do this in the Bourne shell (which is where I saw this be done recently). Suppose you have the following multiline bit of shell script and you want to reduce it to something that's as short as possible:

if cmd1; then
  cmd2
else
  cmd3
fi

Is a good translation of this 'cmd1 && cmd2 || cmd3'?

(Let's ignore the set -e gotcha.)

The answer is no. It's easy to see why this is if I rewrite this with specific commands:

true && false || echo oops

This will echo 'oops'.

The answer to the question turns out to depend on whether you care about the value of the ternary if (or the multiline if). If you don't care about the value and are evaluating things purely for their side effects, then you can write 'a ? b : c' as 'a && (b || true) || c', which avoids the problem by forcing the middle clause to always be considered true. If you do need to expose the (truth) value of b, this rewrite is no good and you are almost certainly up the creek; the only way out is if you know that b's value will always considered true.

(I first saw this problem in Python, but seeing it reappear in the Bourne shell started me down the path of thinking about the general issue.)

The Bourne shell version of this rewrite would be

cmd1 && (cmd2 || true) || cmd3

I think that most of the time this should be fine; it would be very odd style to check the exit code of cmd2 or cmd3 after the if, and certainly the example I saw this in didn't do it. My one twitch is that I am reflexively nervous about () potentially introducing surprise subshells, but if cmd2 is a real command that's irrelevant.

TernaryIfVsAndOr written at 01:15:38; Add Comment

2012-04-10

Revisiting checksum functions

In his comment on Checksums and hashes, nothings correctly called me on some stuff in my discussion of checksum functions. So let me revisit my bit on checksums and try to do it better.

In a strict technical sense, per Wikipedia, there is a relatively narrow mathematical definition of a checksum function. Or perhaps a somewhat broader one (this is the joy of Wikipedia). In either case, it seems that at least some of what we informally think of as 'checksums' are not checksum functions in this mathematical sense.

But as nothings points out, in common programming use most people have a broader view of checksums; something is a checksum if it computes a value that we use to detect errors (well, changes) in the input data, more or less. The exact mathematics don't matter as much as the intended usage and result. Checksum algorithms vary in their complexity (and thus their speed), in the size of the output value (anywhere from one bit for parity, up to 16 or 32 bits for common fast checksums, to 128 bits or more for cryptographic hashes used as checksums), and in what sort of errors they detect; you get to pick an algorithm for your problem that meets your needs.

(Note that checksums are generally considered to not protect you against deliberate, malicious modification of the message. If you need that protection you need a good cryptographic hash, among other things.)

Using a cryptographic hash is an obvious brute force way to get a solid checksum function. They have relatively large output values (128 bits and up) and run quite slowly compared to something like Addler32, but for many applications neither is a significant drawback. This makes cryptographic hashes the easy checksum choice for many applications; just run your data through the standard library SHA1 interface and you're done.

(My view is that the rules for checksum functions are a lot like the rules for cryptography: never invent your own algorithms and never do your own implementations, because both are too easy to screw up. Screwing up a checksum is less explosively dangerous than screwing up cryptography, but why take chances if you don't need to?)

Sidebar: checksums versus error correcting codes

Checksums blur into the whole area of error correcting codes in general. I'm not sure where the boundaries get drawn, especially in common usage as opposed to strict mathematics. My mostly uneducated understanding is that something is a checksum if it has a constant-sized output regardless of the input size (eg SHA1 is always 160 bits regardless of how large the input data is); it turns into an error correcting code if it has to add X bits per Y bits of input.

However, the boundaries are clearly fuzzy here. For example, parity is considered a checksum in general usage but is usually done in fixed size blocks with one bit of parity per N bits of input. Possibly people just merge error correcting codes into a very broad view of 'checksums'.

(What this really means is that I don't know as much about the whole field of checksums as I should.)

ChecksumsRevisited written at 01:15:29; Add Comment

2012-04-06

Checksums and hashes

Every so often I run across something on the Internet that makes me grind my teeth at how wrong it is. Today's example is Jeff Atwood's Speed Hashing, where he significantly mangles what both checksums and hashes are in an otherwise reasonable article. Since it has made me grumpy, I'm going write a summary what they actually are.

(Yes, yes, you may now cue the obligatory XKCD strip.)

A checksum function is designed to always change its value if an N bit or less change happens to the input; what the N is depends on the specific checksum function. Checksum functions are used all over computing, often invisibly, and real checksums can be mathematically proven to have their desired properties. See Wikipedia for a far more thorough overview of ways of doing error detection and correction, where I discovered that a certain amount of things that I think of as checksums are technically other related things (eg ECC).

(In the fully general definition a checksum function changes value if up to N values in the input change. In computing, we generally take the value unit to be 'a bit'; in other applications of checksums they can be things like 'a printed digit'.)

A plain hash function is designed to have an even distribution over the output values for a given set of input values. Good hashing functions can be very specific to the expected input values and the range of output values; the ultimate version of this is a perfect hash function, which efficiently maps a known set of input values to an output range. Hash functions are not guaranteed to change their output value if a bit in the input changes, partly because the hash function may consider the changed bit an 'unimportant' one and not include it at all in the calculation of the hash value.

(There are a lot of subtle issues when designing hash functions that I'm not going to try to cover here.)

A good cryptographic hash function (aka a secure hash) is a hash function that is also collision resistant (see also), non-reversible, unpredictable, and what I'll call tamper-resistant. What these mean in plain language is that it's hard to find two inputs that hash to the same output, it's hard to find an input that hashes to a specific output, that the output value doesn't let you determine properties of the input, and that you can't change the input and have it hash to the same output. Note that using a cryptographic hash does not by itself make you automatically secure against attack; you have to use it right .

(With a conventional hash function you might be able to determine that certain output values generally only occur if the input has certain properties, like being a string in all lower case as opposed to mixed case.)

Whether a given hash is a good cryptographic hash changes over time as people work out how to attack it. For example, MD5 was once a good one but is now very close to completely broken and should no longer be used. Note the implications for burning a specific cryptographic hash into long term things like the specifications for file and communication formats.

Cryptographic hashes can often be used as checksums, although I don't know if any have been mathematically proven to be completely effective checksums (where a change of N bits or less in any arbitrary input is guaranteed to change the output).

People use plain hash functions instead of cryptographic hash functions because plain hash functions can be much faster and have smaller output ranges. A plain hash function can give you a 32-bit integer, whereas secure hashes start at 160 bits (SHA1) for anything you want to use today. Even MD5 is 128 bits.

(Yes, people say that cryptographic hashes are fast. That's a relative measure; they're fast compared to, say, communication links or the time it takes for feasible brute force attacks. They're not fast when compared to plain hash functions. You do not want your kernel to be MD5'ing every filename as part of maintaining a cache mapping filenames to file information.)

ChecksumsAndHashes written at 21:09:26; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.