Wandering Thoughts archives

2015-08-17

Getting dd's skip and seek straight once and for all

Earlier today I wanted to lightly damage a disk in a test ZFS pool in order to make sure that some of our status monitoring code was working right when ZFS was recovering from checksum failures. The reason I wanted to do light damage is that under normal circumstances, if you do too much damage to a disk, ZFS declares the disk bad and ejects it from your pool entirely; I didn't want this to happen.

So I did something like this:

for i in $(seq 128 256 10240); do
    dd if=/dev/urandom of=<disk> bs=128k count=4 skip=$i
done

The intent was to poke 512 KB of random data into the disk at a number of different places, with the goal of both hopefully overwriting space that was actually in use and not overwriting too much of it. This turned out to actually not do very much and I spent some time scratching my head before the penny dropped.

I've used skip before and honestly, I wasn't thinking clearly here. What I actually wanted to use was seek. The difference is this:

skip skips over initial data in the input, while seek skips over initial data in the output.

(Technically I think skip usually silently consumes the initial input data you asked it to skip over, although dd may try to lseek() on inputs that seem to support it. seek definitely must lseek() and dd will error out if you ask it to seek on something that doesn't support lseek(), like a pipe.)

What I was really doing with my dd command was throwing away increasing amounts of data from /dev/urandom and then repeatedly writing 512 KB (of random data) over the start of the disk. This was nowhere near what I intended and certainly didn't have the effects on ZFS that I wanted.

I guess the way for me to remember this is 'skip initial data from the input, seek over space in the output'. Hopefully it will stick after this experience in toe stubbing.

Sidebar: the other thing I initially did wrong

The test pool was full of test files, which I had created by copying /dev/zero into files. My initial dd was also using /dev/zero to overwrite disk blocks. It struck me that I was likely to be mostly overwriting file data blocks full of zeroes with more zeroes, which probably wasn't going to cause checksum failures.

unix/DdSkipVersusSeek written at 22:34:17; Add Comment

Why languages like 'declare before use' for variables and functions

I've been reading my way through Lisp as the Maxwell's equations of software and ran into this 'problems for the author' note:

As a general point about programming language design it seems like it would often be helpful to be able to define procedures in terms of other procedures which have not yet been defined. Which languages make this possible, and which do not? What advantages does it bring for a programming language to be able to do this? Are there any disadvantages?

(I'm going to take 'defined' here as actually meaning 'declared'.)

To people with certain backgrounds (myself included), this question has a fairly straightforward set of answers. So here's my version of why many languages require you to declare things before you use them. We'll come at it from the other side, by asking what your language can't do if it allows you to use things before declaring them.

(As a digression, we're going to assume that we have what I'll call an unambiguous language, one where you don't need to know what things are declared as in order to know what a bit of code actually means. Not all languages are unambiguous; for example C is not (also). If you have an ambiguous language, it absolutely requires 'declare before use' because you can't understand things otherwise.)

To start off, you lose the ability to report a bunch of errors at the time you're looking at a piece of code. Consider:

lvar = ....
res = thang(a, b, lver, 0)

In basically all languages, we can't report the lver for lvar typo (we have to assume that lver is an unknown global variable), we don't know if thang is being called with the right number of arguments, and we don't even know if thang is a function instead of, say, a global variable. Or if it even exists; maybe it's a typo for thing. We can only find these things out when all valid identifiers must have been declared; in fully dynamic languages like Lisp and Python, that's 'at the moment where we reach this line of code during execution'. In other languages we might be able to emit error messages only at the end of compiling the source file, or even when we try to build the final program and find missing or wrong-typed symbols.

In languages with typed variables and arguments, we don't know if the arguments to thang() are the right types and if thang() returns a type that is compatible with res. Again we'll only be able to tell when we have all identifiers available. If we want to do this checking before runtime, the compiler (or linker) will have to keep track of the information involved for all of these pending checks so that it can check things and report errors once thang() is defined.

Some typed languages have features for what is called 'implicit typing', where you don't have to explicitly declare the types of some things if the language can deduce them from context. We've been assuming that res is pre-declared as some type, but in an implicit typing language you could write something like:

res := thang(a, b, lver, 0)
res = res + 20

At this point, if thang() is undeclared, the type of res is also unknown. This will ripple through to any code that uses res, for example the following line here; is that line valid, or is res perhaps a complex structure that can in no way have 10 added to it? We can't tell until later, perhaps much later.

In a language with typed variables and implicit conversions between some types, we don't know what type conversions we might need in either the call (to convert some of the arguments) or the return (to convert thang()'s result into res's type). Note that in particular we may not know what type the constant 0 is. Even languages without implicit type conversions often treat constants as being implicitly converted into whatever concrete numeric type they need to be in any particular context. In other words, thang()'s last argument might be a float, a double, a 64-bit unsigned integer, a 32-bit signed integer, or whatever, and the language will convert the 0 to it. But it can only know what conversion to do once thang() is declared and the types of its arguments are known.

This means that a language with any implicit conversions at all (even for constants like 0) can't actually generate machine code for this section until thang() is declared even under the best of circumstances. However, life is usually much worse for code generation than this. For a start, most modern architectures pass and return floating point values in different ways than integer values, and they may pass and return more complex values in a third way. Since we don't know what type thang() returns (and we may not know what types the arguments are either, cf lver), we basically can't generate any concrete machine code for this function call at the time we parse it even without implicit conversions. The best we can do is generate something extremely abstract with lots of blanks to be filled in later and then sit on it until we know more about thang(), lver, and so on.

(And implicit typing for res will probably force a ripple effect of abstraction on code generation for the rest of the function, if it doesn't prevent it entirely.)

This 'extremely abstract' code generation is in fact what things like Python bytecode are. Unless the bytecode generator can prove certain things about the source code it's processing, what you get is quite generic and thus slow (because it must defer a lot of these decisions to runtime, along with checks like 'do we have the right number of arguments').

So far we've been talking about thang() as a simple function call. But there are a bunch of more complicated cases, like:

res = obj.method(a, b, lver, 0)
res2 = obj1 + obj2

Here we have method calls and operator overloading. If obj, obj1, and/or obj2 are undeclared or untyped at this point, we don't know if these operations are valid (the actual obj might not have a method() method) or what concrete code to generate. We need to generate either abstract code with blanks to be filled in later or code that will do all of the work at runtime via some sort of introspection (or both, cf Python bytecode).

All of this prepares us to answer the question about what sort of languages require 'declare before use': languages that want to do good error reporting or (immediately) compile to machine code or both without large amounts of heartburn. As a pragmatic matter, most statically typed languages require declare before use because it's simpler; such languages either want to generate high quality machine code or at least have up-front assurances about type correctness, so they basically fall into one or both of those categories.

(You can technically have a statically typed language with up-front assurances about type correctness but without declare before use; the compiler just has to do a lot more work and it may well wind up emitting a pile of errors at the end of compilation when it can say for sure that lver isn't defined and you're calling thang() with the wrong number and type of arguments and so on. In practice language designers basically don't do that to compiler writers.)

Conversely, dynamic languages without static typing generally don't require declare before use. Often the language is so dynamic that there is no point. Carefully checking the call to thang() at the time we encounter it in the source code is not entirely useful if the thang function can be completely redefined (or deleted) by the time that code gets run, which is the case in languages like Lisp and Python.

(In fact, given that thang can be redefined by the time the code is executed we can't even really error out if the arguments are wrong at the time when we first see the code. Such a thing would be perfectly legal Python, for example, although you really shouldn't do that.)

programming/WhyDeclareBeforeUse written at 01:03:28; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.