Wandering Thoughts archives

2006-08-31

How dd does blocking

For a conceptually simple program, dd has a number of dark corners. One of them (at least for me) is how it deals with input and output block sizes, and how the various blocking arguments change things around.

  • ibs= sets the input block size, the size of the read()s that dd will make. Since you can get partial reads in various situations, this is really the maximum size that dd will ever read at once.
  • obs= sets the output block size and makes dd 'reblock' output; dd will accumulate input until it can write a full sized output block (except at EOF, where it may write a final partial block).
  • bs= sets the (maximum) IO block size for both reads and writes, but it turns off reblocking; if dd gets a partial read, it will immediately write that partial block.

Because of the reblocking or lack thereof, 'ibs=N obs=N' is subtly different from 'bs=N'. The former will accumulate multiple partial reads together in order to write N bytes, while the latter won't.

(On top of this is the 'conv=sync' option, which pads partial reads.)

So if you're reading from a network or a pipe but want to write in large efficient blocks, you want to use obs, not bs (and you probably want to use ibs too, because otherwise you'll be doing a lot of 512 byte reads, which are kind of inefficient).

DdBlocking written at 19:14:44; Add Comment

2006-08-29

Documentation should be cheap

Although documentation is not free, it should be cheap. By that, I mean that documentation should cost as little as possible to produce, so that you get as much of it as possible for your budget. Again, the major cost is in people's time, so you want writing documentation to be as fast (and easy) as possible.

The golden rule is that time that people are spending doing anything except writing down the actual content is overhead. You get the most bang to the buck by minimizing this overhead. And remember, the perfect is the enemy of the good.

There are two sides to this: the technical and the social. On the technical side, cheap documentation needs to be as simple to write as possible. To me, this means that it should use a simple markup language that is very close to plaintext, in a decent editor. (Web browsers do not qualify.)

(Ideally you want something where you can pretty much write basic text paragraphs and have them come out right. I think that you need some formatting, because some things really need it; ASCII art diagrams are just sad, and ASCII tables need a lot of futzing, especially if you have to revise them.)

On the social side, cheap needs a tolerance for ad-hoc things. Not everything has to be ad-hoc, but there should be room for people to just dump a couple of paragraphs in a file somewhere. Adopt the Google approach for finding things: just search everything. Then you can add more structure on top in various ways.

(In practice, many organizations use archived email lists for this purpose.)

Unfortunately, despite what I said about documentation needing testing, cheap also calls for a tolerance for various forms of inaccuracy, whether that's outright mistakes or just something that is now out of date. One way to deal with this is to have multiple levels of documentation, ranging from carefully vetted operations manuals to scribbled back of the text file notes. People can still be steered wrong, but at least they're not being mislead about how trustworthy the information is.

(I feel that the problem isn't inaccurate information, it's that people trust it too much. I even like outdated historical stuff, because it gives me useful and sometimes fascinating insights into how things evolved. But then, I'm a geek.)

There's an important secondary reason for making documentation cheap: it increases the chances that you'll be able to capture knowledge while it's still fresh in people's minds. The faster it is to write things, the more likely it is that people will have the time to write something down right after they've actually done it. (This is another reason for the popular 'send email to this mail alias to describe what you just did' approach to change documentation.)

DocumentationNeedsToBeCheap written at 00:04:59; Add Comment

2006-08-27

Documentation is not free

Yes, system administrators should document things. At the same time, it's important to understand that documentation is not free; it has a cost. Since documentation does not appear from thin air in zero time, producing it costs people's time.

(And this doesn't even count testing it afterward.)

Unless your sysadmins don't have enough to do (which is a rather rare occurrence around here), this means that documentation costs real dollars. If you want to do everything you currently do plus documenting things, you need more time, which means that you need more people, which means more money. If you hold money constant and still demand documentation, you will get less other things done; the time has to come from somewhere.

(Please don't try to force your sysadmins to work extra time for nothing. This rarely works well.)

Thus, if you want documentation you need to explicitly budget for it, one way or another. If you don't, you're unlikely to get any documentation until people run out of other things to do.

(A corollary is that you can measure the real importance of documentation to an organization by how much they budget for it.)

If you really want documentation, you also need to defend it during crunches. It's a very tempting target for schedule cuts, since nothing breaks without it (at least in the short term) and it's usually cleanup work, done after systems are working.

(Another way to put this is that the costs of no documentation are usually a lot less obvious than the costs of unbuilt systems.)

DocumentationIsNotFree written at 23:46:33; Add Comment

2006-08-22

How not to set up your DNS (part 11)

Presented in the traditional illustrated form:

; dig +short ns hinet.net.tw.
reg.hinet.net.tw.
www.hinet.net.tw.
hinet.net.tw.

This looks good, except for the fact that they all have the same IP address, 210.65.1.231. You would think that a large Taiwanese ISP would be able to afford more than one DNS server, but perhaps they're too busy beefing up their network infrastructure to cope with the amount of spam their customers send. (At the moment Hinet is #4 on Spamhaus.org's list of the 10 worst spam service ISPs, with 53 listings.)

But it gets worse.

; dig +short a ms4.hinet.net.tw @210.65.1.231
210.65.1.231

(Okay, perhaps they only have one IP for all their servers.)

; dig mx ms4.hinet.net.tw @210.65.1.231
[...]
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 18296
[...]

The correct way for a DNS server to answer a query that it doesn't have any data for is with a response that contains no data. SERVFAIL causes other people to retry, instead of just going away. In this case it caused us to not accept email from an alleged 'wbjtewyeox@ms4.hinet.net.tw', because every attempt to look up the MX record for ms4.hinet.net.tw looked like a temporary failure to us.

HowNotToDoDNSXI written at 17:04:06; Add Comment

2006-08-21

How not to set up your DNS (part 10)

This one is a close variation of HowNotToDoDNSIX, but it earns extra points for making the reverse error from a common one. Presented in semi-illustrated format:

; dig +short ns system-bank.net.
dns01.system-bank.net.
dns02.system-bank.net.

(At this point I will pause to note that dns01.system-bank.net and dns02.system-bank.net have the same IP address, 218.227.163.13, a trick that was featured back at the start of this series.)

; dig a server.system-bank.net. @218.227.163.13
[...]
;; AUTHORITY SECTION:
system-bank.net. IN NS dns01.
system-bank.net. IN NS dns02.

(TTLs have been omitted for clarity.)

The usual error is for people to leave out the trailing dot on things like NS records pointing to external machines, so that you get an NS record of 'ns1.other.net.yourdomain.com' or the like. These people have done the reverse by adding some dots where they shouldn't have, leaving their domain name off some things that really need it.

(The net result is the same as in HowNotToDoDNSIX. I wonder how many people accept their email anyways? If all their email bounced, I'd have expected them to notice this problem by now.)

HowNotToDoDNSX written at 11:22:05; Add Comment

2006-08-20

Finally, a good reason to periodically reboot servers

Recently, we had an interesting fire drill that actually winds up being the first decent argument for periodically rebooting good servers that I've seen.

We have a number of very important servers, the kind of important servers that are active all the time and that have their downtimes carefully scheduled well in advance. Recently, we (and by this I mean 'a co-worker') had to patch the OS on a couple of them in the pack. This went fine.

As part of patching the OS, you have to reboot the machines. This did not go fine; both servers refused to come up, puking up an obscure error messages. Once pored over and decoded (partly by the vendor's hardware people), the error messages on both machines boiled down to more or less 'the configuration NVRAM is corrupt'.

(The configuration NVRAM had not been touched by the OS patching process.)

This was, naturally, a big problem. A disruptive problem. Emergency bandaids were slapped into place, things were postponed, and hardware maintenance was summoned (and duly fixed the problem).

Of course, the only time anything looks at the configuration NVRAM (and cares that it's corrupted) is when the system is booted. Since these systems are almost never rebooted, we have very little idea how long ago the NVRAM got zapped; it could have been months. Since both systems failed, we're also somewhat nervous about the state of the rest of the pack, which have more or less identical hardware and haven't been rebooted recently. Do they have corrupt configuration NVRAM too?

(Test reboots are now being scheduled.)

Thus, the first decent argument for periodic precautionary reboots: there are bits of hardware that only get exercised when the machine reboots (and that vendors don't expose for testing). If something has gone wrong with one of them, it is better to find out during a scheduled time than as an unpleasant surprise.

This does have an important consequence: because tests can fail, you had better have a plan for what to do if the server won't come up after its precautionary reboot. (For the important machines I'm responsible for, the answer is 'failover to the backup'; naturally, one should never reboot the primary and the backup at the same time.)

As a corollary, it is probably better to schedule precautionary reboots for somewhat before the start of the workday on a weekday morning, so that if something goes wrong all of your vendor's people will soon be around.

(Naturally, we did the OS patching process in the evening, to have a margin of error for software problems.)

(My apologies to my co-workers if I've mangled the story in my retelling of it.)

RebootReason written at 23:47:14; Add Comment

2006-08-19

Documentation needs testing

One of the under-appreciated areas of writing systems documentation is testing it. And I don't mean proofreading it (although that's necessary too); I mean making sure that it is things like comprehensible, clear, correct, and complete.

Testing is vital, because documentation that is incomplete or incorrect is often worse than no documentation at all. Documentation creates confidence, which means that bad documentation lets people go wrong with confidence, turning a small mess into a much bigger one. ('But the instructions said to newfs /dev/sda1...')

Unfortunately, testing documentation is hard and time consuming. You don't really know if it's good until someone else follows it and things work, and this can be hard to arrange, especially in small or specialized environments. (This is a hidden benefit of part-time student labour at a university; it provides you with a steady stream of guinea pigs to test your documentation on.)

And of course for testing documentation on procedures, you need to try out the procedure in an environment where you can afford to fail. Which often means spare time and spare equipment, which can also be hard to get. (Testing overview and orientation documentation may be even harder; how do you test that someone learned enough from it to be useful?)

In the absence of good guinea pigs, your co-workers are the best testers you have. Unfortunately, it's usually difficult to get people to really take the time to go over the documentation carefully (unless people already get all this). After all, your co-workers either already know most of this stuff or don't need to, and they have other work to do.

(I've always liked to get pre-readers and the like, but it took a recent discussion of documentation to make me understand why and to see this issue.)

Sidebar: why you need other people

Documentation has to be tested by someone else for the same reason you can't thoroughly test your own programs; you're too close to it. You really need someone who's ignorant enough that they can't just fill in the blanks and the unclear bits on their own, without noticing.

(And part of the problem is that after a while, your initially ignorant person learns enough to start filling in the blanks on their own, and you need another guinea pig.)

Serious disaster recovery tests are the extreme example of this, where you send an accounting clerk out to the backup site with your instruction binder and a pile of backup tapes and see what happens.

DocumentationNeedsTesting written at 00:17:44; Add Comment

2006-08-09

A Bourne shell irritation

I was bitten by this today, so I am going to grump about it. Today's irritation with the Bourne shell is that the following is illegal syntax:

echo hi &;

This probably looks peculiar and silly, so let me give a real example that is also illegal:

for i in *watch; do nohup ./$i >/dev/null 2>&1 </dev/null &; done

So is '; ;', and for the same reason: in the Bourne shell, you can't have empty 'simple commands', and both '&' and ';' are command terminators.

(The status of newline is somewhat confusing; the best explanation may be the original V7 sh manpage, which calls it an optional command delimiter. It cannot be a simple command terminator, because that would mean multiple blank lines would be a syntax error.)

This becomes more irritating when you write command lines with multiple backgroundings in them, for example:

foo & bar & baz & wait

Speaking for myself, that makes my eyes bleed. I find it much more readable and easier to write:

foo &; bar &; baz &; wait

(I think a lot of the eye-bleeding is that 'a & b' is very similar to the much more common 'a && b', yet does something radically different.)

I have no idea why Bourne decided to be so nit-picky about this aspect of the shell's grammar, but I suspect it mirrors some bit of Algol.

(As an aside, I note that the original V7 Bourne shell manpage is a marvel of packing a great deal of information into not much space and being reasonably lucid in the process.)

BourneIrritation written at 22:48:03; Add Comment

2006-08-02

In praise of installing from Live CDs

I've recently had the experience of installing Ubuntu from one of their live CDs, and I now have to say that this is a genius idea that should be widely imitated, and as soon as possible.

For me, the genius of a live CD installation is three-fold, and is only truly compelling when the machine has a network connection:

  • I can Google around to figure out what I want to do next, understand any peculiar questions the installer is asking me, and so on.
  • if anything goes wrong, I have a full Unix environment where I can poke around to diagnose what's up (and maybe fix things).
  • I can still be productive while the machine is installing. All too often, setting up machines is an exercise in twiddling my thumbs. But with a live CD and a network, I can get productive work done on other machines right from the machine I am installing.

(Plus the obvious benefit of live CDs: you get to find out if the hardware actually works under (that) Unix.)

Prior to live CD installs, my usual practice was to start an install, go somewhere I could get actual work done, come back somewhat later, discover that the installer had stopped to ask me a question, answer it, go away to do productive work again, lather rinse and repeat. Live CDs are a vast improvement.

(For those that have been living under a rock, like me, a 'live CD' is a CD that boots a fully working Unix environment, with X and networking (if available). That this is possible without tedious manual configuration is an impressive testament to how far Unix and the X server have come in automatic hardware detection, as well as the amount of spare RAM that modern machines have. Installing from a live CD is what it sounds like: you boot into the live CD environment and run a program from there to install the distribution on your hard drive.)

LiveCDPraise written at 23:26:45; Add Comment

2006-08-01

One serial problem I should remember

I hate RS-232 serial connections, because something always goes wrong whenever I have to deal with them and then I have to figure out how to make the things work. Needless to say, there are next to no common diagnostic tools for troubleshooting serial communication problems.

In this case the problem was a mysterious inability to type anything to the remote device when I could receive the remote device's output fine. For my future reference, this can be caused by having hardware flow control on when the remote device doesn't support this.

(Don't ask why something in this day and age doesn't support hardware flow control. (And the device in question does seem to be from this age, since its manual is copyright 2005.))

Also for my future reference:

  • setting odd or even parity when the device expects none mostly just garbles output from the device, while my input is echoed back fine.
  • setting space parity is generally OK, but setting mark parity or the wrong bit count clonks into things hard, with garbled output and no input.

Once I club it suitably, minicom turns out to be a reasonably good environment for experimenting with this, although I still like the minimalism of cu for a lot of routine serial communication. (Like cu, minicom appears to have a few problems with exiting promptly when I ask it to.)

OneSerialProblem written at 15:49:42; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.