2008-09-28
Why I hate Solaris 10's service facility right now
I have a production Solaris 10 x86 server with an onboard serial port. I
want to make the Solaris 10 equivalent of a getty process listen on
it, so that I can still log in when the system's network falls off a
cliff. Ideally it would do so with a (fixed) baud rate of my choice.
This is all that I want to do; in particular, I do not want to make
the serial port into the system console (especially as this would
require a reboot and, as mentioned, this is a production server).
And I need to do this without running a GUI, because my connection
to this server is very limited right now.
(Those of you with Solaris 10 experience may be laughing helplessly right now.)
You might think that this would be reasonably simple. You would be
sorely disappointed. Solaris 10 handles all of this with a new system,
the 'Service Access Controller' (sac, ttymon, and several other
bits), which is absurdly complex and cryptic, not to mention rather
under-documented (with beautiful touches like the ttyadm program not
being how you administer tty ports).
The problem with all this being absurdly complex and cryptic is not
just that it is all but impossible to do things if you are not deeply
steeped in their arcana, but also that trying to solve the problem by
reading manpages and just trying things is not an option. Usually on an
unfamiliar Unix, I can fumble my way around to solve problems because
I understand enough of the basic logic of the system to let me avoid
making terrible, damaging mistakes (at the most I will make localized
messes that I can clean up). Not so with Solaris 10's ttymon system;
absurdly complex things that I do not completely understand and am doing
for the first time are not things that I can touch on a production
machine. Especially in a crisis, which is exactly why I want to set
this up in the first place.
(What makes this all the more frustrating is that I think I may know how to solve the problem, but I dare not try it out on a production system.)
2008-09-15
Why ZFS's raidz design decision is sensible (or at least rational)
Given the downsides covered in yesterday's entry, ZFS's decision to stripe data blocks across all of the disks in a raidz probably sounds rather odd. However, it actually is a sensible decision given ZFS's overall design goals. The easiest way to see why is to contemplate the problems posed by an alternate design.
Suppose that ZFS did not turn a single data block into a 'parity stripe', but instead had the idea of what I will call a 'block set', a group of N data blocks and either one or two parity blocks. Then you would only need to read the full block set if you have to try to reconstruct data; during normal reads you could read a single data block by itself from a single disk and still verify its checksum.
Now consider how this interacts with the twin ZFS design goals of never updating a data block in place, and not having the RAID-5 write hole problem. When a data block changes, it must be rewritten elsewhere, and so this orphans the old data block. However, you cannot reuse that space, because that would invalidate the block set's parity blocks unless you updated them in place, and updating them in place both breaks a ZFS rule and creates the RAID-5 write hole problem.
So the conclusion is that you can only reclaim the space in a block set when all of the data blocks in the block set are orphaned and dead. As a corollary, you must write all of the block set at the same time. This is not fatal, since it is more or less equivalent to having a data block that is as big as the whole block set. But it certainly would complicate ZFS's design, and I think that skipping that complication is a rational choice.
(Because the block set approach has more data blocks, it also uses up more space for metadata; instead of one pointer and checksum for the entire parity stripe, you need N pointers and checksums, one for each data block in the block set.)
2008-09-14
A read performance surprise with ZFS's raidz and raidz2
Sun's ZFS contains a performance surprise for people using its version of RAID-5 and RAID-6, which ZFS calls raidz and raidz2. To understand what is going on, it is necessary to start with some basic ZFS ideas.
One of the things that ZFS is worried about is disk corruption. To deal with this, ZFS famously checksums everything that it writes (both filesystem metadata and file data) and then verifies that checksum when you read the things back; specifically, ZFS computes and checks a separate checksum for every data block. One consequence of this is that ZFS must always read a whole data block, even if the user level code only asked for a single byte. (This is pretty typical behavior for filesystems, and generally doesn't matter; modern disks care far more about seeks than about the amount of data being transfered.)
(You can read more about this here.)
Now we come to the crucial decision ZFS has made for raidz and raidz2: in raidz and raidz2, the data block is striped across all of the disks. Instead of a model where a parity stripe is a bunch of data blocks, each with an independent checksum, ZFS stripes a single data block (and its parity), with a single checksum, across all the disks (or as many of them as necessary).
This is a rational implementation decision, but when combined with the need to verify checksums, it has an important consequence: in ZFS, reads always involve all disks, because ZFS always must verify the data block's checksum, which requires reading all of the data block, which is spread across all of the drives. This is unlike normal RAID-5 or RAID-6, in which a small enough read will only touch one drive, and means that adding more disks to a ZFS raidz pool does not increase how many random reads you can do per second.
(A normal RAID-5 or RAID-6 array has a (theoretical) random read IO capacity equal to the sum of the random IO operations rate of each of the disks in the array, and so adding another disk adds its IOPs per second to your read capacity. A ZFS raidz or raidz2 pool instead has a capacity equal to the slowest disk's IOPs per second, and adding another disk does nothing to help. Effectively a raidz ZFS gives you a single disk's read IOPs per second rate.)
Assuming that you can afford the disk space loss, you can somewhat improve this situation by creating your pools from several smaller raidz or raidz2 vdevs, instead of from one large vdev that has all of the drives. This doesn't get you the same random read IO data rate as normal array does, but at least it will get you a higher rate than a single drive would. (You effectively get one drive's data rate per vdev.)
(Credit where credit is due: I didn't discover this on my own; I gathered it more or less by osmosis from various discussions on the ZFS mailing list.)
2008-09-12
ZFS's helpful attention to detail
Because I just ran into this:
Imagine that you have a ZFS pool, call it tank to match the convention
of ZFS examples, and that it has three spare devices configured. Further
imagine that you want to remove those three configured spares from the
pool:
# zpool remove tank c4t0d0 c4t0d1 c4t0d2
(No error messages are emitted.)
Then, because you are a cautious person (or because you have a tool
that automatically saves your pool configurations somewhere), you
actually examine the spares configuration for tank. Guess what
you will find?
You'll find that tank still has two spares. Silently, ZFS has only
removed the c4t0d0 spare disk.
Now, in theory this behavior is documented in the zpool manpage and
help, because the remove command is described as:
zpool remove pool vdev
Notice how that doesn't say you can specify more than one vdev. In the grand Unix tradition of informative manpages, this means that you can't.
Of course, it would be nice if ZFS's administrative commands actually gave an error message when you used them wrong, instead of just ignoring anything that wasn't supposed to be there. Ignoring things that you don't expect is not robust.
2008-09-05
Something to remember when using DTrace on userland programs
As mentioned before, Solaris's DTrace
is fundamentally a kernel debugger; in order to extract information
from userland programs, you need to copy it to kernel space (generally
using DTrace's copyin() function) before you can start printing it or
otherwise use it.
The most important thing to remember about this is that you wind up
dealing with two sorts of pointers: userland pointers, which is what
are in all of those data structures you copy in from your program, and
kernel pointers, which is what copyin() and friends return. Let's call
the former 'locations' and the latter 'pointers'.
Keeping them straight is vital, because you have to do different things
in order to use them; you dereference pointers but give locations to
copyin(), which returns a pointer that you have to dereference in
order to get the actual userland data. Fortunately they have different
types; locations are type uintptr_t, while pointers are, well,
pointers.
Thus the incantation to get an object of primitive type TYPE (like an
integer, a long, or a pointer) from a user level pointer LOC is:
uintptr_t LOC; TYPE var; var = * ((TYPE *) copyin(LOC, sizeof(TYPE)));
(Unfortunately DTrace doesn't have #define, or you could make this
a macro.)
If this seems confusing, well, it is.
The DTrace language is very C-like and it lets you define structures
just like C, enough so that you can generally just copy the structure
definitions from your C header files into your DTrace program. However,
watch out; all of the pointers in your program's structures are userland
pointers, ie locations, not kernel pointers (real pointers). You will
avoid a lot of confusion if you take all of those structure definitions
and change the type of every pointer field to be uintptr_t, so
that you will get a compile time error if you ever attempt to directly
dereference one (instead of tediously doing it via copyin()).
(The DTrace language helps you out by not having a #include, so
you have to copy the structure definitions to start with.)
As a suggestion: if you do this, leave yourself a comment about what
type the pointers actually point to, so you can remember what you get
when you dereference them via copyin() (and how much data you need to
tell copyin() to copy).