Wandering Thoughts archives

2014-08-22

Where DTrace aggregates are handled for printing et al

DTrace, the system, is split into a kernel component and a user level component (the most obvious piece of which is the dtrace command). However the DTrace documentation has very little discussion of what features are handled where. You might reasonably ask why we care; the answer is that anything done at user level can easily be made more sophisticated while things done at kernel level must be minimal and carefully safe. Which brings us around to DTrace aggregates.

For a long time I've believed that DTrace aggregates had to be mostly manipulated at the user level. The sensible design was for the kernel to ship most or all of the aggregate to user level with memcpy() into a buffer that user space had set up, then let user level handle, for example, printa(). However I haven't known for sure. Well, now I do. DTrace aggregate normalization and printing is handled at user level.

This means that D (the DTrace language) could have a lot of very useful features if it wanted to. The obvious one is that you could set the sort order for aggregates on a per-aggregate basis. With a bit more work DTrace could support, say, multi-aggregate aware truncation (dealing with one of the issues mentioned back here). If we go further, there's nothing preventing D from allowing much more sophisticated access to aggregates (including explicit key lookup in them for printing things and so on), something that would really come in handy in any number of situations.

(I don't expect this to ever happen for reasons beyond the scope of this entry. I expect that the official answer is 'D is low level, if you need sophisticated processing just dump output and postprocess in a script'. One of the reasons that this is a bad idea is that it puts a very sharp cliff in your way at a certain point in D sophistication. Another reason is that it invites you to play in the Turing tarpit of D.)

Sidebar: today's Turing Tarpit D moment

This is a simplified version.

syscall::read:return, syscall::write:return
/ ... /
{
   this->dirmarker = (probefunc == "read") ? 0 : 1;
   this->dir = this->dirmarker == 0 ? "r" : "w";
   @fds[this->dir, self->fd] = avg(self->fd * 10000 + this->dirmarker);
   ....
}

tick-10sec
{
   normalize(@fds, 10000);
   printa("fd %@2d%s: ....\", @fds, @....);
}

If a given file descriptor had both read and write IO, I wanted the read version to always come before the write version instead of potentially flip-flopping back and forth randomly. So I artificially inflate the fd number, add in a little discriminant in the low digits to make it sort right, and then normalize away the inflation afterwards. I have to normalize away the inflation because the value of the aggregation has to be used in printa(), which means that the actual FD number has to come from that and not its part of the key tuple.

Let me be clear here: this may be clever, but it's clearly a Turing tarpit. I've spent quite a lot of time figuring out how to abuse D features in order to avoid the extra pain of a post-processing script and I'm far from convinced that this actually was a good use of my time once the dust settled.

DTraceAggregatesUserLevel written at 01:10:04; Add Comment

2014-08-03

Our second generation ZFS fileservers and their setup

We are finally in the process of really migrating to the second generation of our ZFS fileserver setup, so it seems like a good time to write up all of the elements in one place. Our fundamental architecture remains unchanged. That architecture is NFS servers that export filesystems from ZFS pools to our client machines (which are mostly Ubuntu). The ZFS pools are made up of mirrored pairs, where each side of a mirror comes from a separate iSCSI backend. The fileservers and iSCSI backends are interconnected over two separate 'networks', which are actually single switches.

The actual hardware involved is unchanged from our basically finalized hardware; both fileservers and backends are SuperMicro motherboards with 2x 10G-T onboard in SuperMicro 16+2 drive bay cases. The iSCSI networks run over the motherboard 10G-T ports, and the fileservers also have a dual Intel 10G-T card for their primary network connection so we can do 10G NFS to them. Standard backends have 14 2TB WD SE drives for iSCSI (the remaining two data slots may someday be used for ZFS ZIL SSDs). One set of two backends (and a fileserver) is for special SSD based pools so they have some number of SSDs instead.

On the fileservers, we're running OmniOS (currently r151010j) in an overall setup that is essentially identical to our old S10U8 fileservers (including our hand rolled spares system). On the iSCSI backends we're running CentOS 7 after deciding that we didn't like Ubuntu 14.04. Although CentOS 7 comes with its own iSCSI target software we decided to carry on using IET, the same software we use our old backends; there just didn't seem to be any compelling reason to switch.

As before, we have sliced up the 2TB data disks into standard sized chunks. We decided to make our lives simple and have only four chunks on each 2TB disk, which means that they're about twice as big as our old chunk size. The ZFS 4K sector disk problem means that we have to create new pools and migrate all data anyways, so this difference in chunk size between the old and the new fileservers doesn't cause us any particular problems.

Also as before, each fileserver is using a different set of two backends to draw its disks from; we don't have or plan any cases where two fileservers use disks from the same backend. This assignment is just a convention, as all fileservers can see all backends and we're not attempting to do any sort of storage fencing; even though we're still not planning any failover, it still feels like too much complexity and potential problems.

In the end we went for a full scale replacement of our existing environment: three production fileservers and six production backends with HDs, one production fileserver and two production backends with SSDs, one hot spare fileserver and backend, and a testing environment of one fileserver and two fully configured backends. To save you the math, that's six fileservers, eleven backends, and 126 2TB WD Se disks. We also have three 10G-T switches (plus a fourth as a spare), two for the iSCSI networks and the third as our new top level 10G switch on our main machine room network.

(In the long run we expect to add some number of L2ARC SSDs to the fileservers and some number of ZFS ZIL SSDs to the backends, but we haven't even started any experimentation with this to see how we want to do it and how much benefit it might give us. Our first priority has been building out the basic core fileserver and backend setup. We definitely plan to add an L2ARC for one pool, though.)

ZFSFileserverSetupII written at 00:53:07; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.