Wandering Thoughts archives

2012-11-16

Why DTrace does not attract people to Solaris very often

Back when Solaris 10 was fresh and new, DTrace was supposed to be one of its significant competitive advantages, one of the major things that made it attractive to system administrators (and organizations). In practice Solaris 10 got a tepid reception, DTrace included, despite all of the banging on drums that supporters of Solaris did (and somewhat to their disbelief, at least in my impression of things). Having now used DTrace somewhat, I have an opinion on why.

It's pretty simple:

DTrace is not really a tool that sysadmins use to fix their own problems, it's a mechanism that your support provider and/or OS vendor uses to provide support.

(This idea was first expressed by a commentator on my entry on why we haven't taken to DTrace and has probably been percolating in the back of my mind ever since.)

You can certainly use DTrace to write your own diagnostic tools; after all, we did. But DTrace's almost complete lack of useful documented tracepoints means that doing this basically requires that you can read and understand (Open)Solaris code. Or, to put it another way, you need a system programmer. And system programmers are now uncommon; most places do not have one around and so will never write their own DTrace scripts (beyond at most very simple ones).

As for DTrace as a mechanism to help your OS vendor, well, sysadmins and organizations don't care how OS vendors provide support, just that they do (or don't). A mechanism to provide diagnostic tools is not a selling feature; at best, the diagnostic tools themselves are.

(Also, as I've sort of written before, most sysadmins don't expect to run into problems.)

The less pleasant way to put this is that people are not attracted to technologies, they are attracted to solutions to problems. DTrace is a fine technology but for most places it is not a solution.

(I don't know what Sun's true vision for DTrace was. If they planned for it to be a serious competitive advantage as opposed to mostly an internal tool, I think that they dropped the ball badly.)

DTraceWhyNotII written at 01:33:56; Add Comment

2012-11-07

DTrace: figuring out what you have access to at tracepoints

I've said before that OpenSolaris (or Illumos) source code is an essential component for serious use of DTrace, because it's what you need in practice to figure out how to get useful information out of the kernel. But sometimes you don't want to go spelunking through OpenSolaris code to find structure definitions and all of that, and you'd like a more convenient way. As it happens, there sort of is.

The first step is to find out what arguments are available at a particular DTrace tracepoint. DTrace has this information available (because it uses it to complain if you get things wrong), but it doesn't make this readily accessible. For function tracepoints, you can use an undocumented feature of 'dtrace -v' (at least in Solaris 10 update 8); it will print argument details as well as the uninteresting stability level stuff. For example, 'dtrace -vln fbt::zio_create:entry' reports (for me) that the second argument (args[1]) is a spa_t *.

That's nice, but what's in that spa_t that we have a pointer to? For that, we turn to mdb -k and its '::print -t' command, because the debugger (like DTrace itself) knows all about the structures in your kernel. For example:

# mdb -k
Loading modules: [....]
> ::print -t spa_t
{
    char [256] spa_name 
    avl_node_t spa_avl {
        struct avl_node *[2] avl_child 
        uintptr_t avl_pcb 
    }
    nvlist_t *spa_config 
[.. huge output elided ..]

What all of this will not tell you is what these structures and fields mean and what the various functions use them for. For that, you need the source code. However, using mdb to dump structures has two advantages. First, it's generally faster than trying to find and read the right header file in the depths of the source code tree. Second, it's guaranteed to be accurate for your kernel (which is not true for the OpenSolaris source code). The second bit makes mdb the thing to turn to on the occasions when DTrace is telling you that some structure member is missing when you can clearly see it there in the source code.

Unfortunately 'dtrace -v' will not tell you information about the arguments for non-fbt tracepoints (at least in my version and as far as I can tell). For that you need to read the source code itself. What you want to look at is the various DTRACE macros; for example, in iscsi_cmd_state_machine we have:

DTRACE_PROBE3(event, iscsi_cmd_t *, icmdp, char *,
    iscsi_cmd_state_str(icmdp->cmd_state),
    char *, iscsi_cmd_event_str(event));

This corresponds to sdt::iscsi_cmd_state_machine:event. This DTrace tracepoint has three arguments, an iscsi_cmd_t *, a char *, and then another char *, and you can also see what function arguments they correspond to. Again, for what they mean you need to read the surrounding code.

(This particular sdt tracepoint is almost exactly equivalent to fbt::iscsi_cmd_state_machine:entry, but there are more useful sdt tracepoints. Sometimes they expose execution paths and information that are hard or all but impossible to get with function tracepoints alone.)

PS: that DTrace refuses to tell you information that it knows makes me very annoyed. There is no excuse for hiding this stuff. Doing so is yet another sign that despite propaganda to the contrary, DTrace is not actually designed for anyone except the OS vendor. Make no mistake, it's a great tool if you're the OS vendor or can fake it, but if you're not, forget it.

DTraceTracepointArguments written at 01:27:53; Add Comment

2012-11-02

Why our ZFS fileservers sometimes boot slowly

This is both a sysadmin war story and a great example of what I'll call a full-stack problem.

One of our fileservers is significantly more problematic than all of the others, for reasons that we don't understand. One of the ways that it was problematic was that when it rebooted, it would connect to iSCSI target disks only very slowly. Solaris prints a line for every configured iSCSI target (which can be a problem), and we'd usually see these lines going by at a rate of one every few seconds. When you have quite a lot of targets, this makes the overall boot very slow. At one point we rebuilt this fileserver from scratch (using new hardware and a new, from the ground up install of Solaris) and our testing then showed that it didn't have the problem. Well, sadly this fileserver crashed the other day more or less in front of me and I noticed that yes, the slow boot problem was back.

(An iSCSI target roughly corresponds to a SCSI ID and has one or more LUNs which are the actual disks, or in our case which are partitions on a single disk. A single backend server serves up multiple targets. Yes, it gets confusing.)

Thanks to good suggestions from a coworker, I was able to more or less reproduce the problem in our test environment. While the Solaris boot process is basically inaccessible, I decided to capture network traces so that I could at least rule out the backends as a source of the delay. Much to my surprise, inspection of the packet traces in Wireshark revealed what was going on.

The explanation goes like this:

  • when the rebooted Solaris test fileserver made TCP connections to certain iSCSI targets, the packet sequence went SYN (from the fileserver), ACK (from the Linux backend), RST (from the fileserver), a three or so second pause as Solaris TCP went through some timeout, and then the normal SYN, SYN-ACK, and ACK three-way handshake you'd expect. This retry timeout pause was exactly the per-target pause we saw during Solaris boot.

    (In iSCSI, you make a separate TCP connection for every target (and every network path to it).)

  • Solaris is completely consistent in the local ephemeral port numbers that it uses for iSCSI connections during boot. As a result, every time a fileserver booted it reused the same local IP, local port, backend IP, and (fixed) backend port for a given connection as it had the last time. At the TCP level, this made the new connection more or less indistinguishable from the old one.

    (Assuming that the list of configured targets hadn't changed.)

  • when iSCSI connections between the fileserver and backends were not shut down in an orderly way, the underlying TCP connections remained in the established state on the backend (once I started looking I could see these in netstat). Thus the backend thought that the fileserver's initial SYN was part of the established connection instead of the start of a new connection.

If a fileserver crashed or was abruptly powered off, of course none of its iSCSI connections were closed in an orderly way, the TCP connections all stayed established on the backends, and when the fileserver came back up every iSCSI connection was delayed by about three seconds for the RST and subsequent timeout (Solaris evidently doesn't try to make iSCSI connections in parallel). But it turned out that these stalls could happen during ordinary reboots too because of an additional factor.

In an orderly shutdown our version of Solaris appears to try to close down iSCSI connections before it unmounts ZFS filesystems and shuts down ZFS pools. However, the Solaris iSCSI initiator has a safety feature; it will not let you close the last connection to an iSCSI target that has a disk in use by something (you can only disconnect from completely unused disks) and this still applies during system shutdown. So our ZFS pools prevent orderly disconnects for all of the targets they use, leaving lingering TCP connections on the backends which then stall for three seconds each when the fileserver reboots.

(Again, once I was paying attention during a fileserver shutdown I could see messages about 'cannot log out from target whatever'. I'm pretty certain we'd seen them before but just ignored them as yet more shutdown-time noise.)

I call this a full stack problem because of how it goes from the low level details of TCP connection establishment all the way up to relatively high level details of how Solaris shuts down various bits of the kernel while it's rebooting. Those high level bits look completely unrelated to the low level TCP connection stall until you trace the thread that runs all the way from the bottom to the top to tie them together.

This whole root cause explained several things:

  • when we rebuilt the fileserver it looked like the problem was gone, because of course we tested rebooting the rebuilt fileserver without any ZFS pools. With no pools it could always completely close down its iSCSI connections, leaving no lingering established TCP sessions on the backends to stall out during boot.
  • we hadn't noticed this problem during scheduled fileserver shutdowns because I believe we've always shut down the backends as well.
  • we noticed the problem on this particular fileserver because it's pretty much the only fileserver that's ever crashed or had to be rebooted outside of a planned downtime. We now believe that we'd see this problem on all of the fileservers if we rebooted them.

(In retrospect, testing the rebuilt fileserver without any ZFS pools was a subtle example of thinking we understood the problem. We didn't put any ZFS pools on because we thought that the problem didn't have anything to do with ZFS, and the reason we thought that was that it happened before ZFS pools and filesystems were activated.)

Ironically I might not have found the TCP-level issue if it wasn't for a Wireshark glitch, but this entry is already long enough so that's a story for another time.

Sidebar: the hack workaround that's occurred to me

Solaris has a network tunable for the start of the ephemeral port range. One could write a shell script that runs during sufficiently early boot and randomizes this by a sufficiently large amount that the ports used for iSCSI connections aren't likely to overlap with the last reboot of the machine. With no ephemeral port overlap, the lingering old connections wouldn't matter; they could just time out on their own through whatever mechanism iSCSI has for it.

Since this idea involves both wrestling with SMF and early boot hackery, I suspect that we are not going to come near it. The problem is generally not severe and this would probably be a case where the cure is worse than the disease.

ZFSISCSIStartupDelay written at 01:33:36; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.