Why our ZFS fileservers sometimes boot slowly

November 2, 2012

This is both a sysadmin war story and a great example of what I'll call a full-stack problem.

One of our fileservers is significantly more problematic than all of the others, for reasons that we don't understand. One of the ways that it was problematic was that when it rebooted, it would connect to iSCSI target disks only very slowly. Solaris prints a line for every configured iSCSI target (which can be a problem), and we'd usually see these lines going by at a rate of one every few seconds. When you have quite a lot of targets, this makes the overall boot very slow. At one point we rebuilt this fileserver from scratch (using new hardware and a new, from the ground up install of Solaris) and our testing then showed that it didn't have the problem. Well, sadly this fileserver crashed the other day more or less in front of me and I noticed that yes, the slow boot problem was back.

(An iSCSI target roughly corresponds to a SCSI ID and has one or more LUNs which are the actual disks, or in our case which are partitions on a single disk. A single backend server serves up multiple targets. Yes, it gets confusing.)

Thanks to good suggestions from a coworker, I was able to more or less reproduce the problem in our test environment. While the Solaris boot process is basically inaccessible, I decided to capture network traces so that I could at least rule out the backends as a source of the delay. Much to my surprise, inspection of the packet traces in Wireshark revealed what was going on.

The explanation goes like this:

  • when the rebooted Solaris test fileserver made TCP connections to certain iSCSI targets, the packet sequence went SYN (from the fileserver), ACK (from the Linux backend), RST (from the fileserver), a three or so second pause as Solaris TCP went through some timeout, and then the normal SYN, SYN-ACK, and ACK three-way handshake you'd expect. This retry timeout pause was exactly the per-target pause we saw during Solaris boot.

    (In iSCSI, you make a separate TCP connection for every target (and every network path to it).)

  • Solaris is completely consistent in the local ephemeral port numbers that it uses for iSCSI connections during boot. As a result, every time a fileserver booted it reused the same local IP, local port, backend IP, and (fixed) backend port for a given connection as it had the last time. At the TCP level, this made the new connection more or less indistinguishable from the old one.

    (Assuming that the list of configured targets hadn't changed.)

  • when iSCSI connections between the fileserver and backends were not shut down in an orderly way, the underlying TCP connections remained in the established state on the backend (once I started looking I could see these in netstat). Thus the backend thought that the fileserver's initial SYN was part of the established connection instead of the start of a new connection.

If a fileserver crashed or was abruptly powered off, of course none of its iSCSI connections were closed in an orderly way, the TCP connections all stayed established on the backends, and when the fileserver came back up every iSCSI connection was delayed by about three seconds for the RST and subsequent timeout (Solaris evidently doesn't try to make iSCSI connections in parallel). But it turned out that these stalls could happen during ordinary reboots too because of an additional factor.

In an orderly shutdown our version of Solaris appears to try to close down iSCSI connections before it unmounts ZFS filesystems and shuts down ZFS pools. However, the Solaris iSCSI initiator has a safety feature; it will not let you close the last connection to an iSCSI target that has a disk in use by something (you can only disconnect from completely unused disks) and this still applies during system shutdown. So our ZFS pools prevent orderly disconnects for all of the targets they use, leaving lingering TCP connections on the backends which then stall for three seconds each when the fileserver reboots.

(Again, once I was paying attention during a fileserver shutdown I could see messages about 'cannot log out from target whatever'. I'm pretty certain we'd seen them before but just ignored them as yet more shutdown-time noise.)

I call this a full stack problem because of how it goes from the low level details of TCP connection establishment all the way up to relatively high level details of how Solaris shuts down various bits of the kernel while it's rebooting. Those high level bits look completely unrelated to the low level TCP connection stall until you trace the thread that runs all the way from the bottom to the top to tie them together.

This whole root cause explained several things:

  • when we rebuilt the fileserver it looked like the problem was gone, because of course we tested rebooting the rebuilt fileserver without any ZFS pools. With no pools it could always completely close down its iSCSI connections, leaving no lingering established TCP sessions on the backends to stall out during boot.
  • we hadn't noticed this problem during scheduled fileserver shutdowns because I believe we've always shut down the backends as well.
  • we noticed the problem on this particular fileserver because it's pretty much the only fileserver that's ever crashed or had to be rebooted outside of a planned downtime. We now believe that we'd see this problem on all of the fileservers if we rebooted them.

(In retrospect, testing the rebuilt fileserver without any ZFS pools was a subtle example of thinking we understood the problem. We didn't put any ZFS pools on because we thought that the problem didn't have anything to do with ZFS, and the reason we thought that was that it happened before ZFS pools and filesystems were activated.)

Ironically I might not have found the TCP-level issue if it wasn't for a Wireshark glitch, but this entry is already long enough so that's a story for another time.

Sidebar: the hack workaround that's occurred to me

Solaris has a network tunable for the start of the ephemeral port range. One could write a shell script that runs during sufficiently early boot and randomizes this by a sufficiently large amount that the ports used for iSCSI connections aren't likely to overlap with the last reboot of the machine. With no ephemeral port overlap, the lingering old connections wouldn't matter; they could just time out on their own through whatever mechanism iSCSI has for it.

Since this idea involves both wrestling with SMF and early boot hackery, I suspect that we are not going to come near it. The problem is generally not severe and this would probably be a case where the cure is worse than the disease.

Written on 02 November 2012.
« Our DTrace scripts for NFS server, ZFS, and iSCSI initiator monitoring
Another go-around on the drawbacks and balances of automation »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Nov 2 01:33:36 2012
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.