2011-12-10
ZFS pool activation and iSCSI (part II)
Here's an interesting question that should have occurred to me much earlier:
Why is a bug about boot-time ordering between iSCSI disk discovery and ZFS pool activation fixed with a kernel patch?
I don't have a sure answer to this; the best I can do is a theory. But before we get there, let's talk about how ZFS pool activation seems to connect up with the Solaris iSCSI initiator.
At the SMF level, the iSCSI initiator is svc:/network/iscsi/initiator.
Nothing explicitly depends on it (at least according to 'svcs
-D'). Despite this, on our S10U8 machines it finishes starting
immediately before svc:/system/filesystem/local does (which is
exactly what you want, since the latter SMF service seems to be
what starts ZFS pools). Exactly why SMF uses
or enforces this order is opaque to me. For that matter it's not
clear if SMF itself is enforcing the order; because SMF only shows
the order by end time, not by start time, it's possible that the
start order is different than the finish order.
(A great deal of SMF is opaque, annoying, or both.)
Now it's time for theorizing.
If we take SMF at its word, there is no explicit ordering dependency
in SMF. Any ordering we get is either a lucky coincidence or enforced
by something else, and I don't believe it's a lucky coincidence. The
obvious candidate to enforce an ordering is the kernel, since it
handles both the iSCSI and ZFS parts of all of this. It would make a
kind of sense if the kernel delayed ZFS pool activation until iSCSI
discovery had finished; it's very analogous to how kernels often delay
things for SCSI disk discovery. Given that iSCSI disk discovery can be
quite protracted, it would also make sense if at some point a clever
Sun kernel developer broke that absolute dependency so that the boot
could still proceed even if iSCSI discovery was taking ages; such a
dependency break would match the symptoms we saw here, where 'zfs mount
-a' ran after iSCSI discovery had started but before it had finished.
The fix for this kernel dependency issue would of course be another
kernel change.
(Since Oracle no longer updates the OpenSolaris source code it's impossible to verify this theory. Besides, my patience for spelunking Solaris kernel code is pretty close to being exhausted.)
2011-12-07
Understanding the Solaris iSCSI initiator (a bit)
If you're an innocent person (like I used to be), the Solaris iSCSI
initiator appears much like how it works on other Unixes. You have an
administrative command (iscsiadm), a system daemon (iscsid, which
is what the SMF service svc:/network/iscsi/initiator starts), and a
kernel component that presumably turns iSCSI connections into SCSI
disks. Unfortunately this view of Solaris is highly misleading, or
as I should actually admit, wrong.
Because of the complexity involved, most systems split the iSCSI initiator into a user-level system daemon that does target discovery, iSCSI login, and session initiation and a kernel component that takes established iSCSI sessions and does iSCSI IO with them. Solaris does not work this way.
In Solaris, the entire iSCSI protocol stack is in the kernel,
including all target discovery. Yes, this includes the extra protocols
used for finding targets (iSNS and SendTargets). That tempting looking
iscsid daemon actually only has two little jobs: it tells the kernel
to start up the iSCSI initiator (and keep it running) and it does
hostname lookups for the kernel. Oh, and it tries to avoid reporting
'service ready' to SMF until the kernel seems to have completed iSCSI
discovery or discovery has stalled out.
(iscsid does not even read and write the iSCSI initiator configuration
database in /etc/iscsi; the kernel does it directly. By the way, the
database is stored as a serialized nvlist (of
course). Normally there are two copies, the current database and the
previous database.)
None of this is documented, of course, or at best it's only documented if you read carefully between the lines in the way that the Solaris people want you to.
PS: according to comments in the OpenSolaris iscsid code, the hostname
lookup is incomplete. iscsid only returns to the kernel a single IP
address for a hostname, regardless of how many the host has; it picks
the first one that the underlying library call returns.
Sidebar: when iscsid reports things to SMF
Because I was just looking at this in the source code and we may
need it sometime: first, if the kernel reports that all forms of
iSCSI target discovery have completed, service startup is obviously
done. After that iscsid gives up and declares 'service started' if
it's been 60 seconds without any new LUNs being discovered. As long as
you discover at least one LUN every minute, SMF will keep waiting for
svc:/network/iscsi/initiator to complete.
(What effect this has on the rest of the system is unclear, since nothing depends on the iSCSI initiator in SMF from what I can see.)
2011-12-06
What I know about boot time ZFS pool activation (part I)
In response to my entry on the boot time ZFS and iSCSI sequencing bug, a commentator asked if SMF dependencies could be used to work around the issue. As it happens, this is not a simple question to answer because how ZFS pools are activated at boot time is at best an obscure thing (at least as far as I can tell). Here's what I think is going on, which has to come with a lot of disclaimers.
ZFS pool information for pools that will be imported during boot
is in /etc/zfs/zpool.cache; this is a serialized nvlist of pool information. zpool.cache is read in by
the kernel very early during boot; as far as I can disentangle the
OpenSolaris code, it's loaded when the ZFS module is first loaded (or
as the root filesystem is being brought up, if the root filesystem is a
ZFS one). However this doesn't seem to actually activate the ZFS pools,
just set up the (potential) pool configuration in the kernel.
(ZFS pool activation is, or at least seems to be, when the kernel tries to find all of the pool's devices and either finds enough of them to start the pool up or marks it as failed. Thus ZFS pool activation is the point at which all devices need to have been brought up.)
It's not clear to me when and how ZFS pools are actually activated. At
a low level pools seem to be activated on demand when they are
looked at. However there is no high level SMF service that says
'activate ZFS pools'; instead, they seem to get activated as a side
effect of other SMF services. I suspect that the primary path to
ZFS pool activation is the 'zfs mount -a' that is done in the SMF
svc:/system/filesystem/local service (this is what is prints the
'Reading ZFS config:' message that you see during Solaris boot).
There is also some special magic for activating ZFS swap volumes
(exactly where the magic is depends on which Solaris 10 update you're
on), which may activate pools that have swap volumes.
How iSCSI comes into this picture is sufficiently complicated that it needs another entry.