2014-10-31
With ZFS, rewriting a file in place might make you run out of space
Here's an interesting little issue that I confirmed recently: if you rewrite an existing file in place with random IO on a plain ZFS filesystem, you can wind up using extra space and even run out of space. This is a little bit surprising but is not a bug; it's just fallout from how ZFS works.
It's easy to see how this can happen if you have compression or deduplication turned on on the filesystem and you rewrite different data; the new data might compress or deduplicate less well than the old data and so use up more space. Deduplication might especially be prone to this if you initialize your file with something simple (zeroes, say) and then rewrite with actual data.
(The corollary to this is that continuously rewritten files like the storage for a database can take up a fluctuating amount of disk space over time on such a filesystem. This is one reason of several that we're unlikely to ever turn compression on on our fileservers.)
But this can happen even on filesystems without dedup or compression,
which is a little bit surprising. What's happening is the result of the
ZFS 'record size' (what many filesystems would call their block size).
ZFS has a variable record size, ranging from the minimum block size of
your disks up to the recordsize parameter, usually 128 KB. When you
write data, especially sequential data, ZFS will transparently aggregate
it together into large blocks; this makes both writes and reads more
efficient and so is a good thing.
So you start out by writing a big file sequentially, which aggregates things together into 128 KB on-disk blocks, puts pointers to those blocks into the file's metadata, and so on. Now you come back later and rewrite the file using, say, 8 KB random IO. Because ZFS is a copy on write filesystem, it can't overwrite the existing data in place. Instead every time you write over a chunk of an existing 128 KB block, the block winds up effectively fragmented and your new 8 KB chunk consumes some amount of extra space for extra block pointers and so on (and perhaps extra metaslab space due to fragmentation).
To be honest, actually pushing a filesystem or a pool out of space requires you to be doing a lot of rewrites and to already be very close to the space limit. And if you hit the limit, it seems to not cause more than occasional 'out of space' errors for the rewrite IO; things will go to 0 bytes available but the rewrites will continue to mostly work (new write IO will fail, of course). Given comments I've seen in the code while looking into the extra space reservation in ZFS pools, I suspect that ZFS is usually estimating that an overwrite takes no extra space and so usually allowing it through. But I'm guessing at this point.
(The other thing I don't know is what such a partially updated block looks like on disk. Does the entire original 128 KB block get fully read, split and rewritten somehow, or is there something more clever going on? Decoding the kernel source will tell me if I can find and understand the right spot, but I'm not that curious at the moment.)
2014-10-26
Things that can happen when (and as) your ZFS pool fills up
There's a shortage of authoritative information on what actually happens if you fill up a ZFS pool, so here is what I've both gathered about it from other people's information and experienced.
The most often cited problem is bad performance, with the usual cause being ZFS needing to do an increasing amount of searching through ZFS metaslab space maps to find free space. If not all of these are in memory, a write may require pulling some or all of them into memory, searching through them, and perhaps finding not enough space. People cite various fullness thresholds for this starting to happen, eg anywhere from 70% full to 90% full. I haven't seen any discussion about how severe this performance impact is supposed to be (and on what sort of vdevs; raidz vdevs may behave differently than mirror vdevs here).
(How many metaslabs you have turns out to depend on how your pool was created and grown.)
A nearly full pool can also have (and lead to) fragmentation, where the free space is in small scattered chunks instead of large contiguous runs. This can lead to ZFS having to write 'gang blocks', which are a mechanism where ZFS fragments one large logical block into smaller chunks (see eg the mention of them in this entry and this discussion which corrects some bits). Gang blocks are apparently less efficient than regular writes, especially if there's a churn of creation and deletion of them, and they add extra space overhead (which can thus eat your remaining space faster than expected).
If a pool gets sufficiently full, you stop being able to change most filesystem properties; for example, to set or modify the mountpoint or change NFS exporting. In theory it's not supposed to be possible for user writes to fill up a pool that far. In practice all of our full pools here have resulted in being unable to make such property changes (which can be a real problem under some circumstances).
You are supposed to be able to remove files from a full pool (possibly barring snapshots), but we've also had reports from users that they couldn't do so and their deletion attempt failed with 'No space left on device' errors. I have not been able to reproduce this and the problem has always gone away on its own.
(This may be due to a known and recently fixed issue, Illumos bug #4950.)
I've never read reports of catastrophic NFS performance problems for all pools or total system lockup resulting from a full pool on an NFS fileserver. However both of these have happened to us. The terrible performance issue only happened on our old Solaris 10 update 8 fileservers; the total NFS stalls and then system lockups have now happened on both our old fileservers and our new OmniOS based fileservers.
(Actually let me correct that; I've seen one report of a full pool killing a modern system. In general, see all of the replies to my tweeted question.)
By the way: if you know of other issues with full or nearly full ZFS pools (or if you have additional information here in general), I'd love to know more. Please feel free to leave a comment or otherwise get in touch.
2014-10-25
The difference in available pool space between zfs list and zpool list
For a while I've noticed that 'zpool list' would report that our pools
had more available space than 'zfs list' did and I've vaguely wondered
about why. We recently had a very serious issue due to a pool filling
up, so suddenly I became very interested in the whole issue and did
some digging. It turns out that there are two sources of the difference
depending on how your vdevs are set up.
For raidz vdevs, the simple version is that 'zpool list' reports more
or less the raw disk space before the raidz overhead while 'zfs list'
applies the standard estimate that you expect (ie that N disks worth of
space will vanish for a raidz level of N). Given that raidz overhead is
variable in ZFS, it's easy to see why the two commands are behaving this
way.
In addition, in general ZFS reserves a certain amount of pool space for various reasons, for example so that you can remove files even when the pool is 'full' (since ZFS is a copy on write system, removing files requires some new space to record the changes). This space is sometimes called 'slop space'. According to the code this reservation is 1/32nd of the pool's size. In my actual experimentation on our OmniOS fileservers this appears to be roughly 1/64th of the pool and definitely not 1/32nd of it, and I don't know why we're seeing this difference.
(I found out all of this from a Ben Rockwood blog entry and then found the code in the current Illumos codebase to see what the current state was (or is).)
The actual situation with what operations can (or should) use what space
is complicated. Roughly speaking, user level writes and ZFS operations
like 'zfs create' and 'zfs snapshot' that make things should use the
1/32nd reserved space figure, file removes and 'neutral' ZFS operations
should be allowed to use half of the slop space (running the pool down
to 1/64th of its size), and some operations (like 'zfs destroy') have
no limit whatever and can theoretically run your pool permanently and
unrecoverably out of space.
The final authority is the Illumos kernel code and its comments. These
days it's on Github so I can just link to the two most relevant bits:
spa_misc.c's discussion of spa_slop_shift
and dsl_synctask.h's discussion of zfs_space_check_t.
(What I'm seeing with our pools would make sense if everything was actually being classified as a 'allowed to use half of the slop space' operation. I haven't traced the Illumos kernel code at this level so I have no idea how this could be happening; the comments certainly suggest that it isn't supposed to be.)
(This is the kind of thing that I write down so I can find it later, even though it's theoretically out there on the Internet already. Re-finding things on the Internet can be a hard problem.)
2014-10-03
When using Illumos's lockstat, check the cumulative numbers too
Suppose, not entirely hypothetically that you
have an Illumos (or OmniOS or etc) system that is experiencing
something that looks an awful lot like kernel contention; for
example, periodic 'mpstat 1' output where one CPU is spending
100% of its time in kernel code. Perhaps following Brendan Gregg's
Solaris USE method, you stumble
over lockstat and decide to give it a try. This is a fine thing,
as it's a very nice tool and can give you lots of fascinating output.
However, speaking from recent experience, I urge you to at some
point run lockstat with the -P option and check that output
too. I believe that lockstat normally sorts its output by count,
highest first; -P changes this to sort by total time (ie the count
times its displayed average time). The very important thing that
this does is it very prominently surfaces relatively rare but really
long things. In my case, I spent a bunch of time and effort looking
at quite frequent and kind of alarming looking adaptive mutex spins,
but when I looked at 'lockstat -P' I discovered a lock acquisition
that only had 30 instances over 60 seconds but that had an average
spin time (not block time) of 55 milliseconds.
(Similarly, when I looked at the adaptive mutex block times I discovered the same lock acquisition, this time blocked 37 times in 60 seconds with an average block time of 1.6 seconds.)
In theory you can spot these things when scanning through the full
lockstat output even without -P, but in practice humans don't
work that way; we scan the top of the list and then as everything
starts to dwindle away into sameness our eyes glaze over. You're
going to miss things, so let lockstat do the work for you to
surface them.
(If you specifically suspect long things you can use -d to only
report on them, but picking a useful -d value probably requires
some guesswork and looking at basic lockstat output.)
By the way, there turn out to be a bunch of interesting tricks you
can do with lockstat. I recommend reading all the way through the
EXAMPLES section and especially paying attention to the discussion
of why various flags get used in various situations. Unlike the usual
manpage examples, it only gets more interesting as it goes along.
(And if you need really custom tooling you can use the lockstat DTrace provider in your own DTrace scripts. I wound up doing that today as part of getting information on one of our problems.)
2014-09-17
In praise of Solaris's pfiles command
I'm sure that at one point I was introduced to pfiles through a
description that called it the Solaris version of lsof for a
single process. This is true as far as it goes and I'm certain that
I used pfiles as nothing more than this for a long time, but it
understates what pfiles can do for you. This is because pfiles
will give you a fair amount more information than lsof will, and
much of that information is useful stuff to know.
Like lsof, pfiles will generally report what a file descriptor
maps to (file, device, network connection, and Solaris IPC 'doors',
often with information about what process is on the other end of
the door). Unlike on some systems, the pfiles information is good
enough to let you track down who is on the other end of Unix domain
sockets and pipes. Sockets endpoints are usually reported directly;
pipe information generally takes cross-correlating with other
processes to see who else has an S_IFIFO with the specific ino
open.
(You would think that getting information on the destination of Unix domain sockets would be basic information, but on some systems it can take terrible hacks.)
Pfiles will also report some socket state information for sockets,
like the socket flags and the send and receive buffers. Personally
I don't find this deeply useful and I wish that pfiles also showed
things like the TCP window and ACK state. Fortunately you can get
this protocol information with 'netstat -f inet -P tcp' or 'netstat
-v -f inet -P tcp' (if you want lots of details).
Going beyond this lsof-like information, pfiles will also report
various fcntl() and open() flags for the file descriptor. This
will give you basic information like the FD's read/write status,
but it goes beyond this; for example, you can immediately see whether
or not a process has its sockets open in non-blocking mode (which
can be important). This is
often stuff that is not reported by other tools and having it handy
can save you from needing deep dives with DTrace, a debugger, or
the program source code.
(I'm sensitive to several of these issues because my recent Amanda
troubleshooting left me needing to chart out the flow of pipes and
to know whether some sockets were nonblocking or not. I could also
have done with information on TCP window sizes at the time, but I
didn't find the netstat stuff until just now. That's how it goes
sometimes.)
2014-09-05
A DTrace script to help figure out what process IO is slow
I recently made public a dtrace script I wrote, which gives you per file descriptor IO breakdowns for a particular process. I think it's both an interesting, useful tool and probably not quite the right approach to diagnose this sort of problem, so I want to talk about both the problem and what it tells you. To start with, the problem.
Suppose, not entirely hypothetically, that you have a relatively complex multi-process setup with data flowing between the various processes and the whole thing is (too) slow. Somewhere in the whole assemblage is a bottleneck. Basic monitoring tools for things like disk IO and network bandwidth will give you aggregate status over the entire assemblage, but they can only point out the obvious bottlenecks (total disk IO, total network bandwidth, etc). What we'd like to do here is peer inside the multi-process assemblage to see which data flows are fast and which are slow. This per-data-flow breakdown is why the script shows IO on a per file descriptor basis.
What the DTrace script's output looks like is this:
s fd 7w: 10 MB/s waiting ms: 241 / 1000 ( 10 KB avg * 955) p fd 8r: 10 MB/s waiting ms: 39 / 1000 ( 10 KB avg * 955) s fd 11w: 0 MB/s waiting ms: 0 / 1000 ( 5 KB avg * 2) p fd 17r: 0 MB/s waiting ms: 0 / 1000 ( 5 KB avg * 2) s fd 19w: 12 MB/s waiting ms: 354 / 1000 ( 10 KB avg * 1206) p fd 21r: 12 MB/s waiting ms: 43 / 1000 ( 10 KB avg * 1206) fd 999r: 22 MB/s waiting ms: 83 / 1000 ( 10 KB avg * 2164) fd 999w: 22 MB/s waiting ms: 595 / 1000 ( 10 KB avg * 2164) IO waits: read: 83 ms write: 595 ms total: 679 ms
(This is a per-second figure averaged over ten seconds and file
descriptor 999 is for the total read and write activity. pfiles
can be used to tell what each file descriptor is connected to if
you don't already know.)
Right away we can tell a fair amount about what this process is
doing; it's clearly copying two streams of data from inputs to
outputs (with a third one not doing much). It's also spending much
more of its IO wait time writing the data rather than waiting for
there to be more input, although the picture here is misleading
because it's also making pollsys() calls and I wasn't tracking
the time spent waiting in those (or the time spent in other syscalls).
(The limited measurement is partly an artifact of what I needed to diagnose our problem.)
What I'm not sure about this DTrace script is if it's the most useful and informative way to peer into this problem. Its output points straight to network writes being the bottleneck (for reasons that I don't know) but that discovery seems indirect and kind of happenstance, visible only because I decided to track how long IO on each file descriptor took. In particular it feels like there are things I ought to be measuring here that would give me more useful and pointed information, but I can't think of what else to measure. It's as if I'm not asking quite the right questions.
(I've looked at Brendan Gregg's Off-CPU Analysis; an off-cpu flamegraph analysis actually kind of pointed in the direction of network writes too, but it was hard to interpret and get too much from. Wanting some degree of confirmation and visibility into this led me to write fdrwmon.d.)
2014-08-22
Where DTrace aggregates are handled for printing et al
DTrace, the system, is split into a kernel component and a user
level component (the most obvious piece of which is the dtrace
command). However the DTrace documentation has very little discussion
of what features are handled where. You might reasonably ask why
we care; the answer is that anything done at user level can easily
be made more sophisticated while things done at kernel level must
be minimal and carefully safe. Which brings us around to DTrace
aggregates.
For a long time I've believed that DTrace
aggregates had to be mostly manipulated at the user level. The
sensible design was for the kernel to ship most or all of the
aggregate to user level with memcpy() into a buffer that user
space had set up, then let user level handle, for example, printa().
However I haven't known for sure. Well, now I do. DTrace aggregate
normalization and printing is handled at user level.
This means that D (the DTrace language) could have a lot of very useful features if it wanted to. The obvious one is that you could set the sort order for aggregates on a per-aggregate basis. With a bit more work DTrace could support, say, multi-aggregate aware truncation (dealing with one of the issues mentioned back here). If we go further, there's nothing preventing D from allowing much more sophisticated access to aggregates (including explicit key lookup in them for printing things and so on), something that would really come in handy in any number of situations.
(I don't expect this to ever happen for reasons beyond the scope of this entry. I expect that the official answer is 'D is low level, if you need sophisticated processing just dump output and postprocess in a script'. One of the reasons that this is a bad idea is that it puts a very sharp cliff in your way at a certain point in D sophistication. Another reason is that it invites you to play in the Turing tarpit of D.)
Sidebar: today's Turing Tarpit D moment
This is a simplified version.
syscall::read:return, syscall::write:return
/ ... /
{
this->dirmarker = (probefunc == "read") ? 0 : 1;
this->dir = this->dirmarker == 0 ? "r" : "w";
@fds[this->dir, self->fd] = avg(self->fd * 10000 + this->dirmarker);
....
}
tick-10sec
{
normalize(@fds, 10000);
printa("fd %@2d%s: ....\", @fds, @....);
}
If a given file descriptor had both read and write IO, I wanted the
read version to always come before the write version instead of
potentially flip-flopping back and forth randomly. So I artificially
inflate the fd number, add in a little discriminant in the low
digits to make it sort right, and then normalize away the inflation
afterwards. I have to normalize away the inflation because the value
of the aggregation has to be used in printa(), which means that
the actual FD number has to come from that and not its part of the
key tuple.
Let me be clear here: this may be clever, but it's clearly a Turing tarpit. I've spent quite a lot of time figuring out how to abuse D features in order to avoid the extra pain of a post-processing script and I'm far from convinced that this actually was a good use of my time once the dust settled.
2014-08-03
Our second generation ZFS fileservers and their setup
We are finally in the process of really migrating to the second generation of our ZFS fileserver setup, so it seems like a good time to write up all of the elements in one place. Our fundamental architecture remains unchanged. That architecture is NFS servers that export filesystems from ZFS pools to our client machines (which are mostly Ubuntu). The ZFS pools are made up of mirrored pairs, where each side of a mirror comes from a separate iSCSI backend. The fileservers and iSCSI backends are interconnected over two separate 'networks', which are actually single switches.
The actual hardware involved is unchanged from our basically finalized hardware; both fileservers and backends are SuperMicro motherboards with 2x 10G-T onboard in SuperMicro 16+2 drive bay cases. The iSCSI networks run over the motherboard 10G-T ports, and the fileservers also have a dual Intel 10G-T card for their primary network connection so we can do 10G NFS to them. Standard backends have 14 2TB WD SE drives for iSCSI (the remaining two data slots may someday be used for ZFS ZIL SSDs). One set of two backends (and a fileserver) is for special SSD based pools so they have some number of SSDs instead.
On the fileservers, we're running OmniOS (currently r151010j) in an overall setup that is essentially identical to our old S10U8 fileservers (including our hand rolled spares system). On the iSCSI backends we're running CentOS 7 after deciding that we didn't like Ubuntu 14.04. Although CentOS 7 comes with its own iSCSI target software we decided to carry on using IET, the same software we use our old backends; there just didn't seem to be any compelling reason to switch.
As before, we have sliced up the 2TB data disks into standard sized chunks. We decided to make our lives simple and have only four chunks on each 2TB disk, which means that they're about twice as big as our old chunk size. The ZFS 4K sector disk problem means that we have to create new pools and migrate all data anyways, so this difference in chunk size between the old and the new fileservers doesn't cause us any particular problems.
Also as before, each fileserver is using a different set of two backends to draw its disks from; we don't have or plan any cases where two fileservers use disks from the same backend. This assignment is just a convention, as all fileservers can see all backends and we're not attempting to do any sort of storage fencing; even though we're still not planning any failover, it still feels like too much complexity and potential problems.
In the end we went for a full scale replacement of our existing environment: three production fileservers and six production backends with HDs, one production fileserver and two production backends with SSDs, one hot spare fileserver and backend, and a testing environment of one fileserver and two fully configured backends. To save you the math, that's six fileservers, eleven backends, and 126 2TB WD Se disks. We also have three 10G-T switches (plus a fourth as a spare), two for the iSCSI networks and the third as our new top level 10G switch on our main machine room network.
(In the long run we expect to add some number of L2ARC SSDs to the fileservers and some number of ZFS ZIL SSDs to the backends, but we haven't even started any experimentation with this to see how we want to do it and how much benefit it might give us. Our first priority has been building out the basic core fileserver and backend setup. We definitely plan to add an L2ARC for one pool, though.)
2014-07-25
The OmniOS version of SSH is kind of slow for bulk transfers
If you look at the manpage and so on, it's sort of obvious that the Illumos and thus OmniOS version of SSH is rather behind the times; Sun branched from OpenSSH years ago to add some features they felt were important and it has not really been resynchronized since then. It (and before it the Solaris version) also has transfer speeds that are kind of slow due to the SSH cipher et al overhead. I tested this years ago (I believe close to the beginning of our ZFS fileservers), but today I wound up retesting it to see if anything had changed from the relatively early days of Solaris 10.
My simple tests today were on essentially identical hardware (our new fileserver hardware) running OmniOS r151010j and CentOS 7. Because I was doing loopback tests with the server itself for simplicity, I had to restrict my OmniOS tests to the ciphers that the OmniOS SSH server is configured to accept by default; at the moment that is aes128-ctr, aes192-ctr, aes256-ctr, arcfour128, arcfour256, and arcfour. Out of this list, the AES ciphers run from 42 MBytes/sec down to 32 MBytes/sec while the arcfour ciphers mostly run around 126 MBytes/sec (with hmac-md5) to 130 Mbytes/sec (with hmac-sha1).
(OmniOS unfortunately doesn't have any of the umac-* MACs that I found to be significantly faster.)
This is actually an important result because aes128-ctr is the default cipher for clients on OmniOS. In other words, the default SSH setup on OmniOS is about a third of the speed that it could be. This could be very important if you're planning to do bulk data transfers over SSH (perhaps to migrate ZFS filesystems from old fileservers to new ones)
The good news is that this is faster than 1G Ethernet; the bad news is that this is not very impressive compared to what Linux can get on the same hardware. We can make two comparisons here to show how slow OmniOS is compared to Linux. First, on Linux the best result on the OmniOS ciphers and MACs is aes128-ctr with hmac-sha1 at 180 Mbytes/sec (aes128-ctr with hmac-md5 is around 175 MBytes/sec), and even the arcfour ciphers run about 5 Mbytes/sec faster than on OmniOS. If we open this up to the more extensive set of Linux ciphers and MACs, the champion is aes128-ctr with umac-64-etm at around 335 MBytes/sec and all of the aes GCM variants come in with impressive performances of 250 Mbytes/sec and up (umac-64-etm improves things a bit here but not as much as it does for aes128-ctr).
(I believe that one reason Linux is much faster on the AES ciphers is that the version of OpenSSH that Linux uses has tuned assembly for AES and possibly uses Intel's AES instructions.)
In summary, through a combination of missing optimizations and missing ciphers and MACs, OmniOS's normal version of OpenSSH is leaving more than half the performance it could be getting on the table.
(The 'good' news for us is that we are doing all transfers from our old fileservers over 1G Ethernet, so OmniOS's ssh speeds are not going to be the limiting factor. The bad news is that our old fileservers have significantly slower CPUs and as a result max out at about 55 Mbytes/sec with arcfour (and interestingly, hmac-md5 is better than hmac-sha1 on them).)
PS: If I thought that network performance was more of a limit than
disk performance for our ZFS transfers from old fileservers to the
new ones, I would investigate shuffling the data across the network
without using SSH. I currently haven't seen any sign that this is
the case; our 'zfs send | zfs recv' runs have all been slower
than this. Still, it's an option that I may experiment with (and
who knows, a slow network transfer may have been having knock-on
effects).
2014-07-02
Why Solaris's SMF is not a good init system
An init system has two jobs: running and supervising services, and managing what it runs and supervises. SMF is a perfectly good init system as far as the former goes, and better than some (eg the traditional System V system). It is the second job where SMF falls down terribly because it's decidedly complex, opaque, fragile, and hard to manipulate. The result is a very divided experience where as long as you don't have to do anything to SMF it's a fine init system but the moment you do everything becomes immensely frustrating.
Here is an illustration of how complex, opaque, and fragile SMF is.
The following is a script of commands that must be fed to svccfg
as one block of actions in order to make two changes to our OmniOS
systems: to start syseventd only after filesystem/local (it normally
starts earlier), and to start ssh after filesystem/minimal (ie very
early in the boot process, so if things go wrong we have system
access).
select svc:/system/filesystem/local delpg npiv-filesystem select svc:/milestone/devices delpg devices select svc:/milestone/single-user delpg syseventd_single-user select svc:/system/sysevent:default addpg filesystems-dep dependency setprop filesystems-dep/grouping = astring: "require_all" setprop filesystems-dep/restart_on = astring: "none" setprop filesystems-dep/type = astring: "service" setprop filesystems-dep/entities = fmri: "svc:/system/filesystem/local:default" select svc:/network/ssh setprop fs-local/entities = fmri: "svc:/system/filesystem/minimal" delpg fs-autofs end
There are two obvious things about this sequence of commands, namely that there are quite a lot of them and they are probably pretty opaque if you're not familiar with SMF. But there are several other things that are worthy of mention. The first is that it is actually fairly difficult to discover and work out what these commands should be and need to be; I had multiple false steps and missteps during the process. Many of the names involved in this process are arbitrary, ie up to the individual services to decide on and as you can see many of the services have chosen different names. These names are of course not documented and thus presumably not officially supported.
(Nor do the OmniOS SMF manpages discuss how your changes interact with, say, applying a package update for the packages you've manipulated.)
The next inobvious thing is that if you get some of these wrong, SMF will thoroughly blow up your system by, for example, detecting a dependency cycle and then refusing to have any part of it instead of trying some sort of fallback cycle-breaking in order to allow your system to boot to some extent. Nor does SMF prevent you from creating a dependency cycle by (for example) refusing to commit a service change that would set up such a cycle; instead it just tells you that you've made one somehow. This is why I call managing SMF a fragile thing.
Oh, and the third inobvious thing is that there are several ways to do what I've done above, all of them probably roughly equivalent. At least one thing I've done is a convenient hack instead of what would be the theoretically 'correct' way; I've done the hack because the theoretically correct way is too much of a pain in the rear. That by itself is a glaring problem indicator, as doing the correct thing should be the easiest approach.
(The hack is that instead of deleting ssh's property group for its dependency on filesystems/local and creating a new property group for a new dependency on filesystem/minimal, I have instead rewritten the specific service that 'fs-local' depends on and thus its name is now kind of a lie. But this change is one line instead of six lines, making it an attractive hack.)
To be manageable, an init system needs to be clear, well documented, and easy to use. You should be able to easily discover what properties a service has, what properties it can have, how these affect its operation, and so on. It should be obvious or at least well documented how to change service start order (because this is a not uncommon need). For dependency-based init systems without a strict ordering, it should be easy to discover what depends on what (including transitively) and either impossible or as harmless as possible to create dependency cycles. It should not require a major obscure bureaucracy to change things, nor hunts through the Internet and Stack Overflow to work out how to do things.
SMF is not a success at any of these, especially being easy to use (about the only thing that is simple in SMF is simply disabling and enabling services). That is why I say that it is not a good init system. If I had to describe it in a nutshell, I would say that SMF is a perfect illustration of what Fred Brooks calls the second system effect. People at Sun clearly wanted to make a better init system that fixed all of the problems people had ever had with System V init, but what they put together is utterly over-engineered and complex and opaque.
(I also have to mention that SMF falls down badly on the small matter of managing serial port logins. Doing this in SMF is so complicated that no one I've talked to can tell me how to successfully enable logins on a particular serial port. Really. This is yet another sign that something is terribly wrong in how SMF is configured and manipulated, even if it's perfectly fine at starting and restarting services once you can configure them.)