2007-08-28
A gotcha with Solaris Volume Manager metasets
Here is something that I have learned the hard way: at least on Solaris 10 x86, you cannot create a metaset unless you have a local metadb. It can be completely empty, but you have to have one to start with, even if you are never going to create any local meta-disks and the like.
This is a non-trivial issue because you have to put that local metadb on a disk partition and a default Solaris install doesn't create any spare ones. The result is that you can install a system, decide you want to use SVM metasets, and have to reinstall just to get a spare partition that you will never actually use for anything.
(Don't even think about using that tempting slice 8. I made that mistake and in the end blew up my test system, which was very educational in a certain sort of way, seeing as I now know a lot more about recovering from various sorts of Solaris boot problems and about things that you should never ever do.)
The limitation may be an artificial one that can be bypassed with
enough cleverness. The metaset error messages talk about a service
not running; I suspect that this is because the service starter sees
that there is no local metadb configured and assumes that SVM is not in
use on the system. If you could persuade the necessary service to start
anyways, things might start working; unfortunately the new Solaris 10
svcs stuff is so convoluted it's hard to see what actually gets run
when the metainit service starts.
The overall lesson I have learned from this is always configure a spare partition on any Solaris 10 machine. 32 to 64 Mb will do it, which is easy to fit in given the size of modern disks.
2007-08-01
A ZFS-based fileserver design
The following is the ZFS-based design we would like to use for our new fileserver environment, presented for your entertainment and whatever use you can get from it.
The basic thing we give people are 'storage pools', which are made up from one or more standard sized 'bricks' of storage. Each storage pool contains one or more filesystems, and is owned by a group (or a single person).
(Here I am using 'filesystem' to mean 'distinct mount point' or 'different name', which is really what users see when things get NFS exported to our actual user servers.)
Mechanically, each storage pool is a ZFS storage pool and each brick is a logical drive (or a slice of a logical drive) from a backend SAN controller. Because of ZFS's long term storage management issues, the SAN backend has to handle all of the RAID stuff; ZFS's own RAID support is used only for storage migration and for highly available storage pools, which would be mirrored between several SAN controllers.
(This turns ZFS into a more featureful Solaris Volume Manager, which is kind of a pity.)
You expand your available storage by getting another brick, which can either be added to an existing storage pool or be used to start another one. If you don't have an existing storage pool, you have to start one; you can't buy a brick for an existing group storage pool but reserve it exclusively for your own use.
Groups can add new filesystems any time they want to; they just tell us which of their storage pools the new filesystem should go into. However, filesystems don't move between storage pools once they're created. Groups can also tell us to remove filesystems, although you always have to have one filesystem in each storage pool.
(Technically we can move filesystems between storage pools, but it involves manual data copies and forced NFS remounts and user visible downtimes and so on and we don't want to do it very often.)
The advantage of having a big storage pool with multiple filesystems is that a group does not have to decide ahead of time how much space they want in each different filesystem; they can let them expand (and contract) as needed. The drawback of piling everything into one storage pool is that if a group gets grant funding and buys an entire SAN backend controller to get more storage, they can only mirror or transfer entire storage pools to their new space; they can't make that decision on a filesystem by filesystem basis. (They can expand existing storage pools by adding bricks from their new controller, but then those storage pools and their filesystems depend on their new controller as well as ours.)
Storage pools can never shrink (at least until ZFS adds that feature). This is not too much of a problem, since we don't currently buy storage back from people. (If they need a bunch of space only temporarily, we can create and then later destroy an entire storage pool.)
There will be some maximum size for storage pools, probably somewhere around 2 Tb, so that a single storage pool can't eat too much of a single SAN RAID controller's disk space. There is no size limit for filesystems, except that if you want them to be backed up they can only be so big (probably around 200 Gb; it's based on how big our backup tapes are).