2008-05-20
Getting live network bandwidth numbers on Solaris
After I wrote netvolmon for Linux,
I started getting curious about how much bandwidth our current Solaris
NFS servers were using. Unfortunately, Solaris's version of ifconfig
does not report byte counts; fortunately, the kernel does keep this
information and you can dig it out with kstat (information courtesy
of here, which has a
bunch of more sophisticated programs to report on this stuff).
The magic kstat incantation is 'kstat -p "*:*:<DEV>:*bytes64"',
which gets you the obytes64 and rbytes64 counters for the device; this
works on at least Solaris 8 and Solaris 10. (In this Solaris does Linux
one better, since a 32-bit Linux machines use 32-bit network counters
and on a saturated gigabit link they can roll over in under 20 seconds.)
Armed with this we can write the obvious Solaris version of netvolmon:
#!/bin/sh
# usage: netvolmon DEV [INTERVAL]
DEV=$1
IVAL=${2:-5}
getrxtx() {
kstat -p "*:*:$1:*bytes64" |
awk '{print $2}'
}
rxtx=`getrxtx $DEV`
while sleep $IVAL; do
nrxtx=`getrxtx $DEV`
(echo $IVAL $rxtx $nrxtx) |
awk 'BEGIN {
msg = "%6.2f MB/s RX %6.2f MB/s TX\n"}
{rxd = ($4 - $2) / (1024*1024*$1);
txd = ($5 - $3) / (1024*1024*$1);
printf msg, rxd, txd}'
rxtx="$nrxtx"
done
Vaguely to my surprise, it turns out that Solaris 8 awk doesn't allow
you to split printf (and presumably print) statements over multiple
lines. The Solaris shell is backwards and doesn't support the POSIX
shell $(...) syntax, even in Solaris 10, so this version uses the less
pleasant backquote syntax.
(This can easily be extended to report packets per second as well; the device counters you want are 'opackets64' and 'rpackets64'. I didn't put it in this version for a petty reason, namely that it would make this entry too wide, but you can get the full versions for both Solaris and Linux here.)
2008-05-17
Why we're interested in many ZFS pools
I wrote up our basic fileserver design plan back in ZFSFileserverDesign, but it is worth explaining why we are looking at using many pools. In a nutshell:
Given that we sell fixed size chunks of space to people (as the way we allocate our storage space), we are always going to have a certain number of logical pools of storage to manage. The only question is whether to handle them as separate ZFS pools or to aggregate them into fewer ZFS pools and then administer them as sub-hierarchies using quotas. Our current belief is that it's simpler to use separate pools; there is one less thing to keep track of when you add space, you avoid the possibility of certain sorts of stupid errors, and it is simpler to explain to users.
(In our situation it also lessens the amount of data we'd lose if we lost both disks in a mirrored pair.)
We're unlikely to normally have, say, 132 pools on a single fileserver. However, we are going to have a failover environment, which means that we may sometimes have to limp along with the pools from several fileservers temporarily all running on to one machine. Figuring out the limits in advance may save us a lot of heartburn during a crisis.
(Plus, learning about this stuff helps us plan out the fileservers and how to split groups and people between them, and so on.)
2008-05-13
Things I have learned about ZFS (and a Linux iSCSI target)
I've been testing ZFS over iSCSI storage as an NFS server recently, which has caused me to discover a number of interesting things. In the order that I discovered them:
- each ZFS pool has a cache that has a minimum size that it won't shrink
below, no matter what the memory pressure is; this is apparently
10 MB by default. If you have 2 GB of memory on a Solaris 10U4
x86 machine and 132 separate pools, there is not enough memory
left after all their caches fill to this level to let the machine
keep running regular programs.
(Because the caches initially start out empty and thus much smaller, the machine will boot and seem healthy until you do enough IO to enough pools. This can be mysterious, especially if your IO load is to
zfs scrubthem one after another.)Our workaround was to add more memory to the Solaris machine; it seems happy at 4GB.
- a Linux iSCSI target machine needs more than 512 MB of memory to support
simultaneous IO against 132 LUNs. If you have only 512 MB and throw enough
IO against the machine, the kernel will get hung in an endless OOM loop.
(Your mileage may differ depending on which iSCSI target implementation you're using.)
- despite being released in 2007, Solaris 10 U4 still defaults to running only 16 NFS server threads.
- however, increasing this to 1024 threads (the commonly advised starting
point) and then trying to do simultaneous IO against 132 ZFS pools from
an NFS client will cause your now 4GB server to bog down into complete
unusability. (At one point I saw a load average of 4000.)
This appears to happen because NFS server threads are very high priority threads on Solaris, so if you have too many of them they can eat all of your server for breakfast. 1024 is definitely too many; 512 may yet prove to be too many, but has survived so far.
- ZFS has really aggressive file-level prefetching, even when used as an
NFS server and even when the system is under enough pressure that
most of it gets thrown away. For example, if you have 132 streams
of sequential read IO, Solaris can wind up wasting 90% to 95% of
the IO it does.
(It is easiest to see this if you have a Linux iSCSI target and a Linux NFS client, because then you can just measure the network bandwidth usage of both. At 132 streams, the iSCSI target was transmitting at 118 MBytes/sec but the NFS client was receiving only 6 MBytes/sec.)
The workaround for this is to turn off ZFS file prefetching (following the directions from the ZFS Evil Tuning Guide ). Unfortunately this costs you noticeable performance on single-stream sequential IO.
It is possible that feeding the server yet more memory would help with this, but going beyond 4 GB of memory for the hardware we're planning to use as our NFS servers will be significantly expensive (we'd have to move to 2GB DIMMs, which are still pricey).
Given that we have a general NFS environment, I suspect that we are going to have to accept that tradeoff; better a system that's slower than it could be when it's under low load than a system that totally goes off the cliff when it's under high load.
2008-05-07
Today's Solaris 10 irritation: the fault manager daemon
More and more, Solaris 10 strikes me as being much like Ubuntu 6.06: a
system with plenty of big ideas but only half finished implementations.
Today's half implemented idea is fmd, the new fault manager daemon.
One of the things I expect out of a fault monitoring system is that it
should not report things as faulted when they are now fine, especially
not with scary messages that get dumped on the console at every boot
(it's acceptable to report them as faulted and now better, provided that
you only do it once). As I discovered today, under some circumstances
involving ZFS pools and iSCSI, fmd falls down on this; I got verbose
error messages about missing pools (that were there and fine) dumped to
the console (and syslog) on every boot.
Unfortunately, I couldn't find any simple way to clear these errors.
There is probably a magic fmadm flush incantation, but I couldn't find
the right argument, and doing fmadm reset on the two ZFS modules that
fmadm config reported didn't do anything. I had to resort to picking
event UUIDs out of fmadm faulty output and running fmadm repair on
each one.
(And why didn't Sun give the fault manager an option to send email to someone when faults happen? I'd have thought that that would be basic functionality, and it would make it actually useful for us.)
Sidebar: How I got fmd to choke this way
I ran a test overnight that hung the iSCSI target machine, which caused the Solaris machine to reboot and then hang during boot. In the process of straightening all of this out there was a time when the iSCSI machine was refusing connections, which caused the Solaris machine to finally boot but with none of the ZFS pools available. When I brought the iSCSI machine back up, the pools reappeared but the fault manager had somehow latched on to the original 'pool not present' events and kept repeating them.