Wandering Thoughts archives

2008-08-24

An update to the ZFS excessive prefetching situation

A while back I wrote about how I had discovered that ZFS could wind up doing excessive readahead when faced with many streams of sequential read IO and wind up throwing 90% to 95% of the IO that it had done (with terrible consequences for application performance). It's time for an update on that situation.

First, for various reasons we wound up moving to Solaris server machines with 8 GB of memory (SunFire X2200s instead of X2100s), so I re-enabled ZFS file prefetching and re-ran my experiments. Initial testing was encouraging; with 8 GB, the ZFS ARC cache was big enough that even under my heavy test load ZFS could keep prefetched data around for long enough to not kill application level performance.

Well. Usually big enough, but sometimes the ZFS ARC would spontaneously decide to limit itself down to 2 GB (instead of the usual 5 to 7 GB), despite the test machines being otherwise unused and idle. This destroyed performance, and worse I could find no way of resetting the adaptive ARC target size (what you see as c in the output of 'kstat -m zfs') to recover from the situation. So we turned off ZFS file prefetching again and there things sat for a while.

Recently I discovered the under-documented zfs_arc_min ZFS tuning parameter, which sets the minimum ZFS ARC size (it is the mirror of the better documented zfs_arc_max tuning parameter for setting the maximum size). Since a large minimum size should prevent the catastrophic ARC shrinkage, our test systems now have it set to 5 GB and it seems to be working so far (in that the ARC hasn't shrunk on either of them).

(On dedicated NFS servers, I am pretty sure that we actively want most of the memory to be reserved for ZFS caches. Nothing that is particularly memory-consuming should ever run on them, and if it does, I would prefer that it swap itself to death rather than impacting NFS server performance.)

Update, October 22nd: see an important update. I can no longer recommend that you do this.

ZFSOverPrefetchingUpdate written at 00:39:47; Add Comment

2008-08-04

Our answer to the ZFS SAN failover problem

A while back I wrote about the ZFS SAN failover problem, and recently a commentator asked what we've decided to do about it. Our current answer to the problem is simple but somewhat brutal: we're not going to do failover as such.

We're still including basic support for failover in our NFS server environment, things like virtual fileserver IPs and a naming convention for ZFS pools that includes what fileserver they are part of, but we're not trying to build any explicit failover support, especially automatic failover. If we have to fail over a fileserver, it will be a by-hand process.

Note that ZFS makes by-hand failover for NFS servers not very much work because almost everything you need is already remembered by the pools. All we'd need to do is get a list of the pools on the down fileserver (made easy by the naming convention and 'zpool import'), import them all on another server, and add the virtual fileserver IP as an alias on the new server.

Apart from the relative ease of doing manual failover if we have to, there are several mitigating factors that make this more sensible than it looks. First, it seems clear that we can't do automatic failover, because it is just too dangerous if anything goes wrong (and we don't trust ourselves to build a system that guarantees nothing will ever go wrong). This means that we are not losing much by not automating some of the by-hand work, and an after-hours problem won't get fixed any slower this way; in either case it has to wait for sysadmins to come in.

Second, given the slow speed of zpool import in a SAN environment, any failover is a very slow process (we're looking at tens of minutes). Since even automatic failover would be very user visible, having manual failover be more user visible is not necessarily a huge step worse. This also means that the only time any sort of failover makes sense is when a server has failed entirely.

Third, we're using generic hardware with mirrored, easily swapped system disks. This means that if even a single system disk has survived, we can transplant it into a spare chassis and (with a bit of work) bring the actual server back online; it might even be faster than failing over the fileserver. So to entirely lose a server, we have to lose both system disks at once, which we hope is a very rare event.

(This is when operating system bugs and sysadmin mistake come into the picture, of course.)

OurZFSSanFailoverAnswer written at 15:05:57; Add Comment

2008-08-03

First impressions of using DTrace on user-level programs

I've finally gotten around to trying out DTrace, and I have to say that it actually is pretty cool so far. I haven't used it on kernel side stuff, just to poke at user programs, for which it makes a nice print based debugger; it's easy to point DTrace at a process and see what's going on (easier than attaching a debugger and getting anywhere in my opinion), and you have a bunch of interesting analysis options, none of which require you to sit there holding the debugger's hand.

(For example, it is easy to see all of the library calls that a program makes, or all of the calls that it makes to a specific shared library; this is a powerful way of working out how a program is making decisions.)

One drawback to using DTrace on userland programs is that DTrace is fundamentally a kernel debugger, so it does not give you direct access to a process's user-side memory and thus its variables and data structures. In order to get any of this sort of thing, you have to first copy it from user space with various DTrace functions, primarily copyin() for random data and copyinstr() for C-style strings. Another drawback is that DTrace has no looping and very little conditional decision making, which makes it hard to trace down complex data structures.

(I understand why DTrace has to have this basic model, but I really wish that the DTrace people had put more convenience functions in the DTrace language for this. And I don't understand the whole avoidance of an if or equivalent at all.)

That DTrace is not a user-level debugger is actually reassuring in one sense; user-level debuggers are traditionally rather invasive and do things like stopping the process you're monitoring while you hold their hand. This is alarming for a sysadmin, since the processes we want to monitor are often rather important system ones that we want disturbed as little as possible (and certainly not stopped and potentially aborted).

UserlandDtraceImpressions written at 01:02:59; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.