2015-06-26
The status of our problems with overloaded OmniOS NFS servers
Back at the start of May, we narrowed down our production OmniOS problems to the fact that OmniOS NFS servers have problems with sustained 'too fast' write loads. Since then there have been two pieces of progress and today I feel like writing about them.
The first is that this was identified as a definite Illumos issue. It turns out that Nexenta stumbled over this and fixed it in their own tree in this commit. The commit has since been upstreamed to the Illumos master here (issue) and has made it into the repo for OmniOS r151014 (although I believe it's not yet in a released update). OmniTI's Dan McDonald did the digging to find the Nexenta change after I emailed the OmniOS mailing list and built us a kernel with it patched in that we were able to run in our test environment, where it passed with flying colors. This is clearly our long term solution to the problem.
(In case it's not obvious, Dan McDonald was super helpful to us here, which we're quite grateful for. Practically the moment I sent in my initial email, our problem was on the way to getting solved.)
In the short term we found out that taking a fileserver from 64 GB of RAM to 128 GB of RAM made us no longer able to reproduce the problem in both our test environment and the production fileserver that was having problems. In addition it appears to make our test fileserver significantly more responsive under heavy load. Currently the production fileserver is running without problems with 128 GB of RAM and 4096 NFS server threads (and an increase in kernel rpcmod parameters to go with it). It's definitely survived getting into memory use situations that we'd have expected to lock it up based on prior experience.
(At the moment we've only upgraded the one problem fileserver to 128 GB and left the others at 64 GB. The others get much less load due to some decisions we made during the migration from the old fileservers to our current ones.)
We still have some other issues with our OmniOS fileservers, but for now the important thing is that we have what seems to be a stable production fileserver environment. After all our problems getting here, that is a very big relief. We can live with 1G Ethernet instead of 10G; we can't live with fileservers that lock up under load.
2015-06-18
The cost of OmniOS not having /etc/cron.d
Systems without /etc/cron.d just make my sysadmin life harder and more annoying. OmniOS, I'm looking at you.
For those people who have not encountered it, this is a Linux cron
feature where you can basically put additional crontab files in
/etc/cron.d. To many people this may sound like a minor feature;
let me assure you it is not.
Here is why it is an important feature: it makes adding, modifying,
or deleting your crontab entries as trivial as copying a file.
It is very easy to copy files (or create them). You can trivially
script it, there are tons of tools to do this for you in various
ways and from various sources (from rsync on up), and it is very
easy to scale file copies up for a fleet of machines.
Managing crontab entries without this is either painfully manual,
involves attempts to do reliable automated file editing through
interfaces not designed for it, or requires you to basically build
your own custom equivalent of it and then treat the system crontab
file as an implementation detail inside your cron.d equivalent.
This is a real cost and it matters for us.
With /etc/cron.d, adding a new custom-scheduled service on some
or all of our fileservers would be trivial
and guaranteed to not perturb anything else. Especially, adding it
to all of them is no more work than adding it to one or two (and
may even be slightly less work). With current OmniOS cron, it is
dauntingly and discouragingly difficult. We have to log in to each
fileserver, run 'crontab -e' by hand, worry about an accidental
edit mistake damaging other things, and then update our fileserver
install instructions to account for the new crontab edits. Changed
your mind and need to revise just what your crontab entry is (eg
to change when it runs)? You get to do all that all over again.
The result is that we'll do a great deal to avoid having to update
OmniOS crontabs. I actually found myself thinking about how I would
invent my own job scheduling system in central shell scripts that
we already run out of cron, just because doing that seemed like
less work and less annoyance than slogging around to run 'crontab
-e' even once (and it probably wouldn't have been just once).
(Updates to the shell scripts et al are automatically distributed to our OmniOS machines, so they're 'change once centrally and we're done'.)
Note that it's important that /etc/cron.d supports multiple files,
because that lets you separate each crontab entry (or logically
related chunk of entries) into an independently managed thing. If
it was only one single file, multiple separate things that all
wanted crontab entries would have to coordinate updates to the file.
This would get you back to all sorts of problems, like 'can I
reliably find or remove just my entries?' and 'are my entries
theoretically there?'. With /etc/cron.d, all you need is for
people (and systems) to pick different filenames for their particular
entries. This generally happens naturally because you get to use
descriptive names for them.
2015-05-17
A bit more on the ZFS delete queue and snapshots
In my entry on ZFS delete queues, I mentioned that a filesystem's delete queue is captured in snapshots and so the space used by pending deletes is held by snapshots. A commentator then asked:
So in case someone uses zfs send/receive for backup he accidentially stores items in the delete queue?
This is important enough to say explicitly: YES. Absolutely.
Since it's part of a snapshot, the delete queue and all of the space
it holds will be transferred if you use zfs send to move a
filesystem snapshot elsewhere for whatever reason. Full backups,
incremental backups, migrating a filesystem, they all copy all of
the space held by the delete queue (and then keep it allocated on
the received side).
This has two important consequences. The first is that if you
transfer a filesystem with a heavy space loss due to things being
held in the delete queue for whatever reason,
you can get a very head-scratching result. If you don't actually
mount the received dataset you'll wind up with a dataset that claims
to have all of its space consumed by the dataset, not snapshots,
but if you 'zfs destroy' the transfer snapshot the dataset promptly
shrinks. Having gone through this experience myself, this is a very
WAT moment.
The second important consequence is that apparently the moment you
mount the received dataset, the current live version will immediately
diverge from the snapshot (because ZFS wakes up, says 'ah, a delete
queue with no live references', and applies all of those pending
deletes). This is a problem if you're doing repeated incremental
receives, because the next incremental receive will tell you
'filesystem has diverged from snapshot, you'll have to tell me to
force a rollback'. On the other hand, if ZFS space accounting is
working right this divergence should transfer a bunch of the space
the filesystem is consuming into the usedbysnapshots category.
Still, this must be another head-scratching moment, as just mounting
a filesystem suddenly caused a (potentially big) swing in space
usage and a divergence from the snapshot.
(I have not verified this mounting behavior myself, but in retrospect
it may be the cause of some unexpected divergences we've experienced
while migrating filesystems. Our approach was always just to use
'zfs recv -F ...', which is prefectly viable if you're really
sure that you're not blowing your own foot off.)
2015-05-15
Your Illumos-based NFS fileserver may be 'leaking' deleted files
By now you may have guessed the punchline of my sudden interest in ZFS delete queues: we had a problem with ZFS leaking space for deleted files that was ultimately traced down to an issue with pending deletes that our fileserver wasn't cleaning up when it should have been.
As a well-debugged filesystem, ZFS should not outright leak pending
deletions, where there are no remaining references anywhere yet the
files haven't been cleaned up (well, more or less; snapshots come
into the picture, as mentioned). However it's
possible for both user-level and kernel-level things to hold
references to now-deleted files in the traditional way and thus
keep them from being actually removed. User-level things holding
open files should be visible in, eg, fuser, and anyways this is
a well-known issue that savvy people will immediately ask you
about. Kernel level things may be less visible, and there is at
least one in mainline Illumos and thus OmniOS r151014 (the current
release as I write this entry).
Per George Wilson on the illumos-zfs mailing list here, Delphix
found that the network lock manager (the nlockmgr SMF service)
could hold references to (deleted) files under some circumstances
(see the comment in their fix).
Under the right circumstances this can cause significant space
lossage over time; we saw loss rates of 5 GB a week. This is worked
around by restarting nlockmgr; this restart drops the old references
and thus allows ZFS to actually remove the files and free up
potentially significant amounts of your disk space. Rebooting
the whole server will do it too, for obvious reasons, but is somewhat
less graceful.
(Restarting nlockmgr is said to be fully transparent to clients,
but we have not attempted to test that. When we did our nlockmgr
restart we did as much as possible to make any locking failures a
non-issue.)
As far as I know there is no kernel-level equivalent of fuser,
so that you could list eg even all currently active kernel level
references to files in a particular filesystem (never mind what
kernel subsystem is holding such references). I'd love to be wrong
here; it's an annoying gap in Illumos's observability.
The ZFS delete queue: ZFS's solution to the pending delete problem
Like every other always-consistent filesystem, ZFS needs a solution to the Unix pending delete problem (files that have been deleted on the filesystem but that are still in use). ZFS's solution is implemented with a type of internal ZFS object called the 'ZFS delete queue', which holds a reference to any and all ZFS objects that are pending deletion. You can think of it as a kind of directory (and technically it's implemented with the same underlying storage as directories are, namely a ZAP store).
Each filesystem in a ZFS pool has its own ZFS delete queue object, holding pending deletes for objects that are in (or were originally in) that filesystem. Also, each snapshot has a ZFS delete queue as well, because the current state of a filesytem's ZFS delete queue is captured as part of making a snapshot. This capture of delete queues in snapshots has some interesting consequences; the short version is that once a delete queue with entries is captured in a snapshot, the space used by those pending deleted objects cannot be released until the snapshot itself is deleted.
(I'm not sure that this space usage is properly accounted for in
the 'usedby*' space usage properties, but I haven't tested this
specifically.)
There is no simple way to find out how big the ZFS delete queue is
for a given filesystem. Instead you have to use the magic zdb
command to read it out, using 'zdb -dddd DATASET OBJNUM' to dump
details of individual ZFS objects so that you can find out how many
ZAP entries a filesystem's 'ZFS delete queue' object has; the number
of current ZAP entries is the number of pending deletions. See the
sidebar for full details, because it gets long and tedious.
(In some cases it will be blatantly obvious that you have some
sort of problem because df and 'zfs list' and so on report
very different space numbers than eg du does, and you don't
have any of the usual suspects like snapshots.)
Things in the ZFS delete queue still count in and against per-user
and per-group space usage and quotas, which makes sense because
they're still not quite deleted. If you use 'zfs userspace' or
'zfs groupspace' for space tracking and reporting purposes this
can result in potentially misleading numbers, especially if pending
deletions are 'leaking' (which can happen).
If you actually have and enforce per-user or per-group quotas, well,
you can wind up with users or groups that are hitting quota limits
for no readily apparent reason.
(Needing to add things to the ZFS delete queue has apparently caused problems on full filesystems at least in the past, per this interesting opensolaris discussion from 2006.)
Sidebar: A full example of finding how large a ZFS delete queue is
To dump the ZFS delete queue for a filesystem, first you need to know what its object number is; this is usually either 2 (for sufficiently old filesystems) or 3 (for newer ones), but the sure way to find out is to look at the ZFS master node for the filesystem (which is always object 1). So to start with, we'll dump the ZFS master node to find out the object number of the delete queue.
# zdb -dddd fs3-corestaff-01/h/281 1
Dataset [....]
Object lvl iblk dblk dsize lsize %full type
1 1 16K 1K 8K 1K 100.00 ZFS master node
[...]
microzap: 512 bytes, 3 entries
DELETE_QUEUE = 2
[...]
The object number of this filesystem's delete queue is 2 (it's an old filesystem, having been originally created on Solaris 10). So we can dump the ZFS delete queue:
# zdb -dddd fs3-corestaff-01/h/281 2
Dataset [...]
Object lvl iblk dblk dsize lsize %full type
2 2 16K 16K 144K 272K 100.00 ZFS delete queue
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 16
Fat ZAP stats:
[...]
ZAP entries: 5
[...]
3977ca = 3766218
3977da = 3766234
397a8b = 3766923
397a87 = 3766919
397840 = 3766336
(The final list here is the ZAP entries themselves, going from some magic key (on the left) to the ZFS object numbers on the right. If we wanted to, we could use these object numbers to inspect (or even read out) the actual things that are pending deletion. This is probably most useful to find out how large they are and thus how much space they should be consuming.)
There are two different forms of ZAPs and zdb reports how many
entries they have somewhat differently. In the master node we saw
a 'microzap', used when the ZAP is and always has been small. Here
we see a 'Fat ZAP', which is what a small ZAP turns into if at some
point it grows big enough. Once the ZFS delete queue becomes a fat
ZAP it stays that way even if it later only has a few entries, as
we see here.
In this case the ZFS delete queue for this filesystem holds only five entries, which is not particularly excessive or alarming. Our problem filesystem had over ten thousand entries by the time we resolved the issue.
PS: You can pretty much ignore the summary line with its pretty sizes;
as we see here, they have very little to do with how many delete queue
entries you have right now. A growing ZFS delete queue size may be
a problem indicator,
but here the only important thing in the summary is the type field,
which confirms that we have the right sort of objects both for the ZFS
master node and the ZFS delete queue.
PPS: You can also do this exercise for snapshots of filesystems; just use the full snapshot name instead of the filesystem.
(I'm not going to try to cover zdb usage details at all, partly
because I'm just flailing around with it. See Ben Rockwood's zdb:
Examining ZFS At Point-Blank Range for one
source of more information.)
2015-05-02
OmniOS as a NFS server has problems with sustained write loads
We have been hunting an serious OmniOS problem for some time. Today we finally have enough data that I feel I can say something definitive:
An OmniOS NFS server will lock up under (some) sustained write loads if the write volume is higher than its disks can sustain.
I believe that this issue is not specific to OmniOS; it's likely Illumos in general, and was probably inherited from OpenSolaris and Solaris 10. We've reproduced a similar lockup on our old fileservers, running Solaris 10 update 8.
Our current minimal reproduction is the latest OmniOS (r151014) on our standard fileserver hardware, with 1G networking added and with a test pool of a single mirrored vdev on two (local) 7200 RPM 2TB SATA disks. With both 1G networks being driven at basically full wire speed by a collection of NFS client systems writing out a collection of different files on that test pool, the system will run okay for a while and then suddenly enter a situation where system free memory nosedives abruptly and the amount of kernel memory used for things other than the ARC jumps massively. This leads immediately to a total system hang when the free memory hits rock bottom.
(This is more write traffic than the disks can sustain due to mirroring. We have 200 MBytes/sec of incoming NFS writes, which implies 200 MBytes/sec of writes to each disk. These disks appear to top out at 150 MBytes/sec at most, and that's probably only a burst figure.)
Through a series of relatively obvious tests that are too long to detail here (eg running only one network's worth of NFS clients), we're pretty confident that this system is stable under a write load that it can sustain. Overload is clearly not immediate death (within a few seconds or the like), so we assume that the system can survive sufficiently short periods of overload if the load drops afterwards. However we have various indications that it does not fully recover from such overloads for a long time (if ever).
(Death under sustained overload would explain many of the symptoms we've seen of our various fileserver problems (eg). The common element in all of the trigger causes is that they cause (or could cause) IO slowdowns; backend disks with errors, backend disks that are just slow responding, full pools, or even apparently pools hitting their quota limits, even 10G networking problems. A slowdown of IO would take a fileserver that was just surviving a current high client write volume and push it over the edge.)
The memory exhaustion appears to be related to a high and increasing level of outstanding incomplete or unprocessed NFS requests. We have some indication that increasing the number of NFS server threads helps stave off the lockup for a while, but we've had our test server lock up (in somewhat different test scenarios) with widely varying numbers.
In theory this shouldn't happen. An NFS server that is being overloaded should push back on the clients in various ways, not enter a death spiral of accepting all of their traffic, eating all its memory, and then locking up. In practice, well, we have a serious problem in production.
PS: Yes, I'll write something for the OmniOS mailing lists at some point. In practice tweets are easier than blog entries, which are easier than useful mailing list reports.
2015-04-22
Don't make /opt a filesystem on OmniOS (or probably Illumos generally)
OmniOS boot environments are in general pretty cool things, but
they do create one potential worry: how much data gets captured in
them and thus how much space they can consume over time. Since boot
environments are ultimately ZFS snapshots and clones, the amount
of space each individual one uses over time is partly a function
of how much of the data captured changes over time. Taking an extreme
case, if you have a very large /var/log that is full of churning
logs for some reason, each boot environment and BE snapshot you
have will probably wind up with its own unique copy of all of this
data.
(Unchanging data is free since it's shared between all BEs and BE snapshots.)
One of the things that this can push you towards is limiting what's
included in your boot environments by making some things into separate
filesystems. One obvious candidate here is /opt, where you may wind
up with any number of local and semi-local packages that you update
and otherwise churn at a much faster rate than base OmniOS updates and
upgrades. After all this is the entire point of the OmniOS KYSTY
principle and the historical
use of /opt.
Well, I'll cut to the chase: don't do this if you want to be able
to do upgrades between OmniOS releases, at least right now. You can
create separate ZFS filesystems under bits of /opt, but if you
take the obvious route of making all of /opt its own ZFS filesystem
things will go terribly wrong. The problem is that some core OmniOS
packages that you may wind up installing (such as their GCC packages)
are put into /opt but upgraded as part of making a new boot
environment on major upgrades. Because a boot environment only contains
things directly in /, this doesn't work too well; pkg tries to update
things that aren't actually there and will either fail outright or create
things in /opt in the root of your new BE, which blocks mounting the
real /opt.
I will summarize the resulting OmniOS mailing list discussion as
'a separate /opt is not a supported configuration; don't do that'.
At the best pkg may some day report an explicit error, so that if
you're stuck in this situation you'll know and you can temporarily
remove all of those OmniOS packages in /opt.
(Our solution is to abandon plans to upgrade machines from r151010
to r151014. We'll reinstall from scratch and this time we'll make
the largest single piece of /opt into a filesystem instead.)
My personal view is that this means that you do not want to build
or install anything in /opt. Make up your own hierarchy, maybe
/local, and use that instead; that should always be safe to make
into its own filesystem. OmniOS effectively owns /opt and so you
should stay out.
I believe that this is a general issue with all Illumos derived
distributions if they put any of their own packages in /opt, such
as GCC. I have not looked at anything other than OmniOS. I don't
know if it's an issue on Solaris 11; I'd like to hope not, but then
I have low confidence in Oracle getting this right either.
(You may think that being concerned about disk space is so 00s, in this day of massively large hard drives. Well, inexpensive SSDs are not yet massively large and they're what we like to use as root drives these days. They're especially not large in the face of crash dumps, where OmniOS already wants a bunch of space.)
2015-04-14
Allowing people to be in more than 16 groups with an OmniOS NFS server
One of the long standing problems with traditional NFS is that the protocol only uses 16 groups; although you can be in lots of groups on the client (and on the server), the protocol itself only allows the client to tell the server about 16 of them. Recent versions of Illumos added a workaround (based on the Solaris one) where the server will ignore the list of groups the client sent it and look up the UID's full local group membership. Well, sometimes it will do this, if you get all of the conditions right.
There are two conditions. First, the request from the client must have a full 16 groups in it. This is normally what should happen if GIDs are synchronized between the server and the clients, but in exceptional cases you should watch out for this; if the client sends only 15 groups the server won't do any lookups locally and so can deny permissions for a file you actually have access to based on your server GID list.
Second and less obviously, the server itself must be explicitly
configured to allow more than 16 groups. This is the kernel tunable
ngroups_max, set in /etc/system:
set ngroups_max = 64
Any number larger than 16 will do, although you want it to cover the
maximum number of groups you expect people to be in. I don't know if you
can set it dynamically with mdb, so you probably really want to plan
ahead on this one. On the positive side, this is the only server side
change you need to make; no NFS service parameters need to be altered.
(This ngroups_max need is a little bit surprising if you're
mostly familiar with other Unixes, which generally have much
larger out of the box settings for this.)
This Illumos change made it into the just-released OmniOS r151014 but is not in any earlier version as far as I know. Anyways, r151014 is a LTS release so you probably want to be using it. I don't know enough about other Illumos distributions like SmartOS and Nexenta's offering to know when (or if) this change made it into them.
(The actual change is Illumos issue 5296 and was committed to the Illumos master in November 2014. The issue has a brief discussion of the implementation et al.)
Note that as far as I know the server and the client do not need to agree on the group list, provided that the client sends 16 groups. My test setup for this actually had me in exactly 16 groups on the client and some additional groups on the server, and it worked. This is a potential gotcha if you do not have perfect GID synchronization between server and client. You should, of course, but every so often things happen and things go wrong.
2015-03-16
Our difficulties with OmniOS upgrades
We are not current on OmniOS and we've been having problems with
it. At some point, well meaning people are going to suggest that we
update to the current release version with the latest updates and
mention that OmniOS makes this really quite easy with beadm and boot
environments. Well, yes and no.
Yes, mechanically (as far as I know) OmniOS package updates and even release version updates are easy to do and easy to revert from. Boot environments and snapshots of them are a really nice thing and they enable relatively low-risk upgrades, experiments, and so on. Unfortunately the mechanics of an upgrade are in many ways the easy part. The hard part is that we are running (unique) production services that are directly exposed to users. In short, users very much notice if one of our servers goes down or doesn't work right.
The first problem is that this makes reboots noticeable and since they're noticeable they have to be scheduled. Kernel and OmniOS release updates both require reboots (in fact I believe you really want to reboot basically immediately after doing them), which means pre-scheduled, pre-announced downtimes that are set up well in advance.
The second problem is that we don't want to put something into production and then find out that it doesn't work or that it has problems. This means updating is not as simple as updating the production server at a scheduled downtime; instead we need to put the update on a test server and then try our best to fully test it (both for load issues and to make sure that important functionality like our monitoring systems still work). This is not a trivial exercise; it's going to consume time, especially if we discover potential issues.
The final problem is that changes increase risk as well as potentially reducing it. Our testing is not and cannot be comprehensive, so applying an update to the production environment risks deploying something that will actually be worse than we have now. The last thing we need is for our current fileservers to get worse than they are now. This means that even considering updates involves a debate over what we're likely to get versus the risks we're taking on, one in which we need to persuade ourselves that the improvements in the update are worth taking on the risks to a core piece of our infrastructure.
(In an ideal world, of course, an update wouldn't introduce new bugs and issues. We do not live in that world; even if people try to avoid it, such things can slip through.)
PS: Obviously, people with different infrastructure will have different tradeoffs here. If you can easily roll out an update on some production servers without anyone noticing when they're rebooted, monitor them in live production, and then fail them out again immediately if anything goes wrong, an OmniOS update is easy to try out as a pilot test and then either apply to your entire fleet or revert back from if you run into problems. This gets into the cattle versus pets issue, of course. If you have cattle, you can paint some of them pink without anyone caring very much.
2015-03-08
Why ZFS's 'directory must be empty' mount restriction is sensible
If you've used ZFS for a while, you may have run across the failure mode where some of your ZFS filesystems don't mount because the mount point directories have accidentally wound up with something in them. This isn't a general Unix restriction (even on Solaris); it's an extra limit that ZFS has added. And I actually think that it's a sensible restriction, although it gets in my way on rare occasions.
The problem with mounting a filesystem over a directory that has things in it is that those things immediately become inaccessible (unless you do crazy hacks). Unix lets you do this anyways for the same reason it lets you do other apparently crazy things; it assumes you know what you're doing and have a good reason for it.
The problem with doing this for ZFS mounts too is that ZFS mounts are generally implicit, not explicit, and as a result ZFS mounts can basically appear from nowhere. If you import a pool, all of its filesystems normally get automatically mounted at whatever their declared mount point is. When you imported that old pool to take a look at it (or maybe you're failing over a pool from one machine to another), did you remember that it had a filesystem with an unusual mountpoint that you've now turned into a normal directory?
(As it stands, you can't even find out this information about an
unimported pool. You'd have to use a relatively unusual import
command and then poke around the imported but hopefully inactive
pool. Note that just 'zpool import -N' isn't quite enough.)
Given the potential risks created by magically appearing filesystems,
ZFS's behavior here is at least defensible. Unix's previous behavior
of allowing this at least required you to explicitly request a
specific mount (more or less, let's wave our hands about /etc/fstab),
so there was a pretty good chance that you really meant it even if
you were going to make some stuff inaccessible. With ZFS pool
imports, perhaps not so much.
(You can also get this magically appearing mount effect by assigning
the mountpoint property, but you can argue that this is basically
the same thing as doing an explicit mount. You're at least doing
it to a specific object, instead of just taking whatever happens
to be in a pool with whatever mountpoints they happen to have.)