Wandering Thoughts archives

2015-06-26

The status of our problems with overloaded OmniOS NFS servers

Back at the start of May, we narrowed down our production OmniOS problems to the fact that OmniOS NFS servers have problems with sustained 'too fast' write loads. Since then there have been two pieces of progress and today I feel like writing about them.

The first is that this was identified as a definite Illumos issue. It turns out that Nexenta stumbled over this and fixed it in their own tree in this commit. The commit has since been upstreamed to the Illumos master here (issue) and has made it into the repo for OmniOS r151014 (although I believe it's not yet in a released update). OmniTI's Dan McDonald did the digging to find the Nexenta change after I emailed the OmniOS mailing list and built us a kernel with it patched in that we were able to run in our test environment, where it passed with flying colors. This is clearly our long term solution to the problem.

(In case it's not obvious, Dan McDonald was super helpful to us here, which we're quite grateful for. Practically the moment I sent in my initial email, our problem was on the way to getting solved.)

In the short term we found out that taking a fileserver from 64 GB of RAM to 128 GB of RAM made us no longer able to reproduce the problem in both our test environment and the production fileserver that was having problems. In addition it appears to make our test fileserver significantly more responsive under heavy load. Currently the production fileserver is running without problems with 128 GB of RAM and 4096 NFS server threads (and an increase in kernel rpcmod parameters to go with it). It's definitely survived getting into memory use situations that we'd have expected to lock it up based on prior experience.

(At the moment we've only upgraded the one problem fileserver to 128 GB and left the others at 64 GB. The others get much less load due to some decisions we made during the migration from the old fileservers to our current ones.)

We still have some other issues with our OmniOS fileservers, but for now the important thing is that we have what seems to be a stable production fileserver environment. After all our problems getting here, that is a very big relief. We can live with 1G Ethernet instead of 10G; we can't live with fileservers that lock up under load.

OmniOSNFSOverloadStatus written at 01:17:35; Add Comment

2015-06-18

The cost of OmniOS not having /etc/cron.d

I tweeted:

Systems without /etc/cron.d just make my sysadmin life harder and more annoying. OmniOS, I'm looking at you.

For those people who have not encountered it, this is a Linux cron feature where you can basically put additional crontab files in /etc/cron.d. To many people this may sound like a minor feature; let me assure you it is not.

Here is why it is an important feature: it makes adding, modifying, or deleting your crontab entries as trivial as copying a file. It is very easy to copy files (or create them). You can trivially script it, there are tons of tools to do this for you in various ways and from various sources (from rsync on up), and it is very easy to scale file copies up for a fleet of machines.

Managing crontab entries without this is either painfully manual, involves attempts to do reliable automated file editing through interfaces not designed for it, or requires you to basically build your own custom equivalent of it and then treat the system crontab file as an implementation detail inside your cron.d equivalent. This is a real cost and it matters for us.

With /etc/cron.d, adding a new custom-scheduled service on some or all of our fileservers would be trivial and guaranteed to not perturb anything else. Especially, adding it to all of them is no more work than adding it to one or two (and may even be slightly less work). With current OmniOS cron, it is dauntingly and discouragingly difficult. We have to log in to each fileserver, run 'crontab -e' by hand, worry about an accidental edit mistake damaging other things, and then update our fileserver install instructions to account for the new crontab edits. Changed your mind and need to revise just what your crontab entry is (eg to change when it runs)? You get to do all that all over again.

The result is that we'll do a great deal to avoid having to update OmniOS crontabs. I actually found myself thinking about how I would invent my own job scheduling system in central shell scripts that we already run out of cron, just because doing that seemed like less work and less annoyance than slogging around to run 'crontab -e' even once (and it probably wouldn't have been just once).

(Updates to the shell scripts et al are automatically distributed to our OmniOS machines, so they're 'change once centrally and we're done'.)

Note that it's important that /etc/cron.d supports multiple files, because that lets you separate each crontab entry (or logically related chunk of entries) into an independently managed thing. If it was only one single file, multiple separate things that all wanted crontab entries would have to coordinate updates to the file. This would get you back to all sorts of problems, like 'can I reliably find or remove just my entries?' and 'are my entries theoretically there?'. With /etc/cron.d, all you need is for people (and systems) to pick different filenames for their particular entries. This generally happens naturally because you get to use descriptive names for them.

NoCronDCost written at 00:51:05; Add Comment

By day for June 2015: 18 26; before June; after June.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.