2010-12-31
One group that can sensibly use non-GPL'd Linux kernel modules
A rather long time ago, a commentator on my entry on the practicalities of non-GPL'd kernel modules wrote, in part:
ZFS is most beneficial to companies running storage servers. These companies will only use supported modules [...]
I actually think that this is backwards. Ignoring the various legal issues for the moment, companies building commercial storage servers are exactly the people that I would not be surprised to see using a kernel module that was not and could never be in the mainline Linux kernel.
(I agree that organizations rolling their own hand-built storage servers on commodity Linux distributions would be crazy to adopt any out of kernel module that was not really crucial and very well maintained, or at least trying very hard to get into the mainline and likely to succeed.)
More generally, I think that any company selling a black box system based on Linux has a number of decent reasons to consider using out of kernel modules:
- delivering a black box instead of a Linux system (or distribution)
drastically reduces the pressure of updating to new Linux kernels and
thus the support burden of dealing with porting code and so on.
- they may well really need the features that the out of kernel module
gives them, or to put it another way such modules can give them
significant competitive advantages (as in the case of ZFS).
- if they're doing a good job, they already need to have fairly deep kernel expertise on hand in order to tune things for their specific environment and to fix any bugs that you run into.
The latter deserves a bit of elaboration. First, the attraction of black boxes to your customers is that they just work. Thus, as a company you do not have the time to wait for anyone else to solve the problems you run into, either before you ship or after you ship; no matter what, you need your own local expertise so you can fix problems in a timely manner.
In a sense, the extra work required to make out of kernel modules work is a competitive advantage. In the long term, a good part of your profits are determined by how much value you add; the more value you add, the more profits you get (conversely, the less value you add the less money you make). Doing the work costs you money, but if you chose correctly it gives you more than that back in extra value added.
2010-12-19
A tale of memory allocation failure
I have a message for developers, and not the one you might think. Here is my message:
Sometimes memory allocation failures are not because the system is out of memory.
There is a subset of programmers who do not want to really deal with memory allocation failures, and to be honest I can't entirely disagree with them; recovering from memory allocation failures in a complex program that also wants to use as little memory as possible is quite non-trivial and rather contorted. Apart from just exiting on the spot, the programmers' non-coping strategies generally involve either retrying the allocation on the spot (usually after a short delay) or failing the immediate operation but not doing anything fundamental to change the program's state, so that it will almost immediately try to allocate more memory.
(For instance, you might try to respond to network input being available by allocating some data structures and then reading the data in and processing it. When your initial allocation fails, you immediately drop back into the main loop without doing anything more, where you once again find that there's network input ready to be read and the whole thing repeats.)
These are sort of sensible coping strategies if the system is out of memory; either you're likely to get memory back soon, or the system is likely to crash. But, well, not all memory allocation failures are because the system is out of memory. And that's where my story comes in.
After upgrading our login servers to Ubuntu 10.04 recently, we started
seeing a number of locked up GNU Emacs processes; clearly abandoned by
their users, they were sitting there burning CPU endlessly. Some work
with strace showed that they were looping around trying to do memory
allocations, which kept failing. Oh, and all the processes were 3 GB in
size. On 32-bit machines.
(Technically they were just under 3 GB of address space; most of the tools I used to look at them rounded this up.)
Given that on conventional 32-bit x86 Linux kernels, processes can only have 3 GB of virtual address space, those allocations were never going to succeed. They were not failing because of any transient condition; they were failing because GNU Emacs had allocated all of the memory it ever could. Not doing something to fundamentally change the situation just sent the program into an endless loop.
(I'm honestly impressed that GNU Emacs managed to get so close to 3GB of address space, given various holes in the address space.)
2010-12-13
Fumbling towards understanding how to use DRBD
For a long time I thought of DRBD as a way of getting shared storage or network based RAID-1 (as its website puts it), and I didn't entirely see the point. But there's a different way of looking at it, one that's quite eye-opening and which I was recently exposed to somewhere.
A standard high availability setup for something like virtualization has four machines: two frontends and two backend disk storage machines, with the storage mirrored between the backends and both frontends seeing all storage (this is our setup, for example). If one frontend machine fails, you just bring the services up on the other; if one backend machine fails, you still have a copy of the data on the other one. This is a traditional shared storage setup.
But many services these days have relatively modest disk space demands (virtualization is again a common example). If all you need is a TB or two, the kind of disk space it's easy to fit into a modern server, it seems rather wasteful to use two entire backend machines (and possibly a couple of switches) to deliver it. So let's do without them.
Take two frontend machines with as much disk as possible and split their data space in half. One half is used to host local services, and is replicated to the other frontend with DRBD; the other half is the replica target for the other frontend's local data space. All services get local disk speeds for reads (and maybe close to it for writes). If one frontend fails, the other has a full copy of its data; it declares itself the primary for that half and starts up the services that normally run on the other frontend.
This approach works doesn't scale up as well as an actual SAN; as you add more frontends that need to be able to replace each other, you lose an increasing amount of disk space to data replicas. But it has the great virtue that it works quite efficiently at a small scale, where it lets you use about the minimum number of machines possible (since you're always going to need two machines for frontend redundancy).
(It turns out that this is another story of me not reading the documentation, since I think this is kind of spelled out on the DRBD website. In my defense, it never sounded interesting enough to make me want to read the website; 'networked RAID-1' is not really something I think of as very attractive, and iSCSI and AOE are both more broadly supported for general network disk access.)
2010-12-12
One problem with the files-in-directory approach to configuration
Over the past while, there's been a definite pattern in Linux software of moving from a single monolithic configuration file (or a control file) to a configuration that is created by scanning a directory and looking for eligible files. From a packaging system and system management perspective, there is a lot to recommend this approach; it makes it much easier for packages and sysadmins to add and remove elements to the configuration.
However there is a drawback to this approach as commonly implemented,
which is handily illustrated by Ubuntu 10.04's new superintelligent
scheme to automatically update /etc/motd. On 10.04, the motd is
created by running scripts in /etc/update-motd.d; each script prints
out some portion of the standard motd's contents, and the driver script
collects them all together. With some glue, this lets Ubuntu give people
various standardized up to the minute reports on things like how many
package updates there are and if the system should be rebooted.
This sounds great, except that on our systems we don't want these notes about pending package updates and reboots needed ad so on to be there. Given that we run the systems and apply package updates when we want to, all they do is confuse and perhaps alarm users.
(Right now, we don't disable the entire thing because we do want the current kernel and Ubuntu version to be in the motd.)
The problem is that Ubuntu's magic motd system doesn't give us any way to turn these standard bits off short of deleting the files that create them. Deleting files is a bad approach to controlling things in any environment with a package management system; you run the risk that a deleted file will be restored the next time you upgrade the package, package verification will distract you by reporting the issue, you can no longer look at the file to see what it does (or did), and so on.
(In some package management systems, if the particular package marked a file that you deleted in just the right way, it won't get restored when you apply a package update. In practice this is useless to sysadmins, because it's not something that we can always count on.)
This is a general problem with pretty much all of these schemes. They're very well set up for packages that want to add and remove their own bits of a configuration (and they're a substantial improvement on how things used to be), but they are not at all well set up for sysadmins who want to manage the configuration of packages with confidence.
Sadly I can't think of a great solution to this problem, although
there are a number of decent ones (such as the Debian Apache style
separation of availability from activation, where module configuration
files go in /etc/apache2/mods-available and are then symlinked into
/etc/apache2/mods-enabled to actually turn them on).