2010-06-28
The great irritation of hidden access controls
One of the things that I really hate about modern desktop environments, specifically their graphical system administration tools, is that they are increasingly completely hiding their access controls and authorization systems. I know that they have them, because they have some mechanism for doing various operations that require root permissions without bothering to ask me for the root password. But it's become just about impossible to find them; what was once at least visible in PAM configurations has vanished into what I believe is a twisty mess of Kits (PackageKit, PolicyKit, whatever) and DBus-based agents.
(The most flagrant recent example of this is my Fedora 13 machine, which will happily let me click buttons to apply package updates without ever asking me for the root password or giving me any way to control this. Where this is controlled is undocumented and opaque, as far as I can tell. Perhaps the system will just permit all console users to do this, or perhaps I accidentally clicked on a button to tell it to remember this permission permanently. Fortunately I am not trying to run a lab full of Fedora 13 machines, or this really would be hell.)
I'm reasonably sure that all of this is perfectly transparent if you are an expert who works in this area all of the time (and hopefully there are magic controls somewhere). I'm equally sure that it's completely opaque if you are not, and that's a problem.
One of its irritations is that I wind up feeling that I'm not in control of my system, which is not something that I'm accustomed to on a Unix machine. Unix systems are not supposed to be mysterious black boxes where things just happen for your own good (or just because), controlled by distant developers and system creators; they are supposed to be systems where you can follow everything that is going on. The 'we know best' complex, opaque world is the world of Windows, and I am not enthused about Unix becoming Windows.
(This view of Unix is starting to be a trend.)
Sadly, part of the problem is DBus, which has become the great nexus of spooky privileged action at a distance on modern Linux machines. In the old world, even the world of heavily PAM-ified magic authorization, you could at least follow the path of setuid binaries (and what ran them) to work out what was going on. In the new DBus world, setuid is no longer required; all you need is the right DBus system agents that programs can talk to, and magic happens.
(Sometimes the magic is even documented, but don't hold your breath on that. These things are not intended to be user-accessible parts, so they seem to get about the documentation you'd expect. And in this, sysadmins pretty much count as ordinary users.)
2010-06-14
The practicalities of non-GPL'd Linux kernel modules
A commotion has broken out on the ZFS mailing list; once again, someone has announced an open-source project to port the CDDL-licensed ZFS code to the Linux kernel, turning it into a kernel module. Predictably, a number of people are pointing out that this mixing of the CDDL and the GPLv2 is (probably) not legally viable, and equally predictably other people are both saying that it is and putting forth ways around the problem. But all the legal argument is missing the point; regardless of what the legal situation really is, in practice a non-GPL'd open-source Linux kernel module project is mostly useless.
The reasons are cultural (and rationally risk-based):
- you'll never be included in the main kernel.org kernel source.
The main kernel source is GPL v2 only, period.
(Since the Linux kernel has no stable kernel API, your project will be burdened with relatively constant effort to merge upstream API changes into your codebase.)
- you'll basically never be included in strict distributions like Debian
and Fedora.
- you'll almost certainly not be included in more pragmatic distributions
like Ubuntu and possibly Red Hat Enterprise, unless you provide
some really compelling advantage.
(While the situation of an open-source kernel module with a conflicting license is not quite the same as the situation with essentially closed-source graphics card drivers, ATI's and NVidia's closed source drivers are a much more compelling advantage than another filesystem and they're still very much on the outside.)
The practical effect is that a non-GPL'd kernel module will simply not be used by the vast majority of Linux users because it will not be included in their distribution (or won't be in the default setup, which is more or less the same thing). This is especially the case for things like filesystems that need significant extra work over just installing the module; now you need users who not just know about your module and where to find it (even if this is relatively easy) but care enough to do the extra work to use it.
(Graphics card drivers are not a comparable case because by and large they deliver benefits without users having to do anything beyond installing the software package (although configuration can give you even more benefits).)
The net result is that you wind up with an unimportant niche project that's used by almost no one.
(Many of these issues apply to all out-of-kernel kernel modules, regardless of their license. It's just that they especially apply to non-GPL'd kernel modules, because they are always going to be out-of-kernel modules.)
Sidebar: the risk to distributions
A Linux distribution that adopts any out of kernel module is taking on a practical risk, namely that the module will stop being kept up to date with the main kernel. If this happens, the distribution can either drop the module, annoying users who were using it, or start developing it itself; neither is an appetizing option.
This risk is especially acute for filesystems that want to be first class citizens in a distribution, with support for them at install time, users being encouraged to put real data on the filesystem, and so on. Dropping support for an uncommon feature is moderately painful to most people; making their data inaccessible or their system unbootable (or just not upgradeable) is really, really bad. It loses you users in a real hurry and causes incandescently angry people who will remember you very unkindly for years to come. As a result, you are effectively stuck with supporting the code, almost no matter how much work it takes.
(This can happen even to in-kernel filesystems, which is one reason that distributions are really conservative about what filesystems they consider first class citizens. Look at what happened with SuSE and ReiserFS 3, where my impression is that SuSE was basically forced to take over supporting the filesystem because the main developers first lost interest in favour of ReiserFS 4 and then evaporated.)
2010-06-12
iSCSI Enterprise Target 1.4.20.1 and disk write caches (and ZFS)
First, a note about IO modes. IET
has two ways of doing IO to whatever actual storage is behind the iSCSI
targets it advertises, called 'fileio' and 'blockio' respectively. Fileio
more or less does regular filesystem IO, as if you had opened the backing
storage at user level and were doing read() and write() to it. Blockio
does low-level direct block IO to the backing storage. All of the following
is specific to blockio.
There are two levels of possible write caching in any Linux iSCSI target implementation; any write caching done in the target server's RAM and the write caches of the physical disks themselves. IET in blockio mode has no in-memory caching; iSCSI targets are always in write through mode, where all writes are immediately sent to the underlying storage (whatever it is). IET does nothing in particular about any write caching in the underlying storage.
IET advertises all blockio iSCSI LUNs as having write caching disabled ('WCD'); this is theoretically justifiable because it doesn't do any in-memory write caching. IET doesn't allow the initiator to change this; the write cache status of a LUN (whether blockio or fileio) is a local configuration decision that is not subject to remote overrides. Besides, it can't turn on an in-RAM write cache for blockio LUNs, as there simply isn't any code for it.
(In actual fact the code is generic and advertises WCD versus WCE based on whether the LUN is in writethrough or writeback mode. Since blockio LUNs are always writethrough, IET always advertises them as WCD.)
IET ignores cache flush operations on blockio LUNs; cache flush commands do not get errors the way MODE SELECT does, but they don't have any effect. In particular, cache flushes do not get passed to the underlying disk. This is somewhat unfortunate. While IET itself has no cached writes to flush, the underlying physical disk may have its write cache enabled and if so, you would like it to get flushed.
Thus, the only truly safe way to use IET 1.4.20.1 (or any prior version) in blockio mode is to turn the write caches off on your physical disks unless your 'disks' themselves have some sort of non-volatile write caches (for example, a hardware RAID card with NVRAM). Nor is there any way for an initiator to discover the true end-to-end state of write caching, since IET always claims blockio LUNs have write caching disabled.
(There is a patch under development to add real cache flush support for blockio, but I don't know if or when it will appear in an IET release.)
PS: note that IET cannot sanely support allowing initiators to turn on and off the disk write caches of the disk(s) behind a LUN. Since a single physical disk may be shared between multiple LUNs and it only has one global setting for its write cache, allowing this would allow one initiator on one LUN to change the write cache setting behind another LUN's back.
Sidebar: how ZFS interacts with all of this
Earlier I wrote:
(The situation with ZFS, IET, and write caches is beyond the scope of this entry, but ZFS's inability to nominally enable the nominal disk write caches is not currently a problem for us.)
Now I can explain what I meant by that. For blockio LUNs, the only thing that enabling or disabling the nominal disk write cache could do is change whether IET reports the LUN as WCD or WCE. Per yesterday, the net effect of reporting LUNs as WCD is to cause ZFS to not send cache flush requests for them. And in turn this is 'okay' because IET wouldn't do anything with the cache flush requests even if ZFS did send them.
The patch being developed for IET currently makes blockio LUNs report as WCE if the underlying storage appears to support cache flush operations. With the patch, ZFS sees that the 'disks' need cache flushes, sends them, and the cache flushes are propagated to the physical disks. As a result, you can safely run the physical disks with their write caches enabled.
2010-06-10
What an iSCSI Enterprise Target kernel message really means
IET is the Linux iSCSI target implementation that we use. Periodically it will spit out somewhat alarming kernel messages that look like this:
iscsi_trgt: scsi_cmnd_start(1045) Unsupported 15
iscsi_trgt: cmnd_skip_pdu(454) e0da9005 1c 15 32
(The numbers in brackets will vary from IET version to version; they are line numbers in the source code of IET's kernel module.)
What this actually means is best summarized as 'iscsi_trgt:
unsupported or unimplemented SCSI opcode 0x15'. How important this
is depends on what SCSI opcode 0x15 turns out to be and how much your
initiator machine cares about having it supported (and what happens
if it's not). Sometimes everything will go on without problems, but
at other times this is the prelude to a nasty explosion because your
initiator really does need that particular SCSI operation.
The most convenient breakdown that I know of for what SCSI opcode
has what hex value is include/scsi/scsi.h in the Linux kernel
source; you can see it online here.
Or there's also the handy Wikipedia page (silly me for not looking
there before now), which even links to explanations of some of the opcodes.
The ones that I've seen in our environment are:
- 0x15 (MODE SELECT); sent
by ZFS to try to turn on disk write caches. IET doesn't support
the initiator changing whatever write caching the target has set up
(or any of the other things theoretically settable in SCSI mode pages).
- 0x4d (LOG SENSE),
presumably because the Solaris initiator is trying to check for disk
errors. Fortunately we are already doing that on the iSCSI backends
(well, usually).
- 0x5e (PERSISTENT RESERVE IN); this is a query, so the Solaris initiators may be trying to see if there is cluster storage fencing in effect.
So far, none of these have turned out to be essential for us.
(The situation with ZFS, IET, and write caches is beyond the scope of this entry, but ZFS's inability to nominally enable the nominal disk write caches is not currently a problem for us.)
Sidebar: the exact format of this message
For my future reference: the number after the Unsupported is the SCSI
opcode byte, in hex. The cmd_skip_pdu numbers are all in hex and
are, in order, the command's initiator task tag (ITT), the command's
iSCSI opcode, the SCSI opcode byte, and the iSCSI PDU's data size. The
names are the names of the functions in the IET kernel module that print
these errors (and as mentioned, the number in the brackets is the line
number in the source code).
(Note that iSCSI opcodes are not the same thing as SCSI opcodes. Welcome to the 'enterprise' madness that is iSCSI.)
iSCSI opcodes probably mean something to people who know the
protocol, but for people like me they are basically opaque
and probably unimportant. The IET source has a list of them
in kernel/iscsi_hdr.h, augmented with some aliases by
kernel/iscsi.h.