Wandering Thoughts archives

2012-09-27

Microkernels and device drivers

A commentator on my entry on microkernel modularity asked a good question:

One common argument pro microkernel is that they are more robust, because individual drivers can crash without taking the whole system down. What's your take on this?

I don't know if anyone has done practical studies on this and certainly I have any personal experience myself, but I'm dubious about this claim for several reasons.

To start with, drivers control hardware and hardware itself is extraordinarily powerful. On most machines, hardware can do DMA to (and from) any memory that the driver asks it to, so a driver bug can already smash memory regardless of what access rights the driver code theoretically has or doesn't have. The counter-argument here is that bugs in DMA targets are relatively rare; addressing bugs in the driver code are much more common and the microkernels protect against those.

However the big issue is, well, let me repeat something that Dan Astoorian quoted in a comment on here:

"Never test for an error condition you don't know how to handle." -- Steinbach's Guideline for Systems Programmers.

So, one of your microkernel's driver processes has crashed. What do you next?

Active drivers are highly likely to be crucial to the operation of the system. If you lose the disk controller, the network hardware, or any number of other drivers, your system is very close to being a paperweight even if other bits keep going. In a technical sense the whole system may not have crashed, but the effects are basically the same. The obvious fix for this is to restart and re-run the driver process somehow, just as if it was a user-level process that you were restarting after a crash. Unfortunately, hardware drivers deal with hardware. This means that a crashed driver leaves its hardware in some unknown state, possibly one that's dangerous to touch. In general, to recover the hardware itself and thus make the driver actually useful you need to return the hardware to a known state somehow. And you need to do this with the hardware in an arbitrary state, without you knowing anything about that state.

I'm sure that there's hardware where you can do this (for example, hardware where you can tell the bus to turn off power to the device and then turn it back on). I'm also sure that there's plenty of hardware where you can't and a certain amount of hardware where mis-programming it (or sometimes partially programming it) will lock up your entire machine. This is not something that any amount of microkernel driver isolation can help you with.

The counter-argument is that a microkernel's isolation gives you more options. The system can choose to leave the driver un-restarted until this is initiated by some user-level action, for example. And if the system decides that the best way out is to reboot, you're likely to have more of a chance to save things and terminate processes in an orderly way since it's probable that any damage the crashing driver did has been confined to its own memory.

MicrokernelDrivers written at 02:19:24; Add Comment

2012-09-26

Microkernels and modularity: do microkernels ever make sense?

Here's a question that I've been mulling over for a while, stated here as a thesis and argument.

The meta-goal of microkernels is to make it easier to write OSes and to make them more reliable by isolating OS components from each other; this increased isolation and modularity is achieved by limiting their ability to talk to each other and affect each other. In a conventional kernel, you have global variables, shared global data structures, and potential semi-random function calls as the flow of control zig-zags back and forth through various levels and modules. A microkernel replaces all of this with something more regimented, where the separate OS components can only interact with each other through the well defined IPC mechanisms provided by the microkernel.

But this is mostly an illusion. What we've really done is replace an obvious API (all of those function calls across modules) with a far less obvious and much more indirect API in the form of all of the messages passed between separated components over the microkernel IPC. Microkernels don't fundamentally reduce the complexity of what an OS must do or the interconnections between OS components; they just obscure it by wrapping it up in a layer or two more of abstraction and indirection. In fact this may well make the complexity worse precisely because it makes what's actually happening less obvious and harder to follow.

(One can draw an analogy to the whole modern web service approach of tunneling all APIs over HTTP requests, and observe that this has not exactly improved the APIs. If anything, it has complicated everyone's life.)

Or in summary: what matters is the interaction surface between your OS components, not how that interaction is achieved. At the superficial level, microkernels trade preventing accidents and blocking certain sorts of interaction that're considered dangerous (ie, global variables and 'APIs' that have not been explicitly planned out) for obscuring what the actual interactions are and what they're happening with. If having to work through a microkernel is so awkward that you figure out a better structure for your OS components so they don't talk to each other as much, well, you could have done that even without the microkernel. The microkernel just gave you a useful push. As a tool for creating (increased) modularity, microkernels are very blunt instrument with potentially bad consequences.

MicrokernelsAndModularity written at 02:02:19; Add Comment

2012-09-16

The problem with noise

The problem with noise is that humans habituate to it very fast. This is a problem because humans see (and hear, and generally perceive) what they expect to see (hear, etc). If you are habituated to noise you hear noise and you ignore it. We are extremely badly adopted to picking up the one time out of many when what we are hearing is actual signal instead of noise.

(There are a number of fascinating and sometimes disturbing psychology experiments that demonstrate just how much of what we think we're perceiving is faked by our mind and how many things we overlook. I'll let you do the Internet searches for things like 'selective attention' yourself.)

In short, people are terrible at detecting true positives in a sea of false positives.

Any system that mostly generates false positives or other forms of noise that can't immediately be told from actual important things is in trouble; the more noise there is, the more trouble. In a relatively strong way, such systems are useless. By extension, creating a system that generates a lot of noise mixed in with its moderate signal and then saying 'but the information is there if you pay attention' is not solving the real problem, as usual. Engineering in the real world requires understanding the limitations of the real humans who will be using your system.

This applies all over the computing world, for example to security alerts, and underlies many other problems.

(By the way, you're still in trouble even if your signal is relatively distinct from noise. People do what they've been habituated to do and if that is 'delete email' or 'ignore phone', well, there you go. It doesn't really matter that people could easily tell that this was different if they paid attention to the email message because they probably won't, not if you've deluged them with noise before.)

(This is not novel observation and I've touched on this issue before in other entries. I just feel like writing the core issue down explicitly for once.)

NoiseProblem written at 00:42:27; Add Comment

2012-09-11

A realization about ratelimit time horizons

Here's something that's smacked me in the nose recently as I started working with Exim's ratelimits.

When you have a ratelimit it's usually expressed in terms of 'X events in Y time' and you generally get to pick both X and Y. Mathematically, there is no difference in how many total events a ratelimit allows over a long time period if you scale both X and Y together; 20 events per 10 minutes is the same as 120 events per hour. But this is not quite right. The two ratelimits are different, and here's how: the shorter the ratelimit time interval is, the harsher it is on burst traffic. If you send to 50 recipients in a burst in five minutes, you will trip the '20 in 10 minutes' limit but not the '120 in an hour' version.

The result is that picking a time period for your ratelimit is a tradeoff. On the one hand a short ratelimit period gives less and less allowance for bursts. On the other hand it also limits the amount of damage that an aggressor can do because it cuts in sooner. If you are sending messages as fast as possible, the 20 in 10 minutes limit will allow you to send to 20 recipients and then cut you off while the 120 in an hour version will let you spam a lot more people before it stops you.

I think that this means that my first step for setting ratelimit numbers in the future should be to figure out how much we're willing to let a bad guy get away with. The larger the number the more flexibility I have with the time period and generally longer is going to be better unless the events aren't very bursty.

(I'm sure that this is well known among people who deal with ratelimits regularly, and probably it was even mentioned in the documentation for Exim's. I'm slow sometimes and writing things down helps me make them stick.)

RatelimitPeriodsRealization written at 02:06:10; Add Comment

2012-09-03

People are not ignorant (usually)

One of the eternal complaints in the computer world is, roughly, 'people are ignoring this marvelous thing because they are ignorant' (and its flipside version of 'people are only using X/doing X because they don't know any better'). A closely related version of this (arguably the same one) is 'people would use X if only they really understood how good it is'. You can fill in the blank here with any number of technologies, often classical ones; these days, you can add various practices that people are not doing (or doing) to the list as well.

If you say this, you are probably dead wrong, at least at a global level. There are two reasons for this. The first is that this is a form of ignoring the real problem; creating a technology is only the start of the work, not the end of it. The second is that to the extent that people actually are ignorant of your marvelous thing, there are almost always good reasons for this. To put it one way, generally people are ignorant about your thing because it is not important enough for them to be informed, and in turn this is generally because your thing is actually not a significant enough advantage for them to matter (not once you add up both the advantages and the disadvantages involved).

(What a lot of this comes down to is good enough versus better. Is X better than Y is the wrong question to ask; the right question is whether X is enough better than Y to justify switching if you are already using Y and there is a lot of support for Y and so on. And even when the answer is yes, there is a lot of momentum behind any existing decision.)

To put it another way, saying 'people are doing this because they are ignorant' is a comfortable slam that spares you the bother of asking uncomfortable questions about why. Both why people are ignorant and then once you get past that, why non-ignorant people might still make a different decision than you have. Any time you're tempted to explain something this way, you should be very certain that the people in question really are acting out of true lack of knowledge and that they would make a different decision if they knew more. Otherwise you do not really understand the situation, which is a great way to go badly wrong.

(It is also a great way to insult people, which has all sorts of effects that you are unlikely to want, at least if you genuinely want people to adopt your marvelous thing. Sometimes I wonder about various groups that are prone to this behavior.)

PeopleAndIgnorance written at 23:19:59; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.