Wandering Thoughts archives

2013-06-23

'Human error' is not a root cause of problems

Whenever something bad happens, like people changing files that are controlled by an automated system and then having their modifications overwritten, it is tempting to blame the person, to say 'the root cause of this incident is human error'. This is both wrong and a mistake. What we call 'human error' is basically always really a failure of process or the surrounding environment, in at least three different ways.

First (and famously) the person who committed the error may have been working in an environment and with an interface that magnified the chances of errors, in some cases making it almost certain that someone would make a mistake sooner or later. Confusing interfaces, incomplete information, overwhelming flows of information, there are lots of ways to fail here. Bad interfaces and environments don't make errors certain (if they do, they get fixed), but they make it more likely. Unless you're lucky this will not be an obvious thing because people very rarely build interfaces that are obviously bad; instead, they tend to build interfaces that look superficially okay but have hidden flaws.

(Often the people who build the interfaces are not well placed to see how they lead people astray. You need a skeptical outside eye.)

But even with good interfaces, people sometimes make errors because they are sleepy or under pressure of some emergency or any number of other reasons. This too is a failure of process, specifically a failure to understand that people inevitably make mistakes and to improve your overall environment to deal with this. A resilient environment needs to work even in the face of occasionally sleepy or forgetful or over-stressed people, because all of these are going to keep happening.

(You can try to do something about some of these causes with high-level process changes. For example you could decide to deal with the sleepy people problem by saying 'no more midnight downtimes for work, we'll do them during the day when people are fully awake even if it's a bit more disruptive'.)

Finally, maybe you can say that you tried all of this and someone just can't keep from making mistakes anyways. Perhaps you have a sysadmin who just keeps editing files directly despite lots and lots of attempts to educate them otherwise. In this case you still have a failure of process; to put it bluntly, how did you manage to miss the problem when you hired this person and how come they are still working for you? Hiring and retaining bad or incompetent people is itself a failure that you need to address.

Understanding that human error is not the root cause is important because your goal should be to stop problems from happening again and to do that you must understand why people commit those errors. Very few people deliberately do things that they know are wrong. Either they do things that they think are right, in which case you need to figure out why they thought they were doing the right thing, or they make a mistake in execution and you should figure out how that mistake was possible and was so damaging.

(Note that 'ignorance' is not really a good explanation for why someone thought they were doing the right thing and even if it's correct, it leads to the process failure questions of why this ignorance wasn't fixed before the incident and also why the ignorance wasn't detected.)

(None of this is original to me and if I had planned this entry ahead of time I would have all sorts of links for you. Much of this information ultimately comes from general system safety research and especially aviation safety research and has reached me through sources like John Allspaw and the general Twitter sphere I follow. See eg here and this excellent presentation, which are the best recent links I could find in my Twitter stream right now.)

Sidebar: on what is and isn't a mistake

Note that there are a whole bunch of situations where people are not making mistakes in an ordinary sense, in that they are doing things that get them the results that they want but in a way that you don't like. These are 'mistakes' from your perspective but not from theirs, and in this situation it is even more important to understand why these people are taking the actions that they are. Preemptively declaring these cases a 'mistake' that you then define as being made due to 'human error' is two mistakes in one decision and is basically guaranteed to not solve your problems and to give you increasingly toxic relations with those people to boot.

HumanErrorNotRootCause written at 01:36:30; Add Comment

2013-06-13

I don't usually think about IO transfer times

All of this started when I was thinking about the differences between iSCSI disk IO and normal disk IO to regular disks. In light of our experiences I wound up thinking 'well, one of them is that iSCSI disk IO takes appreciable time to actually transfer between the 'disk' that is the iSCSI target and the host that uses it'. After all, a 128 Kbyte iSCSI IO will take around a millisecond just to get between the target and the initiator (in the best case). Then some useful part of my brain poked me, asking what the SATA data transmission rate is. The answer is of course that it's usually 3 Gbits/sec, or only (roughly) three times faster than gigabit Ethernet. Suddenly the difference between the two looks much smaller than I was thinking.

(It's even less of a difference if your system winds up falling back to 1.5 Gbits/sec basic SATA. Per Wikipedia, signaling overhead makes it only somewhat faster than gigabit Ethernet.)

What's really going on is that an inaccurate mental model of hard drive disk IO had settled into my head. In this model transmission speeds were so much higher than platter read and write IO speeds that they could be ignored. I was thinking that the time consuming bits of disk IO were seeking and then a bit of reading the data, but sending the data over the wire was trivial (at least for old-style spinning rust, since I knew that SSDs were fast enough that it mattered). This model may have been accurate at some point but it is dangerously close to being incorrect today; modern 3 Gb SATA is only around three times as fast as the platter read IO speeds that I usually deal with. Transmitting data between the drive and the computer can now take an appreciable amount of time.

This has a knock on effect on the impact of 'over-reading', such as always reading a 128 KB block when the user only asked for less. My traditional view has been that this was basically free (for local disks) because disk drives usually read an entire track at a time and then transmitting the extra data was effectively free because transmission was so fast in general. But since transmission speed does matter, this is not necessarily the case in real life; the extra transmission time alone may make a difference.

(Of course SATA has the edge over iSCSI, even in our environment, because systems generally have more SATA channels than they have iSCSI gigabit network links. A disk generally gets that 3 Gbits/sec to itself while with iSCSI all N disks are sharing one gigabit connection.)

Sidebar: this is actually worse in our environment

We're not just using SATA; we're using SATA port multipliers. My understanding of port multipliers is that they don't increase the link bandwidth, so we actually have 3 Gbits/sec being shared between four or five disks (depending on the specific chassis). This is enough disks that I'd expect simultaneous full-bore IO from all of the disks at once to run into the channel's bandwidth limit.

(I guess it's time to go test that. I did some tests before but I should probably revisit them with the benefit of more thinking about things. And looking back it's striking that I didn't think to do the math on the SATA channel bandwidth limit at the time of those tests; I just handwaved things with the assumption that the channel would be more than fast enough.)

IOTransferTimeAssumption written at 01:24:11; Add Comment

2013-06-07

My current understanding of 'software defined networking'

I've been hearing about 'software defined networking' (hereafter SDN) for a while, but it's never been entirely clear just what it was and what people meant when they talked about it. Partly this is because by the time I started hearing about SDN it had already become encrusted in a thick layer of marketing and insider jargon due to being the hot new thing. I've recently been poking around some things and this is what I've gathered (after a misstep and a correction or two).

At least on a conceptual level a modern piece of network gear can be divided into two parts, the data plane and the control plane. The data plane moves packets around really fast (possibly mangling them slightly); it's usually a lot of custom ASICs and high bandwidth internal cross-connects and so on. The control plane tells the data plane where to put all of those packets and otherwise, well, controls the switch or router. At the top of the control plane is an embedded processor that handles things like SNMP, the overall device configuration, the web interface, the serial console (if there is one), and so on. On routers this processor will also handle things like OSPF and BGP routing (which means that on high end routers it's pretty powerful).

(See, for example, the OpenFlow description of this split.)

Software Defined Networking is the idea of splitting the data plane and the control plane apart and putting them on different pieces of hardware. The data plane is the core of the actual network hardware (the switch or router), the control plane runs on some general server, and the two talk to each other through (standard and open) protocols that let you replace both sides (or use one control plane server to talk to a whole bunch of different sorts and makes of network hardware). It is software-defined in the sense that the control plane software, running on your server, defines the network topology of what gets routed or forwarded where. The OpenFlow people describe this as a programmable network.

(In practice I imagine vendors sell (or will sell) dedicated 'appliance' control plane servers and software that you can just plug in and use, just as vendors today sell, eg, iSCSI appliances even though you can build your own.)

There are a number of advantages (or potential advantages) to SDN. In no particular order:

  • The hardware becomes more generic, with all that implies; you'll have more choice, you can mix and match it, you can replace it in pieces, and it will probably get cheaper because vendors will lose the ability to charge extra for what are basically control plane features.

  • Central management of the network configuration (what I initially mistook SDN for). If you want to make a change such as adding a VLAN or changing network topology you no longer need to go off and touch a bunch of switches and routers individually; instead you can make the change centrally and everything gets pushed out from there.

  • Your control plane software can do more sophisticated things (including interacting with things like your database of DHCP registrations) because it's now just software running on a server, not special firmware in a magic embedded hardware.

  • You can dynamically manage how data moves through your networks based on a global view of the state of all your networks and links. In the process you have a central spot to track and troubleshoot things (and to see that global network picture).

That's pretty abstract so let me try a concrete example (based on my understanding). Suppose that you have a bunch of routed networks and a bunch of routers to route them. Today those routers probably work out network reachability (and routes) through a distributed protocol like OSPF, with each router advertising its links to its peers, building up a picture of its surroundings, and routing things accordingly. You may or may not get much insight into how your routing changes over time and what exactly happens if something goes wrong, depending on how (and if) routers log things. In the SDN future all of the routers will simply report link state information to your central control plane server; the control plane software will then build a central picture of what links are available (and possibly how much traffic is trying to go over them) and tell all of the routers how they should route things. In the process you can have the control plane software keep good track of what was going on when, what it saw, and why it made the decisions it did.

The bit of SDN that's most interesting to me is the centralized management. I eagerly look forward to a future where you'll no more log into an individual switch to set up a new VLAN than you would log into an individual server to configure an application. Artisanal hand-crafted switch configurations are as annoying as the server equivalents.

(Managing network devices and automating their configuration has been terrible for a very long time. When expect and web scraping are close to the state of the art, you are not in a good place. I'm sure you can get expensive 'enterprise' single-vendor management solutions from the big vendors, but we've never had that kind of money.)

SDNWhatItIs written at 00:10:06; Add Comment

2013-06-02

Security is not the most important thing to most people

I'm a security aware sysadmin and yet yesterday I casually admitted that I made less-secure choices because the really secure option was too annoying and potentially inconvenient. In fact this is not the only case where I make this tradeoff, picking a less secure but more convenient option.

This shouldn't really surprise people. In real life security is almost never the most important thing to people, even to security aware people. Even aware, knowledgeable people prioritize other things over security; we disable SELinux, we use fixed-keyed IPSec tunnels, we almost never try to verify new SSH host keys through out of band methods, and so on.

(This is somewhat distinct from how users don't care about security; this is people who care about security but only so far.)

One of the many reasons for this is that most people are not operating in a high-threat environment. We aren't being specifically targeted by attackers and if we take basic broad precautions we'll probably never experience a security breach. This biases almost everyone towards a tradeoff that I can describe as 'availability over security' and it also makes painful security precautions have a very low return on investment; we're being asked to invest potentially a lot of work and aggravation in exchange for what is in practice a very small gain.

(The worst case is when being 'truly secure', whatever that means, means not doing something that we want to do. When I couldn't get IKE rekeying working on my IPSec tunnel, the really secure thing would have been to say 'well, that means no IPSec tunnel at all'. Very few people are going to make that sort of tradeoff.)

(Yes, I know, I bang on this particular drum a lot. That's because I still think that a lot of people in computing have very mistaken attitudes on what security really means and how it can be achieved, attitudes that result in mistake after mistake.)

SecurityNotImportant written at 23:42:46; Add Comment

Why I do IPSec improperly and reduce my security

I've had an IPSec tunnel between my home machine and work for a fairly long time now. Now, I have a confession: during all of this time, the tunnel has had fixed keys.

If you don't know IPSec you may not understand what this means, so let me explain. To simplify a bit IPSec connections are protected by symmetric stream ciphers, which require keys for those ciphers; this is very similar to SSH, TLS, and any other network encryption protocol. But in those other protocols the stream cipher keys are normally arranged on the fly through the session setup protocol and often long-running sessions will be re-keyed periodically. Periodic re-keying (if done securely) both limits the amount of encrypted data an attacker can do brute force attacks on and limits the damage of a key compromise; it's thus generally considered a good idea.

My IPSec tunnel doesn't have any of this. The stream cipher keys it uses are hard-coded into my scripts and it uses the same stream cipher keys for however long until I go through the manual effort to change them. As you might guess, this is far from ideal from a security perspective, even though my keys are completely random.

(I have a program that dumps N bits of randomness from /dev/urandom and I use that to generate keys whenever I decide to rekey.)

In theory it's possible to do better than this, because IPSec has a whole set of systems called IKE to handle exactly the job of negotiating IPSec stream cipher session keys. The problem is that in practice all of this is fiendishly complex, generally badly documented, and doesn't always actually work (at least on Linux, other environments may have better experiences). Part of the problem is that IPSec IKE systems are generally designed and documented for complex configurations where you want things like X.509 certificates and so on; simple scenarios sometimes seem like an afterthought.

(IPSec itself falls into the usual crypto obsession of having too many options and thus quite a lot of ways of blowing your feet off, in addition to the fundamental complexity that you have to understand to use it at all.)

I actually once got an IKE setup 'working' between two systems in that it could set up the initial keys but then things blew up whenever it re-keyed the ongoing connection (I believe that existing TCP connections died, but it was a long time ago so I've forgotten the details). Of course trying to figure out where the bug might be was hopeless so I just gave up (and went to fixed keys, among other things).

(There is a general lesson for security here but you can probably already guess what it is and I think I've probably mentioned it before.)

(I was reminded of all of this pain by a recent package upgrade on my Linux machines that changed things from one IKE system to another. I saw that and briefly considered trying to set up a proper re-keying IKE configuration before I began laughing bitterly at the very idea of wading back into the IKE swamps.)

IPSecConstantKeysWhy written at 01:26:01; Add Comment

By day for June 2013: 2 7 13 23; before June; after June.

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.