Wandering Thoughts archives

2008-02-23

Where the risk is with virtualization (and iSCSI)

Both virtualization and iSCSI are fundamentally 'sharing' technologies, where multiple things use a shared infrastructure. This means that they have a shared set of general security risks, pretty much common to any sharing technologies: cross-contamination and base system compromise.

Cross contamination is the ability to interfere with other things using the shared environment and to compromise them in turn; having a shared infrastructure necessarily gives you a channel to attack other things that are using that infrastructure. This isn't novel, but virtualization and iSCSI make it worse in that much of the attack surface that they expose is trusted by the underlying operating environment.

(In other words, your SCSI driver is probably less hardened than your network driver.)

Base system compromise is breaking into the base system; for example, escaping virtualization and compromising the host operating system. This is worse than compromising other guest operating systems because the host OS probably has more access than any of the guests do; it may be on special management networks and so on. The same is true of breaking into an actual iSCSI target (and it's almost certainly possible, especially as it's very unlikely that they've been seriously hardened).

(This risk is not new with iSCSI, but it's certainly more accessible now that your SAN infrastructure will be using Ethernet and TCP instead of FibreChannel. And I am pretty sure that iSCSI plus TCP is significantly more complex than SCSI over FC, which increases the attack surface and thus the chance of a problem.)

SharingRisk written at 23:40:55; Add Comment

2008-02-14

Why does anyone buy iSCSI TOE network cards?

We're (still) looking for a new SAN, so I have recently been doing some work with a homebrew iSCSI testing environment built from ordinary servers. The experience leaves me with a simple question: why does anyone buy special iSCSI accelerator network cards?

(These are sometimes called 'iSCSI HBA cards'.)

I ask because my ordinary test environment, with commodity server NICs and no tuning besides 9,000 byte jumbo frames, can basically saturate gigabit Ethernet doing bulk reads over iSCSI. Through the filesystem, no less. Neither end is hurting for CPU; the iniator machine shows 75% idle, the target machine show 40% idle.

(I don't care about the target machine's CPU all that much, since it has no other role in life besides being an iSCSI target; what really matters is iniator CPU usage.)

To me this smells an awful lot like the hardware RAID situation, where you have to be both CPU constrained and doing a large volume of IO in order for an iSCSI HBA to make any sense. My suspicion is that this is not very common, and thus that a lot of people are being scared into buying iSCSI accelerators that they don't need.

(The one possible exception I can think of is if you are using iSCSI to supply disk space to virtualized servers. Still, I can't help but think that such a machine is maybe a bit overcommitted.)

Sidebar: my test environment

If for no other use than my future reference:

  • the initiator machine is a SunFire X2100 M2 running Solaris 10 U4 with ZFS and the native Solaris iSCSI iniator.
  • the target machine is an HP DL320s running Red Hat Enterprise 5 (in 64-bit mode) with the IET driver. The iSCSI-exported storage is LVM over software RAID 6 across 11 80GB SATA disks.

(The HP's native hardware RAID 6 turned out to have abysmal performance, so I fell back to Linux software RAID, which performs fine.)

The Solaris machine is using an Intel gigabit NIC (via a Sun expansion card); the HP machine is using one of the onboard Broadcom BCM5714 ports. Linux enabled RX and TX checksumming and scatter-gather support on the Broadcom, and I don't know what Solaris did. The machines are connected through a Netgear switch.

(There is a rant about switches that claim to support jumbo frames but don't that could go here.)

ISCSIOffloadPuzzle written at 23:43:51; Add Comment

2008-02-11

Why commercial support needs to solve your customers' problems

If you're providing commercial support, I think that it is very important that you actually solve your customers' problems or, if you cannot, at least realize this very fast and tell the customers.

(It is not so much because who get nothing out of their support contracts will stop paying for them; indeed quite possibly not.)

If you do not actually solve their problems, your customers have become your unpaid debuggers; in fact, they're paying you for the privilege. This is because going through the work of bug and problem reports only benefits your customers if you then solve the problem. If you do not, all of the effort the customer has put in to the process only benefits you (in the long run it helps you improve your product).

(This is the corollary of who benefits from bug reports.)

This matters a lot because humans fiercely resent being taken advantage of (a behavior that seems relatively hardwired). Your customers may dislike spending money and receiving little value, but they are likely to hate you for being taken advantage of, whether or not you intended it and whether or not they realize it.

(This is why it is important to fail fast for problems that you can't fix, before you have the customer put in a lot of work doing diagnostics and the like.)

CommercialSupportNote written at 23:28:44; Add Comment

2008-02-09

How your fileservers can wind up spreading over your SAN

Consider a fileserver and SAN environment where you have a number of frontend NFS fileservers and a number of backend SAN disk units, with at least as many backend disk units as fileservers. The obvious sensible way to split the disk units among the fileservers is to have each SAN unit used by only one fileserver, because that makes various things much easier to manage.

(You might even be tempted to design an infrastructure around the assumption that disk units won't be shared.)

The problem with staying this way is that you can wind up with a fileserver that has enough activity to overload its SAN disk units, because you may not know in advance what filesystems will be significantly active. To fix the issue, you need to move some active filesystems to other disk units, rebalancing the IO loads among them.

If you want to avoid user-visible disturbances, you generally can't do this by moving filesystems to another fileserver, but a well designed storage setup will let you do it by migrating the filesystem's storage to another disk unit on its current fileserver. In this case that means getting some free disk space from a less loaded SAN disk unit that normally belongs to another fileserver, since we're assuming that all of the fileserver's proper disk units are overloaded already.

Presto: you have SAN disk units that are shared between multiple fileservers, and your fileservers can potentially wind up spreading across many (or all) of your SAN disk units.

SANSpread written at 01:03:31; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.