Wandering Thoughts archives

2015-01-18

Limited retention policies for email are user-hostile

I periodically see security people argue for policies and technology to limit the retention of email and other files, ie to enact policies like 'all email older than X days is automatically deleted for you'. Usually the reasons given are that this limits the damage done in a compromise (for example), as attackers can't copy things that have already been deleted. The problem with this is that limited retention periods are clearly user hostile.

The theory of limited retention policies is that people will manually save the small amount of email that they really need past the retention period. The reality is that many people can't pick out in advance all of the email that will be needed later or that will turn out to be important. This is a lesson I've learned over and over myself; many times I've fished email out of my brute force archive that I'd otherwise deleted because I had no idea I'd want it later. The inevitable result is that either people don't save email and then wind up wanting it or they over-save (up to and including 'everything') just in case.

Beyond that, such policies clearly force make-work on people in order to deal with them. Unless you adopt an 'archive everything' policy that you can automate, you're going to spend some amount of your time trying to sort out which email you need to save and then saving it off somewhere before it expires. This is time that you're not doing your actual job and taking care of your actual work. It would clearly be much less work to keep everything sitting around and not have to worry that some of your email will be vanishing out from underneath you.

The result is that a limited retention policy is a classical 'bad' security policy in most environments. It's a policy that wouldn't exist without security (or legal) concerns, it makes people's lives harder, and it actively invites people to get around it (in fact you're supposed to get around it to some extent, just not too much).

(I can think of less user hostile ways to deal with the underlying problem, but what you should do depends on what you think the problem is.)

LimitedRetentionUserHostile written at 03:15:14; Add Comment

2015-01-07

Forwarding access to only a subset of ssh-agent's identities

Suppose, not entirely hypothetically, that you have three machines: a laptop or desktop, some general access login servers that you do most of your sysadmin work from, and some fileservers that the login server mounts all sorts of things from. It's convenient to have passwordless ssh access from the login server where I work to the fileservers so that I can do things like conveniently run a command on several of them at once, but at the same time there are some security and convenience issues.

If I use an unencrypted key for the login server to fileserver ssh session, well, it's passwordless access but the key is sitting there on disk for anyone to use. Compromise my account on the login server and you get bonus fileserver access, which is not ideal. If I use an encrypted key on the login server, I have to start and manage a ssh-agent session on the login server every time I log in to it (including ideally purging the key if I lock the screen on my desktop to walk away for a while), and an attacker on the login server has a chance to capture the unencrypted version of the key. What I really want is for my desktop to securely hold the actual key (and include it in all of my existing desktop key management stuff) but let the login server use it to authenticate me to the fileserver.

Of course SSH sort of has a way to do this in the form of ssh-agent, since the entire purpose of ssh-agent is to hold your keys and do authentication with them so that other things can't get at them. The problem is that allowing the login server full access to my desktop's ssh-agent is too much power, as that lets it use all of my keys instead of just the fileserver access key. What would be ideal is if ssh itself could limit what keys are accessible when you had it do agent forwarding; unfortunately, as far as I know no version of ssh has such a feature. Right now the only native way you have to limit what keys are available through ssh agent forwarding is to limit what keys the ssh-agent being forwarded has access to, which can get inconvenient fast (as it multiplies the number of ssh-agents that you must start, manage, purge keys from, and so on).

If ssh won't do the filtering itself, the next best solution is a filtering ssh-agent proxy. Fortunately just such a thing exists in the form of Timo Weingärtner's ssh-agent-filter. While this works, it turns out there is an important limitation in using a filtering proxy that can complicate your life. The problem is that ssh with agent forwarding is actually using your ssh-agent for two different things: first for authenticating to the login server, and then for the agent forwarding to the login server that the login server will use for further authentication attempts to elsewhere (eg to a fileserver). If you give ssh a filtered view with only the fileserver key, the fileserver key has to work for both purposes. If you want the fileserver key to just be for fileserver access, you need a second key that is only for login server access. And remember that the login server will have access to this key through the agent forwarding.

(ssh-agent-filter has a mode where you can be asked to approve each use of a key, so you could at least set the login server key to this mode and then generally refuse after the first use. Hacks are possible here.)

PS: I've only just discovered ssh-agent-filter and experimented with it a little bit, so there may be other surprises lurking in the underbrush. I did get a test setup working, although I accumulated some notes in the process that I am going to scribble down in another entry when I have more experience with it.

PPS: In our specific configuration allowing the fileserver access key to also log me in to our login servers is probably okay because in practice trust basically flows outwards from the fileservers since, well, they have all the filesystems that the login servers NFS mount. This is definitely not going to be the case in all situations.

SshAgentFiltering written at 00:50:42; Add Comment

2015-01-03

The effects of our fileserver multi-tenancy

I wrote yesterday about the ways our fileserver environment has multi-tenancy. In the aftermath of that, one entirely reasonable question to ask is whether the multi-tenancy actually matters, ie whether we notice effects from it. Unfortunately the answer is unquestionably yes. While we have much more experience in our old fileserver environment and some reason to hope that not all of it transfers to our new fileservers, essentially all levels of the multi-tenancy have caused us heartburn in the past.

The obvious direct way that multi-tenancy has caused problems is through one 'tenant' (here a ZFS pool and IO to it) contaminating the performance of another pool, or all pools on the same fileserver. We have had cases where problems in one pool essentially took down the fileserver; in some cases these were merely lots of IO, especially write IO. In less severe cases people just get worse performance without things actually exploding, and sometimes it doesn't affect everyone on the fileserver just some of them.

(We've also seen plenty of cases where IO to a pool slows the pool down for everyone using it, even people doing unrelated IO. Since our pools generally aggregate a fair number of people's home directories together, this can easily happen, Especially with bigger pools.)

The less obvious way that multi-tenancy has caused us problems is by complicating our troubleshooting. Multi-tenancy makes it so that the activity causing the problem might be only vaguely correlated to the problems that people are reporting; group A reporting slow IO from system X may actually be caused by group B banging away on a separate ZFS pool from system Y. We have gotten very used to starting our troubleshooting by looking at overall system stats, drilling down to any hotspots, and then just assuming that these are causing all of our problems. Usually this works out, but sometimes it's caused us to send urgent email to people about 'please stop that' for activity that turns out in the end to be totally unrelated and okay.

(The other issue with multi-tenancy is that many disk failure modes appear as really slow IO, and through multi-tenancy a single failing disk can have ripple effects to an entire fileserver.)

All of this makes multi-tenancy sound like a really bad idea, which brings me around to the final important effect of multi-tenancy. Namely, multi-tenancy saves us a lot of money. To be blunt this is the largest reason people do multi-tenancy at all, including on things like public clouds. It's cheaper to share resources and put up with the occasional problems that result instead of getting separate dedicated hardware (and other resources) for everything. The latter might be more predictable but it's certainly a lot more expensive. For us, it simply wouldn't be feasible to give every current ZFS pool owner their own dedicated fileserver hardware, not unless we had a substantially larger hardware budget.

(Let's assume that if we got rid of multi-tenancy we'd also get rid of iSCSI and host the disks on the ZFS fileservers, because that's really the only approach that makes any sort of cost sense. That's still a $2K or so server per ZFS pool, plus some number of disks.)

OurMultiTenancyEffects written at 03:21:46; Add Comment

2015-01-02

Where we have multi-tenancy in our fileserver environment

One of the things that people worry about and often consider a bad idea when designing systems is multi-tenancy, where one resource serves multiple people (or, well, upstream things). The problems inherent with multi-tenancy are well known to people who use public clouds, namely that other people's activity (often other people's invisible activity) can adversely impact you. We have a number of levels of multi-tenancy in our fileserver environment and today I feel like enumerating them.

(All of this is theoretically something that you can deduce from my writeups, but writing it down explicitly doesn't hurt.)

At the moment, we have multi-tenancy on three different levels; fileservers, iSCSI backends, and individual disks. On fileservers we have multiple pools, each of which generally serves a different set of users. Since each fileserver only uses two backends to support all of its pools, this implies that we have multi-tenancy on the iSCSI backends as well; each backend hosts disks for multiple pools and thus multiple user groups. We also have multi-tenancy on individual backend disks because we slice each physical disk up into fixed-size chunks to make them more manageable and then parcel out the chunks to different pools.

(When a fileserver doesn't have much of its disk space allocated, the disks themselves may only have one chunk used and so not be multi-tenanted yet. As more disk space gets allocated, we can and do run out of disks and start having to reuse them. On our current generation, the trigger point for disk multi-tenancy is more than about 5.7 TB of space getting allocated to pools.)

We don't share backends or backend disks between fileservers any more, so backends are not multi-tenanted to that degree. This at least makes it somewhat easier to work out what's causing there to be a choke point; you only have to look at one fileserver's activity instead of more than one.

Each of these multi-tenancy points creates obvious chokepoints. The most significant one is individual disks, since they have strict and often very low performance limits in the face of any significant volume of seeks or writes (including resilvers). I think that our current generation backends don't have an internal limit on aggregate disk bandwidth, but with only two currently 1G network interfaces they can easily hit total iSCSI bandwidth limits (200 Mbytes/sec is nothing these days if you've got a lot of disks going at once with sequential activity). The fileservers are limited on both NFS and iSCSI network bandwidth (1G and 2x1G respectively), but less obviously they're also limited on RAM for caching (which is effectively shared between all pools) and NFS and iSCSI processing in general (a single fileserver can only do so many NFS and iSCSI things at once).

(I'm ignoring the multi-tenancy created by having more than one person or project in a single ZFS pool as more or less out of scope. Most ZFS deployments will probably have some degree of multi-tenancy at that level for various reasons.)

OurFileserverMultiTenancy written at 03:08:04; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.