The effects of our fileserver multi-tenancy

January 3, 2015

I wrote yesterday about the ways our fileserver environment has multi-tenancy. In the aftermath of that, one entirely reasonable question to ask is whether the multi-tenancy actually matters, ie whether we notice effects from it. Unfortunately the answer is unquestionably yes. While we have much more experience in our old fileserver environment and some reason to hope that not all of it transfers to our new fileservers, essentially all levels of the multi-tenancy have caused us heartburn in the past.

The obvious direct way that multi-tenancy has caused problems is through one 'tenant' (here a ZFS pool and IO to it) contaminating the performance of another pool, or all pools on the same fileserver. We have had cases where problems in one pool essentially took down the fileserver; in some cases these were merely lots of IO, especially write IO. In less severe cases people just get worse performance without things actually exploding, and sometimes it doesn't affect everyone on the fileserver just some of them.

(We've also seen plenty of cases where IO to a pool slows the pool down for everyone using it, even people doing unrelated IO. Since our pools generally aggregate a fair number of people's home directories together, this can easily happen, Especially with bigger pools.)

The less obvious way that multi-tenancy has caused us problems is by complicating our troubleshooting. Multi-tenancy makes it so that the activity causing the problem might be only vaguely correlated to the problems that people are reporting; group A reporting slow IO from system X may actually be caused by group B banging away on a separate ZFS pool from system Y. We have gotten very used to starting our troubleshooting by looking at overall system stats, drilling down to any hotspots, and then just assuming that these are causing all of our problems. Usually this works out, but sometimes it's caused us to send urgent email to people about 'please stop that' for activity that turns out in the end to be totally unrelated and okay.

(The other issue with multi-tenancy is that many disk failure modes appear as really slow IO, and through multi-tenancy a single failing disk can have ripple effects to an entire fileserver.)

All of this makes multi-tenancy sound like a really bad idea, which brings me around to the final important effect of multi-tenancy. Namely, multi-tenancy saves us a lot of money. To be blunt this is the largest reason people do multi-tenancy at all, including on things like public clouds. It's cheaper to share resources and put up with the occasional problems that result instead of getting separate dedicated hardware (and other resources) for everything. The latter might be more predictable but it's certainly a lot more expensive. For us, it simply wouldn't be feasible to give every current ZFS pool owner their own dedicated fileserver hardware, not unless we had a substantially larger hardware budget.

(Let's assume that if we got rid of multi-tenancy we'd also get rid of iSCSI and host the disks on the ZFS fileservers, because that's really the only approach that makes any sort of cost sense. That's still a $2K or so server per ZFS pool, plus some number of disks.)

Written on 03 January 2015.
« Where we have multi-tenancy in our fileserver environment
What makes a 'next generation' or 'advanced' modern filesystem, for me »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jan 3 03:21:46 2015
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.