What we need in our fileservers (in the abstract)

July 13, 2013

A few weeks ago, a commentator on one of my fileserver entries asked if we'd considered using Ceph instead of our ZFS plus iSCSI setup. My initial reaction was strongly and more less reflexively negative, but for various reasons I've been thinking about the general issues involved off and on since then (partly because the timing for it is singularly good, since we have to migrate our data anyways). One of the first steps of any sort of semi-objective evaluation of options is to come up with a list of what general, abstracted features we need in our fileservers.

This is my list so far:

  • Full Unix permissions on the machines that people use; our users would (rightfully) lynch us if we took away all of the things that moving to a real distributed filesystem costs you. I think that this requires NFS to the 'fileservers', whatever those are.

  • Traditional Unix filesystem semantics (more or less) and performance, again on the machines that people use. People are going to run all sorts of general Unix programs against this file service; they will not be happy if some of them suddenly don't work or perform really badly because (for example) database-style random write IO performs poorly.

    We can't make any particular predictions about how people will use our fileservice. Some people will make light use of it. Some people will append to files a lot over and over again. Some people will do intensive stream IO. Some people will run databases on it, doing lots of random read and/or write IO. Some people will unpack and shuffle around huge trees of files (updating git repositories, unpacking tar files, or whatever). We can't tell them 'don't do that, our fileservers aren't built for it'.

  • No single point of (server) hardware failure that can take down storage. I think a good way to put this is that we should be able to paper over any server dying even if we're working remotely and can't touch the physical hardware (beyond forcing a hard power off). Our current environment has this property; we can fail over (virtual) fileservers if the physical server dies and if an iSCSI backend dies we can re-mirror storage to our hot spare one.

    This implies that storage can't be tightly tied to 'fileservers' because a dead server would then require physical work to shift disks, connectors, or what have you to another server.

  • Adding more storage and file service can't cost too much; we have to always be able to buy them in relatively inexpensive units even if we're full up at the moment. This implies that the individual components can't be too expensive.

  • It has to be possible to update and replace hardware without the users noticing (possibly apart from minor downtime for some changes). Migrating storage from one set of disks on one set of machines to another set of disks on another set of machines should not involve user-visible downtime. Nor should expanding the space available to any particular set of users.

    (The more changes and so on that can be done without users noticing, the better.)

  • Two levels of space limitations and space reservations, however that's accomplished. We need one level because for better or worse our model of providing storage is to sell space to groups and professors, which means that we need to be able to sell them X amount of guaranteed space that they can use. We need a second level within that in order to limit the size of 'backup entities' and to reserve space for specific people within a group.

    (Our current implementation uses ZFS pools and ZFS filesystems (with quotas and reservations) to provide the two levels.)

  • We need to be able to allocate top-level space to people in units smaller than 'one (replicated) physical disk'. Among other reasons, physical disk sizes change over time.

    (Today this is done by slicing physical disks into fixed-size logical chunks and exposing those chunks by iSCSI.)

  • In short, flexible space management; the 'filesystems' within the 'pools' should not need to have space preallocated to them. That free space is shared between and flows between filesystems in pools has been a major win in our current environment because it means that people simply buy generic space and don't have to carefully plan out where it goes and who gets it and then re-balance it as needs and space usage changes. We would be very reluctant to give this up.

  • Data integrity checksums and resilient handling of disk errors. To summarize the issue they need to be as good as ZFS's.

  • Space allocation and IO patterns that we can understand and analyze. We're not interested in shoveling a bunch of disks on storage servers into a great big cloud and having theoretically generic fileservice come out the other side; we need to be able to understand, control, and monitor what 'pools' are putting their data where.

    (And not all of our disks will be uniform. We'll likely put some specific storage on SSDs but most of it will be on good old inexpensive spinning rust.)

  • In general we need to understand how the whole system works and why it should perform well, survive explosions, and so on. 'And then complex magic happens' makes us nervous and unhappy.

  • Confidence that what we pick has a good chance of being around and working well in, say, ten years time. No one can guarantee anything, but turning over an entire fileserver environment is very painful and we want to at least have good confidence that we won't have to do it any time soon.

This is all abstract (and hopefully high level) because I'm trying to be open-mindedly generic rather than viewing everything through the limiting goggles of our current fileserver solution and its model of how the storage world can be structured.

I'm reluctant to consider 'source available' or even 'open source' as a strict requirement, but at this point it might be very hard to persuade us that a closed alternative was enough better to overcome the substantial disadvantage of not having source code. Real source code makes it much easier to understand and inspect systems and we've repeatedly found this to be very important. Similar things apply for monolithic single vendor solutions as compared to solutions built on open standards with replaceable components.

(I'm assuming that everything today will deliver basic features like fault and problem monitoring and the ability to be driven from a Unix command line environment.)

Comments on this page:

From at 2013-07-13 06:27:57:

Could you please say something about the network you use between the iSCSI Linux boxes and the Solaris ZFS servers? Can you do this with 1Gbit/s or do you use something faster?

Thank you, Edwin de Graaf

From at 2013-07-13 07:39:24:

The closest Linux commercially-supported widget seems to be glusterfs.

By trs80 at 2013-07-13 10:09:02:

GlusterFS is good, but having run it on Debian squeeze I wouldn't run it on anything but RHEL or Fedora at the moment. You'd want to run it on XFS (LWN has coverage on why). Also, it's more a NFS replacement than a ZFS replacement.

By cks at 2013-07-13 15:17:30:

Our current environment is plain 1 GB Ethernet everywhere. We'd like to move at least part of things up to 10 GB in the next generation if we can afford it.

Written on 13 July 2013.
« The ZFS ZIL's optimizations for data writes
Why single vendor solutions are a hard sell »

Page tools: View Source, View Normal, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jul 13 02:37:23 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.