What we need in our fileservers (in the abstract)
A few weeks ago, a commentator on one of my fileserver entries asked if we'd considered using Ceph instead of our ZFS plus iSCSI setup. My initial reaction was strongly and more less reflexively negative, but for various reasons I've been thinking about the general issues involved off and on since then (partly because the timing for it is singularly good, since we have to migrate our data anyways). One of the first steps of any sort of semi-objective evaluation of options is to come up with a list of what general, abstracted features we need in our fileservers.
This is my list so far:
- Full Unix permissions on the machines that people use; our users
would (rightfully) lynch us if we took away all of the things
that moving to a real distributed filesystem costs you. I think that this requires
NFS to the 'fileservers', whatever those are.
- Traditional Unix filesystem semantics (more or less) and
performance, again on the machines that people use. People are
going to run all sorts of general Unix programs against this file
service; they will not be happy if some of them suddenly don't
work or perform really badly because (for example) database-style
random write IO performs poorly.
We can't make any particular predictions about how people will use our fileservice. Some people will make light use of it. Some people will append to files a lot over and over again. Some people will do intensive stream IO. Some people will run databases on it, doing lots of random read and/or write IO. Some people will unpack and shuffle around huge trees of files (updating git repositories, unpacking tar files, or whatever). We can't tell them 'don't do that, our fileservers aren't built for it'.
- No single point of (server) hardware failure that can take down
storage. I think a good way to put this is that we should be able
to paper over any server dying even if we're working remotely and
can't touch the physical hardware (beyond forcing a hard power off).
Our current environment has this
property; we can fail over (virtual) fileservers if the physical
server dies and if an iSCSI backend dies we can re-mirror storage to
our hot spare one.
This implies that storage can't be tightly tied to 'fileservers' because a dead server would then require physical work to shift disks, connectors, or what have you to another server.
- Adding more storage and file service can't cost too much; we have
to always be able to buy them in relatively inexpensive units even
if we're full up at the moment. This implies that the individual
components can't be too expensive.
- It has to be possible to update and replace hardware without
the users noticing (possibly apart from minor downtime for some
changes). Migrating storage from one set of disks on one set of
machines to another set of disks on another set of machines should
not involve user-visible downtime. Nor should expanding the space
available to any particular set of users.
(The more changes and so on that can be done without users noticing, the better.)
- Two levels of space limitations and space reservations, however
that's accomplished. We need one level because for better or worse
our model of providing storage is to sell space to groups and
professors, which means that we need to be able to sell them X
amount of guaranteed space that they can use. We need a second
level within that in order to limit the size of 'backup entities'
and to reserve space for specific people within a group.
(Our current implementation uses ZFS pools and ZFS filesystems (with quotas and reservations) to provide the two levels.)
- We need to be able to allocate top-level space to people in units
smaller than 'one (replicated) physical disk'. Among other reasons,
physical disk sizes change over time.
(Today this is done by slicing physical disks into fixed-size logical chunks and exposing those chunks by iSCSI.)
- In short, flexible space management;
the 'filesystems' within the 'pools' should not need to have space
preallocated to them. That free space is shared between and flows
between filesystems in pools has been a major win in our current
environment because it means that people simply buy generic space
and don't have to carefully plan out where it goes and who gets
it and then re-balance it as needs and space usage changes. We
would be very reluctant to give this up.
- Data integrity checksums and resilient handling of disk errors.
To summarize the issue they need to be as good as ZFS's.
- Space allocation and IO patterns that we can understand and analyze.
We're not interested in shoveling a bunch of disks on storage servers
into a great big cloud and having theoretically generic fileservice
come out the other side; we need to be able to understand, control,
and monitor what 'pools' are putting their data where.
(And not all of our disks will be uniform. We'll likely put some specific storage on SSDs but most of it will be on good old inexpensive spinning rust.)
- In general we need to understand how the whole system works and why
it should perform well, survive explosions, and so on. 'And then
complex magic happens' makes us nervous and unhappy.
- Confidence that what we pick has a good chance of being around and working well in, say, ten years time. No one can guarantee anything, but turning over an entire fileserver environment is very painful and we want to at least have good confidence that we won't have to do it any time soon.
This is all abstract (and hopefully high level) because I'm trying to be open-mindedly generic rather than viewing everything through the limiting goggles of our current fileserver solution and its model of how the storage world can be structured.
I'm reluctant to consider 'source available' or even 'open source' as a strict requirement, but at this point it might be very hard to persuade us that a closed alternative was enough better to overcome the substantial disadvantage of not having source code. Real source code makes it much easier to understand and inspect systems and we've repeatedly found this to be very important. Similar things apply for monolithic single vendor solutions as compared to solutions built on open standards with replaceable components.
(I'm assuming that everything today will deliver basic features like fault and problem monitoring and the ability to be driven from a Unix command line environment.)