2017-03-03
Why exposing only blocking APIs are ultimately a bad idea
I recently read Marek's Socket API thoughts, which mulls over a number of issues and ends with the remark:
But nonetheless, I very much like the idea of only blocking API's being exposed to the user.
This is definitely an attractive idea. All of the various attempts
at select()
style APIs have generally not gone well, high level
callbacks give you 'callback hell', and it would be conceptually
nice to combine cheap concurrency with purely blocking APIs to have
our cake and eat it too. It's no wonder this idea comes up repeatedly
and I feel the tug of it myself.
Unfortunately, I've wound up feeling that it's fundamentally a
mistake. While superficially attractive, attempting to do this in
the real world is going to wind up with an increasingly ugly mess
in practice. For the moment let's set aside the issue that cheap
concurrency is fundamentally an illusion
and assume that we can make the illusion work well enough here.
This still leaves us with the select()
problem:
sooner or later the result of one IO will make you want to stop
doing another waiting IO. Or more generally, sooner or later
you'll want to stop doing some bit of blocking IO as the result of
other events and processing inside your program.
When all IO is blocking, separate IO must be handled by separate threads and thus you need to support external (cross-thread) cancellation of in-flight blocked IO out from underneath a thread. The moment you have this sort of unsynchronized and forced cross-thread interaction, you have a whole collection of thorny concurrency issues that we have historically not been very good at dealing with. It's basically guaranteed that people will write IO handling code with subtle race conditions and unhandled (or mishandled) error conditions, because (as usual) they didn't realize that something was possible or that their code could be trying to do thing X right as thing Y was happening.
(I'm sure that there are API design mistakes that can and will be
made here, too, just as there have been a series of API design
mistakes around select()
and its successors. Even APIs are hard
to get completely right in the face of concurrency issues.)
There is no fix for this that I can see for purely blocking APIs. Either you allow external cancellation of blocked IO, which creates the cross-thread problems, or you disallow it and significantly limit your IO model, creating real complications as well as limiting what kind of systems your APIs can support.
(For the people who are about to say 'but Go makes it work', I'm afraid that Go doesn't. It chooses to limit what sort of systems you can build, and I'm not just talking about the memory issues.)
PS: I think it's possible to sort of square the circle here, but
the solution must be deeply embedded into the language and its
runtime. The basic idea is to create a CSP like environment where
waiting for IO to complete is a channel receive or send operation,
and may be mixed with other channel operations in a select
. Once
you have this, you have a relatively clean way to cancel a blocked
IO; the thread performing the IO simply uses a multi-select, where
one channel is the IO operation and another is the 'abort the
operation' channel. This doesn't guarantee that everyone will get
it right, but it does at least reduce your problem down to the
existing problem of properly handling channel operation ordering
and so on. But this is not really a 'only blocking API' as we
normally think of it and, as mentioned, it requires very deep support
in the language and runtime (since under the hood this has to
actually be asynchronous IO and possibly involve multiple threads).
This is also going to sometimes be somewhat of a lie, because on
many systems there is a certain amount of IO that is genuinely
synchronous and can't be interrupted at all, despite you putting
it in a multi-channel select
statement. Many Unixes don't really
support asynchronous reads and writes from files on disk, for
example.
Some notes on ZFS per-user quotas and their interactions with NFS
In addition to quotas on filesystems themselves (refquota
) and quotas
on entire trees (plain quota
), ZFS also supports per-filesystem quotas
on how much space users (or groups) can use. We haven't previously
used these for various reasons, but today we had a situation with an
inaccessible runaway user process eating up all the free space in one
pool on our fileservers and we decided to (try
to) stop it by sticking a quota on the user. The result was reasonably
educational and led to some additional educational experimentation, so
now it's time for notes.
User quotas for a user on a filesystem are created by setting the
userquota@<user>
property of the filesystem to some appropriate
value. Unlike overall filesystem and tree quotas, you can set a
user quota that is below the user's current space usage. To see the
user's current space usage, you look at userused@<user>
(which
will have its disk space number rounded unless you use 'zfs get
-p userused@<user> ...'
). To clear the user's quota limit after
you don't need it any more, set it to none
instead of a size.
(The current Illumos zfs
manpage has an annoying mistake, where
its section on the userquota@<user>
property talks about finding
out space by looking at the 'userspace@<user>
' property, which
is the wrong property name. I suppose I should file a bug report.)
Since user quotas are per-filesystem only (as mentioned), you need to know which filesystem or filesystems your errant user is using space on in your pool in order to block a runaway space consumer. In our case we already have some tools for this and had localized the space growth to a single filesystem; otherwise, you may want to write a script in advance so you can freeze someone's space usage at its current level on a collection of filesystems.
(The mechanics are pretty simple; you set the userquota@<user>
value to the value of the userspace@<user>
property, if it exists.
I'd use the precise value unless you're sure no user will ever use
enough space on a filesystem to make the rounding errors significant.)
Then we have the issue of how firmly and how fast quotas are enforced.
The zfs
manpage warns you explicitly:
Enforcement of user quotas may be delayed by several seconds. This delay means that a user might exceed their quota before the system notices that they are over quota and begins to refuse additional writes with the
EDQUOT
error message.
This is especially the case over NFS (at least NFS v3), where NFS clients may not start flushing writes to the NFS server for some time. In my testing, I saw the NFS client's kernel happily accept a couple of GB of writes before it started forcing them out to the fileserver.
The behavior of an OmniOS NFS server here is somewhat variable. On the one hand, we saw space usage for our quota'd user keep increasing over the quota for a certain amount of time after we applied the quota (unfortunately I was too busy to time it or carefully track it). On the other hand, in testing, if I started to write to an existing but empty file (on the NFS client) once I was over quota, the NFS server refused all writes and didn't put any data in the file. My conclusion is that at least for NFS servers, the user may be able to go over your quota limit by a few hundred megabytes under the right circumstances. However, once ZFS knows that you're over the quota limit a lot of things shut down immediately; you can't make new files, for example (and NFS clients helpfully get an immediate error about this).
(I took a quick look at the kernel code but I couldn't spot where ZFS updates the space usage information in order to see what sort of lag there is in the process.)
I haven't tested what happens to fileserver performance if a NFS
client keeps trying to write data after it has hit the quota limit
and has started getting EDQUOTA
errors. You'd think that the
fileserver should be unaffected, but we've seen issues when pools
hit overall quota size limits.
(It's not clear if this came up today when the user hit the quota
limit and whatever process(es) they were running started to get
those EDQUOTA
errors.)