2011-10-31
Deduplication is always going to be expensive
I've seen a reaction or two to my entry on ZFS deduplication that suggest that the reason ZFS dedup has significant issues is that the Solaris programmers did a bad job on it. I want to set the record straight: this is false. Deduplication is fundamentally not an easy operation and no form of deduplication can be fast unless it is fed huge amounts of resources.
Any deduplication system needs to maintain a dedup table in some form. It's the core data structure because the core operation in deduplication is 'given this new object to be written, do I already have a copy?' The performance of any dedup system depends on how fast it can look up and modify random entries in this table, while the size of this table depends primarily on how many unique entities the system is tracking. All dedup systems want to keep their dedup tables in RAM for good performance on writes and deletes.
(Dedup table IO is random because the dedup checksum has to be assumed to be randomly distributed. This is certainly the case for all good cryptographic hashes.)
Picking block level deduplication is not a ZFS mistake. Block level deduplication is more powerful and more effective than file level deduplication, at the cost of keeping track of more unique entities. Block level dedup is also much easier than file level dedup to add to a Unix filesystem for various reasons that all devolve to 'writing blocks is a fundamental filesystem activity and writing files is not'.
In short, dedup inherently requires bunches of RAM; the more you dedup, the more memory. Block level dedup requires more memory than file level dedup, but gets better results.
There are only two and a half areas where ZFS dedup really makes life worse for itself. The half is that ZFS dedup is done at the pool level instead of the filesystem level and pools are general larger (sometimes significantly so) than you would make any single filesystem. All else being equal, the larger the entity that you do deduplication over, the more unique entities you'll be dealing with and the larger your dedup table will be.
(Actually this is only half true; if you have N terabytes of disk space and a fixed block size, you can't magically shrink the collective dedup tables you need by slicing the space up into smaller chunks. If anything, smaller chunks will result in more total dedup table space because there is no cross-filesystem sharing of duplicate content. What you gain is potentially needing less RAM to hold the dedup tables of the active filesystem areas, if only some of your filesystems see much write traffic.)
The genuinely unforced ZFS error is artificially limiting how much RAM the dedup table can take up to less than 25% of system RAM. I actually think that this is less significant than it looks, because I think that this only costs you a factor of two or three for how large a ZFS pool you can dedup in RAM and this is often not going to be enough.
The final ZFS issue is that deleting a snapshot is a uniquely bad operation for any dedup system because it causes a huge amount of object deletions in a single action that's generally uninterruptible. ZFS could do better than it does, but only at the expense of mutating a snapshot that's in the process of being deleted (which would enable genuinely suspending a snapshot removal). The other way you can get this effect on any block level dedup system is to delete a sufficiently large file.
(Even today removing a sufficiently large file can be surprisingly slow on many filesystems, precisely because file removal doesn't report success until it has freed up all blocks associated with the file. It can take a significant amount of disk access just to determine what all of these blocks are for a large file.)
2011-10-23
Boiling frogs and PC performance
If you'd asked me a week ago, I would have confidently asserted that the performance of my old home machine was pretty much fine (with a few narrow exceptions); certainly no longer top of the line, but still mostly more than adequate in general. Sure it was five years old but I'd gotten reasonably good components five years ago, I don't do many things that are performance demanding, and anyways I've long heard (and agreed) how both performance demands and system performance have leveled off.
I was wrong. There's no other way to put it. Using my new machine has been a revelation in how much my old machine had quietly been slow. Even for programs that I knew were compute-heavy, the change has been dramatic. And other programs have kept surprising me when they're noticeably nicer and smoother on the new machine.
(My favorite example of improvement is that my photo processing program has not merely gotten faster at doing the same things I was doing before, it has turned into an interactive editor where I can drag adjustment sliders and see the results instantly.)
When I look back at this, I think that happened is that I was being the computing equivalent of a frog being boiled. Sure, my machine had performed well five years ago, with the programs of five years ago. But over the past five years, both programs and what we do have slowly become more demanding, bit by bit. As things got more demanding my machine's effective performance slowly degraded, but because this was a slow and progressive change I didn't consciously notice anything. Things were just a little bit slower than they were last month, never suddenly a lot slower, and soon enough it became the new normal that I expected and was used to.
In short, I had gotten accustomed to having a genuinely slow machine. And because it had happened gradually I didn't realize that it was possible to have a qualitatively different and better experience with a modern fast machine; I expected only moderate improvements from the upgrade, not an eye-opening sea change.
(Okay, there were a few things that I was hoping would speed up visibly, but that was because they were broken to start with and I would be throwing huge amounts of memory and CPU at them in the hopes of papering over the issue.)
(In retrospect, I perhaps should have noticed that there were an increasing number of things that I wasn't able to do. I've never been able to view Flash-based video in HD, for example, and I knew that was a CPU issue, but I sort of wrote it off as 'oh, Flash on Linux is just handicapped in general'.)
2011-10-13
The problem with event loops
There are a number of surface reasons why programming in an event loop paradigm is a pain in the ass; anyone who has ever done it can probably reel off a litany of complaints for you. I maintain, however, that many come down to a single fundamental issue. In a nutshell, event loops make you manage execution context manually, by yourself.
As I mentioned in my entry on event loops versus threads, both event loops and threads are switching contexts back and forth; it's just that threads do it all for you and in event loops you do it explicitly. Well, with that explicit context switching comes the need to explicitly manage those contexts, maintaining their state by hand. Much of the extra pain of event loops is ultimately from the various pains of doing manual context management.
(My entry on the real problem with asynchronous IO talks a bit about why manual context management is a pain. Asynchronous IO itself is obviously very closely related to event loops.)
The corollary to this is that the simpler and more regular your contexts are, the less of a pain they're going to be to manage. I think that this goes a fair way to explain the popularity of event loops for problems like serving static resources, as these problems generally have low state. Conversely the worst hell is things with irregular and constantly mutating state, especially complex state; this is the kind of thing where you vanish into a black hole of nested callback closures.
(I have a theory that this also explains why operating system programmers are often much more enamoured of event loops and state machines than regular programmers are; OS programmers are already thinking more or less explicitly about the context that their code runs in, because they always have to bear it in mind.)
2011-10-06
Thinking about event loops versus threads
The inimitable Ted Dziuba recently compared threads against event loops for throughput and ease of use and came down firmly on the side of threads; to summarize his writing, he argues (with math) that threads have the same throughput and are easier to use. While I sort of agree with this, I think that his math makes some assumptions or, alternately, ignores some constant factors that are potentially important in real life.
(His math implicitly assumes that servicing activity either takes the same amount of CPU in threads as it does with event loops, or that the difference in CPU required is small enough to ignore.)
If you look at them from the right angle, both threads and event loops are switching contexts back and forth between various IO requests; the difference is that threads perform this context switching implicitly for you while in an event loop you perform it explicitly. This gives event loops three potential performance advantages over threads for CPU usage: lower context switch overheads, lower costs of coordination, and aggregation of IO wait requests.
Event loops change context entirely in userspace and generally through very simple code (part of the simplicity is that event loops normally give up on scheduling), while threads have more involved and expensive context switches. However, looking only at the 'context switch' part is to overstate the event loop advantage; to be fair we need to include the work to manage all of the various callback contexts that event loops breed. My feeling is that most event loop implementations will still come out ahead, especially once multiple threads begin contending for CPU and force the scheduler to become involved.
(In many cases any work the scheduler does to 'fairly' or efficiently decide the next thread to run is almost entirely wasted effort; it just doesn't matter, and where it matters a bit the important thing is to avoid starvation.)
The coordination costs are the locking and mutexing costs that threads pay to talk to each other to exchange information; event loops mostly get this coordination for free. However this is task dependent, since how much coordination you need depends on what you're doing. Some sorts of systems require very high coordination (eg ones where information frequently passes from connection to connection), while others have none (the classic 'serve static resources' case). High coordination systems pay a high price for being threaded since hot locks and frequent locking are both very expensive.
(The counter argument is that a multi-core event loop system has the same coordination needs, although it may be able to get away with less locking if it is sufficiently clever and can partition the work between cores sufficiently carefully. You only get coordination for free if you only have one event loop process or at the least you don't need to coordinate across event loop process boundaries.)
Aggregation of IO wait requests is the question of whether it is more (CPU) efficient in your operating system to have one thread wait for activity on five hundred IO sources or to have five hundred threads wait for one IO source each. A certain amount depends on the details of the system calls available; for example, a potential factor in threads' favour is that many of their IO waits are implicit in other system calls that they are already doing. If an event loop has to do 'wait for read data to be available; read available data' (as two syscalls), threads have a potential advantage (since they're just doing a single blocking read). Things like asynchronous IO with completion notifications level the playing field again back to the basic question.
The classical thread performance breakdown under high concurrency is in fact exactly a failure of these assumptions and factors: you have so many (active) threads that the system is spending almost all of its time thrashing back and forth between them and almost none of it actually having the thread code do useful work. A corresponding pragmatic advantage of event loops is that you can build your management data structures for your expected load, using sophisticated high capacity ones where necessary.
(Operating systems have historically not designed their schedulers and other bits for huge numbers of active threads and have thus opted for relatively simple and maintainable data structures that work fine for typical, sane loads.)
One final note: the effects of all of these factors drops as the amount of real computation required to service a particular request rises and vice versa. This neatly explains where event loops have been hugely popular, namely environments where the per-request computation is effectively nil and there is a large number of requests; this situation maximizes any thread overhead that exists.
(The worst case worst case has lots of requests, very low per request computation, and high coordination requirements, thereby maximizing every possible thread disadvantage. I believe the classic demo case is simple chat rooms with large numbers of clients in each room.)