2007-02-02
A modern environment's need for broadband
I have an embarrassing confession: my home Internet connection is still a dialup 28.8Kbps PPP link. (Back when I first set it up, this was only slightly behind the times, as 33.3Kbps modems had just come out.)
For a long time this was fine, because all I really did with it was text-based stuff, mostly sshing to my machines at the university; while I did do some web surfing, pretty much all of the sites I visited regularly were mostly text. As time went by, along with webcomics and streaming Internet radio and big PDFs and so on, things got more awkward. But I could still make do, and I'm both stubborn and lazy so I stuck with my PPP link.
(I run both ends of my PPP link myself, so it does exactly what I want it to. This matters a lot. Also I can defer bandwidth consuming things until I'm at my office machine.)
Then recently I moved to a modern Linux version with a wide package selection that got frequent updates (ie, Fedora Core 6), and it rapidly became obvious that dialup just doesn't cut it any more. Modern environments really need broadband; there are just so many things that assume you have abundant bandwidth.
(Fedora is particularly bad, because the index files yum downloads are
over a megabyte per repository. This makes 'yum check-updates
' one of
those things that I have to plan ahead of time.)
I can limp along for now, using various baroque coping strategies, but it's clear that broadband is in my future (and no later than Fedora Core 7; I don't want to think about trying to upgrade without it). And you just can't experiment with the huge package selection in Fedora Extras over a dialup link, which is no fun.
(Due to local circumstances, broadband for me means DSL, which means I have to pick which local DSL provider I want to deal with, since the local pseudo-monopoly provider is not suitable as they have recently introduced a (relatively low) monthly transfer cap for new customers. Which is a massive headache, since there are so many choices and options and whatnot.)
Why don't SQL servers do a lot of caching?
Recently, nothings asked an interesting question in a comment here:
[...] why, under the hood, deep down, is something like memcached necessary? Why isn't the SQL server's cache as effective?
I think that one of the reasons for this is that highly aggressive caching in an SQL server is much harder than it looks.
In database servers, caches have to be transparent; non-transparent caches would implicitly break ACID guarantees. Now contemplate the problem of completely correct cache invalidation that still preserves good cache performance. Clearly it's not good enough to discard a cached query result every time any change is made to a table that the query used, because even modest levels of write traffic will dynamite your cache (unless your database has a highly atypical design with many small tables).
Let's consider a hypothetical query to generate a count of visible comments on a particular blog entry:
SELECT count(*) FROM comments WHERE entry = 27793 AND visible = 'Y';
When a transaction changes the comments table, at a minimum the SQL server must test whether any of the affected rows match the WHERE clause and discard the cached query result if it does. This check is necessarily about as expensive (on a per-row basis) as the initial query was, and the SQL server has to do it for every cached query, which the server is going to have a lot of if you're under enough load that you want an aggressive cache. The SQL server can optimize this process to some degree; for example, it can work out common subexpressions in WHERE clauses and merge things so that it only checks them once, not once for every cached query involving them. But I think it's clear that it's not an easy or simple job (and it may burn a lot of CPU).
All of this is much easier when it is the application handling the cache invalidation with a non-transparent cache like memcached; it can just directly shoot down cache entries, because it knows exactly what possible cache entries can be invalidated by a particular operation. The application can even organize what it caches to minimize the amount of invalidation it has to do, using high-level knowledge it has about its own behavior and usage patterns.
(One way to view this is that the application is effectively segmenting one database table into many pseudo-tables in the non-transparent cache, thereby buying itself the ability to do much simpler cache invalidation.)
Now, you can make the application give hints to the SQL server about all of this. But the more hints the application has to give the SQL server, the more the application might as well be using its own non-transparent cache to start with.