Why don't SQL servers do a lot of caching?

February 2, 2007

Recently, nothings asked an interesting question in a comment here:

[...] why, under the hood, deep down, is something like memcached necessary? Why isn't the SQL server's cache as effective?

I think that one of the reasons for this is that highly aggressive caching in an SQL server is much harder than it looks.

In database servers, caches have to be transparent; non-transparent caches would implicitly break ACID guarantees. Now contemplate the problem of completely correct cache invalidation that still preserves good cache performance. Clearly it's not good enough to discard a cached query result every time any change is made to a table that the query used, because even modest levels of write traffic will dynamite your cache (unless your database has a highly atypical design with many small tables).

Let's consider a hypothetical query to generate a count of visible comments on a particular blog entry:

SELECT count(*) FROM comments WHERE entry = 27793 AND visible = 'Y';

When a transaction changes the comments table, at a minimum the SQL server must test whether any of the affected rows match the WHERE clause and discard the cached query result if it does. This check is necessarily about as expensive (on a per-row basis) as the initial query was, and the SQL server has to do it for every cached query, which the server is going to have a lot of if you're under enough load that you want an aggressive cache. The SQL server can optimize this process to some degree; for example, it can work out common subexpressions in WHERE clauses and merge things so that it only checks them once, not once for every cached query involving them. But I think it's clear that it's not an easy or simple job (and it may burn a lot of CPU).

All of this is much easier when it is the application handling the cache invalidation with a non-transparent cache like memcached; it can just directly shoot down cache entries, because it knows exactly what possible cache entries can be invalidated by a particular operation. The application can even organize what it caches to minimize the amount of invalidation it has to do, using high-level knowledge it has about its own behavior and usage patterns.

(One way to view this is that the application is effectively segmenting one database table into many pseudo-tables in the non-transparent cache, thereby buying itself the ability to do much simpler cache invalidation.)

Now, you can make the application give hints to the SQL server about all of this. But the more hints the application has to give the SQL server, the more the application might as well be using its own non-transparent cache to start with.


Comments on this page:

By nothings at 2007-02-02 01:54:38:

But the more hints the application has to give the SQL server, the more the application might as well be using its own non-transparent cache to start with.

See, I argued that you could apply the same logic to indexes, though. An index is a hint that you want a certain kind of query to go fast. Nobody says 'the more you use indices, the more you might as well write your own data structures to accelerate queries'. So I don't think the complexity of server implementation is a fair justification; the whole point of putting it into the server is so every application gets a benefit.

However, if the ACID stuff is that bad, then ok. I imagine, though, that there are other strategies. For instance, you can tag every row that currently has any cached data, and just check for that tag, without having to do a full test. (Somewhat like a 'dirty bit'.) That's one bit per row. (But I have no idea how databases store rows of data, so maybe just because you're accessing one field of the row doesn't mean the tag is necessarily handy.)

I also don't know how the ACID stuff really shakes out. In the course of building a web page from 16 selects, is that actually transactional? How? Does the perl/PHP program start a transaction and the server locks everything it touches? That would lead to deadlock, which could require abandoning the transaction and restarting it... do web pages actually do that? Unless somebody tells me otherwise, my guess would be that in practice 99% of stuff that's done with databases is already giving the client an inconsistent view of the database, because the client isn't asking for one. If that's the case, I imagine there's huge room for sloppy cacheing stuff that occurs non-transactionally.

(Why do dwiki quote blocks require a space after the '>'? This took me forever to figure out what I was doing wrong.)

By cks at 2007-02-02 09:09:26:

The difference between indexes and cache management is that indexes only have to be set up once, when you define the table, and after that the SQL server does all the work of maintaining them; the indexes are transparent. If the application has to feed the SQL server lots of hints about how to manage the cache, the cache is clearly not transparent.

For instance, you can tag every row that currently has any cached data, and just check for that tag, without having to do a full test.

Then you don't have a really efficient cache, because you're invalidating all queries involving the table every time a transaction modifies it. This is going to happen a lot for a busy database.

The problem with ACID is that only the application knows how much staleness it can tolerate for each different query. The database has to treat everything as transactional (because that's how the spec says it works) and therefor do all the consistency and isolation guarantees, cache included.

By nothings at 2007-02-02 20:33:39:

Ok, sorry, right. If a query is for field foo having the value bar, you have to invalidate the query if any row that has .foo change its barness, to or from, which is messy.

The thing I described would work for cacheing the results of a query on a primary key or any key which is guaranteed to be unique (have only one row). (My mistake for overgeneralizing, it's just that this is the default sort of result I tend to think about since I don't ever use databases and it strongly corresponds to traditional data structure operations, but obviously it's not actually a very interesting case at all, although perhaps it's still a useful one to consider.)

There's also lower levels of cacheing. If a database uses only the underlying file-system cache, or only caches using pages itself, it's possible that that cacheing is performing extremely inefficiently if the data being cached is much smaller than a page. Of course it's a lot easier than trying to cache little fragments of stuff, but again the issue isn't so much what's easier, if it can be leveraged for everyone.

I can't reply to the ACID issues since, as I said at length, it doesn't make sense to me how a web page frontend to a database is going to be using it in the first place.

By cks at 2007-02-02 21:17:43:

To simplify slightly, all SQL operations happen within the context of a database transaction. In basic SQL usage the transaction is only one statement long (sometimes called 'implicit transactions'), but you can do multi-statement transactions. (A web application may or may not put all of the SQL queries it makes while building a page into a single transaction; there are arguments either way.)

There's two ACID problems that this causes. The first is simply that databases are not really ACID but VACID, where the V is 'Visibility': the idea that once a transaction commits, all transactions started after it see the changes it made. To do this once the database server has a cache requires instant, accurate cache invalidation.

Web applications can often tolerate some lack of visibility. But without information from the application, the database server doesn't know how much and what queries it can (or can't) be done for.

Even if you relax visibility, you get hit by the Consistency requirement in ACID. Imagine a set of transactions like this, done in sequence:

  1. set a.balance to 20 and b.balance to 35.
  2. query b.balance.
  3. set b.balance to 20 and a.balance to 35.
  4. query a.balance and b.balance.

Consistency means that you can't return an answer of '35, 35', because then the query would be seeing only half of a committed transaction. Even with relaxed visibility so you can return stale answers, the answers have to be consistent with each other, which gets you back to the problem of detecting what to invalidate or fence off.

You can relax both visibility and consistency. But by then either the application is doing a lot of hinting or what you have is not something that you can call an SQL database without getting laughed out of the room.

By nothings at 2007-02-02 21:34:06:

Didn't MySQL used to (2-4 years ago) have serious limitations in its consistency or transactional support (like, not having any or something)? I seem to recall one group of people mocking it while the vast majority of people just sat down and used it and got stuff done. But maybe it was something a little different.

I dunno. The whole thing just makes me feel like the database model isn't actually the right model, if people are clumsily layering other stuff on top of it.

By cks at 2007-02-02 23:32:08:

I believe that the MySQL limitations were in its support for concurrent updates, especially concurrent updates in multi-statement transactions. I think that all updates were basically visible the moment they got made, whether or not the nominal transaction had nominally committed, and that there was no rollback if you tried to abort your nominal transaction.

(This makes perfect sense for something aimed at just being a convenient data store, per DatabaseLevels.)

By nothings at 2007-02-02 23:47:32:

Aha, exactly right. Thanks.

Written on 02 February 2007.
« Transparent versus non-transparent caching
A modern environment's need for broadband »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Fri Feb 2 00:19:13 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.