Wandering Thoughts archives

2011-04-15

Why a dynamic website with caching is simpler than a baked site

In a comment on my first entry in this series, nothings wrote:

And if you're writing from scratch, writing a static baked system is surely easier than writing a dynamic system with caching. So: better performance, and easier to write.

Actually, I disagree about which option is simpler.

First off, I think it's clear that a simple dynamic system without caching is easier to write than a static system. Both systems need something to render pages from content text, templates, and so on, but once the dynamic system has that its job is basically done. Even if you ignore invalidation and tracking dependencies entirely, the static system needs some additional code to walk all pages (and to know what all pages are, something the dynamic system doesn't necessarily need) and deal appropriately with the results.

(The really simple static system generates all pages every time it is run but doesn't bother writing any that are unchanged from the previous pass. This is scalable enough for a small site and sidesteps all issues of tracking what static pages to update when a particular component changes.)

Once you have a basic dynamic system it's fairly easy to start adding caching to it, even well after the initial design was done. The nature of caching means that you can be selective (only adding caching for a few operations to start with) and you don't have to completely implement fully accurate cache invalidation right away. In fact you can tune the balance of cache expiry, cache invalidation, and cache validation differently for different sorts of caches and objects, depending on what's important and what you have time to implement.

By contrast, page invalidation and dependency tracking for static systems is harder and more annoying to implement. The correctness demands are usually higher and you generally need a universal system, because you effectively only have a single 'cache'. If and when you make generated static pages permanent unless invalidated, your correctness requirements ramp up; being absolutely correct is generally quite costly and complex, but sadly it's what you need.

(If you don't make generated static pages permanent, you throw away increasing amounts of CPU power doing pointless page re-renders and you slow down updates of genuinely changed pages. This is especially the case if you use comparable staleness guarantees to the dynamic system case.)

DynamicSimplerThanStatic written at 00:55:15; Add Comment

2011-04-14

More on baking websites to static files and speed

A commentator on my first entry on this asked a good question:

Where do you draw the distinction between a baked site and a cached site? They're both a snapshot of a dynamic site. They both suffer from potentially stale cache. They both require an invalidation mechanism for the publishers.

I think there are two important differences.

First, a baked site is effectively a permanent cache. The word 'permanent' is the important thing, because it means that you absolutely have to get invalidation right because there is nothing else that will save you if the wrong data gets into the baked site.

(Any permanent cache has almost the same problem that invalidation must be completely correct.)

A temporary cache can do invalidation on heuristics because if the heuristics don't work out, bad data will time out 'soon enough' anyways. The ultimate version of this is to have no invalidation heuristics at all, just timeouts, and accept temporarily stale pages or data. This makes the problem of cache invalidation (or validation) much simpler, especially in extreme cases; such cases are good enough to make you survive Slashdot-style load surges, so for many people that's all they need.

Second, a cache still works if there is a cache miss; a baked site generally does not. This means that you have a big hammer to deal with cache problems: you simply flush the entire cache. Your site is suddenly slow until the cache rebuilds, but it still works and more importantly, it is instantly guaranteed correct and current. There is no equivalent with typical implementations of baked sites (although there are implementation tricks that give you this); the software may let you force a full rebuild, but it won't give you a correct site on the spot since 'populating' the 'cache' is an asynchronous process.

This also means that your site still works completely if something didn't make it into the cache or if the cache is malfunctioning for some reason. Pre-baked sites have no similar mechanism; if something doesn't get baked for some reason or gets removed somehow, well, it's a 404 until you (or software) notice and fix it. The advanced version of this is that it's quite easy and natural to deliberately have a partially cached dynamic site, instead of caching everything. There's no such natural equivalent for baked sites (although once again it can be done with implementation tricks).

BakingVersusSpeedII written at 00:12:59; Add Comment

2011-04-13

Some common caching techniques for dynamic websites

I'm not a deep expert on this field, so I'm not going to claim that this is a complete taxonomy of how dynamic sites implement caches. These are just the three sorts of caches that I've seen mentioned fairly frequently. As it happens, DWiki uses all three so I can give examples.

Query caches cache the results of expensive database queries and other lookups that are (hopefully) frequently used. DWiki uses a query cache to save the result of the filesystem walk it does to determine the order of entries (which is in turn used for things like the 'next' and 'previous' links on entries, the Atom syndication feeds, and so on).

Fragment caches cache the results of generating fragments of the page or, in general, the output; for example, a blog with a 'N most recent comments' sidebar might cache the rendered version of the sidebar since it's potentially expensive to create while being the same on all pages. DWiki uses a fragment cache to cache the rendered version of the content text of individual entries, for various reasons (including that converting things from DWikiText to HTML is one of the more expensive operations in DWiki).

Page caches cache whole pages (or URLs in general), and thus they shed the most load on a cache hit; an active page cache effectively turns your website into something close to a statically rendered site. DWiki uses a very simple brute force page cache when under sufficient load.

(For more details on DWiki's caching, see here.)

The simplest way for a dynamic website to handle cache validation is to not bother; you simply declare that it's acceptable to return stale data or an out of date page for N seconds (or in extreme cases, N minutes) and then do time-based cache expiry. This tends to be perfectly fine for things like blogs, which change rarely and where comments being delayed may be seen as a feature. DWiki uses this approach for its page cache.

(Such a simple cache will not make your site faster in general but it will make it faster under load, or at least under the right sort of load.)

Query caches and page caches are often quite easy to add to a dynamic website; for a query cache you just have to wrap a few of your function calls, and page caches are so simple that they can often be added by external programs.

DynamicSiteCaching written at 00:09:38; Add Comment

2011-04-11

You don't need to bake your site to static files to be fast

Recently (for my value of recently) there has been a bunch of people pushing rendering your website to static files as the way to make it able to stand up to surprise load; for example, Tim Bray's More on Baking (which has links to others). I disagree with this approach, because it's not necessary and it has some significant downsides.

You don't need static files to go fast when you're hit with load; you just need software that doesn't suck. A little caching helps a lot and is generally very easy to add to decent software, and honestly these days any blog or framework with pretensions of quality should have some sort of option for this. As I've discussed before, the load surges from popular links aren't the same as being heavily loaded in general and thus they can be dealt with with much simpler techniques.

Since everyone likes anecdotal evidence, I will point to DWiki (the software behind this blog) as an existence proof of my thesis. DWiki is not what you could call small and efficient, and it still manages to hold up against load with only a relatively crude cache. It's not running on a big dedicated server, either; this machine is relatively modest and this blog shares it with lots of other things. I've never been linked to by any of the big traffic sources, but I have been pounded by spambots and the campus search engine without anyone really noticing.

The big downside of static rendering is the problem that Tim Bray glosses over: cache invalidation. Your static rendering is a cache of your real website, so when your real website changes (for example, someone leaves a comment or you modify an entry) you need to invalidate all of the static renderings for all of the URLs where the updated content appears. Tim Bray makes this sound easy because he has cleverly arranged to not have anything that he needs to do cache invalidation on, but he has done so by being aggressively minimalistic (for example, he doesn't really do tagging or categories). This is, to put it one way, very unusual. Most blog software that you want to use is all about having multiple views of the same basic information; you have the entry page and the main page and the category or tag pages and various archival views, and you may have syndication feeds for some or all of them. All of this multiplies the number of URLs involved in your site quite a bit.

(This URL multiplication also increases the cost of baking your site, of course. If you have a heavily baked site and more than a modest amount of content, you probably aren't going to have many alternate views of your content, or at least not very many alternate views of content that you expect to change very often. For example, you might make all of your archive and category pages just have the titles of entries so that you don't have to re-render them if you modify an entry.)

Ultimately it is the usual tradeoff; baked sites run faster at the cost of more work for you. I think that this is a bad tradeoff for most people, since most people do not have heavily loaded sites and an occasional load surge is quite easy to deal with (provided that you have software that doesn't suck).

PS: possibly I am overly optimistic about the quality of common blogging and framework software.

BakingVersusSpeed written at 22:52:29; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.