2011-02-26
A belated realization about web spiders and your page cache
Like a lot of other web applications, DWiki has various sorts of caching. One of its caching mechanisms is a simple brute force cache for full pages, intended to deal with Slashdot effect situations; if a page has taken 'too long' to generate it's put into the cache, and then further requests for it are served straight from the cache for a short interval.
Just today, I realized that much of what was getting put into the page cache was actually being inserted pointlessly.
Like most any blog, WanderingThoughts has a lot of virtual pages. This means it has a lot of URLs for web spiders to explore. Because it has so many URLs for web spiders to walk compared to actual content, a significant amount of my total traffic is web spiders trying to explore through everything that they can find. Even a vaguely competent web spider is basically never going to re-crawl the same URL within a few seconds or minutes, ie within the time interval where my simple page cache will do any good. The result is then straightforward: adding pages that spiders request to the page cache is pointless because they will never be hit again, or at least not before their cache entry has expired.
Avoiding spiders contaminating your page cache is relatively simple. Because the largest contamination comes from the most active web spiders, you don't have to hunt down all of the spiders active on your site; all you have to do is look at your user agent logs and then make your cache insertion code pass over requests from the most active crawlers that you see. Generally they will jump right out at you.
(Extending this to caches of page components is much more chancy because the possibility of cross-page reuse is much higher.)
2011-02-23
Handling variations of a page (or, the revenge of REST)
Suppose that you have a web application with users, and different users (or different sorts of users) need to see slightly different versions of various pages and forms. It's both obvious and very tempting to implement these page and form variations with conditional logic in a single URL's view handler; you have to retrieve the user's information anyways (and check access permissions), so you simply put in some checks and switches based on what sort of user they are or what permissions they have.
(This is also the sort of situation where you start reaching for features like on the fly modification of what fields are included in a form, read-only fields, and so on.)
Through recent hard experience, I've come to feel that this is a mistake. What you should really do is put each separate version of the page or form on its own distinct URL.
The problem with having all of the versions of the page accessed through the same URL is that you have no simple, good way to give a single user access to more than one of them. You can't easily handle things like a user who has multiple roles that would get different versions of the page, or allowing administrative users to drop down to simpler, more restricted interfaces (or even to see the simpler interfaces for training purposes).
(I've recently stubbed my toes on both of these cases. Our current workarounds for both of them are unaesthetic and awkward.)
You can solve this several ways, but the easy solution is the RESTful way. Given that you actually have different (albeit closely related) pages, you should give each different page a distinct URL. Among other advantages, this means that users can immediately use as many different versions as they have permissions for just by going to different URLs.
(Under the hood you might still use all of the dynamic form modification and conditional templates and so on that you used to; it's just that now you're making choices based on the URL instead of the user's data. You only use the user's data to insure that they have access to this version of the page, and you needed to do those access checks anyways.)
What this really drives home for me is that REST is a really good idea. The moment I hid application state in my code instead of exposing it in my URLs, it bit me, just like REST told me it would. I can only hope that I'll learn the next time around.
Sidebar: how to avoid complicating URLs in the rest of your application
So, now you have three or four URLs for different versions of a page instead of a single page that displays differently for various people. How do you link to it from other pages? In particular, do you have to put conditional logic (or special rendering tags or the like) in your other pages that look up what version of the page the user is entitled to and link to just it?
My answer is no. Suppose that you have three different versions of a creation page (for three different classes of users), and you put them at /manage/create/sponsor, /manage/create/gradoff, and /manage/create/staff. Set up a fourth page, /manage/create/. What this page does is look up what the user can access, pick the preferred version, and answer with a HTTP redirect to it. Once you have this set up you can just point all of your outside links to /manage/create and everything will just work. Knowledge of what the user has access to and where they should wind up is now localized in /manage/create; no one else has to care.
This scheme gets a bit more complicated if you need to pass components of the URL through, if you have eg /manage/edit/<id> and it needs to be redirected to one of /manage/edit/staff/<id> or /manage/edit/sponsor/<id>. But this can be handled too with a bit more work.
2011-02-15
More on an advantage of the blog approach to web writing
In light of Phil Hollenback's reply to my earlier entry on this topic, I need to write some more words.
First off, I entirely agree with Phil that a wiki engine can do a decent job of generating blog-style syndication feeds; all you need is to have a full text feed of recently created pages and you're basically done. Category or hierarchy based feeds are useful but optional.
(Making the whole thing efficient is only a small matter of programming and data storage, and for even a medium sized wiki it generally doesn't matter. If your wiki is popular, the information you need to generate a syndication feed will be in RAM; if it is not popular, slow syndication feed generation is usually not really a problem.)
But this is the easy side of things because it is using a wiki engine as a blogging platform, where you will follow the blog approach to web writing and not the wiki approach. The fundamental difference between the blog approach to web writing and the wiki approach is that in the wiki approach you significantly revise old pages.
What I realized in the process of writing my earlier entry is that generating a good syndication feed for revised wiki pages is actually a hard problem, even though it looks relatively simple from the outside.
(Well, a good syndication feed from the perspective of a blog reader, someone who wants to keep track of the new information showing up on the wiki, wherever and however it appears.)
In the abstract, what you did when you made the revision is that you added new information to the page (yes, even if you just deleted text). What the reader wants is to read about this new information (with enough surrounding context for it to make sense). In a sense, they want your page edit recast as a version control system commit message (and for much the same reasons that people want commit messages, not just code diffs). You probably don't want to write this version of your change by hand, because it's basically equivalent to writing a blog entry in addition to your wiki page revision. But automatically extracting it from your change is extremely non-trivial in the general case (even though there are simple ones, like adding a bunch of text to the end of an existing entry).
This is why I think that wikis suffer from more than just having badly implemented syndication feeds today. Giving a wiki a decent blog-style syndication feed is a solvable technical problem (and it has been solved several times over). Getting software to automatically describe changes to existing pages in a useful and general way is deeply challenging, and I don't think it's going to be done any time soon. This means that syndication feeds for page revisions are inherently going to be not as useful as the same information written up by itself from scratch, ie in the blog style (where you don't revise, you write a new entry).
2011-02-12
An advantage of the blog approach to web writing versus the wiki approach
You can argue that in many ways it is basically a tossup between the blog approach and the wiki approach to writing for the web and that you might as well use whichever one you want to adopt, whichever one fits your particular style of producing content. (Given real blog usability you can certainly argue that a blog's chronological ordering is unimportant and a wiki's greater internal links are highly useful.)
However, the blog approach does have at least one significant advantage over the wiki approach: blogs have figured out much better ways for people to keep track of a stream of new information than wikis have. With blogs, every significant addition of information is a new entry and a standalone entity; it appears at the top of a page if you want to visit directly, it shows up as a complete, readable entity in syndication feeds, and you can announce it on Twitter, Facebook, and other social networking sites of your choice.
Wikis, well, have not figured this out, not anywhere near as well. As Aristotle Pagaltzis noted in a comment here, most wikis do not have good support for showing people important changes in a useful way (in syndication feeds or in the wiki itself). I think that this is partly an inherently hard problem. When there is a change, you don't want to see just it, you want to know what it means, and extracting that semi-semantic information from raw text is inherently very hard. With blogs, people generally pre-extract that information for you because they're writing a new entry.
(It doesn't always work even in blogs, because a new entry can be so densely laden with references to other things that you have to follow a lot of links to actually understand it. But at least you have a good chance, especially if you've been following the blog for a while.)
PS: as noted before, this is a difference in the approach, not necessarily in the storage engine. You can write with the blog approach using a wiki engine, and if you try hard you can probably do the reverse as well.
2011-02-05
A side note on Google Chrome and the future of HTML
In light of my earlier grump about the lesson of XHTML, it's struck me that there is an interesting way to look at Google Chrome. With lots of disclaimers, it goes like this:
If major browsers are the only people who really get a 'vote' on the future of HTML (broadly construed), then one of the things that developing Chrome has done for Google is that it has bought it a seat at the table of the future of HTML. No one can argue that Chrome is not a relatively major and important browser at this point, which means that Google (as Chrome's developer) is now in a position to be part of the lesson of XHTML (in both directions, both for and against things).
Given how important HTML is to Google, Google could well consider the entire cost of Chrome's development to be well worth it just for this influence alone.
(One of the disclaimers is that this is a thought exercise. Another is that Google has lots of potential reasons for developing Chrome, some of which I'm sure they've advanced in public already.)
Also, a year or two ago I would have said that the cynical view of this was that none of it mattered because Internet Explorer continued to be the 800 pound gorilla in the room, and IE was paying just as much attention to web standards efforts as always (ie, almost none). However, IE usage has been sliding (partly because of Chrome). With the field becoming more even, web standards efforts might actually become important again.
(There is a decent devil's advocate argument for something like the IE position on paying attention to web standards, but that's another entry.)
2011-02-03
The real lesson of XHTML
Let's say it plainly: XHTML is a failure. It is broadly unsupported, barely used for real, and further standards work has been abandoned.
(Yes, yes, there are lots of people with XHTML badges on their web pages. Many of them fail XHTML validation, and of those that validate almost none are being served as XHTML. See here.)
The real lesson that people learned from XHTML and its failure is simple: web standards are ultimately created by what major browser vendors are willing to implement.
As a consequence, people also learned that there is no point in working on standards that browsers are not going to implement and not that much point in working on standards that you are not reasonably certain that they will. Everyone saw the amount of effort and energy poured into XHTML (not just on creating the standard itself but also on advocacy, tools and so on), and all of it for nothing in the end.
(The consequences of this lesson learned are pretty predictable, and we are seeing some of them in action today with HTML5.)
This is not a new development in web standards; it has always been this
way, right from the start when Mosaic created the <img> tag by fiat
and working code. It is just that XHTML made it glaringly obvious
because it was the first major web standard that a major browser
vendor completely balked at, plus it was a change with no backwards
compatibility path so this balk could not be papered over.
(You cannot serve the same document in the same way to both MS IE and XHTML capable browsers and have it be W3C-proper XHTML. You need at least server side support to switch the content-type.)
Accepting this is not surrendering to browser vendors; it is acknowledging reality. Web standards are simply not forced standards, or at least not the sort where you can write a specification and magically make people implement it, and they never have been.
(As de facto standards many aspects of the web are forced standards for most people, but this is interoperability and market forces at work.)