Wandering Thoughts archives

2014-03-23

Differences in URL and site layout between static and dynamic websites

One of the big but subtle differences between a statically rendered site and a dynamically rendered one is simply how you design the URL structure for both of them. One example here is dynamically rendered versus statically rendered blogs.

Broadly speaking, in a statically rendered site you want to have a minimum number of URLs and for each chunk of core content to appear in a relatively minimal number of places, because you have to pre-generate every URL. The more you let content and URLs propagate, the more pages you have to re-render any time you change something (or simply in general), even if they contain mostly redundant information or may never be requested or both. This is going to drive you towards very simple site layouts; in a blog you might have only the individual entries, a front page with recent entries, and then a relatively simple scheme for archives.

In a dynamically rendered site more URLs are almost free and so you often casually grow URLs and even entire URL hierarchies that offer alternative ways of accessing your core content. After all, often all you need to procedurally generate an entire new URL hierarchy is a single chunk of parameterized code (and 'pagination' of large results is often provided for free by your framework, adding more URLs). Provided that this generation is reasonably efficient you might as well create as many ways of accessing your core content as you can think of. On a blog you might support looking at things by any combination of date, category, tag, author, and so on. All you need is some dispatch rules and some lookup filtering.

(You need to create the same core content in both static and dynamic sites; the difference is how many URLs it is visible under.)

The corollary of this is that you may not have a very happy time if you try to go from a dynamic site to a static site while keeping more or less the same URL structure. In a shift like this you probably want to rethink how things are indexed, which basically means rethinking the overall site design and URL structure.

(The corollary to the corollary is that if you're not sure whether you're going to wind up generating things statically or dynamically you should start out by designing your site as a static site, with a URL layout that works for that. As a bonus you'll likely get a simpler, more focused URL structure.)

StaticVsDynamicSiteLayout written at 02:13:27; Add Comment

2014-03-10

The problem of conditional GET and caches for dynamic websites

For reasons beyond the margins of this entry, Aristotle Pagaltzis recently noticed an oddity with syndication feeds for this blog. To paraphrase his message, he made an initial plain feed request with no If-None-Match header, got back something with an ETag header, redid the request as a conditional GET with the same tag in If-None-Match, and got back a different result with a different ETag. On the surface this sounds like my caching is broken, but what is really going on is that the traditional irony of conditional GET for dynamic sites is interacting with the desire to reduce load through caching.

The dynamic site conditional GET problem is that in many dynamic environments you need to more or less build the entire page in order to determine its ETag and Last-Modified information. If you want to have a full page cache to reduce your load under some circumstances and you don't have explicit cache invalidation (which is very hard in a file based engine), you don't necessarily have fully accurate ETag values; reducing load implies relying on the ETag cached in the page cache even though the actual page and its ETag may have changed since then. If you serve an old cached version of a page, expire it, and then regenerate it, the newly generated version may well be different. This is basically the traditional conflict between a desire for more cache hits and a desire for absolutely current information.

(You can try to track dependency information in the page cache and revalidate it before you use a cache entry and its ETag, but the general problem there is that the more you revalidate the slower a cache hit is. This is especially acute in a file based engine because the validators are harder and often less efficient.)

However this is not a full explanation of what Aristotle Pagaltzis saw (not unless the cache entry for the feed expired between the two requests, which it probably didn't). What is also going on is that DWiki is doing some special hacks in certain circumstances in order to reduce the impact of generating syndication feeds. This is relatively important here because feeds are requested quite often and they're one of the most expensive things to generate (partly because I set the number of entries in feeds quite high).

What I found when I started looking at my conditional GET logs at some point was that I was getting a significant number of requests that were not conditional GETs, ie they lacked both an If-None-Match and an If-Modified-Since header. Overcome by grumpyness I decided that if these people could not be bothered to do conditional GET I was not going to go out of my way to (re)generate current content for them, so what my DWiki setup does is serve them syndication feeds from the page cache for much longer than it does for people who make conditional GETs. This means that if you do what Aristotle did, your first request may get served from an old cache entry but then your second one is recomputed from scratch (now that it's a proper conditional GET). Under the right circumstances this will result in a changed ETag.

(It looks like roughly one seventh of my successful syndication feed requests are being sent without conditional GET at the moment. This doesn't count things like Googlebot that are getting their requests refused outright.)

Conceptually this is tilting the balance between cache hits and avoiding staleness in the direction of more cache hits under some circumstances. I don't think there's anything wrong with this as long as you're doing it deliberately and with your eyes open (and ideally based on numbers).

With that said, now that I've had my attention drawn to this I'm probably going to rethink how I want to handle caching for various sorts of syndication feed requests. My initial syndication feed caching was set up rather a long time ago and there have been several generations of overall cache improvements since then. It's quite likely that the relative cost of generating syndication feeds has shifted in favour of caching them less and generating them more often.

(One of the things that has happened on Wandering Thoughts is that syndication feeds are requested so often that they're almost always in the page cache. I actually routinely flush them from cache by hand any time I publish or revise an entry, which is probably a warning sign I should have paid attention to some time ago.)

ConditionalGETAndCaching written at 23:08:43; Add Comment

2014-03-02

Googlebot is now aggressively crawling syndication feeds

I'm not sure how long this has been going on (I only noticed it recently) but Googlebot, Google's search crawler, is now aggressively crawling syndication feeds. By 'aggressively crawling' I mean two things. First, it is fetching the feeds multiple times a day; one of my feeds was fetched 46 times in one 24-hour period. Second and worse, it's not using conditional GET.

I've written before about why web spiders should not crawl syndication feeds and I still believe everything I wrote back then (even though I've significantly reduced the number of feeds I advertise since those days). My feed URLs are all marked 'nofollow', a declaration that Googlebot generally respects. And even if Google was going to crawl syndication feeds, the minimum standard is implementing conditional GET instead of repeatedly spamming fetch requests; the latter is the kind of thing that gets you banned here.

I might very reluctantly accept Googlebot crawling a few syndication feed URLs if they properly implemented conditional GET. Then it might be a reasonable move to find updated content (although Googlebot accesses my sitemap much less frequently) and I'd passively go along with the 800 pound gorilla of search traffic. But without conditional GET it's my strong opinion that this is abuse plain and simple, and I have no interest in cooperation.

So, in short: I suggest that you check your syndication feed logs to see if Googlebot is pounding on them too and if it is, block it from accessing those URLs. I doubt Google is going to change its behavior any time soon or even notice, but at least you can avoid donating your site resources to an abusive crawler.

(As I expected, Googlebot is paying absolutely no attention to days of 403 responses on the feed URLs it's trying to fetch. It keeps on trying to fetch them at great volume, to the tune of 245 requests so far today for 11 different URLs.)

Sidebar: Some more details

First, this really is Googlebot; it comes from Google IP address ranges and from specific IPs with crawl-*.googlebot.com reverse DNS such as 66.249.66.130.

Second, in the past Googlebot has shown signs of supporting conditional GET on syndication feeds. I have historical logs that show Googlebot getting 304's on syndication feed URLs.

Third, based on historical logs I have for my personal website, this appears to have started happening there around January 13th. There are sporadic requests for feed URLs before then, but January 13th is when things light up with multiple requests a day.

GooglebotCrawlingFeeds written at 23:33:34; Add Comment

Cool URL fragments don't change either

I was all set to write an entry about how a limitation of dealing with site changes (from one domain to another, from HTTP to HTTPS, or just URL restructuring) via HTTP redirects was that URL fragments fell off during the redirections when I decided to actually check the end URLs I was getting and discovered that I was wrong. Browsers do preserve URL fragments during redirection (although you may not see this if the URL fragment is cut off because the full URL is long). What was really going on in my case is that the site I was dealing with has violated a sub-rule of 'Cool URLs don't change'.

Simply stated, the sub-rule is 'URL fragments are part of the URL'. Let me rephrase that:

If you really care about cool URLs, you can't change any HTML anchors once you create them.

The name of the anchor must remain the same and the meaning (the place it links to) must also stay. This is actually a really high bar, probably an implausibly high one, since HTML anchors are often associated with very specific information that can easily become invalid or otherwise go away (or simply be broken out to a separate page when it becomes too big).

Note that this implies that simply numbering your HTML anchors in sequential order is a terrible thing to do unless you can guarantee that you're not going to introduce or remove any sections, subsections, etc. It's much better to give them some sort of name. Effectively the anchor text should be seen as a unique and stable permanent identifier. Again this is a pretty high bar and is probably going to cause you heartburn if you try to really carry it out.

This somewhat tilts me towards a view that HTML anchors should be avoided. On the other hand it's often easier to read a large page of lots of information (exactly the situation that calls for HTML anchors and an index at the top) rather than keep clicking through a lot of small pages. Today's side moral is that web page design can be hard.

(I'd say that the right answer is small pages with some JavaScript so that one page seamlessly transitions to the next one as you read without you having to do anything, but even that's not a complete solution since you don't get things like 'search in (full) page'.)

I suppose what I really wish is that web servers got URL fragments in the URL of the HTTP request but normally ignored them. Then they could pay attention to the fragment identifier when doing redirects and do the right thing if they had any opinions about it. But this is a wish that's a couple of decades too late; I might as well wish for pervasive encryption while I was at it.

CoolUrlFragments written at 01:59:45; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.