Wandering Thoughts archives

2006-04-26

Common web spider mistakes

It turns out that a lot of web spiders crawl my website here, and because I read my web server logs more than the average monkey I've gotten to see a number of web spider mistakes over and over again. These mistakes are so common that it's a rare day when I don't see at least one of them in the logs.

(These are mistakes as opposed to stupid spider tricks because they are so clearly wrong by the basic rules of HTTP.)

So, stupid web spider mistakes:

  • a leading slash on the URL in an <a href> means that you should not prepend the current page's URL to it when following the link. I am honestly boggled that any software gets this wrong, but some PHP-based thing appears to.

  • trailing slashes on the ends of URLs are not some optional element that you can omit. They have important semantic meaning, and the only thing you get if you leave them off is a HTTP redirect. (In that I am being nice; some places don't even give you the redirect.)

  • although the hostname component of URLs is case-independent, the rest of the URL is not. Spiders that lower-case URLs here get 404s, not the pages they are looking for.

  • the query parameter '?foo' is not the same as the query parameter '?foo='. You cannot rewrite the one into the other.

  • URLs that spiders give to web servers should not include fragment identifiers on the end (such as '#comments'). At least not if your spider wants me to give you any pages.

In writing this I actually read the HTTP 1.1 RFC and found out that in theory I should be accepting absolute URLs, with the 'http://hostname/' on the front, although equally in theory no one should be sending them (since our web server is not a proxy). All of the fragment identifier requests in the past 28 days are for absolute paths, though, not full absolute URLs, and fragments aren't allowed in either absolute paths or absolute URLs in HTTP.

Reading the RFC also answers the obvious question about absolute URLs: if a request uses an absolute URL, any Host: header is ignored and you take the hostname from the absolute URL (section 5.2). This both simplifies and complicates my life. (Also, missing Host: headers are an error in HTTP 1.1 requests (section 19.6.1.1). Fortunately, it appears that Apache already checks for that for me.)

StupidSpiderMistakes written at 02:04:40; Add Comment

2006-04-21

A CSS limitation: it's not supported by lynx

That CSS isn't supported by lynx may sound like a peculiar thing to call a limitation of CSS, but it does have one important consequence: If I want lynx to format something right, I can't use CSS.

(Well, I suppose I can use CSS in addition to non-CSS methods. But I have to have the non-CSS methods.)

As strange as it may sound, I still value looking good in lynx (and links, a related text mode browser). (And in Konqueror/Safari. And even in Internet Explorer, although sometimes it's tempting. I have pretty much given up on Netscape 4, though, except to the extent that it's lynx-compatible.)

One case where this comes up in WanderingThoughts is the day marker strings (the centered bold '2006-04-20' and so on) that mark the start of a day's posts. The morally proper way to do these is to put them in <div class="daymarker"> and then apply the centering and bolding via CSS. But that would look crappy in lynx, so instead I use the older, deprecated way:

<p class="daymarker" align=center> <b> 2006-04-20 </b> </p>

(The class is currently unused.)

These days I'm a web design pragmatist; I'm more interested in looking right than in doing something in the currently 'blessed as proper' way (especially as the currently approved way keeps changing). I prefer to do things in the proper way, because I'm geeky enough that it makes me feel good, but if the proper way conflicts with pragmatics the proper way loses. For me, intellectual purity is not worth looking ugly.

CSSLimitationsI written at 00:53:19; Add Comment

2006-04-10

The fun and charm of quoting URLs properly

The fun and charm of URL quoting is that you need to do it twice. Differently. That's because there's two different entities involved: browsers and web servers.

Strictly speaking, about the only thing that you have to quote for the browser is quote characters, because otherwise your nice <a href="..."> comes out very confusing. If you are being a good web standards monkey you need to quote at least ampersands (&'s) as well, because otherwise the browser may take them as entity references. The HTML 4.01 spec in section 5.3.2 recommends also quoting '>', just in case.

(In practice, no browser pays any attention to anything except a truly valid entity reference, because practically everyone except the obsessively standards compliant has unescaped &'s flying around.)

Web servers are startlingly liberal, so the only things you really have to quote is space characters (as either %20 or '+', depending on context) and the percent character itself. RFC 2396 has an additional list or two of stuff that should also be quoted (in sections 2.4.3 and 2.2), like quotes, and some web servers are picky.

(And if you are unlucky enough to deal with a joker who embedded URL component separator characters like '?' or '&' into his paths, you'll have to quote them too.)

You quote things for the browser with entity encoding, so & turns into &amp;. You quote things for the web server with percent encoded hex character values, so a quote turns into %22 and the browser ignores it too. In theory a neurotic application like DWiki that gets handed a URL with a quote should encode it as &quot; so it survives the browser and gets passed as is to the web server for the web server to puke on if desired; in practice, DWiki just encodes quotes in URLs straight to %22s.

Also in practice, many browsers will perform all of the necessary percent encoding for the web server themselves, turning spaces into %20 and so on, and you only need to worry about getting it to the browser. The one gotcha is that browsers often trim trailing spaces, which might be a necessary part of the URL. Doing more quoting is friendlier to simplistic HTML parsing applications.

(This entry is brought to you by me getting curious about the technical requirements of this all during an online discussion with friends.)

UrlQuoting written at 01:49:00; Add Comment

2006-04-03

An ugly spam attempt

Every so often I take a look at what user agents are visiting WanderingThoughts. Tonight it turned up a doozy; a single visit with a User-Agent of:

<script>window.open('<URL>')</script>

Presumably the intended attack vector was sites that summarize user agent traffic onto a web page without escaping the text; that would make this user-agent string into live JavaScript that would force any visitor's browser to go there.

The attack is also noteworthy for how brazen it is. The URL in the request is for 'buy4cheap.brinkster.net/buy2/side-search.htm', and the request itself came from 65.182.100.121, aka 'orf-premium12a.brinkster.com'. Most spammers are far less willing to clearly sign their work like that.

(I vacillated between calling this 'clever' or 'ugly'; I am going with 'ugly' because I don't like the implications of what these people are doing, and attempting to inject JavaScript is not a sign of angels.)

UglyWebSpammer written at 03:36:39; Add Comment

Spiders should respect rel="nofollow"

If you're writing a web spider, what should you do when you see a link marked rel="nofollow"?

In theory, you can do nothing different from any other link. It's not a formal specification, and the original description only talks about the resulting link not giving the target any credit.

In practice, on the Internet what people expect is in large part defined by what the 800 pound gorillas do. And both Google and MSN Search consider nofollow to be literal: don't follow this link. In fact Google explicitly documents this behavior; see one of the original postings on nofollow, or Google's description of it in their help pages.

So the real answer is: if you see a rel="nofollow" link, you shouldn't crawl the target.

Since Google (the original creators of nofollow) describe it this way, I will go so far as to say that respecting nofollow requires you to not crawl marked links.

Spider authors should do this not just because it's what people expect, but because it's genuinely useful for guiding spiders around web sites. (Especially dynamic web sites like wikis and blogs, which can have a lot of different ways of viewing more or less the same content.)

RespectTheNofollow written at 02:54:03; Add Comment

2006-04-01

A Firefox CSS irritation

I'm not going to fault Firefox for not supporting the CSS 2.1 'word-wrap: pre-wrap', no matter how convenient it would be for me if it did. Especially since CSS 2.1 is not yet a standard, merely a late stage working draft. But I am annoyed that Firefox doesn't support the CSS2 'display: compact', since I could have used it just now.

'display: compact' is classically used (in that it is right there in the CSS2 spec as an example) to create <DL> lists where the <DT> term is on the same line as the start of the <DD> definition (or definitions, since you can have more than one). But with Firefox not supporting this, your only real option for the same visual appearance is a table.

(Please don't suggest floats.)

The Bugzilla bug is #180468, open since 2002. Mozilla not supporting <dl compact> is the impressively ancient #2055 from 1998, marked WONTFIX.

(Firefox is hardly alone in not supporting display: compact, judging from here or here; I believe that this is why the CSS 2.1 working draft quietly drops it. However, support may be on the uptick; the KHTML engine, used by Konqueror and Apple's Safari, seems to support it.)

FirefoxCSSIrritation written at 04:09:06; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.