Wandering Thoughts archives

2006-02-28

A web spider update: not actually Uptilt's web spider

A while back, I wrote an entry about a bad web spider that at the time appeared to belong to Uptilt Inc. About a week after I published the entry, some of the system administration folks from Uptilt stumbled across it and got in touch with me to look into the whole situation.

In fact they were pretty puzzled about the incident, because (as they put it) Uptilt didn't even do outgoing HTTP, much less have a web crawler; their business is based on email. After I provided some additional specific information to them, they worked out what seems to have happened.

According to them, 64.71.164.96/27 didn't actually currently belong to Uptilt. Hurricane Electric had allocated it to them in November 2005, but when they ramped up operations from it in December they found they were getting a lot of emails from it blocked; upon investigation, they found that the subnet had previously been used by New Horizons, a well-known spammer, since 2004 or so (see eg the SPEWS listing). So Uptilt asked HE for a new clean netblock, and told HE to take back 64.71.164.96/27. However, neither the ARIN WHOIS information nor some of Uptilt's own records got updated at that time.

Once the Uptilt Inc people worked out what was going on, they got in touch with HE to get the WHOIS information corrected (I expect they also made sure all of their internal records got corrected). Unfortunately, the updated WHOIS information is now completely generic, just showing Hurricane Electric's /18 with no delegation information. Also, the Uptilt people were unable to get HE to tell them who the netblock is now assigned to.

There's a lesson in here about making sure that records, even your own records, are up to date. I've certainly seen similar things happen with internal records here. (In fact back in August I wrote about the accuracy problems of non-essential information.)

UptiltUpdate written at 02:03:42; Add Comment

2006-02-18

Stupid web spider tricks

In the spirit of earlier entrants, but not as bad, here's some stupid web spider tricks.

The first stupid trick: crawling 'Add Comment' pages. Not only are the 'Add Comment' links marked nofollow (so good little spiders shouldn't be going there), but it's also a great way to make me wonder if you're a would-be comment spammer and pay close attention to every CSpace page you hit. CSpace gets sufficiently few pages views at the moment that I can read all of the server logs, so I will notice.

(All sorts of web spiders seem to find the 'Add Comment' links especially tasty for some reason; it's quite striking. I'm pretty sure they're the most common nofollow links for web spiders to crawl.)

The second stupid trick: including a URL explaining your spider, but having that URL be a '403 permission denied' error page. Fortunately for my irritation level, I could find a copy in Google's cache (pick the cached version of the obvious web page) and it more or less explained the web spider was doing.

Thus, today's entrant is the 'findlinks' web spider, from various 139.18.2.* and 139.18.13.* IP addresses (which belong to uni-leipzig.de) plus a few hits from 80.237.144.96 (which doesn't seem to). The spider seems to be a distributed one, where any client machine that uses the software can crawl you. (I'm not sure I like distributed crawlers.)

On a side note, I derive a certain amount of amusement from seeing English Apache error messages on a foreign language website.

(Other information on the findlinks spider: in this huge database of spiders or here.)

StupidSpiderTricks written at 02:35:29; Add Comment

2006-02-15

Fun with control characters and the web

Yesterday, I broke WanderingThoughts' syndication feeds (and made the main page not validate). I did this by accidentally putting a ^D character into an entry; one of the editors I use makes this unfortunately easy to do by accident, and hard to spot.

Neither HTML 4.01 Transitional nor XML (which the Atom syndication format is a dialect of) allow control characters (apart from tabs (and linefeeds)). A lot of things are forgiving of crappy HTML, but things that eat Atom feeds are usually more picky; for example, the LiveJournal version stopped updating entirely. (That's how I noticed it.)

(Technically this is not the full story for XML; there are a number of other invalid characters and a great big character set swamp. So far I have been madly ducking it.)

Clearly I'd like to avoid having this happen in the future, but there's a problem: since DWiki pages are edited through the filesystem, DWiki is in a position of having pages with bad characters thrust down its throat at page rendering time, so it has to do something with them. What's the best way to communicate the problem while still producing valid (and ideally useful) output?

This isn't entirely new, as DWikiText already has a number of ways for me to screw it up; for example I could put in an invalid macro. The basic principle DWiki uses is 'the rendering must go on': errors should do as little damage as possible, and never kill the entire page. Usually they produce literal text; of course the problem here is that the literal text is the problem.

(Aborting things to report errors is appropriate for situations when you're showing the error reports to the author. When you're showing them to random people, it makes far less sense.)

This pretty much leads to the answer: stray control characters should produce something like '{control character elided}' in both regular HTML and Atom feeds (it's technically challenging to make it be in bold or the like). This keeps things valid, doesn't hide the problem like just deleting the characters would, and doesn't totally destroy the page. Now I just have to code it. Efficiently, since DWikiText rendering is a hot path.

(A lesser way would be to rewrite control characters to things like '^D', but this is a) somewhat more complicated and b) not as noticeable.)

CharacterProblems written at 03:15:45; Add Comment

2006-02-13

The problem with <pre>

Generally, <pre> is a fine thing, and it's become the de-facto way of writing any number of 'computer' things; code, Unix sessions, even equations, as as various WanderingThoughts entries illustrate. But there's a problem with it: <pre> text doesn't line-wrap.

The consequence is that if you write a long <pre> line, the browser will happily force a line wider than the browser window, making the reader either widen their browser (if they can widen it enough) or scroll. For WanderingThoughts it's even worse, because a CSS irritation forces me to lay it out using a table. Text inside a table cell is wrapped not at the browser width but at the table cell's width, and a single long line widens the entire cell's width, causing all text to be that wide. The net result is that if you don't (or can't) make your browser wide enough, you can't read anything.

Sometimes this is what's required and damn the torpedoes (and I'll 'line-wrap' by hand to try to avoid too-long lines). But there's a surprisingly large number of times when what you really want is just monospaced text with forced line breaks where the raw text has line breaks; extra line-wrapping doesn't actually hurt (especially if it's clear).

(I might as well admit that this is part of the 'personal aesthetic reasons' I alluded to in my comment on this entry; I browse with a fairly narrow browser window.)

In DWikiText my solution has been to write 'manual' <pre> text using _ and [[...|]], more or less like this:

_[[... ... ...|]]_ \\
_[[... ... ...|]]_

But this is awkward, doesn't clearly show automatically wrapped lines, and compresses whitespace; plus, it requires manual work. Ideally, DWikiText would have a formatting option that makes it easy to do the right thing. (After all, one of the reason <pre> text gets used here so much is that it's so easy.)

It's possible to use CSS to get most of the way to what I want (not all; there is no way to not compress whitespace without disallowing automatic line-wrapping). The CSS that I've come up with so far is a <div> with 'font-family: monospace; margin-left: 1em; text-indent: -1em;', and then each line inside it is another <div> (empty lines have to be forced with <br>). This causes wrapped lines to be indented by a bit, to make them stand out. Of course this looks pretty bad if you aren't using CSS, and I still value readability in lynx.

(It's still a temptation to implement it in DWiki.)

PreProblem written at 01:52:07; Add Comment

2006-02-11

The return of how to get your web spider banned

Today's entrant is 'Uptilt Inc', uptilt.com, aka 64.71.164.96/27. Let's see how they did on my scale:

Important update: it turns out I was fooled by out of date WHOIS information and Uptilt Inc isn't involved; for the full story, see UptiltUpdate. Remember that when you read the historical references to them in the rest of this entry.

  • 1,159 requests in one day.
  • 25+ requests for several URLs that are permanent redirections. The redirected to pages haven't changed recently, either.

  • They had the generic user-agent string "NutchCVS/0.8-dev (Nutch; [...])". At least it included a URL to the Nutch page. (It of course did not include a URL to any of their own pages.)

  • they did frequently fetch robots.txt; 28 times in one day, in fact.

  • None of the 10 different IP addresses in 64.71.164.96/27 that hit us have reverse DNS. (In fact, nothing in the subnet has reverse DNS.)
  • The subnet has no useful contact information, apart from the fact that Hurricane Electric says it belonged to an 'Uptilt Inc'. There is an uptilt.com, but to make you wonder it lives in a different subnet and its WHOIS data has a different physical address. However, the uptilt.com website says Uptilt Inc's headquarters is at the same address as HE has for the owners of 64.71.164.96/27.

In short: even more searching than last time.

  • Of course, www.uptilt.com has no information on any spidering activity they may be doing. Instead, it has lots of information on them being a "leading provider of Marketing Automation software solutions", and their subsidiary emaillabs.com being a "leading provider of advanced email marketing solutions".
  • they lose points for having prominent links to a website called 'crm.uptilt.com', which doesn't exist. Some of the links to their privacy policy and so on don't work either.
  • Since around here 'email marketing' tends to be spelled S-P-A-M, I wasn't exactly encouraged to send them any email about their spider. These days if you're involved in 'email marketing', I feel that you had better bend over backwards to reassure people that you're not a spammer and you understand all the rules and so on.

Overall score: BANNED. Since they use a generic user agent string (even though it does check robots.txt), their subnet now resides in our permanent kernel level IP blocks alongside our first contestant.

(We actually banned them a bit under two weeks ago, but I've only gotten around to writing this up now. The kernel IP block counters show that they've tried to drop by a few times since their ban.)

HowToGetYourSpiderBannedII written at 00:26:28; Add Comment

2006-02-08

The id attribute considered dangerous

I've been doing various overhauls of DWiki recently (a faster and better DWikiText to HTML renderer written by Daniel Martin (and hacked by me, so blame me for any problems) went in tonight, for example). As part of this, I've been looking at the HTML DWiki generates, running it past HTML Tidy and the W3C validator and so on. It's been a learning experience.

One of the things I've learned is that almost everywhere DWiki code and templates were giving <div>s an id="..." attribute, it was the wrong thing to do. In something like DWiki, id elements are dangerous.

The problem for a dynamic, template driven site is that every (hard-coded) id is a promise that that particular element will only ever appear once in a page. Any page. Every page. Better not forgetfully reuse an id-containing template twice in some page, even if it's the structurally right thing to do.

(For instance, the <div> around all of the comments on a page used to have an id. That seemed harmless until I thought about a) possibly showing comments by default on pages and then b) the blog view, which would show multiple pages in their normal rendering glommed into one bigger page. Kaboom!)

In DWiki's case, using id wasn't getting me anything I couldn't get with class; CSS styling can use either. This made my use of id more or less gratuitous, which just made my mistakes with it all the more wince-inducing. (The height of hubris was having a renderer, not even a template, put an id in its output.)

(I'm sure that advanced CSS tricks care about the difference, but DWiki's use of CSS is pretty simple.)

After a bunch of revisions, I now have id attributes in only five places, and at least two of them (the page header and the syndication feed information) are probably still mistakes. (They persist because I'd have to revise the CSS too, and I've already made enough big wrenching changes today.)

PS: changing the DWikiText renderer caused some fiddly little changes in the HTML of every entry in CSpace's syndication feeds. If your feed reader exploded as a result of this, you may want to get a better one. (With a very few minor exceptions, there should be no visual differences.)

IdConsideredDangerous written at 02:05:27; Add Comment

2006-02-06

More on simple markup languages

In a reply to my WhySimpleMarkup post, Chris Wage wrote, in part:

I like the idea of a simple markup language, but the reality is that they are implemented in an obscure and often counter-intuitive fashion.

I'll unfortunately agree with this; it's one reason why I created my own for DWiki. Pretty much all of the existing wikitext dialects I look at struck me as ugly to see, tiresome to write, or both. Looking good as plain text and being easy to write in were explicit goals for DWikiText, and I left features out to achieve it. (Well, I think I've achieved it.)

(I'm not convinced that it's possible to be attractive looking, easy to write ordinary things in, and have a complete set of text formatting options. There's only so many characters to go around, at least until we start using Unicode (and down that road lies Perl 6).)

However, I disagree with Chris Wage about editors replacing simple markup languages. I feel that playing with any sort of HTML editing environment is actually make-work, even if it's faster than writing HTML by hand. And I don't think it is in many cases, because the editors are designed to be novice-friendly instead of fast for people who do this all the time.

One source of the disagreement may be that I don't think of simple markup languages as a way of making it easy for novices; I think of them as a way of streamlining the work of experts. I can write HTML by hand; I just don't want to bother.

(This means that I don't really care about standardization either, unless it doesn't cost me very much. If your goal is making it real easy for casual people to make changes in any wiki they run across, you may feel differently. Since DWiki doesn't have web-based editing, I'm already a heathen in that respect.)

PS: you can see how the plain text source of this looks with the 'View Source' link in the Page Tools entry at the bottom of here, and make your own decision about pretty or ugly it is.

WhySimpleMarkupII written at 01:57:46; Add Comment

2006-02-04

Why simple markup languages make sense

I'm a big fan of simple markup languages for writing web pages (or in fact any sort of document; it just happens that web sites are pretty much everything I write these days). Recently I figured out a good expression of why:

Simple markup languages are the same idea as high level programming languages: less make-work and more of what actually matters.

(And as an added bonus, less interruption of my writing to sprinkle HTML all over.)

Languages like C and Java have a bunch of necessary repetitive tedium that you wind up doing over and over again to pacify the language, to the extent that automating many of them is a major IDE industry. One way that good high level languages are such a pleasure to program in is that they do away with all of this tedium and let you think about your program, instead of yet another set of canned getter and setter methods.

Simple markup languages versus HTML have the sample effect: less <p>s and <a href="...">s and more of your actual writing. At least for me, this results in a better focus on just writing, and thus better writing (and more of it).

As far as I'm concerned, one of the big wins of wikis is how they streamline writing for the web by using simple markup languages. (Of course they then often throw away a bunch of this advantage because browsers make bad editors.)

(You can argue that the real answer is 'get a good HTML editor'. One of the reasons I think that this is not a real answer is that the more streamlined and unobtrusive the editor is in adding HTML to your plain writing, the more what you write looks like a simple markup language to start with.)

WhySimpleMarkup written at 06:16:33; Add Comment

2006-02-02

The rise of wikiblogs

Recently I read Doc Searls' The Chronological Web, where he argues (as the CentreSource blog entry that led me to his entry puts it) that most organizations need a blog. But what really made me sit up was this bit towards the end:

This helps, for example, when we talk to civilians who are new to the Web and want to "put up a website". Very often what they really need is a blog. [...] Updating a "site" is a chore. [...] Blogs are written, not constructed. Updating them can be as easy as writing an email. Yet there's nothing about a blog that excludes static pages.

This is about as concise a summary of the appeal of what I call wikiblogs as I could ask for. You get a blog, you get easily made static pages, and you hopefully get a decent management and authoring environment for it all. (You usually also get a very clear separation between content and visual design, often with your choice of skins.)

The other thing you get with a wikiblog is a blurring of the distinction between blog entries and 'static' pages (which may not be all that static). This is good for both sides, as each have things the other side can profitably steal. (The blog side may feel it has little to take from boring static pages, to which I have a simple reply: non-crappy navigation.)

(Another possible effect of blurring the distinction is to make people think of all of their site as more a publishing environment, a part of the 'Live Web' in Doc Searls terms, than a part of the heavyweight 'Static Web'.)

if you want examples of wikiblogs in action, see Ian Bicking or Martin Fowler (or look for people using packages like Blosxom and PyBlosxom). (Disclaimer: selections not comprehensive.)

Also see the WikiPedia page on bliki, which has more additional links than I can shake a stick at.

WikiBlogs written at 03:01:42; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.