Wandering Thoughts archives

2016-08-24

Blindly trying to copy a web site these days is hazardous

The other day, someone pointed a piece of software called HTTrack at Wandering Thoughts. HTTrack is a piece of free software that makes offline copies of things, so I presume that this person for some reason wanted this. I don't think it went as they intended and wanted.

The basic numbers are there in the logs. Over the course of a bit over 18 hours, they made 72,393 requests and received just over 193 MBytes of data. Needless to say, Wandering Thoughts does not have that many actual content pages; at the moment there are a bit over 6400 pages that my sitemap generation code considers to be 'real', some of them with partially duplicated content. How did 6400 pages turn into 72,000? Through what I call 'virtual directories', where various sorts of range based and date based views and so on are layered on top of an underlying directory structure. These dynamic pages multiply like weeds.

(I'm reasonably sure that 72,000 URLs doesn't cover them all by now, although I could be wrong. The crawl does seem to have gotten every real page, so maybe it actually got absolutely everything.)

Dynamic views of things are not exactly uncommon in modern software, and that means that blindly trying to copy a web site is very hazardous to your bandwidth and disk space (and it is likely to irritate the target a lot). You can no longer point a simple crawler (HTTrack included) at a site or a URL hierarchy and say 'follow every link', because it's very likely that you're not going to achieve your goals. Even if you do get 'everything', you're going to wind up with a sprawling mess that has tons of duplicated content.

(Of course HTTrack doesn't respect nofollow, and it also lies in its User-Agent by claiming to be running on Windows 98. For these and other reasons, I've now set things up so that it will be refused service on future visits. In fact I'm in a sufficiently grumpy mood that anything claiming to still be using Windows 98 is now banned, at least temporarily. If people are going to lie in their User-Agent, please make it more plausible. In fact, according to the SSL Server Test, Windows 98 machines can't even establish a TLS connection to this server. Well, I'm assuming that based on the fact that Windows XP fails, as the SSL Server Test doesn't explicitly cover Windows 98.)

PS: DWiki and this host didn't even notice the load from the HTTrack copy. We found out about it more or less by coincidence; a university traffic monitoring system noticed a suspiciously high number of sessions from a single remote IP to the server and sent in a report.

web/SiteCopiesAreHazardous written at 22:18:27; Add Comment

more, less, and a story of typical Unix fossilization

It all started on Twitter:

@palecur: I know enough about Unix to get along but you will never convince me of a meaningful difference between 'less' and 'more'

@thatcks: In the genius Unix tradition, the answer is that less is more.

(Sadly, this is true at about 3 to 4 levels. It's a long story.)

In the beginning, by which we mean V7, Unix didn't have a pager at all. That was okay; Unix wasn't very visual in those days, partly because it was still sort of the era of the hard copy terminal. Then along came Berkeley and BSD. People at Berkeley were into CRT terminals, and so BSD Unix gave us things like vi and the first pager program, more (which showed up quite early, in 3BSD, although this isn't as early as vi, which appears in 2BSD). Calling a pager more is a little bit odd but it's a Unix type of name and from the beginning more prompted you with '--More--' at the bottom of the screen.

All of the Unix vendors that based their work on BSD Unix (like Sun and DEC) naturally shipped versions of more along with the rest of the BSD programs, and so more spread around the BSD side of things. However, more was by no means the best pager ever; as you might expect, it was actually a bit primitive and lacking in features. So fairly early on Mark Nudelman wrote a pager with somewhat more features and it wound up being called less as somewhat of a joke. When less was distributed via Usenet's net.sources in 1985 it became immediately popular, as everyone could see that it was clearly nicer than more, and pretty soon it was reasonably ubiquitous on Unix machines (or at least ones that had some degree of access to stuff from Usenet). In 4.3 BSD, more itself picked up the 'page backwards' feature that had motived Mark Nudelman to write less, cf the 4.3BSD manpage, but this wasn't the only attraction of less. And this is where we get into Unix fossilization.

In a sane world, Unix vendors would have either replaced their version of more with the clearly superior less or at least updated their version of more to the 4.3 BSD version. Maybe less wouldn't have replaced more immediately, but certainly over say the next five years, when it kept on being better and most people kept preferring it when they had a choice. This would have been Unix evolving to pick a better alternative. In this world, basically neither happened. Unix fossilized around more; no one was willing to outright replace more and even updating it to the 4.3 BSD version was a slow thing (which of course drove more and more people to less). Eventually the Single Unix Specification came along and standardized more with more features than it originally had but still with a subset of less's features (which had kept growing).

This entire history has led to a series of vaguely absurd outcomes on various modern Unixes. On Solaris derivatives more is of course the traditional version with source code that can probably trace itself all the way back to 3BSD, carefully updated to SUS compliance. Solaris would never dream of changing what more is, not even if the replacement is better. Why, it might disturb someone.

(I am not a fan of Solaris's long standing refusal to touch anything. Well, Solaris before Oracle took it over. I haven't looked at Solaris 11, just at Solaris 10 and derivatives like Illumos.)

Oddly, FreeBSD has done the most sensible thing; they've outright replaced more with less. There is a /usr/bin/more but it's the same binary as less and as you can see the more manpage is just the less manpage. OpenBSD has done the same thing but has a specific manpage for more instead of just giving you the less manpage.

On Linux, more is part of the util-linux package but its manpage outright tells you to use less instead:

more is a filter for paging through text one screenful at a time. This version is especially primitive. Users should realize that less(1) provides more(1) emulation plus extensive enhancements.

Given the comments in the manpage, it appears that this version of more is directly derived from the source code of one of the BSD versions. It might even have less changes from the original than the Solaris version.

So, now you can see why I say that less is more, or more, or both, at several levels. less is certainly more than more, and sometimes less literally is more (or rather more is less, to put it the right way around).

unix/MoreAndUnixFossilization written at 00:49:57; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.