2013-11-24
Baidu's web spider ignores robots.txt (at least sometimes)
On my personal site I've had an entry in my
robots.txt to totally disallow Baidu's web spider for a long time,
for various reasons (including that it doesn't respect nofollow). Recently I was looking at my logs for another
reason, and, well, imagine my surprise when I saw requests from
something with a user-agent of Baiduspider/2.0. Further investigation
showed that this has been going on for several months, although the
request volume was not high. Worse, not only is Baidu's spider crawling
when it shouldn't be, it seems to be not requesting robots.txt more
than very occasionally (and it usually requests robots.txt with a
blandly non-robot user-agent, or at least I assume that the fetches of
robots.txt from Baidu's IP address range are from their robot).
All of this seems to have started when I switched my personal site to
all-HTTPS, but that's not an excuse for Baidu
(or anyone else). Yes, there are redirections from the HTTP version
involved, but things still work (and I actually wound up making an
exemption for _robots.txt). The plain
fact is that Baidu is flagrantly ignoring robots.txt and not even
fetching it.
I don't tolerate this sort of web spider behavior. As a result, on my personal site Baidu is now blocked at the web server level (based on both IP address and user-agent) and I've just added similar blocks for it here on Wandering Thoughts. I'm aware that Baidu doesn't care about a piddly little site like me blocking them but I do, so I'm doing this no matter how quixotic it feels like.
(I'm writing this entry about the situation because Baidu's behavior makes me genuinely angry (for a small amount of anger). And bad behavior by a major search engine should be called out.)
2013-11-22
My hack use for Chrome's Incognito mode
These days I have a script that starts Chrome in Incognito mode (or at least opens another window if Chrome is already running, which I am starting to arrange). It looks like this:
#!/bin/sh exec google-chrome --incognito --new-window "$@"
(Note that --new-window is an undocumented option and thus may change
someday.)
I don't do this because I like Chrome or because I need anonymity that often (and anyways my testing Firefox is about as equally anonymous). I do this because Chrome has an extremely valuable option for Incognito mode: you can turn off specific extensions if you want to (more exactly you have to enable extensions to run in Incognito mode). I use this to turn off NotScripts and FlashBlock and all of the other Chrome extensions that protect me from the unfiltered Internet in its natural crud-infested state.
In other words, Chrome's Incognito mode has become my 'just make this stuff work, I don't care any more' browser (especially as Chrome integrates its own supported version of Flash so I don't have to worry about that part either). And just like my testing Firefox, I don't have to worry very much about being contaminated by cookies and whatnot because Chrome throws them all away afterwards.
(Note that Chrome only discards cookies et al from your Incognito windows when all of them have been closed. If you keep one around for some reason, perhaps because it is running some important internal app for you, those cookies will live on for the duration.)
I would prefer to do this in Firefox and in theory I could do it with another profile for my testing Firefox (one that didn't have the relevant extensions installed). However I've never gotten alternate profiles to work very well in Firefox and I'd still have the Flash issues. In the mean time Chrome Incognito is convenient for this.
(I've even modified my custom environment so that I can dump an URL into
'ichrome' as easily as I can into my regular Firefox session. It's a
bit sad that I use it (or need to use it) that often, but I do and it is
oh so convenient.)
PS: writing this up was inspired by this tweet by @saintaardvark, which led to me discovering that I'm not alone in using Chrome Incognito for this.
2013-11-09
Google Feedfetcher is still fetching feeds and a User-Agent caution
Feedfetcher was Google's feed fetching backend for Google Reader, which as you may remember was shut down on July 1st this year (to generally mixed feelings). At the time of that shutdown Google was pretty definite about how the service was gone, its data was not being retained, and there would be no recovery or resumption possible. One would normally expect that the feed fetching backend would also be shut down at the same time.
Well, no, of course not. This is Google, after all, the new home
of 'we don't care because we don't have to' (cf). Google Feedfetcher
is still pulling my feeds more than four months after the shutdown
of Google Reader. In fact it's worse than that; the claimed readership
numbers listed in its User-Agent have barely budged from the time
when Google Reader was running (this is what is known as a flat out
lie). As irritating things involving Google go, this is a drop in
the bucket. Still I've recently decided that I've had enough so
I've blocked their user-agent. It turns out that this exposes a
little issue that you may want to think about when you create
User-Agent strings.
Here is the User-Agent header for Google Feedfetcher:
Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 445 subscribers; feed-id=1422824070729197911)
Here is the User-Agent header for Feedly:
Feedly/1.0 (+http://www.feedly.com/fetcher.html; like FeedFetcher-Google)
If you block Google Feedfetcher using a case-independent match you'll
probably also block Feedly unless your User-Agent parser is really
smart. It would be easy to miss this when you set up blocks unless you
make a habit of monitoring what they match (and I suspect that many
people don't do that, any more than they have a fancy User-Agent
parser instead of a general regexp engine).
By the way, if this happens I would argue that it is more or less
Feedly's fault here. There are quite a lot of feed fetchers that do not
feel the need to drop Google Feedfetcher's name in their User-Agent
header and the way that Feedly is doing this, combined with Google's own
User-Agent formatting, makes it very easy for a match to hit both. If
Feedly wants to communicate the similarity to webmasters reading their
logs they could have used a different phrasing that would not run this
risk.
(Of course I rather suspect that Feedly actively wanted their feed fetcher to be mistaken for Google Feedfetcher by automated code, it's just that when they planned it this they expected that it was going to be a good thing.)
2013-11-03
Wikitext needs a better way of writing tables
For the most part good wikitext dialects do a pretty good job of letting you write formatting stuff in a way that looks and feels very natural. Markdown is an excellent example here; much of its formatting looks basically how you'd write it in a plain text document such as a plaintext README or email message (and this is by deliberate design). I will semi-modestly claim that DWikiText (my wikitext dialect here in DWiki) does likewise, again by design, although it probably isn't quite as good here as Markdown is.
But there is one exception to this. Namely, I've yet to see a natural looking syntax for tables in wikitext. The qualification is important because there sort of is a natural syntax for tables in plain text; if you need to present such a table you generally line up the columns with enough whitespace between them to make it plain that this is a table and maybe indent the whole thing a bit to make it stand out from the regular flow of text. The problem is that this syntax is relatively obvious to humans but generally is terribly ambiguous and hard to process for computers.
Past that I've seen a number of syntaxes that look semi-okay, generally involving drawing explicitly table and cell borders in some way, but they all seem relatively awkward and unnatural to me (this includes the syntax in DWikiText). None of them look like something that I'd put in ordinary plain text and that would be immediately obvious to a reader. Possibly the problem is simply too hard and the real solution is some form of formatting that's not natural but is clear and easy to read and write.
(I suspect that one doesn't want to borrow, eg, the tbl syntax. It may
have seen a lot of use but I don't think it's exactly the clearest thing
to read or write.)