2014-03-02
Googlebot is now aggressively crawling syndication feeds
I'm not sure how long this has been going on (I only noticed it recently) but Googlebot, Google's search crawler, is now aggressively crawling syndication feeds. By 'aggressively crawling' I mean two things. First, it is fetching the feeds multiple times a day; one of my feeds was fetched 46 times in one 24-hour period. Second and worse, it's not using conditional GET.
I've written before about why web spiders should not crawl syndication feeds and I still believe everything I wrote back then (even though I've significantly reduced the number of feeds I advertise since those days). My feed URLs are all marked 'nofollow', a declaration that Googlebot generally respects. And even if Google was going to crawl syndication feeds, the minimum standard is implementing conditional GET instead of repeatedly spamming fetch requests; the latter is the kind of thing that gets you banned here.
I might very reluctantly accept Googlebot crawling a few syndication feed URLs if they properly implemented conditional GET. Then it might be a reasonable move to find updated content (although Googlebot accesses my sitemap much less frequently) and I'd passively go along with the 800 pound gorilla of search traffic. But without conditional GET it's my strong opinion that this is abuse plain and simple, and I have no interest in cooperation.
So, in short: I suggest that you check your syndication feed logs to see if Googlebot is pounding on them too and if it is, block it from accessing those URLs. I doubt Google is going to change its behavior any time soon or even notice, but at least you can avoid donating your site resources to an abusive crawler.
(As I expected, Googlebot is paying absolutely no attention to days of 403 responses on the feed URLs it's trying to fetch. It keeps on trying to fetch them at great volume, to the tune of 245 requests so far today for 11 different URLs.)
Sidebar: Some more details
First, this really is Googlebot; it comes from Google IP address ranges and from specific IPs with crawl-*.googlebot.com reverse DNS such as 66.249.66.130.
Second, in the past Googlebot has shown signs of supporting conditional GET on syndication feeds. I have historical logs that show Googlebot getting 304's on syndication feed URLs.
Third, based on historical logs I have for my personal website, this appears to have started happening there around January 13th. There are sporadic requests for feed URLs before then, but January 13th is when things light up with multiple requests a day.
Cool URL fragments don't change either
I was all set to write an entry about how a limitation of dealing with site changes (from one domain to another, from HTTP to HTTPS, or just URL restructuring) via HTTP redirects was that URL fragments fell off during the redirections when I decided to actually check the end URLs I was getting and discovered that I was wrong. Browsers do preserve URL fragments during redirection (although you may not see this if the URL fragment is cut off because the full URL is long). What was really going on in my case is that the site I was dealing with has violated a sub-rule of 'Cool URLs don't change'.
Simply stated, the sub-rule is 'URL fragments are part of the URL'. Let me rephrase that:
If you really care about cool URLs, you can't change any HTML anchors once you create them.
The name of the anchor must remain the same and the meaning (the place it links to) must also stay. This is actually a really high bar, probably an implausibly high one, since HTML anchors are often associated with very specific information that can easily become invalid or otherwise go away (or simply be broken out to a separate page when it becomes too big).
Note that this implies that simply numbering your HTML anchors in sequential order is a terrible thing to do unless you can guarantee that you're not going to introduce or remove any sections, subsections, etc. It's much better to give them some sort of name. Effectively the anchor text should be seen as a unique and stable permanent identifier. Again this is a pretty high bar and is probably going to cause you heartburn if you try to really carry it out.
This somewhat tilts me towards a view that HTML anchors should be avoided. On the other hand it's often easier to read a large page of lots of information (exactly the situation that calls for HTML anchors and an index at the top) rather than keep clicking through a lot of small pages. Today's side moral is that web page design can be hard.
(I'd say that the right answer is small pages with some JavaScript so that one page seamlessly transitions to the next one as you read without you having to do anything, but even that's not a complete solution since you don't get things like 'search in (full) page'.)
I suppose what I really wish is that web servers got URL fragments in the URL of the HTTP request but normally ignored them. Then they could pay attention to the fragment identifier when doing redirects and do the right thing if they had any opinions about it. But this is a wish that's a couple of decades too late; I might as well wish for pervasive encryption while I was at it.
2014-02-26
Saying goodbye to the PHP pokers the easy way
If you have a public web site or a web app, you almost certainly have people trying drive-by PHP exploits against you whether or not your site shows any sign of using PHP. The people (or software) behind these don't care; they seem to operate by taking one of your URLs and slapping the page name (and sometimes query parameters) of a vulnerable bit of PHP, then seeing if it works. I see requests like:
GET /~cks/space/blog/linux/images/stories/food.php?rfPOST /~cks/space/blog/linux/index.php?option=com_jce&task=plugin&plugin=imgmanager&file=imgmanager&version=1576&cid=20POST /~cks/space/blog/linux//components/com_jnews/includes/openflashchart/php-ofc-library/ofc_upload_image.php?name=guys.php GET /~cks/space/blog/linux//components/com_jnews/includes/openflashchart/tmp-upload-images/guys.php?rf
If you have anything other than a static site, these requests are
at least annoying (in that they're forcing your code to run just
to give the attacker a 'no such URL' answer). If you log potential
security issues (such as odd POST content-types or the like) they
can also make your logs nag at you. Recently I got irritated at
these people and decided to make them go away the easy way.
The easy way here is to have your web server handle refusing the requests instead of letting them go all the way to your actual app code. Front end web servers generally have highly developed and very CPU-efficient ways of doing this (exactly how varies with the web server), plus this means your app code won't be logging any errors because it's never going to see the requests in the first place. In my case this host runs Apache and so the simplest way is a RewriteRule:
RewriteRule ^.*\.php$ - [F,L]
No fuss, no muss, no CPU consumption from my Rube Goldberg stack, and no more log messages.
(Arguably this generates the wrong HTTP error code, if you think that matters, since it generates a 403 instead of the theoretically more correct 404.)
Of course you can only do this trick if you can guarantee that
you'll never use a URL ending in .php. This isn't necessarily
something you can assert for a general use web program (cf), but it often is something you can say
about your particular site. It's certainly something I can say about
here; even though I theoretically could create a perfectly
valid URL ending in .php (although it wouldn't be a PHP page), I'm
never going to.
(And if I do, I can change or remove my RewriteRule.)
2014-02-22
A subtle advantage of generating absolute path URLs during HTML rendering
If you're writing a multi-page web application of some sort, sooner
or later you'll want to turn some abstract name for another page
into the URL for that page, or more exactly into a URL that you can
put into a link on the current page. For a non-hypothetical example
you might be writing a wiki or a blog engine and linking one entry
to another one. When you're doing this, a certain sort of person
will experience a little voice of temptation urging them to be
clever and generate relative paths in those URLs. After all if
you're rendering /a/path/page1 and linking to /a/path/page2 you
can simply generate a '<a href="page2">' for your link instead
of putting the whole absolute path in.
(And this sort of cleverness appeals to any number of programmers.)
The obvious reason not to do this is that it's more work. Your code almost certainly already has to be able to generate the absolute URLs for pages, while converting those absolute URLs to relative ones will take additional code. So let's assume that you have a library that will do this for free. Generating relative URLs is still a bad idea because of what it does to your (potential) caching.
A HTML fragment with absolute path URLs is page-independent; it can be included as-is anywhere on your site and it will still work. But a HTML fragment with relative path URLs is page-dependent. It works only on a specific page and can't be reused elsewhere, or at least it can only be reused in certain select other pages, not any arbitrary page. Relative path URLs require more cache entries; instead of caching 'HTML fragment X', you have to cache 'HTML fragment X in the context of directory Y' (and repeat for all different Ys you have). Some web apps have a lot of such directories and thus would need a huge number of such cache entries. Which is rather wasteful, to put it one way.
This is one of those fortuitous design decisions that I stumbled into back at the start of writing DWiki. I made it due to laziness (I didn't want to write something to relativize links, however nifty it would have been) but it turned out to be an excellent idea due to the needs of caching.
(Note that in most blog engines, one sort of 'HTML fragments' that you will be reusing is blog entries or at least their leadin text. Blogs typically have lots of places where entries appear.)
2014-02-17
File based engines and the awkward problem of special URLs
I was recently asked a good question on Twitter:
@thatcks Do you publish feed URLs on your blog besides '/blog/?atom'? My reader of choice sadly has issues re: dropping the GET param.
The answer is unfortunately not. So, you might reasonably wonder, why do syndication feeds here use a query parameter? The answer is that DWiki (the engine behind Wandering Thoughts) is a file and directory based engine and when you build such an engine, you wind up with a URL namespace problem.
Put simply, when you're simply presenting a view of a directory hierarchy the user de-facto owns the URL namespace. They create valid URLs by creating files and directories, and it's within their power and their right to create even ones with awkward names. If you add your own names to this namespace (for example a 'blog/atom' URL for the blog's Atom syndication feed) you're at risk of colliding with a name the user is creating in the directory hierarchy. Collisions are generally bad, especially collisions when you have not told the user what happens next.
I think that there are three main things you can do here. First, you can simply reserve some names in the namespace, ie you tell the user 'you can't create a file or directory called 'atom', that's a reserved name'. There are several versions of name reservation but I think that they're all unappetising for various reasons. Reserved names also give you problems if you want to add features that require new ones, since the user may already be using the name you want to take over.
(This is a familiar issue for programming languages; adding new reserved keywords for things like new syntax is fraught with peril and the possibility that old programs suddenly won't work with your new version because they're using what is now a reserved keyword as a variable name.)
The second and related approach is to fence off certain classes of names as invalid for the user and thus available for your program's synthetic URLs. This can work reasonably well if you create rules that match user expectations and have solid, appealing reasons. For example, DWiki won't serve files with names that start in '.' or end in '~', and so both categories of names are available for synthetic URLs. The drawback of this is that the resulting synthetic URLs generally look ugly; you would have 'blog/.atom' or 'blog/atom~'.
(DWiki uses a few such synthetic URLs but all for transient things, not for any URL that users will ever want to bookmark or pass around.)
The third approach is to take your program's synthetic URLs completely out of the user's namespace, such as by making them query parameters. Even if a user creates a file with a '?' in its name, it simply won't be represented as an URL with a query parameter; to be done right, the '?' will have to be %-encoded in the URL. This approach has two virtues. First, it's simple. Second, it can be applied to any regular URL whether or not it's a directory or a file, and it doesn't require turning a file into a pseudo-directory (eg going from 'blog/anentry' to 'blog/anentry/commentfeed', which raises the question of what 'blog/anentry/' should mean). DWiki takes this approach, and so syndication feeds and in fact all alternate views of directories or files are implemented as query parameters.
(From the right perspective, a syndication feed is just an alternate view of a directory hierarchy. Or at least that's my story and I'm sticking to it.)
2014-02-16
Why comments aren't immediately visible on entries here
Recently, a commentator on this entry left a comment with a good question that was unrelated to the entry. GlacJAY asked:
Why do I need one more click to see the comments?
The most useful answer is that things remain this way as a deliberate design decision that I've made because of how I want Wandering Thoughts to operate and come across to readers. I could sugar coat this, but I should be honest: the entries are what I really want people to read, not the comments. I see comments as an optional supplement for the entries, similar in spirit to footnotes.
Making it take an extra click to read comments for many URLs is a conscious way of de-emphasising the comments in favour of the entry text. I want you to read the entry text; then you can go on to read comments if you find the idea interesting enough. If I embedded comments on the main entry page, there are some entries (often entries that I care relatively strongly about) where the comments section would come to dominate the overall entry simply because of the relative volumes of text (eg this recent one). I very much don't want that. My writing is the important thing here as far as I'm concerned (and yes I'm biased).
(Related to this, I consider it a feature that you can't start reading an entry and then trivially skip down to the comments partway through. There is at least a little bit of a roadblock.)
This is not a blog design decision that works everywhere and for everyone. Some people want their entries to be the starting point for discussion and interaction; these people clearly want to make their comments more accessible and so on than I do here. I read a number of blogs like that, some of them where the comments section can be as interesting as the blog entries themselves.
(Some people go the other way and don't want on-blog comments at all. I don't feel this way and value comments here, but I do feel that comments are here primarily for me instead of for my readers. Which is a reason I'm willing to de-emphasise them for readers in the way I do.)
PS: that comments are treated this way is also caught up in the history of DWiki's original design and intended purpose (which was not to be a blog engine). But that's another story for another entry.
2014-01-06
Some thoughts on blog front pages in the modern era
Once upon a time, the front page of your blog was how people read and followed it. This drove a great many features of the standard blog front page, things like the tropism towards showing full entries and paged navigation that let you go back to the previous N entries and so on. However, that was a very long time ago and things are almost certainly much different today.
I'm sure that some people still read blogs they follow through the front page (I'll admit that I do that for some blogs for assorted reasons). But I don't think that this is the dominant use any more. My standard belief is that most people come to most blogs through web searches, which will put them on individual entries. For people who follow your blog, I think that most will do so either by syndication feeds or by links from the social web. In this sort of environment, what's the purpose of your blog's front page? I can see at least two: it's where you point people for an overview of your blogging when they follow a link from a social web profile (or a GitHub profile or the like), and it's one of the top places people will look if they read an entry then decide they like your writing in general and want to see more of it.
What this suggests to me is that traditional front pages may be effectively obsolete and in need of being rethought for the modern era. For instance, I can imagine a front page that progressively shortens entries as you go along, with the first entry or two shown in full, the next few entries shown with significant excerpts, and then increasingly minimal entries for at least a few more. Perhaps you should also have a 'greatest hits' section afterwards (or as an explicit sidebar on the front page). I also suspect that there's no real point in paged navigation on the front page any more; instead you might as well end the front page with a link to your full archives. Your front page would still be a starting point for people reading your blog but it would be a different sort of starting point, one more oriented towards a first time (and one time) visitor.
(This uses ideas and practices from Peter Donis and Aristotle Pagaltzis that they mentioned in comments on this entry.)
Of course all of this is rambling theorization so far, uniformed by trying to do any real research. I'm sure that people have written plenty of things about design patterns for modern blogs and people may even have measured how traffic flows to various places on various sorts of blogs; if I was serious need to find out what's the current state of the art. As usual, I lack the motivation and energy for that sort of large scale design overhaul, among other things.
(That sort of a major redesign is edging close to a 'blow up the world' exercise for me and if I did that a whole lot of things would change, more because it implies a major rethink about how the blog operates than because it would require a major code change.)
2014-01-04
One aspect of partial versus full entries on blog front pages
One of the eternal discussions and differences in blogging is whether your blog's front page has full entries or just some form of excerpts, with readers having to click through to read full entries. There are plenty of blogs that go either way and I expect that there are decent arguments for both positions. Wandering Thoughts is a full-entries blog for a very simple reason: I happen to think that full-entry blogs are simply easier, not at a technical level but at a writing level.
(One argument for partial-entry front pages I've read is the increasing rise of mobile devices and other things with relatively small screens. Partial-entry front pages give users of such devices a relatively compact overview of your writing without a wall of text effect. Similar logic may apply even on full sized displays if you write a lot of long entries.)
With a partial-entry blog, some portion of the front of your entry is effectively an abstract or a teaser for the full entry. You can't simply write whatever first sentence, paragraph, or whatever you normally would if you were simply writing a full entry; you need to always keep this additional usage in mind. I know that I've written any number of first paragraphs that would most emphatically not work as this sort of introductory teaser. For recent examples of what I'm talking about, the first paragraph of this entry would probably work fine but I'm pretty sure that the first paragraph of this one doesn't really.
(I'm picking the first paragraph here simply as a common division point and using it as an example. It's not required to always be this and in fact you probably want to customize the division point on an entry by entry basis rather than fit everything into the procrustean mold of a single teaser paragraph.)
I also don't think that good writing requires you to write first paragraphs that work this way. Sometimes you will be writing in a form where the first paragraph naturally frames your thesis or otherwise is a good introduction, but not always; there are perfectly good forms where this doesn't happen and you can't neatly slice off some reasonable amount of the front and having something that will draw people in. At the very least I believe that even if this way is arguably better writing it's neither clearly superior nor easy writing; you will be working harder to carefully craft the lead-in than you would be if you wrote the entry without having to consider this and you are probably not going to get a major overall quality payoff for it.
So the short version is that Wandering Thoughts has full entries on the front page because I don't want to make my writing that much harder by thinking about division points and standalone first paragraphs and so on when I'm writing entries.
(Of course real blog usability suggests that this whole issue may not matter too much. How many visitors even look at your front page anyways? (Perhaps I should generate stats for that someday.))
2013-12-29
Broad thoughts on tags for blog entries
Yes, I know, tags are on my mind lately. In particular I've been thinking about what I want to do with them. Ultimately what it comes down to is supporting real blog usability, specifically both encouraging and rewarding blog visitors for exploring outwards from whatever entry they initially landed on. When I wrote that entry I said that the most powerful way to do that was probably some sort of 'related entries' feature; tags are an obvious way of providing that.
There is an important corollary: for this to work, the tags must not merely lead to related entries but they must be related in a way that your visitors are interested in. Some tags will be too general to be useful (these are really broad categories) while others will be too uninteresting or obscure. This means that creating useful tags requires thinking about the relationships that visitors will want to explore; in other words, what about any particular entry that people will want to read more of.
(This is one reason that I think tags will be somewhat retrospective; you won't necessarily realize those interesting relationships until you have another entry to relate the first entry to.)
Also, tags aren't enough by themselves because they are too unspecific. There are at least three sorts of more specific relationships that I think will get lost in a general tag cloud and should be handled differently by at least the blog's UI: 'related entries', the more specific form of related entries that is 'entries in a series', and 'this entry is updated by ...'. Related entries is more specific than merely sharing tags; one way I can look at it is entries that share a topic even if they aren't specifically a series. Entries that are one entry in a series should have strong support in the UI for real blog usability because these are the entries that a visitor is most likely to want to read more of if they liked their initial entry.
(So in UI priority it should be 'this entry is updated by ...', 'entries in series', 'related entries', and then general tags, based on what I expect visitors to be most interested in and what's most important.)
In thinking about this I've wound up with the feeling that tags are going to work quite well for certain sorts of entry to entry relationships but not necessarily very well for others. Probably I won't fully understand this until (and if) I implement some sort of tags and other relationships in DWiki and start using them.
(As a result, any scheme I set up in DWiki should be flexible about what sorts of relationships it can associate with an entry or a bunch of entries. I will probably want to use it for more than tags.)
2013-12-27
A reason to keep tags external in 'entry as file' blog engines
In EntryAsFileTagProblem I ran over the problem 'entry as file' blog engines have with tags (because they need efficient two-way queries on the mappings between tags and entries) and suggested that one solution was a completely external mapping file (or files) of tag information. I've since realized that there is an additional reason to like this approach.
Put simply, having tag/entry mappings in an external file allows you to change the tags associated with an entry without editing the actual entry's file; especially you can retrospectively add tags to old entries. This is based on my feelings about two issues (feelings that other people may not share).
First off, I think that a decent amount of tagging is probably going to be done after the initial publication. Tagging is taxonomy and sometimes it's only going to be obvious when you write the second (or third, or whatever) entry that touches on a particular thing. In addition I'm biased against single entry tags (they're not merely pointless but distracting) so I'm not even likely to put in obvious tags unless I'm relatively confident that I'll write at least a second entry with the same tag.
(Fundamentally tagging as exposed in a blog is about luring people to read additional entries by giving them a way to follow interests. If you're interested in a particular tag, you can find and read other entries that have that tag. If there are no other entries when you click through the tag's link, I've wasted your time. I can use single-entry tags internally for tracking or taxonomy purposes, but I shouldn't expose them to visitors.)
Second, I'm strongly biased against modifying entry files after their initial publication; I would like to do it as little as possible. If the master source of tag information is in the file and it's common to modify tags after publication, well, I'm going to have to edit entry files much more than I'd like. Putting the same information into a separate set of files is less problematic this way.
(One issue with editing entry files is that it opens you up to making larger edits than you intend, because the tag metadata is mingled with other metadata and the actual entry text. No matter what you do to a tag metadata file, it only affects tag metadata.)