Wandering Thoughts: Recent Entries

2013-05-23

Why web robots sending Referer headers is wrong

I've written before on my view that web robots of all sorts should never send a Referer header. In those entries I mostly said 'don't do that' without giving a solid philosophical argument about why, so today I feel like changing that.

(Not that a philosophical argument actually matters. Proper behavior on the web is defined by social convention, ie by what lots of other people do and expect, not by arguing with people over what makes sense. Whether or not you agree with a social convention you break it at your peril, and today robots not sending Referer headers is a well established social convention that I will ban you for violating. And anyways the people who should read this never will.)

There are two philosophical reasons why it's wrong for robots to send Referer headers. The first is inherent in what the Referer header means, namely 'I just followed a link from page <X>'. This is a description of human behavior but not really of robot behavior; almost no web robot actually traverses the web in that way, finding links and immediately following them. If you crawl web pages, accumulate links, and then some time later crawl those links, you are not 'following a link' in any conventional sense. Worse, what happens if you discover the same link through multiple source documents? Which document gets 'credit' and appears in Referer?

(Yes, yes, this is not quite the spec definition, which kind of permits the 'I found it here' meaning that robots sometimes use. It is instead the practical definition of the header, as defined by how most everything behaves.)

So, you say, you don't care; you want to use Referer as a kind of 'this is what links to you' field for servers. I can summarize a bunch of problems here by saying that the Referer field is a terrible way to communicate this information to web operators, fundamentally because you are trying to use a side effect of HTTP requests to pass on what may be a huge amount of information. If you actually want to be useful you should make this information available on your own web site where people can see and fetch it in bulk.

Finally, the brutal truth is that 'who links to me' is by far less interesting than 'who is sending human traffic to me (right now)'. By far the most valuable part of Referer is information on where real (human) visitors are coming from, to the extent that it's possible to find this out. Being read by people is the ultimate purpose of most web pages, which makes what places are the source of traffic and active links something of decided interest to us. And this sort of human behavior has very little to do with either robot behavior or what potential links exist out there in the world. Mingling either your robot's actions or a 'helpful' attempt to tell us about the latter is not doing us any favours; rather the contrary, in fact (this is one large reason that I react angrily to robots sending Referer).

(There is also the inconvenient fact that once you're operating a decent sized site you're not likely to really care about who links to you because there will be far too many links out there, most of them in increasingly obscure and unimportant places. The links you do care about are exactly the links that send you significant traffic.)

WhyNoRefererForRobots written at 00:25:17; Add Comment

2013-05-21

Diffbot's bad Referer header

Today a web spider called 'Diffbot' (run by diffbot.com) made a whole bunch of requests here, all of which failed. They failed because, just as it has repeatedly done in the past, it made them all with a Referer header of 'http://news.google.com/' and this behavior long ago led me to ban it entirely from here.

There are a number of things wrong with this header. The first is that, to steal from the old Trix commercials, 'silly robot, the Referer header is for humans'. I've writen about this before at some length and doing it here is generally a good way to get your spider banned.

(I have a philosophical ramble about why this is the correct view, but it's going in another entry.)

The second is that, of course, this Referer value is a flaming lie in two different ways. Diffbot in no way shape or form traveled from news.google.com to the whole collection of URLs here that it attempted to crawl with that Referer header and on top of that, news.google.com does not link to here at all. Diffbot made up the header from whole cloth. I react very badly to web spiders that lie to me at the best of times (even if they aren't spraying junk over my referer logs).

Diffbot and its operators may or may not be legitimate, or at least honest about what they're doing; I have no particular opinions on that. But they are unquestionably operating a web spider that routinely lies. I have no idea why and really, I don't care; I was doing them a favour by letting them crawl me and I can and will withdraw that favour if they irritate me.

(See also my technical requirements for web spiders and my standards for responsible spider behavior.)

(No, I haven't mailed Diffbot's operators about this behavior. Are you kidding? I'm neither crazy nor stupid. On today's Internet, mailing people about issues is for people that you actually trust.)

DiffbotBadReferer written at 23:20:49; Add Comment

2013-04-21

Why a free SSL Certificate Authority is not horrifying

Back in this entry I casually mentioned in passing that there is a CA that will give you completely functional SSL certificates for free. To some people this will be horrifying; after all, as the story goes, SSL certificates are supposed to cost money so that they mean something and verify your identity (well, your website's identity).

The truth of what is going on here is that these free certificates contain exactly as much verification of your identity as everyone else's. In fact they may contain more verification, because this CA actually performs automated tests to verify that you have some control over the domain you want a certificate for; I don't know how much checking other CAs do besides making sure that they can charge your credit card. This particular CA is simply being honest about how much this particular 'service' costs to provide, ie essentially nothing. So they give you basic SSL CAs for free and charge you if you want additional features.

(There are a number of CAs that will give you free but short duration SSL certificates for testing purposes. This CA gives year-long ones and will happily issue you new ones for the next year.)

Given my long-standing irritation with what I've called the SSL CA racket, I'm kind of glad that there is a CA that is willing to be honest about exactly what's going on. If it horrifies people and offends them that such a CA is trusted by browsers, well, good, maybe it will spark a little reflection about what SSL CAs are really providing and not providing.

On a pragmatic basis, given that SSL certificates are a commodity and you can now obtain this commodity for free (which demonstrates its actual natural price) I see no reason to pay for basic SSL certificates any more.

(I continue to not name the SSL CA for a number of reasons including that I don't feel like doing their marketing for them. It isn't difficult to work out what CA it is, either with some web searches or by checking the SSL certificate chain for the website I mentioned in the earlier entry.)

Sidebar: what I mean by a basic SSL certificate

By a basic SSL certificate I mean one for a single name without wildcards. Single name certificates are slightly inconvenient but my impression is that SNI support is now common enough in both servers and (modern) clients that you can deal with this if you have to.

(I was pleasantly surprised about how few things I tried had problems with SNI after I set it up on various subdomains of my personal domain. Of course smartphones may complicate this pleasant picture.)

SSLFreeCANotHorrifying written at 00:59:51; Add Comment

2013-04-17

Some thoughts on going to HTTPS by default

My Twitter feed recently dropped a link to Tim Bray's Private By Default in front of me so I read it, nodded along in agreement, and started thinking about doing it myself for my personal domain. The technical side was easy and pain-free, since there's a Certificate Authority who'll give you free basic SSL certificates. But that's as far as I've gone due to what I've come to think of as the problem of really committing to HTTPS.

If I was doing this seriously, I would redirect all HTTP traffic to the HTTPS version of my site (because otherwise much of the existing traffic won't shift). But doing that implies an ongoing commitment to HTTPS. If people are using HTTPS URLs I need to keep those URLs working and in turn that means I need a duly CA-approved SSL certificate. Right now I can get such a thing for free but there's no guarantee that this will continue to be the case in the future; at that point, well, I have to cough up some money. And I'm not at all sure that I'm enthused enough about HTTPS everywhere to actually pay for it.

(I agree with all of Tim Bray's arguments for it intellectually. But buying a SSL certificate is not just money, it's also hassle. For that matter, using an SSL certificate is an ongoing hassle if you really care about security because then you get to wade into the great SSL cipher swamp every time a new threat emerges.)

But is this actually a real worry? Presumably I ought to have at least some warning that my next certificate will cost me money; at that point I could start redirecting my HTTPS traffic back to the HTTP version of the site and I should have some amount of time for the redirections to take effect before the certificate expired. In the extreme case I could get the cheapest one-year certificate available to have a full year for the transition (and extremely cheap SSL certificates don't seem likely to go away). Also the HTTPS version of the site wouldn't go away entirely because I'd probably put up a self-signed certificate just to keep the URLs valid (although visitors would get the usual scary browser warnings). How much this affected people in practice would depend on how many saved HTTPS URLs there were for my site out there in the wild.

(In a world of ephemeral social media and search-driven navigation that's probably a good question in general. I have no answers.)

MullingOverHttps written at 01:16:50; Add Comment

2013-04-07

The apparent source of my Firefox memory bloat problems

I recently took another shot at trying to get rid of my long-running Firefox performance problems, which I had narrowed down to garbage collection stalls resulting from memory bloat. The good news is that I seem to have found what was causing my memory problems. The bad news is that it's in extensions that I more or less care about.

The first necessary disclaimer is that I haven't gone through the painstaking work to test extensions in isolation (especially in my normal browsing environment). What I can say is that using just my core extensions of NoScript, FireGestures, It's All Text, the last working version of CookieSafe, and the Mozilla all-JavaScript PDF viewer leaves Firefox's memory usage stable and performance excellent. If I add either or both of Stylish and GreaseMonkey, memory usage climbs slowly but steadily and I see my usual performance issues. Given that GreaseMonkey is a heavily used extension, I suspect that my problems with it are due to either some interaction with my other extensions or with the specific user script that I use. The same may be true for Stylish (although there is one review that suggests other people are having memory problems with it).

(While I haven't seen memory bloat with Status-4-Evar, having it active seems to make Firefox's scrolling somewhat less snappy for me. Without GreaseMoneky and Stylish, the status bar is relatively empty anyways so I've currently experimenting with disabling S4E.)

Although I called GreaseMonkey and Stylish essential extensions back here, I can in practice live without them. Having mangled Google search results and various badly formatted websites irritates me, but I can sort of live with them (and the cure for the latter is to stop visiting those websites). I wish I didn't have to, so I keep hoping that Firefox will come up with a better solution for whatever is causing these leaks.

(Given that my bloat seemed to involve a lot of compiled JavaScript code sitting around, I'm now wondering if Firefox has something like Java's PermGen issues with loaded code and compiled/JIT'd functions sticking around when they shouldn't.)

MyFirefoxPerformanceIII written at 01:45:42; Add Comment

2013-03-25

Rethinking avoiding Apache

Somewhat recently I wrote about when I'd use a web server other than Apache (despite Apache's temptations). I've recently discovered that I need to change those opinions somewhat; Apache turns out to be much more usable than I expected in a constrained resources situation.

One of my recent hobbies has been testing DWiki in a low-memory virtual machine (as I mentioned once in passing). I did my primary testing using nginx because it had an SCGI gateway, but with that working I decided on a whim to see how Apache plus mod_wsgi would do in the same small VM. To be honest, I expected Apache to explode spectacularly under any sort of real concurrent connection load, driving the virtual machine into the ground in the process.

To my total surprise, this did not happen. Not at all. Instead a more or less stock Ubuntu 12.04 Apache plus mod_wsgi setup handily dealt with all of the load I could throw at it. In my limited testing it was actually slightly faster on average than my nginx setup, dealt better with really extreme numbers of concurrent connections, and still left the machine with free memory. It was also easier to manage than my nginx lashup, which needed a separate system to run and restart the SCGI-based WSGI server that nginx talked to.

Part of this seems to be that Ubuntu 12.04 has sensible (ie small) Apache configuration settings. Another part is that mod_wsgi totally isolates the WSGI serving into separate processes (although they are still Apache processes). But regardless of all of this the whole setup just works and does so in an environment where I had previously expected Apache to be completely unsuitable. I am metaphorically eating my hat right about now.

(If I ever do deploy DWiki into such an environment, Apache plus mod_wsgi is now going to be my first choice. Not for performance, I doubt there's any meaningful practical difference, but because it's easier to manage because everything is in one spot and mod_wsgi has good support for easy code reloads.)

Sidebar: a caution about my performance results

Siege, the load tester I was using, reports only the average request time (and the maximum and minimum); it doesn't provide any difference about the distribution. It's possible that the distribution of response times is worse with Apache and the average is masking this. To do real testing I'd need to find a more thorough HTTP load tester (well, one with better stats reporting).

RethinkingAvoidingApache written at 22:33:44; Add Comment

2013-03-20

Don't use ab for your web server stress tests (I like siege instead)

Like many other people, I sort of automatically reach for the venerable ab Apache program when I want to do some sort of a web server stress test. I've heard that it has flaws and it's not the best program out there, but surely it's good enough for the basics, right?

Well, no, as I found out recently. I don't know exactly why or what's going on, but ab's concurrency option plain doesn't work; you get nowhere near as much concurrency as you asked for and it claims. Due to my concurrency misunderstanding I got to see this first hand and very vividly. When I ran 'ab -c N' against a test DWiki setup, nowhere near as many worker processes got started and used as there should have been (I believe I asked for 50 concurrent requests and saw only 4 worker processes running, which is very wrong). So my message is simple: do not use ab to test anything you care about. That it's there does not make it worthwhile unless you are very sure that it is not quietly doing something odd on you.

On the other hand I can attest that siege works. When I asked it to make N concurrent requests, well, my worker process count shot right up to what it should have been (in the case of high concurrency, every worker process that I allowed). Siege is also capable of hammering on a fast web server so rapidly that it exhausts your machine's normal range of 28,000 or so local TCP ports. On the one hand this is vaguely annoying. On the other hand I can only describe it as a good problem to have, since it means you are serving requests considerably faster than old sockets can expire out of TIME_WAIT.

(Siege is not perfect and I have not conducted either an exhaustive test of web server stress testers or a careful validation of the numbers it reports. Plus, if you really care about this you will want not just averages for things like response speeds but also 90th and 99th percentiles and distributions and so on. You may also want a more sophisticated model than just concurrent connections, one that more closely models the real world behavior of people.)

(This elaborates on a tweet I made a while ago.)

AvoidAbUseSiege written at 01:07:17; Add Comment

2013-03-14

What I want out of a web-based syndication feed reader

In light of Google Reader's impending shutdown I've started thinking about what I'd want out of any replacement to it that I switch to. I don't use Google Reader as my primary feed reader (that has always been Liferea); instead, my use is for three somewhat contradictory things:

  • feeds that I want to be able to browse from more than one place.
  • casual reading feeds, where Google Reader's slow expiry of old unread entries is a feature.
  • feeds that I don't want to get lost in the black hole that my Liferea feeds have turned into.

(Unless I really care about a feed, adding it to Liferea usually insures that I then ignore it; I just have too many things in there. I should probably remove most of my current Liferea feeds but I can't get up the willpower and I can't quite abandon the idea that I'll read those worthwhile entries someday.)

This leads me to think that a number of features are important to me (besides just being web-based in some way, even self-hosted):

  • not a 'river of news' interface where all entries from all feeds are dumped on me at once. A Planet-style interface may work for many people but it doesn't work for my casual reading; I need to be able to pick and choose what I'm going to read at any given point.

  • a notion of unread and read entries where I don't have to read a feed in any specific order; I can skip around, read some entries, and leave others for later (even leave entire feeds for later).

  • unread entries need to expire after a while. Ideally not really fast; say, a month.

  • meaningful visibility of entry contents while I'm browsing things (ie the way Google Reader does it). I don't want to see little snapshots of web pages or anything like that, I want to see some (or all) of the text of an entry.

  • efficient use of space that does not slice things up into a squeezed multi-column layout. I read one entry at a time; I do not need to see two or three columns of them on the screen, forcing the one I want to read into a tiny skinny box.

    (I think I've seen this sort of bad layout called a newspaper like layout, presumably because of a newspaper's multiple columns.)

I'm relatively indifferent to whether or not the feed reading presents entries as simple, readable text (as Google Reader and Liferea do) or whether it makes some attempt to make entries look like they do on the real site (as some other web-based feed readers apparently do). Terrible formatting will just cause me to unsubscribe from a feed, which should be no major loss given what I'm theoretically using this for (mostly).

Unfortunately all of this is a sufficiently complex set of wishes that it implies a web application instead of just a website (although I'm willing to self-host the web app if I can).

(In theory I'd also be happy with a good graphical feed reader program that synced things between multiple machines using some backend. In practice I'm not sure there's any such program whose interface I'd like and that runs on Fedora.)

WebFeedReaderWants written at 00:55:35; Add Comment

2013-02-07

Today's learning experience with CSS: don't be indirect

This is today's learning experience and I will preface it by saying that I am probably doing things wrong and in not the right CSS way. I will present this as a story.

Once upon a time, you write a wikitext to HTML converter and with it some associated CSS. Your wikitext has tables and the tables should be styled in a certain way, so you wrap the entire generated wikitext in a <div class="wikitext"> and write a CSS rule:

.wikitext td { border: 1px; border-style: solid; padding: .3em; }

These tables come out with a nice 1 pixel solid border the way you wanted and also the right padding around everything to look nice.

Your wiki also has some tables that it generates outside of the wikitext. They have HTML like <table class="blogtitles"> and CSS to style them the way you want:

.blogtitles td { padding-bottom: .5em; vertical-align: top; }

.blogtitles td + td { padding-left: 0.5em; }

These tables also come out with the right padding and no border, the way you want them to.

Then much, much later you decide that you want to embed a blogtitles table in the generated wikitext, wrapped in that great big wikitext <div>. You render the whole thing and lo, your blogtitles table comes out looking horrible. For a start, it has borders.

Well, of course it has borders. You said to give it borders: 'every <td> inside a wikitext <div> should have borders' says your CSS, and right there is a (blogtitles) <td> inside a wikitext <div>. Similarly your blogtitles table has all sorts of padding it 'inherited' from (general) wikitext tables. The results of combining the blogtitles CSS with the wikitext tables CSS is probably nothing like what you wanted (and may not look very good).

Your problem (ie, my problem) is that you were indirect when you did not want to be. 'Any <td> inside my <div>' is an indirect way of specifying 'wikitext tables', and as an indirect way it runs the danger of being too general. Which is what happened here. Blogtitles tables are conceptually a completely separate thing and should be styled independently from your regular wikitext tables, but they are being swept up in your dragnet.

The right solution, at least in generated HTML, is to be direct. Generate your wikitext <tables> with an an actual class (eg <table class="wikitable">) and then write CSS on that. The CSS doesn't even have to change much. In short, say what you actually mean. You didn't really want to style every <td> inside your wikitext; you wanted to style your wikitables. So you should say this directly (in CSS and in classes) and save yourself a certain amount of hassle and annoyance.

(There is probably a really clever way to fix this in CSS that I don't know because I'm mostly CSS-ignorant. Note that I don't consider carefully trying to undo the wikitext table settings to be a clever way.)

The ice is thinner for HTML that isn't automatically generated, because putting classes on things is somewhat more annoying there (especially if you may have a lot of them). I don't pretend to have a nice answer there.

CSSAvoidIndirection written at 00:10:11; Add Comment

2013-02-05

What makes DWiki and other dynamic file based blog engines slow

In yesterday's entry I mentioned that DWiki (the software behind this blog) is pretty much a worst case for a blog engine as far as speed goes. Today I feel like talking about what makes DWiki slow, and by extension the things that can slow down any dynamic file based blog engine. Part of why is so that you (if you are considering writing such a thing) can avoid the mistakes that I made.

(Some of the slowness is because chunks of DWiki's code are not exactly the best that they could be, but the issues there are generally dwarfed by the general ones I'm about to discuss.)

For basic background, DWiki is about as pure a dynamic file based blog engine as you could ask for; conceptually it is purely a bunch of views of a filesystem hierarchy (actually of two of them). Each entry and each comment is stored in a separate file in a directory hierarchy (entries are files in category subdirectories and comments are files in a per-entry subdirectory that is itself in a mirror of the entry's regular hierarchy). Entries (and comments) are written and stored in DWiki's wikitext dialect, not HTML, and the time of an entry (or a comment) is the modification time of its file.

This gives DWiki two main slow points. The most obvious one is converting DWikiText to HTML. At the level of a single entry it isn't a terribly bad process, taking about 6 milliseconds to render yesterday's entry (and then about 4 milliseconds to render the sidebar text, which is also wikitext in a file). But at the level of the blog front page this adds up fast; ten entries is already over 60 milliseconds (although per-entry rendering varies by a few milliseconds depending on what's in them). Still, 60 milliseconds is not a terrible killer.

(In retrospect, one of the reasons to use Markdown or some other popular wikitext format is that other people may well write fast HTML converters for you. With a private wikitext, you're on your own.)

The less obvious but much larger slow point is that DWiki has to walk the filesystem any time it needs to know the relationship between entries, or just to find them all. The obvious case is the blog's front page, which needs to find the N most recent entries; in a file based engine like DWiki you do this by walking the filesystem to find all the entry files, stat()ing them to find their timestamp, sorting the list, and taking the top N. More subtly, DWiki also needs to do this walk when displaying individual entries in order to figure out what the next and previous entries are so that it can generate links to them. And if you want to display some sort of calendar of what days or weeks or months have entries? Again you need a walk.

(Comments are usually less of a problem because the filesystem walks to find them are smaller and more focused. The exception is if you do something crazy like 'show N most recent comments'.)

This filesystem walk is not a big issue for a small blog (which will have a modest number of files). But when your blog gets more and more entries, well, things scale up and slow down. Rendering the front page of WanderingThoughts without any caches currently takes 3,299 lstat()s and scans 18 directories; rendering yesterday's entry takes 3,207 lstat()s and scans 13 directories. This takes a while even if everything is in the kernel's caches.

(You can optimize the walking code as much as you want but you still have to stat() every file no matter what you do. For scale, a raw filesystem walk over all WanderingThoughts entries currently takes about 200 milliseconds with hot kernel caches (in Python, but ls and find take similar amounts of time).)

The way around these problems is to cache or pregenerate this information, which is why if I was doing a file based blog design again there would be an explicit 'publish entry' step (among other changes).

(DWiki is as weirdly limited as it is because its initial design was to run purely read only, with no write access to anything. Comments and on-disk caching still haven't fundamentally changed that attitude.)

Sidebar: two other DWiki performance-related design mistakes

DWikiText allows bare words (in the usual WikiWord format) to be links if and only if the target of the link exists. This turns out to be a bad idea if you want to cache the rendered HTML, because suddenly changes elsewhere in the filesystem (not just changes to the page itself) can invalidate the HTML; a file appearing or disappearing can create or remove a WikiWord link. This adds a couple of extra lstat()s every time DWiki loads a cached HTML rendering.

(This is not just a performance issue. It means that you can't have a simple model of 'compile the HTML of an entry when it's published and you're done'; you have to worry that publishing a new entry will need an old entry to suddenly be regenerated. The headaches are just not worth it; use a wikitext that requires explicit markup for links and then makes them always be links, whether or not the target exists.)

DWiki has an authentication and permission system that controls things like who can see or comment on an entry. Cleverly I made two terrible decisions when designing it; permissions are embedded in the DWikiText markup and permissions can be per file not just per directory hierarchy. In short, DWiki kind of has to render each file to find out if it can render each file. This is saved only by the fact that generally you're going to render a file anyways any time you need to check its permissions (if it's accessible), but if I was doing it again I would not do this; it could be pretty bad if there were a lot of access-restricted pages.

(DWiki caches this permission information along with the rendered HTML, which helps. The actual code model for doing this is in retrospect kind of terrible, partly because it evolved in multiple steps and was never refactored to be sane.)

FileBasedSlowness written at 22:56:51; Add Comment

These are my WanderingThoughts
(About the blog)

GettingAround
Full index of entries
Recent comments

This is part of CSpace, and is written by ChrisSiebenmann.
Twitter: @thatcks

* * *

Atom feeds are available; see the bottom of most pages.

This is a DWiki.
(Help)

Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web

Search:
(Previous 10 or go back to February 2013 at 2013/02/04)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.