|
2013-05-23
Why web robots sending Referer headers is wrong
I've written before on my view that web robots of all sorts should
never send a Referer header. In those entries I mostly said 'don't do
that' without giving a solid philosophical argument about why, so today
I feel like changing that.
(Not that a philosophical argument actually matters. Proper behavior
on the web is defined by social convention, ie by what lots of other
people do and expect, not by arguing with people over what makes
sense. Whether or not you agree with a social convention you break it
at your peril, and today robots not sending Referer headers is a
well established social convention that I will ban you for violating. And anyways the people who should read this never
will.)
There are two philosophical reasons why it's wrong for robots to
send Referer headers. The first is inherent in what the Referer
header means, namely 'I just followed a link from page <X>'. This is a
description of human behavior but not really of robot behavior; almost
no web robot actually traverses the web in that way, finding links and
immediately following them. If you crawl web pages, accumulate links,
and then some time later crawl those links, you are not 'following a
link' in any conventional sense. Worse, what happens if you discover
the same link through multiple source documents? Which document gets
'credit' and appears in Referer?
(Yes, yes, this is not quite the spec definition, which
kind of permits the 'I found it here' meaning that robots sometimes use. It
is instead the practical definition of the header, as defined by how
most everything behaves.)
So, you say, you don't care; you want to use Referer as a kind of
'this is what links to you' field for servers. I can summarize a bunch
of problems here by saying that the Referer field is a terrible way
to communicate this information to web operators, fundamentally because
you are trying to use a side effect of HTTP requests to pass on what may
be a huge amount of information. If you actually want to be useful you
should make this information available on your own web site where people
can see and fetch it in bulk.
Finally, the brutal truth is that 'who links to me' is by far less
interesting than 'who is sending human traffic to me (right now)'. By
far the most valuable part of Referer is information on where real
(human) visitors are coming from, to the extent that it's possible
to find this out. Being read by people is
the ultimate purpose of most web pages, which makes what places are the
source of traffic and active links something of decided interest to
us. And this sort of human behavior has very little to do with either
robot behavior or what potential links exist out there in the world.
Mingling either your robot's actions or a 'helpful' attempt to tell us
about the latter is not doing us any favours; rather the contrary, in
fact (this is one large reason that I react angrily to robots sending
Referer).
(There is also the inconvenient fact that once you're operating a decent
sized site you're not likely to really care about who links to you
because there will be far too many links out there, most of them in
increasingly obscure and unimportant places. The links you do care about
are exactly the links that send you significant traffic.)
WhyNoRefererForRobots written at 00:25:17; Add Comment
2013-05-21
Diffbot's bad Referer header
Today a web spider called 'Diffbot' (run by diffbot.com) made a whole
bunch of requests here, all of which failed. They failed because, just
as it has repeatedly done in the past, it made them all with a Referer
header of 'http://news.google.com/' and this behavior long ago led me
to ban it entirely from here.
There are a number of things wrong with this header. The first is that,
to steal from the old Trix commercials, 'silly robot, the Referer
header is for humans'. I've writen about this before at some length and doing it here is generally a good way to get
your spider banned.
(I have a philosophical ramble about why this is the correct view,
but it's going in another entry.)
The second is that, of course, this Referer value is a flaming lie
in two different ways. Diffbot in no way shape or form traveled from
news.google.com to the whole collection of URLs here that it attempted
to crawl with that Referer header and on top of that, news.google.com
does not link to here at all. Diffbot made up the header from whole
cloth. I react very badly to web spiders that lie to me at the best of
times (even if they aren't spraying junk over my referer logs).
Diffbot and its operators may or may not be legitimate, or at least
honest about what they're doing; I have no particular opinions on
that. But they are unquestionably operating a web spider that
routinely lies. I have no idea why and really, I don't care; I was
doing them a favour by letting them crawl me
and I can and will withdraw that favour if they irritate me.
(See also my technical requirements for web spiders and my standards for responsible spider
behavior.)
(No, I haven't mailed Diffbot's operators about this behavior. Are you
kidding? I'm neither crazy nor stupid. On today's Internet, mailing
people about issues is for people that you actually trust.)
DiffbotBadReferer written at 23:20:49; Add Comment
2013-04-21
Why a free SSL Certificate Authority is not horrifying
Back in this entry I casually mentioned in
passing that there is a CA that will give you completely functional
SSL certificates for free. To some people this will be horrifying;
after all, as the story goes, SSL certificates are supposed to cost
money so that they mean something and verify your identity (well,
your website's identity).
The truth of what is going on here is that these free certificates
contain exactly as much verification of your identity as everyone
else's. In fact they may contain more verification, because this
CA actually performs automated tests to verify that you have some
control over the domain you want a certificate for; I don't know how
much checking other CAs do besides making sure that they can charge
your credit card. This particular CA is simply being honest about
how much this particular 'service' costs to provide, ie essentially
nothing. So they give you basic SSL CAs for free and charge you if you
want additional features.
(There are a number of CAs that will give you free but short duration
SSL certificates for testing purposes. This CA gives year-long ones and
will happily issue you new ones for the next year.)
Given my long-standing irritation with what I've called the SSL CA
racket, I'm kind of glad that there is a CA that is willing to be honest
about exactly what's going on. If it horrifies people and offends
them that such a CA is trusted by browsers, well, good, maybe it will
spark a little reflection about what SSL CAs are really providing and not providing.
On a pragmatic basis, given that SSL certificates are a commodity and you can now obtain this commodity for free
(which demonstrates its actual natural price) I see no reason to pay
for basic SSL certificates any more.
(I continue to not name the SSL CA for a number of reasons including
that I don't feel like doing their marketing for them. It isn't
difficult to work out what CA it is, either with some web searches or by
checking the SSL certificate chain for the website I mentioned in the
earlier entry.)
Sidebar: what I mean by a basic SSL certificate
By a basic SSL certificate I mean one for a single name without
wildcards. Single name certificates are slightly inconvenient but
my impression is that SNI support is now common
enough in both servers and (modern) clients that you can deal with
this if you have to.
(I was pleasantly surprised about how few things I tried had problems
with SNI after I set it up on various
subdomains of my personal domain. Of course smartphones may complicate this
pleasant picture.)
SSLFreeCANotHorrifying written at 00:59:51; Add Comment
2013-04-17
Some thoughts on going to HTTPS by default
My Twitter feed recently dropped a link to Tim Bray's Private By
Default
in front of me so I read it, nodded along in agreement, and
started thinking about doing it myself for my personal domain. The technical side was easy and pain-free,
since there's a Certificate Authority who'll give you free basic SSL
certificates. But that's as far as I've gone due to what I've come to
think of as the problem of really committing to HTTPS.
If I was doing this seriously, I would redirect all HTTP traffic to the
HTTPS version of my site (because otherwise much of the existing traffic
won't shift). But doing that implies an ongoing commitment to HTTPS. If
people are using HTTPS URLs I need to keep those URLs working and in
turn that means I need a duly CA-approved SSL certificate. Right now I
can get such a thing for free but there's no guarantee that this will
continue to be the case in the future; at that point, well, I have to
cough up some money. And I'm not at all sure that I'm enthused enough
about HTTPS everywhere to actually pay for it.
(I agree with all of Tim Bray's arguments for it intellectually. But
buying a SSL certificate is not just money, it's also hassle. For that
matter, using an SSL certificate is an ongoing hassle if you really
care about security because then you get to wade into the great SSL
cipher swamp every time a new threat
emerges.)
But is this actually a real worry? Presumably I ought to have at least
some warning that my next certificate will cost me money; at that point
I could start redirecting my HTTPS traffic back to the HTTP version of
the site and I should have some amount of time for the redirections to
take effect before the certificate expired. In the extreme case I could
get the cheapest one-year certificate available to have a full year
for the transition (and extremely cheap SSL certificates don't seem
likely to go away). Also the HTTPS version of the site wouldn't go away
entirely because I'd probably put up a self-signed certificate just to
keep the URLs valid (although visitors would get the usual scary browser
warnings). How much this affected people in practice would depend on how
many saved HTTPS URLs there were for my site out there in the wild.
(In a world of ephemeral social media and search-driven navigation
that's probably a good question in general. I have no answers.)
MullingOverHttps written at 01:16:50; Add Comment
2013-04-07
The apparent source of my Firefox memory bloat problems
I recently took another shot at trying to get rid of my long-running
Firefox performance problems, which I had
narrowed down to garbage collection stalls resulting from memory
bloat. The good news is that I seem to have
found what was causing my memory problems. The bad news is that it's
in extensions that I more or less care about.
The first necessary disclaimer is that I haven't gone through the
painstaking work to test extensions in isolation (especially in my
normal browsing environment). What I can say is that using just
my core extensions of NoScript, FireGestures,
It's All Text, the last working version of CookieSafe, and the
Mozilla all-JavaScript PDF viewer leaves Firefox's memory usage
stable and performance excellent. If I add either or both of
Stylish
and GreaseMonkey,
memory usage climbs slowly but steadily and I see my usual performance
issues.
Given that GreaseMonkey is a heavily used extension, I suspect
that my problems with it are due to either some interaction with
my other extensions or with the specific user script that I use. The same may be true for
Stylish (although there is one review that suggests other people are
having memory problems with it).
(While I haven't seen memory bloat with Status-4-Evar,
having it active seems to make Firefox's scrolling somewhat less
snappy for me. Without GreaseMoneky and Stylish, the status bar
is relatively empty anyways so I've currently experimenting with
disabling S4E.)
Although I called GreaseMonkey and Stylish essential extensions back
here, I can in practice live without them. Having
mangled Google search results and various badly formatted websites
irritates me, but I can sort of live with them (and the cure for the
latter is to stop visiting those websites). I wish I didn't have to,
so I keep hoping that Firefox will come up with a better solution for
whatever is causing these leaks.
(Given that my bloat seemed to involve a lot of compiled JavaScript
code sitting around, I'm now wondering if Firefox has something like
Java's PermGen issues with loaded code and compiled/JIT'd functions
sticking around when they shouldn't.)
MyFirefoxPerformanceIII written at 01:45:42; Add Comment
2013-03-25
Rethinking avoiding Apache
Somewhat recently I wrote about when I'd use a web server other
than Apache (despite Apache's temptations). I've recently discovered that I need to change
those opinions somewhat; Apache turns out to be much more usable
than I expected in a constrained resources situation.
One of my recent hobbies has been testing DWiki in a low-memory virtual
machine (as I mentioned once in passing). I did
my primary testing using nginx because it
had an SCGI gateway, but with that working I decided on a whim to see
how Apache plus mod_wsgi would
do in the same small VM. To be honest, I expected Apache to explode
spectacularly under any sort of real concurrent connection load, driving
the virtual machine into the ground in the process.
To my total surprise, this did not happen. Not at all. Instead a more
or less stock Ubuntu 12.04 Apache plus mod_wsgi setup handily dealt
with all of the load I could throw at it. In my limited testing it was
actually slightly faster on average than my nginx setup, dealt better
with really extreme numbers of concurrent connections, and still left
the machine with free memory. It was also easier to manage than my nginx
lashup, which needed a separate system to run and restart the SCGI-based
WSGI server that nginx talked to.
Part of this seems to be that Ubuntu 12.04 has sensible (ie small)
Apache configuration settings. Another part is that mod_wsgi totally
isolates the WSGI serving into separate processes (although they are
still Apache processes). But regardless of all of this the whole setup
just works and does so in an environment where I had previously expected
Apache to be completely unsuitable. I am metaphorically eating my hat
right about now.
(If I ever do deploy DWiki into such an environment, Apache plus
mod_wsgi is now going to be my first choice. Not for performance,
I doubt there's any meaningful practical difference, but because it's
easier to manage because everything is in one spot and mod_wsgi has
good support for easy code reloads.)
Sidebar: a caution about my performance results
Siege, the load tester I was using, reports only the average request
time (and the maximum and minimum); it doesn't provide any difference
about the distribution. It's possible that the distribution of response
times is worse with Apache and the average is masking this. To do real
testing I'd need to find a more thorough HTTP load tester (well, one
with better stats reporting).
RethinkingAvoidingApache written at 22:33:44; Add Comment
2013-03-20
Don't use ab for your web server stress tests (I like siege instead)
Like many other people, I sort of automatically reach for the venerable
ab Apache program when I want to do some sort of a web server stress
test. I've heard that it has flaws and it's not the best program out
there, but surely it's good enough for the basics, right?
Well, no, as I found out recently. I don't know exactly why or what's
going on, but ab's concurrency option plain doesn't work; you get
nowhere near as much concurrency as you asked for and it claims. Due to
my concurrency misunderstanding
I got to see this first hand and very vividly. When I ran 'ab -c N'
against a test DWiki setup, nowhere near as many worker processes got
started and used as there should have been (I believe I asked for 50
concurrent requests and saw only 4 worker processes running, which is
very wrong). So my message is simple: do not use ab to test anything
you care about. That it's there does not make it worthwhile unless
you are very sure that it is not quietly doing something odd on you.
On the other hand I can attest that siege works. When I asked it to make N
concurrent requests, well, my worker process count shot right up to
what it should have been (in the case of high concurrency, every worker
process that I allowed). Siege is also capable of hammering on a fast
web server so rapidly that it exhausts your machine's normal range of
28,000 or so local TCP ports. On the one hand this is vaguely annoying.
On the other hand I can only describe it as a good problem to have,
since it means you are serving requests considerably faster than old
sockets can expire out of TIME_WAIT.
(Siege is not perfect and I have not conducted either an exhaustive test
of web server stress testers or a careful validation of the numbers
it reports. Plus, if you really care about this you will want not
just averages for things like response speeds but also 90th and 99th
percentiles and distributions and so on. You may also want a more
sophisticated model than just concurrent connections, one that more
closely models the real world behavior of people.)
(This elaborates on a tweet I made a while ago.)
AvoidAbUseSiege written at 01:07:17; Add Comment
2013-03-14
What I want out of a web-based syndication feed reader
In light of Google Reader's impending shutdown
I've started thinking about what I'd want out of any replacement
to it that I switch to. I don't use Google Reader as my primary
feed reader (that has always been Liferea); instead, my use is for
three somewhat contradictory things:
- feeds that I want to be able to browse from more than one place.
- casual reading feeds, where Google
Reader's slow expiry of old unread entries is a feature.
- feeds that I don't want to get lost in the black hole that my Liferea
feeds have turned into.
(Unless I really care about a feed, adding it to Liferea usually insures
that I then ignore it; I just have too many things in there. I should
probably remove most of my current Liferea feeds but I can't get up
the willpower and I can't quite abandon the idea that I'll read those
worthwhile entries someday.)
This leads me to think that a number of features are important to me
(besides just being web-based in some way, even self-hosted):
I'm relatively indifferent to whether or not the feed reading presents
entries as simple, readable text (as Google Reader and Liferea do) or
whether it makes some attempt to make entries look like they do on the
real site (as some other web-based feed readers apparently do). Terrible
formatting will just cause me to unsubscribe from a feed, which should
be no major loss given what I'm theoretically using this for (mostly).
Unfortunately all of this is a sufficiently complex set of wishes that
it implies a web application instead of just a website (although I'm
willing to self-host the web app if I can).
(In theory I'd also be happy with a good graphical feed reader program
that synced things between multiple machines using some backend. In
practice I'm not sure there's any such program whose interface I'd like
and that runs on Fedora.)
WebFeedReaderWants written at 00:55:35; Add Comment
2013-02-07
Today's learning experience with CSS: don't be indirect
This is today's learning experience and I will preface it by saying that
I am probably doing things wrong and in not the right CSS way. I will
present this as a story.
Once upon a time, you write a wikitext to HTML converter and with it
some associated CSS. Your wikitext has tables and the tables should
be styled in a certain way, so you wrap the entire generated wikitext
in a <div class="wikitext"> and write a CSS rule:
.wikitext td { border: 1px; border-style: solid; padding: .3em; }
These tables come out with a nice 1 pixel solid border the way you
wanted and also the right padding around everything to look nice.
Your wiki also has some tables that it generates outside of the
wikitext. They have HTML like <table class="blogtitles"> and
CSS to style them the way you want:
.blogtitles td { padding-bottom: .5em; vertical-align: top; }
.blogtitles td + td { padding-left: 0.5em; }
These tables also come out with the right padding and no border, the
way you want them to.
Then much, much later you decide that you want to embed a blogtitles
table in the generated wikitext, wrapped in that great big wikitext
<div>. You render the whole thing and lo, your blogtitles table comes
out looking horrible. For a start, it has borders.
Well, of course it has borders. You said to give it borders: 'every <td>
inside a wikitext <div> should have borders' says your CSS, and right
there is a (blogtitles) <td> inside a wikitext <div>. Similarly your
blogtitles table has all sorts of padding it 'inherited' from (general)
wikitext tables. The results of combining the blogtitles CSS with the
wikitext tables CSS is probably nothing like what you wanted (and may
not look very good).
Your problem (ie, my problem) is that you were indirect when you did
not want to be. 'Any <td> inside my <div>' is an indirect way of
specifying 'wikitext tables', and as an indirect way it runs the danger
of being too general. Which is what happened here. Blogtitles tables
are conceptually a completely separate thing and should be styled
independently from your regular wikitext tables, but they are being
swept up in your dragnet.
The right solution, at least in generated HTML, is to be direct.
Generate your wikitext <tables> with an an actual class (eg <table
class="wikitable">) and then write CSS on that. The CSS doesn't even
have to change much. In short, say what you actually mean. You
didn't really want to style every <td> inside your wikitext; you wanted
to style your wikitables. So you should say this directly (in CSS and in
classes) and save yourself a certain amount of hassle and annoyance.
(There is probably a really clever way to fix this in CSS that I don't
know because I'm mostly CSS-ignorant. Note that I don't consider
carefully trying to undo the wikitext table settings to be a clever
way.)
The ice is thinner for HTML that isn't automatically generated, because
putting classes on things is somewhat more annoying there (especially if
you may have a lot of them). I don't pretend to have a nice answer there.
CSSAvoidIndirection written at 00:10:11; Add Comment
2013-02-05
What makes DWiki and other dynamic file based blog engines slow
In yesterday's entry I mentioned that DWiki
(the software behind this blog) is pretty much a worst case for a blog
engine as far as speed goes. Today I feel like talking about what makes
DWiki slow, and by extension the things that can slow down any dynamic
file based blog engine. Part of why is so that
you (if you are considering writing such a thing) can avoid the mistakes
that I made.
(Some of the slowness is because chunks of DWiki's code are not exactly
the best that they could be, but the issues there are generally dwarfed
by the general ones I'm about to discuss.)
For basic background, DWiki is about as pure a dynamic file based blog
engine as you could ask for; conceptually it is purely a bunch of views
of a filesystem hierarchy (actually of two of them). Each entry and
each comment is stored in a separate file in a directory hierarchy
(entries are files in category subdirectories and comments are files
in a per-entry subdirectory that is itself in a mirror of the entry's
regular hierarchy). Entries (and comments) are written and stored in
DWiki's wikitext dialect, not HTML, and the time of an entry (or a
comment) is the modification time of its file.
This gives DWiki two main slow points. The most obvious one is
converting DWikiText to HTML. At the level of a single entry it
isn't a terribly bad process, taking about 6 milliseconds to render
yesterday's entry (and then about 4 milliseconds to render the
sidebar text, which is also wikitext in a file). But at the level of
the blog front page this adds up fast; ten entries is already
over 60 milliseconds (although per-entry rendering varies by a few
milliseconds depending on what's in them). Still, 60 milliseconds is not
a terrible killer.
(In retrospect, one of the reasons to use Markdown or some other popular
wikitext format is that other people may well write fast HTML converters
for you. With a private wikitext, you're on your own.)
The less obvious but much larger slow point is that DWiki has to walk
the filesystem any time it needs to know the relationship between
entries, or just to find them all. The obvious case is the blog's front
page, which needs to find the N most recent entries; in a file based
engine like DWiki you do this by walking the filesystem to find all the
entry files, stat()ing them to find their timestamp, sorting the list,
and taking the top N. More subtly, DWiki also needs to do this walk when
displaying individual entries in order to figure out what the next and
previous entries are so that it can generate links to them. And if you
want to display some sort of calendar of what days or weeks or months
have entries? Again you need a walk.
(Comments are usually less of a problem because the filesystem walks
to find them are smaller and more focused. The exception is if you do
something crazy like 'show N most recent comments'.)
This filesystem walk is not a big issue for a small blog (which will
have a modest number of files). But when your blog gets more and more
entries, well, things scale up and slow down. Rendering the front page
of WanderingThoughts without any caches currently takes 3,299
lstat()s and scans 18 directories; rendering yesterday's entry
takes 3,207 lstat()s and scans 13 directories. This takes a while
even if everything is in the kernel's caches.
(You can optimize the walking code as much as
you want but you still have to stat() every file no matter what you
do. For scale, a raw filesystem walk over all WanderingThoughts
entries currently takes about 200 milliseconds with hot kernel caches
(in Python, but ls and find take similar amounts of time).)
The way around these problems is to cache or pregenerate this
information, which is why if I was doing a file based blog design again
there would be an explicit 'publish entry' step
(among other changes).
(DWiki is as weirdly limited as it is because its initial design was
to run purely read only, with no write access to anything. Comments
and on-disk caching still haven't fundamentally changed that attitude.)
Sidebar: two other DWiki performance-related design mistakes
DWikiText allows bare words (in the usual WikiWord format) to be links
if and only if the target of the link exists. This turns out to be a bad
idea if you want to cache the rendered HTML, because suddenly changes
elsewhere in the filesystem (not just changes to the page itself) can
invalidate the HTML; a file appearing or disappearing can create or
remove a WikiWord link. This adds a couple of extra lstat()s every
time DWiki loads a cached HTML rendering.
(This is not just a performance issue. It means that you can't have a
simple model of 'compile the HTML of an entry when it's published and
you're done'; you have to worry that publishing a new entry will need an
old entry to suddenly be regenerated. The headaches are just not worth
it; use a wikitext that requires explicit markup for links and then
makes them always be links, whether or not the target exists.)
DWiki has an authentication and permission system that controls things
like who can see or comment on an entry. Cleverly I made two terrible
decisions when designing it; permissions are embedded in the DWikiText
markup and permissions can be per file not just per directory
hierarchy. In short, DWiki kind of has to render each file to find
out if it can render each file. This is saved only by the fact that
generally you're going to render a file anyways any time you need to
check its permissions (if it's accessible), but if I was doing it
again I would not do this; it could be pretty bad if there were a lot of
access-restricted pages.
(DWiki caches this permission information along with the rendered HTML,
which helps. The actual code model for doing this is in retrospect kind
of terrible, partly because it evolved in multiple steps and was never
refactored to be sane.)
FileBasedSlowness written at 22:56:51; Add Comment
|
These are my WanderingThoughts
(About the blog)
GettingAround
Full index of entries
Recent comments
This is part of CSpace, and is written by ChrisSiebenmann.
Twitter: @thatcks
* * *
Atom feeds are available; see the bottom of most pages.
This is a DWiki.
(Help)
Categories: links, linux, programming, python, snark, solaris, spam, sysadmin, tech, unix, web
|