2013-10-27
Some things I've learned from transitioning a website to HTTPS
A while back I first added a HTTPS version of my personal site along side the existing HTTP version and then decided that I was going to actively migrate it to HTTPS. The whole thing has been running for a few months now, so it seems about time to write up some things I've learned from it.
The first set of lessons I learned were about everything on my side,
especially my own code. The first layer of problems was code et al with
explicit 'http:' bits in it; it was amazing and depressing how many
places I was just automatically doing that (you could call this 'HTTP
blindness' if you wanted a trendy term for it). The more subtle problem
areas were things like caches, where a HTTP version of a page might be
different from a HTTPS version yet I was storing them under the same
cache key. I also ran into a situation where I wanted to generate output
for a HTTP URL request but use the 'canonical' HTTPS URLs for links
embedded in the result; this required adding a feature to DWiki.
(I also found a certain amount of other software that didn't cope well.
For example, the Fedora 19 version of mod_wsgi doesn't seem to cope
with a single WSGI application group that's served over both HTTP and
HTTPS; the HTTPS environment value latches to one value and never
changes.)
Once I had my own code working I got to find out all sorts of depressing things about how other people's code deals with such a transition. In no particular order:
- while search engines did eventually switch over to returning HTTPS
results and to crawling only the HTTPS version of my site, it
took a surprisingly long time (and the switch may not be complete
even now, it's hard to tell).
- Many syndication feed fetchers have not changed to the HTTPS version;
they still request a HTTP URL then get redirected. I will reluctantly
concede that there are sensible reasons for this behavior. It does mean that the HTTP redirects
will probably live on forever.
- There are a certain number of syndication feed fetchers that still
don't deal with HTTPS feeds or at least with redirections to them.
Yes, really, in 2013. Unfortunately two of these are FeedBurner
and the common Planet software, both of which I at least sort of
care about. This led to the 'generate HTTP version but use the
canonical HTTPS links' situation for my software.
- Some web spiders don't follow redirects for
robots.txt. I decided to not redirect for that URL alone rather than block the spiders outright in the server configuration, partly because the former was a bit easier than the latter.(I already totally ban the spiders in
robots.txt, which is one reason I wanted them to see it.)
Despite all of this the process has been relatively straightforward and mostly without problems. To the extent that there were problems, I'm more or less glad to know about them (and to fix my code; it was always broken, I just didn't realize it).
2013-10-19
I should never have allowed 'outside' content to break my layout
Here is a lesson that only sank into my head very recently: you should never allow user contributed content to break your layout. Or, really, any outside content; by this I mean things that show up basically outside of your control and get dropped into your pages.
You might wonder how on earth you stumble into this problem in the first place. In my case what happened here on Wandering Thoughts was comments with preformatted text that had lines that were too long for the width of your browser window. This has been an infrequent but long standing problem but I never did anything about it (at least at the layout level). That was a mistake.
Your site is, well, your site. It's your responsibility to make it so that your site looks right and works right; as a corollary, it's basically your fault if it doesn't. Where the content that's doing it comes from is at one level irrelevant because you're still the site owner. It's not as if my visitors are going to go 'well, it's totally not Chris's fault that looking at the comments on this entry makes the whole thing unreadable'. Instead their reaction is more likely to be to close the window and leave WT.
(And I don't get to blame the people leaving the comments with the wide preformatted lines, either. I'm sure the lines look fine to them on their displays. How they look on other displays is, again, my problem.)
The other way to put this is that outside content is not sacred and its nominal integrity is not a higher priority than your site's overall layout. It may feel bad to mangle the layout and readability of outside content but it is generally the lesser evil.
How you make sure that outside content won't break your layout depends
on a bunch of things. How I finally dealt with the long <pre> line
problem here is that I forced all <pre> blocks in comments to be
linewrapped via a CSS setting of 'white-space: pre-wrap'. This can
sort of break the formatting of such comments, but I consider this a
lesser evil (and they still render correctly if the browser window is
wide enough).
(It turns out that I couldn't make the CSS overflow property do
what I wanted it to do, perhaps because of my
table based layout.)
(As sometimes usual, I'm writing this down now partly in the hopes of getting it embedded in my brain so that next time around I remember it and follow through.)
2013-10-02
What your User-Agent header should include and why
I wound up having a discussion about this in the context of a feed reader and it caused me to have a realization or two, so I've decided to write up my views on this. All of this is mostly from the perspective of a website operator; there are other ones.
There are three different cases: when you are writing a user agent,
when you are writing a web robot, and when you are writing a web robot
library (which will be used by possibly many web robot operators). The
easiest case is when you're writing a client that will be directly used
by real people. Here your User-Agent should identify the software
by name and by a URL to your project site and give a general version
number. It should not identify the user, either directly by name or
indirectly by including additional client fingerprint information such
as the platform it's running on. As a side note, your project site
should include enough information to convince a suspicious website
operator that it is a real client that gets used by real people.
(Some people will object to the version number but I think it's important to include because it lets me either tell people to upgrade because the upgrade fixes a problem or tell you that your latest code has some problem. If you leave the version number out all I can possibly report to your project is 'some version of your software does this bad thing'.)
This is completely different for web robots. For web robots the the
User-Agent header must contain a clear identification of both your
robot and of who is responsible for its operation, ie the URL of a web
page describing who you are, what you do, and so on. There should be
readable English on the page and a method of contacting you privately
(such as email or a contact form). It is vaguely customary to include
a version number but as a website operator I don't care in the least;
you might as well always use '/1.0' if you feel a version number is
required.
Including this information in your User-Agent is to your benefit
because it encourages website operators to investigate and perhaps
report some crawling program instead of blocking you out of hand (either by user-agent or by source IPs, or perhaps
both). I have much harsher reactions to anonymous robots than I do to
ones that are willing to identify themselves. Note that if you're a
company running software from your servers that is poking my websites,
you're a robot operator. At one level I don't care exactly why you're
running the software or how many users it is helping, I still expect it
to identify the specific party responsible for itself. Fail to do this
and I reach for the block tools.
(And yes, this very much applies to feed reader aggregator sites.)
If you're writing a web robot library you need to somehow force its
users to add such a clear identification of themselves into the
User-Agent (although including your library's project URL is nice, it
is not an identification of the responsible party for the robot that
is hitting my site). I'd put this into the library's configuration as
a mandatory field or make it an optional setting but with the default
value of something like 'UNCONFIGURED, BLOCK THIS ROBOT'. Note that if
you supply 'sensible' default values, many of your library's users will
never change them.
(If you're writing a web library for use by real clients I wouldn't
bother having any default User-Agent or putting your library's
identification in. Just provide an API for supplying the user agent
information and document what's a good idea to put in there. Make using
the API mandatory because otherwise people won't. Putting your library
information as well is okay and potentially useful, but your library
information alone in the User-Agent is completely useless to website
operators because it tells us nothing about what is visiting.)