Wandering Thoughts archives

2013-10-27

Some things I've learned from transitioning a website to HTTPS

A while back I first added a HTTPS version of my personal site along side the existing HTTP version and then decided that I was going to actively migrate it to HTTPS. The whole thing has been running for a few months now, so it seems about time to write up some things I've learned from it.

The first set of lessons I learned were about everything on my side, especially my own code. The first layer of problems was code et al with explicit 'http:' bits in it; it was amazing and depressing how many places I was just automatically doing that (you could call this 'HTTP blindness' if you wanted a trendy term for it). The more subtle problem areas were things like caches, where a HTTP version of a page might be different from a HTTPS version yet I was storing them under the same cache key. I also ran into a situation where I wanted to generate output for a HTTP URL request but use the 'canonical' HTTPS URLs for links embedded in the result; this required adding a feature to DWiki.

(I also found a certain amount of other software that didn't cope well. For example, the Fedora 19 version of mod_wsgi doesn't seem to cope with a single WSGI application group that's served over both HTTP and HTTPS; the HTTPS environment value latches to one value and never changes.)

Once I had my own code working I got to find out all sorts of depressing things about how other people's code deals with such a transition. In no particular order:

  • while search engines did eventually switch over to returning HTTPS results and to crawling only the HTTPS version of my site, it took a surprisingly long time (and the switch may not be complete even now, it's hard to tell).

  • Many syndication feed fetchers have not changed to the HTTPS version; they still request a HTTP URL then get redirected. I will reluctantly concede that there are sensible reasons for this behavior. It does mean that the HTTP redirects will probably live on forever.

  • There are a certain number of syndication feed fetchers that still don't deal with HTTPS feeds or at least with redirections to them. Yes, really, in 2013. Unfortunately two of these are FeedBurner and the common Planet software, both of which I at least sort of care about. This led to the 'generate HTTP version but use the canonical HTTPS links' situation for my software.

  • Some web spiders don't follow redirects for robots.txt. I decided to not redirect for that URL alone rather than block the spiders outright in the server configuration, partly because the former was a bit easier than the latter.

    (I already totally ban the spiders in robots.txt, which is one reason I wanted them to see it.)

Despite all of this the process has been relatively straightforward and mostly without problems. To the extent that there were problems, I'm more or less glad to know about them (and to fix my code; it was always broken, I just didn't realize it).

HTTPSTransitionLessonsLearned written at 02:30:49; Add Comment

2013-10-19

I should never have allowed 'outside' content to break my layout

Here is a lesson that only sank into my head very recently: you should never allow user contributed content to break your layout. Or, really, any outside content; by this I mean things that show up basically outside of your control and get dropped into your pages.

You might wonder how on earth you stumble into this problem in the first place. In my case what happened here on Wandering Thoughts was comments with preformatted text that had lines that were too long for the width of your browser window. This has been an infrequent but long standing problem but I never did anything about it (at least at the layout level). That was a mistake.

Your site is, well, your site. It's your responsibility to make it so that your site looks right and works right; as a corollary, it's basically your fault if it doesn't. Where the content that's doing it comes from is at one level irrelevant because you're still the site owner. It's not as if my visitors are going to go 'well, it's totally not Chris's fault that looking at the comments on this entry makes the whole thing unreadable'. Instead their reaction is more likely to be to close the window and leave WT.

(And I don't get to blame the people leaving the comments with the wide preformatted lines, either. I'm sure the lines look fine to them on their displays. How they look on other displays is, again, my problem.)

The other way to put this is that outside content is not sacred and its nominal integrity is not a higher priority than your site's overall layout. It may feel bad to mangle the layout and readability of outside content but it is generally the lesser evil.

How you make sure that outside content won't break your layout depends on a bunch of things. How I finally dealt with the long <pre> line problem here is that I forced all <pre> blocks in comments to be linewrapped via a CSS setting of 'white-space: pre-wrap'. This can sort of break the formatting of such comments, but I consider this a lesser evil (and they still render correctly if the browser window is wide enough).

(It turns out that I couldn't make the CSS overflow property do what I wanted it to do, perhaps because of my table based layout.)

(As sometimes usual, I'm writing this down now partly in the hopes of getting it embedded in my brain so that next time around I remember it and follow through.)

UserContentAndLayout written at 00:58:58; Add Comment

2013-10-02

What your User-Agent header should include and why

I wound up having a discussion about this in the context of a feed reader and it caused me to have a realization or two, so I've decided to write up my views on this. All of this is mostly from the perspective of a website operator; there are other ones.

There are three different cases: when you are writing a user agent, when you are writing a web robot, and when you are writing a web robot library (which will be used by possibly many web robot operators). The easiest case is when you're writing a client that will be directly used by real people. Here your User-Agent should identify the software by name and by a URL to your project site and give a general version number. It should not identify the user, either directly by name or indirectly by including additional client fingerprint information such as the platform it's running on. As a side note, your project site should include enough information to convince a suspicious website operator that it is a real client that gets used by real people.

(Some people will object to the version number but I think it's important to include because it lets me either tell people to upgrade because the upgrade fixes a problem or tell you that your latest code has some problem. If you leave the version number out all I can possibly report to your project is 'some version of your software does this bad thing'.)

This is completely different for web robots. For web robots the the User-Agent header must contain a clear identification of both your robot and of who is responsible for its operation, ie the URL of a web page describing who you are, what you do, and so on. There should be readable English on the page and a method of contacting you privately (such as email or a contact form). It is vaguely customary to include a version number but as a website operator I don't care in the least; you might as well always use '/1.0' if you feel a version number is required.

Including this information in your User-Agent is to your benefit because it encourages website operators to investigate and perhaps report some crawling program instead of blocking you out of hand (either by user-agent or by source IPs, or perhaps both). I have much harsher reactions to anonymous robots than I do to ones that are willing to identify themselves. Note that if you're a company running software from your servers that is poking my websites, you're a robot operator. At one level I don't care exactly why you're running the software or how many users it is helping, I still expect it to identify the specific party responsible for itself. Fail to do this and I reach for the block tools.

(And yes, this very much applies to feed reader aggregator sites.)

If you're writing a web robot library you need to somehow force its users to add such a clear identification of themselves into the User-Agent (although including your library's project URL is nice, it is not an identification of the responsible party for the robot that is hitting my site). I'd put this into the library's configuration as a mandatory field or make it an optional setting but with the default value of something like 'UNCONFIGURED, BLOCK THIS ROBOT'. Note that if you supply 'sensible' default values, many of your library's users will never change them.

(If you're writing a web library for use by real clients I wouldn't bother having any default User-Agent or putting your library's identification in. Just provide an API for supplying the user agent information and document what's a good idea to put in there. Make using the API mandatory because otherwise people won't. Putting your library information as well is okay and potentially useful, but your library information alone in the User-Agent is completely useless to website operators because it tells us nothing about what is visiting.)

UserAgentContentsView written at 01:07:28; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.