Wandering Thoughts

2019-10-14

Googlebot is both quite fast and very determined to crawl your pages

I recently added support to DWiki (the engine behind Wandering Thoughts) to let me more or less automatically generate 'topic' index pages, such as the one on my Prometheus entries. As you can see on that page, the presentation I'm using has links to entries and links to the index page for the days they were posted on. I'm not sure that the link to the day is particularly useful but I feel the page looks better that way, rather than just having a big list of entry titles, and this way you can see how old any particular entry is.

The first version of the code had a little bug that generated bad URLs for the target of those day index page links. The code was only live for about two hours before I noticed and fixed it, and the topic pages didn't appear in the Atom syndication feed, just in the page sidebar (which admittedly appears on every page). Despite that short time being live, in that time Googlebot crawled at least one of the topic pages and almost immediately began trying to crawl the bad day index page URLs, all of which generated 404s.

You can probably guess what happened next. Despite always getting 404s, Googlebot continued trying to crawl various of those URLs for about two weeks afterward. At this point I don't have complete logs, but for the logs that I do have it appears that Googlebot only tried to crawl each URL once; there just were a bunch of them. However, I know that its initial crawling attempts were more aggressive than the tail-off I have in the current logs, so I suspect that each URL was tried at least twice before Googlebot gave up.

(I was initially going to speculate about various things that this might be a sign of, but after thinking about it more I've realized that there really is no way for me to have any good idea of what's going on. So many things could factor into Googlebot's crawling decisions, and I have no idea what is 'normal' for its behavior in general or its behavior on Wandering Thoughts specifically.)

PS: The good news is that Googlebot does appear to eventually give up on bad URLs, or at least bad URLs that have never been valid in the past. This is what you'd hope, but with Googlebot you never know.

GoogleCrawlingPersistence written at 23:15:31; Add Comment

2019-10-05

The wikitext problem with new HTML elements such as <details>

I recently wrote about my interest in HTML5's <details> element. One of the obvious potential places to use <details> (when it becomes well supported) is here on Wandering Thoughts; not only is it the leading place where I create web content, but I also love parenthetical asides (perhaps a little too much) and <details> would be one way to make some of them less obtrusive. Except that there is a little problem in the way, which is that Wandering Thoughts isn't written in straight HTML but instead in a wikitext dialect.

When you have a wik or in general any non-HTML document text that is rendered down to HTML, using new HTML elements is necessarily a two step process. First, you have to figure out what you're going to sensibly use them for, which is the step everyone has to do. But then you have a second step of figuring out how to represent this new HTML element in your non-HTML document text, ideally in a non-hacky way that reflects the resulting HTML structure and requirements (for example, that <details> is an inline 'flow' element, not a block element, which actually surprised me when I looked it up just now).

Some text markup languages allow you to insert arbitrary HTML, which works but is a very blunt hammer; you're basically going to be writing a mix of the markup language and HTML. There probably are markup languages that have extra features to improve this, such as letting you tell them something about the nesting rules and so on for the new HTML elements you're using. My wikitext dialect deliberately has no HTML escapes at all, so I'd have to add some sort of syntax for <details> (or any other new HTML element) before I could use it.

(Life is made somewhat simpler because <details> is a flow element, so it doesn't need any new wikitext block syntax and block parsing. Life is made more difficult because you're going to want to be able to put a lot of content with a lot of markup, links, and so on inside the <details>, which means that certain simplistic approaches aren't good answers in the way they are for, for example, <ABBR>.)

At a sufficiently high level, this is a general tradeoff between having a single general purpose syntax as HTML does (okay, it has a few) and having a bunch of specialized syntaxes. The specialized syntaxes of wikitext have various advantages (for instance, it's a lot faster and easier for me to write this entry in DWikiText than it would be in HTML), but they also lack the easy, straightforward extensibility of the general purpose syntax. If you have a different syntax for everything, adding a new thing needs a new syntax. With HTML, you just need a name (and the semantics).

('Syntax' is probably not quite the right word here.)

HTMLDetailsWikiProblem written at 18:44:04; Add Comment

2019-10-01

My interest in and disappointment about HTML5's new <details> element

Because I checked out from paying attention to HTML's evolution years ago, it took me until very recently to hear about the new <details> element from HTML5. Put simply and bluntly, it's the first new HTML element I've heard of that actually sounds interesting to me. The reason for this is straightforward; it solves a problem that previously might have taken Javascript or at least complex CSS, namely the general issue of having some optional information on a web page that you can reveal or hide.

(That's the surface reason. The deeper reason is that it's the only new HTML5 tag that I've heard of that has actual browser UI behavior associated with it, instead of just semantic meaning.)

Now that I've heard of it, I've started to notice people using it (and I've also started to assume that if I click on the browser UI associated with it, something will actually happen; unfortunately Firefox's current rendering doesn't make it obvious). And when I look around, there are all sorts of things that I might use <details> for, both here on Wandering Thoughts and elsewhere, because optional or additional information is everywhere if you look for it.

(Here on Wandering Thoughts, one form of 'optional' information is comments on blog entries. Currently these live behind a link that you have to click and that loads a separate page, but <details> would let them be inline in the page and revealed more readily. Of course there are various sorts of tradeoffs on that.)

I was all set to make this a very enthusiastic entry, but then I actually looked at the the browser compatibility matrix from MDN and discovered that there is a little problem; <details> is not currently supported in Microsoft Edge (or IE). Edge may not be as popular as it used to be, but I'm not interested in cutting off its users from any of my content (and we can't do that at work). This can be fixed with a Javascript polyfill, but that would require adding Javascript and I'm not that interested.

Given that Edge doesn't support it yet and that IE is out there, it will probably be years before I can assume that <details> just works. Since the 'just works' bit is what makes it attractive to me, I sadly don't think I'm going to be using it any time soon. Oh well.

(HTML5 has also added a number of important input types; I consider these separate from new elements, partly because I had already somewhat heard about them.)

HTMLDetailsNotYet written at 23:24:06; Add Comment

2019-09-20

Modernizing (a bit) some of our HTML form <input> elements

We have a Django web app for handling requests for Unix accounts, which has some HTML forms (in fact it's basically half HTML form filling). These forms (and all of the app's HTML) were put together years ago and only looked at on the desktop at the time. Recently, I poked around the app's forms on the work iPad to see how it would go. Even after I fixed the traditional viewport issue (see the comments), there were little irritations; for example, when you entered your desired Unix login, the iPad wanted to capitalize the first letter as part of its general auto-capitalization. Our Unix logins have to be all lower case, so this was a point of friction. Naturally I wondered if it was possible to improve the experience.

(Our Django web app was written in 2011, and we first got a work iPad in mid-2014. In 2011, not checking phone and table browser behavior was not crazy; today, it probably is and we likely should pay more attention to how all of our sites look on them.)

Unsurprisingly, there are ways to improve the situation by simple changes to our HTML, although not all of them were completely successful. You can find a variety of people writing about this online, and also the MDN page on <input type="text"> form fields, although it doesn't mention the autocapitalize attribute. The short version is that we apparently want to set all of the following attributes for text field that is supposed to be a login:

<input type="text" autocapitalize="none"
   autocorrect="off" spellcheck="false" ...>

Spellchecking is obviously not applicable; very few logins are dictionary words. Autocorrection is similarly probably not desirable, and autocapitalization is what we started out not wanting. The autocorrect attribute is a Safari extension, but apparently Android may want you to use autocomplete instead (which is a nominally standard attribute with all sorts of possible values).

The form also has a field for people's names. I set this to 'autocapitalize="words"', and should probably also set it 'autocomplete="name"' now that I've read about it. On my iOS devices, some combination of the attributes we're using (possibly including a 'name="name"' attribute from Django) causes Safari to be willing to autocomplete it from your contacts, which is handy if your contacts include yourself.

My less successful experiment was setting a 'pattern=...' and a 'title=...' attribute on the field for your login. What I wanted was for the browser to automatically react with a helpful error message when you entered an invalid character, but my flailing around so far hasn't produced this. We have some additional client side validation, but they involve a server check and so only trigger when the field is de-focused; faster feedback would be nice. However, my initial reading left me with the impression that doing a good job of this across both desktop and mobile browsers would require more JavaScript than I wanted to write.

(MDN has a useful page on general form validation, which I haven't read all of.)

All of this (including writing this entry) has done a good job of showing me how ignorant I am about modern HTML. Things have definitely changed here over the last decade or so, which is good to see even if it leaves me well behind the times.

PS: Django is already setting appropriate 'type=...' values on things like the field for your email address, or that would be another obvious and necessary change to make here.

ModernizingSomeInputElements written at 23:08:14; Add Comment

2019-09-18

Firefox, DNS over HTTPS, and us

The news of the time interval is that Mozilla will soon start rolling out DNS over HTTPS for US users, where by 'rolling out' Mozilla means 'enabling by default'. To their minimum credit, Mozilla says that they will explicitly notify people of this change and give them the opportunity to opt out. I hope and assume that this will work much like how Mozilla rolled out various tracking protection measures, including with how thoroughly informative that was.

(Clearly notifying people and giving them the chance to opt out is the obvious right thing to do, but Mozilla's track record on doing the obvious right thing is somewhat mixed.)

Since we're not in the US, this doesn't immediately affect people here; however, I have to assume that Mozilla is going to start rolling DNS over HTTPS out more broadly than just the USA. Given things like GDPR, Mozilla may not push this to Europe any time soon, but there probably aren't many roadblocks for rolling it out in Canada. My overall views on this remain unchanged; there are tradeoffs in either direction, and I have no idea what the right choice is in general.

For my department in particular, Firefox switching to DNS over HTTPS presents a potential problem because we have a split horizon DNS setup where some names resolve to different IPs internally than they do externally. According to Mozilla's blog post, the Firefox DoH implementation has some heuristics to detect a split horizon DNS environment, but from the vague descriptions we have so far it's not clear if they would reliably trigger for our users. If people here wind up with Firefox configured to use DNS over HTTPS and Mozilla's split horizon DNS heuristics don't trigger, they won't be able to connect to some of our hosts. We could theoretically say that this is people's fault in the same way that setting their machine to always use one of the public resolvers is, but this is the wrong answer, since Mozilla will have made this setting for them.

Mozilla currently supports a way for networks to explicitly disable DNS over HTTPS, by making your local resolver return NXDOMAIN for a canary domain. This is easy to do in Unbound, which we use on our local OpenBSD resolvers (see here or here). We could preemptively deploy this, but I tentatively think that we should wait to see if the Firefox split horizon detection heuristics work in our environment. Working heuristics would be the best answer for various reasons (including that Mozilla may find too many people abusing the canary domain and start paying less attention to it).

For work, there's probably no point in adding DNS over HTTPS to our local resolving DNS servers, even once it's supported on the OpenBSD version of Unbound. As far as I know, people here would have to specifically configure their Firefox to talk to our servers, and then their configuration would break when they moved outside of our network and could no longer reach our resolving DNS servers.

For my own personal use, I may eventually add DNS over HTTPS support to my resolving Unbound instances, because apparently DoH is the only way to get encrypted SNI. Unfortunately it also apparently normally requires DNSSEC, so unless I can get my Unbound to lie about that (or Firefox to not care), I may be out of luck. I do wish I could tell Firefox that a resolver on localhost was trusted even without DoH, but I suspect that I can't.

(This does raise long term issues about encrypted SNI support for our users, but perhaps in the long term people will come up with answers. Hopefully ones that don't involve DNSSEC.)

FirefoxDNSOverHTTPSAndUs written at 23:37:31; Add Comment

2019-08-28

Allowing some Alias directives to override global Redirects in Apache

When I wrote Apache, Let's Encrypt, and site-wide reverse proxies and HTTP redirections, I confidently asserted that there was no way to override a Redirect for just some URLs, so that you could Alias the /.well-known/acme-challenge/ URL path off to somewhere while still redirecting the entire site to somewhere else. It turns out that there is a way of doing this under some circumstances, and these circumstances are useful for common Let's Encrypt configurations.

The magic trick is that if you put your Redirect directive inside a <Directory> directive, it only applies to URLs that resolve to paths inside that directory hierarchy. URLs that resolve to elsewhere, for example because they have been remapped by an Alias, are not affected and are passed through unaffected. This is extremely useful because in common configurations for Let's Encrypt clients, the challenge directory is often mapped to a common outside location in the filesystem, such as /var/run/acme/acme-challenge. So, for a virtual host you can set a DocumentRoot to some suitable spot that's not used for anything else and then wrap the site-wide redirect inside a <Directory> directive for your DocumentRoot, like this:

DocumentRoot /some/stub
<Directory /some/stub>
  Redirect permanent / https://..../
</Directory>

(It seems common to supply the Alias and <Directory> directives for the Let's Encrypt stuff in a general configuration snippet that's applied to all virtual hosts. Doing this globally is one reason to make them all go to a common spot in the filesystem.)

The stub DocumentRoot probably has to exist (and have permissions that allow Apache access), but it doesn't have to have anything useful in it. It's there purely to confine the Redirect away from the Alias.

(I stumbled over this trick somewhere on the Internet, but I can't find where any more.)

PS: I don't think you need to specify any AllowOverride or Options settings in your <Directory>, because they're all surplus if you're not doing anything with the stub directory tree except the Redirect. Our <Directory> sections tend to have these even when the entire site is being proxied or redirected, but that's because we're creatures of habit here.

ApacheAliasOverRedirectTrick written at 00:13:00; Add Comment

2019-08-24

Apache, Let's Encrypt, and site-wide reverse proxies and HTTP redirections

Back in the days before Let's Encrypt, life was simple if you had an entire virtual host that wanted to be redirected somewhere (perhaps from its HTTP version to its HTTPS one) or served through a reverse proxy (which is our solution to various traditional problems with a shared webserver), since both of these were single directives in Apache. Then along came Let's Encrypt, where the simplest and easiest way to authenticate your control over a website is through their HTTP challenge, which requires specially handling random URLs under /.well-known/acme-challenge/. Now you want to reverse proxy or redirect everything but the Let's Encrypt challenge directory, and that is not entirely easy.

The easiest case is a reverse proxy, because there's a ProxyPass directive to say 'don't proxy this':

ProxyPass /.well-known/acme-challenge/ !
ProxyPassReverse /.well-known/acme-challenge/ !

(I'm not sure if you need the PPR rule.)

These come before your regular ProxyPass, because the first match wins as far as proxying goes. With reverse proxying not being done for this, your Alias directive can take over.

Redirection is unfortunately not so simple. The straightforward way to redirect an entire site is 'Redirect / ...', and once you put this in the configuration there is, as far as I can tell, no way to override it for some URLs. Redirect specifically acts before Alias and almost anything else, and can't be turned off for a subset of your URL space through a specific <Location> block.

If you only want to redirect the root of your site and have random URLs error out, you can use a restricted RedirectMatch that doesn't match the Let's Encrypt challenge path (or much of anything else):

RedirectMatch temp "^/$" https://<whatever>/

Apache doesn't appear to have any general support for negated regular expressions (unless it's hiding in the depths of PCRE), so you can't easily write a RedirectMatch directive that matches everything except the Let's Encrypt challenge path. You can do a more general version, for example one that skips all paths that start with '/.':

RedirectMatch temp "^/([^.].*|$)" https://<whatever>/$1

(As a disclaimer, I haven't tested this regular expression.)

If you want a HTTP redirection that specifically excludes only the Let's Encrypt challenge directory, then apparently you need to switch from plain Redirect and company to doing your redirection through mod_rewrite:

RewriteCond %{REQUEST_URI} !^/.well-known/acme-challenge/ [NC]
RewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=302]

(RewriteCond specifically supports negating regular expression matches.)

In theory if you're just redirecting from the HTTP to the HTTPS version of your site, you might let the Let's Encrypt challenge get redirected too. In practice I would be wary of a chicken and egg problem, where you might not be able to get Let's Encrypt to accept the redirection unless you already have a valid TLS certificate for your HTTPS site. Of course in that case you could just temporarily shut down the redirection entirely, since without a valid TLS certificate the HTTPS version is not too usable. But that requires manual action.

(Perhaps and hopefully there are other solutions that I'm missing here.)

ApacheLetsEncryptVsRedirect written at 23:06:30; Add Comment

2019-08-20

Saying goodbye to Flash (in Firefox, and in my web experience)

Today, for no specific reason, I finally got around to removing the official Adobe-provided Linux Flash plugin packages from my office workstation. I was going to say that I did it on both my home and my office machine, but it turns out that I apparently removed it on my home machine some time ago; the flash-plugin package was only lingering on my work machine. This won't make any difference to my experience of the web in Firefox, because some time ago I disabled Flash in Firefox itself, setting the plugin to never activate. Until I walked away from Chrome, Chrome and its bundled version of Flash was what I reached for when I needed Flash.

I kept the plugin around for so long partly because for a long time, getting Flash to work was one of the painful bits of browsing the web on Linux. Adobe's Linux version of Flash was behind the times (and still is), for a long time it was 32-bit only, and over the years it required a variety of hacks to get it connected to Firefox (cf, and, and, and so on). Then, once I had Flash working, I needed more things to turn it off when I didn't want it to play. After all of this, having an officially supplied 64-bit Adobe Flash package that just worked (more or less) seemed like sort of a miracle, so I let it sit around even well after I'd stopped using it.

Now, though, the web has moved on. The last website that I cared about that used Flash moved to HTML5 video more than a year ago, and as mentioned I haven't used Flash in Firefox for far longer than that. Actively saying goodbye by removing the flash-plugin package seemed about time, and after all of the hassles Flash has put me through over the years, I'm not sad about it.

(Flash's hassles haven't just been in the plugin. I've had to use a few Flash-heavy websites over the years, including one that I at least remember as being implemented entirely as a Flash application, and the experience was generally not a positive one. I'm sure you can do equally terrible things in HTML5 with JavaScript and so on, but I think you probably have to do more work and that hopefully makes people less likely to do it.)

Flash is, unfortunately, not the last terrible thing that I sort of need in my browsers. Some of our servers have IPMI BMCs that require Java for their KVM over IP stuff, specifically Java Web Start. I actually keep around a Java 7 install just for them, although the SSL ciphers they support are getting increasingly ancient and hard to talk to with modern browsers.

(I normally say TLS instead of SSL, but these are so old that I feel I should call what they use 'SSL'.)

PS: I'm aware that there is (or was) good web content done in Flash and much of that content is now in the process of being lost, and I do think that that is sad. But for me it's kind of an abstract sadness, since I never really interacted with that corner of the web, and also I'm acclimatized to good things disappearing from the web in general.

FlashGone written at 23:53:26; Add Comment

2019-07-12

Browers can't feasibly stop web pages from talking to private (local) IP addresses

I recently read Jeff Johnson's A problem worse than Zoom (via), in which Johnson says:

[...] The major browsers I've tested — Safari, Chrome, Firefox — all allow web pages to send requests not only to localhost but also to any IP address on your Local Area Network! Can you believe that? I'm both astonished and horrified.

(Johnson mostly means things with private IP addresses, which is the only sense of 'on your local and private network' that can be usefully determined.)

This is a tempting and natural viewpoint, but unfortunately this can't be done in practice without breaking things. To understand this, I'll outline a series of approaches and then explain why they fail or cause problems.

To start with, a browser can't refuse to connect to private IP addresses unless the URL was typed in the URL bar because there are plenty of organizations that use private IP addresses for their internal web sites. Their websites may link to each other, load resources from each other, put each other in iframes, and in general do anything you don't want an outside website to do to your local network, and it is far too late to tell everyone that they can't do this all of a sudden.

It's not sufficient for a browser to just block access by explicit IP address, to stop web pages from poking URLs like 'http://192.168.10.10/...'. If you control a domain name, you can make hosts in that have arbitrary IP addresses, including private IP addresses and 127.0.0.1. Some DNS resolvers will screen these out except for 'internal' domains where you've pre-approved them, but a browser can't assume that it's always going to be behind such a DNS resolver.

(Nor can the browser implement such a resolver itself, because it doesn't know what the valid internal domains even are.)

To avoid this sort of DNS injection, let's say that the browser will only accept private IP addresses if they're the result of looking up hosts in top level domains that don't actually exist. If the browser looks up 'nasty.evil.com' and gets a private IP address, it's discarded; the browser only accepts it if it comes from 'good.nosuchtld'. Unfortunately for this idea, various organizations like to put their internal web sites into private subdomains under their normal domain name, like '<host>.corp.us.com' or '<host>.internal.whoever.net'. Among other reasons to do this, this avoids problems when your private top level domain turns into a real top level domain.

So let's use a security zone model. The browser will divide websites and URLs into 'inside' and 'outside' zones, based on what IP address the URL is loaded from (something that the browser necessarily knows at the time it fetches the contents). An 'inside' page or resource may refer to outside things and include outside links, but an outside page or resource cannot do this with inside resources; going outside is a one-way gate. This looks like it will keep internal organizational websites on private IP addresses working, no matter what DNS names they use. (Let's generously assume that the browser manages to get all of this right and there are no tricky cases that slip by.)

Unfortunately this isn't sufficient to keep places like us working. We have a 'split horizon' DNS setup, where the same DNS name resolves to different IP addresses depending on whether you're inside or outside our network perimeter, and we also have a number of public websites that actually live in private IP address space but that are NAT'd to public IPs by our external firewall. These websites are publicly accessible, get linked to by outside things, and may even have their resources loaded by outside public websites, but if you're inside our network perimeter and you look up their name, you get a private IP address and you have to use this IP address to talk to them. This is exactly an 'outside' host referring to an 'inside' resource, which would be blocked by the security zone model.

If browsers were starting from scratch today, there would probably be a lot of things done differently (hopefully more securely). But they aren't, and so we're pretty much stuck with this situation.

BrowsersAndLocalIPs written at 21:49:48; Add Comment

2019-07-07

Straightforward web applications are now very likely to be stable in browsers

In response to my entry on how our goals for our web application are to not have to touch it, Ross Hartshorn left a comment noting:

Hi! Nice post, and I sympathize. However, I can't help thinking that, for web apps in particular, it is risky to have the idea of software you don't have to touch anymore (except for security updates). The browsers which are used to access it also change. [...]

I don't think these are one-off changes, I think it's part of a general trend. If it's software that runs on your computer, you can just leave it be. If it's a web app, a big part of it is running on someone else's computer, using their web browser (a piece of software you don't control). You will need to update it from time to time. [...]

This is definitely true in a general, abstract sense, and in the past it has been true in a concrete sense, in that some old web applications could break over time due to the evolution of browsers. However, this hasn't really been an issue for simple web applications (ones just based around straight HTML forms), and these days I think that even straightforward web applications are going to be stable over browser evolution.

The reality of the web is that there is a huge overhang of old straightforward HTML, and there has been for some time; in fact, for a long time now, at any given point in time most of the HTML in existence is 'old' to some degree. Browsers go to great effort to not break this HTML, for the obvious reason, and so any web application built around basic HTML, basic forms, and the like has been stable (in browsers) for a long time now. The same is true for basic CSS, which has long since stopped being in flux and full of quirks. If you stick to HTML and CSS that is at least, say, five years old, everything just works. And you can do a great deal with that level of HTML and CSS.

(One exhibit for this browser stability is DWiki, the very software behind this blog, which has HTML and CSS that mostly fossilized more than a decade ago. This includes the HTML form for leaving comments.)

Augmenting your HTML and CSS with Javascript has historically been a bit more uncertain and unstable, but my belief is that even that has now stopped. Just as with HTML and CSS, there is a vast amount of old(er) Javascript on the web and no one wants to break it by introducing incompatible language changes in browsers. Complex Javascript that lives on the bleeding edge of browsers is still something that needs active maintenance, but if you just use some simple Javascript to do straightforward progressive augmentation, I think that you've been perfectly safe for some time and are going to be safe well into the future.

(This is certainly our experience with our web application.)

Another way to put this is that the web has always had some stable core, and this stable core has steadily expanded over time. For some time now, that stable core has been big enough to build straightforward web applications. It's extremely unlikely that future browsers will roll back very much of this stable core, if anything; it would be very disruptive and unpopular.

(You don't have to build straightforward web applications using the stable core; you can make your life as complicated as you want to. But you're probably not going to do that if you want an app that you can stop paying much attention to.)

WebAppsAndBrowserStability written at 23:23:22; Add Comment

(Previous 10 or go back to June 2019 at 2019/06/16)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.