Wandering Thoughts

2017-03-19

Using Firefox's userContent.css for website-specific fixes

In a comment on my entry on using pup to fix a Twitter issue, Chris Povirk suggested using Firefox's userContent.css feature to fix Twitter by forcing the relevant bit of Twitter's HTML to actually display. This is actually a pretty good idea, except for one problem; if you write normal CSS there, it will apply to all sites. Anywhere there is some bit of HTML that matches your CSS selector, Firefox will apply your userContent.css CSS fixup. In the case of this Twitter issue, this is probably reasonably safe because it's not likely that anyone else is going to use a CSS class of 'twitter-timeline-link', but in other cases it's not so safe and in general it makes me twitchy.

Luckily, it turns out that there is a way around this (as I found out when I did some research for this entry). In Firefox it's possible to write CSS that's restricted to a single site or even a single URL (among some other options), using a special Mozilla extension to CSS (it's apparently been proposed as a standard feature but delayed repeatedly). In fact if you use Stylish or some other extensions you've already seen this used, because Stylish relies on this Mozilla CSS feature to do its site-specific rules.

How you do this is with a @-moz-document CSS rule (see here for Mozilla's full list of their CSS extensions). The Stylish userstyles wiki has some examples of what you can do (and also), which goes all the way to regular expression matches (and also). If we want to restrict something to a domain, for example twitter.com, the matching operation we want is domain(twitter.com) or perhaps url-prefix(https://twitter.com/).

So in this particular case, the overall userContent.css I want is:

@-moz-document domain(twitter.com) {
  .twitter-timeline-link {
    display: inline !important;
  }
}

(I'm willing to have this affect any Twitter subdomains as well as twitter.com itself.)

This appears to work on Twitter, and I'm prepared to believe that it doesn't affect any other website without bothering to try to construct a test case (partly because it certainly seems to work for things like Stylish). I don't know if using just userContent.css is going to have the memory leak I see with Stylish, but I guess I'm going to find out, since I've now put this Twitter fix in my brand new userContent.css file.

An extension like Stylish appears to have only a few advantages over just modifying userContent.css, but one of them is a large one; Stylish can add and change its CSS modifications on the fly, whereas with userContent.css you appear to have to quit and restart your Firefox session to to pick up changes. For me this is an acceptable tradeoff to avoid memory leaks, because in practice I modified my overrides only rarely even when I was using Stylish.

FirefoxPerSiteUserCSS written at 01:53:29; Add Comment

2017-03-12

CSS, <pre>, and trailing whitespace lead to browser layout weirdness

Today someone left a comment on this entry of mine about Python which consisted of a block of code. The comment looked normal in my feed reader, but when I looked at it in my browser (for neurotic reasons) I got a surprise; the new comment was forcing my Firefox to display the page really widely, with even regular text out of my viewport. This was very surprising because I theoretically made that impossible years ago, by forcing all <pre> blocks in comments to have a CSS white-space: pre-wrap setting. At first I thought that my CSS had broken at some point, but with Firefox's style debugging tools I could see the actual CSS being applied and it had the proper white-space setting.

More experimentation showed that things were even weirder than it had initially looked. First, the behavior depended on how wide my Firefox window was; if it dropped below the critical width for my responsive design here to show the sidebar as a sidebar, the problem went away. Second, the exact behavior depended on the browser; in Chrome, the overall page gained a horizontal scrollbar but no actual content extended out the right side of the browser's viewport (ie, its visible window area).

(I've fixed how the live version of the page renders, but you can see the original version preserved here. Feel free to play around with your browser's tools to see if you can work out why this is happening, and I'd love to know what other browsers beyond Firefox and Chrome do with it.)

Eventually (and more or less by luck), I stumbled over what was causing this (although I still don't know why). The root cause is that the <pre> block has a huge area of whitespace at the end of almost every line. Although it looks like the widest <pre> line is 76 characters long, all but the last line are actually 135 characters long, padded out with completely ordinary spaces.

The MDN writeup of white-space contains a hint as to why this is happening, when it says that for pre-wrap 'sequences of white space are preserved'. This is what you need in preformatted text for many purposes, but it appears to mean that the really long runs of trailing whitespace in these lines are being treated as single entities that force the content width to be very wide. Firefox doesn't visibly wrap these lines anywhere and has the whitespace forcing the surrounding boxes to be wide, while Chrome merely had it widen the overall page width without expanding the content boxes. My Chrome at the right width will force the longest line or two of the <pre> content to wrap.

My fix for now was to use magic site admin powers to edit the raw comment to trim off the trailing whitespace. In theory one possible CSS-level fix for this is to also set the word-break CSS property to 'break-all' for <pre> elements, which appears to make Firefox willing to break things in the middle of this sort of whitespace. However this also makes Firefox willing to break <pre> elements in the middle of non-whitespace words, which I find both ugly and unreadable. What I really want is a setting for word-break that means 'try not to break words in the middle, but do it if you have to in order to not run over'.

(Maybe in another ten years CSS will have support for that. Yes, there's overflow-wrap for regular text, in theory, but it doesn't seem to do anything here. Possibly this is because Firefox doesn't feel that the large chunk of whitespace is actually overflowing its containing box but instead it's growing the containing box. CSS makes my head hurt.)

CSSPreLayoutTrailingWhitespace written at 01:47:25; Add Comment

2017-03-11

Your live web server probably has features you don't know about

As has become traditional, I'll start with my tweet:

TIL that Ubuntu gives you an Apache /cgi-bin/ script alias that maps to /usr/lib/cgi-bin. Do you know what's in that dir on your web server?

Most web servers have configuration files and configuration processes that are sufficiently complicated that almost no one writes configurations for them from scratch from the ground up. For their own reasons, these servers simply require you to specify too many things; all of the modules you want loaded, all of the decisions about character set mappings, all of the sensible defaults that must actually be explicitly specified in a configuration file somewhere, and so on. Instead, to configure many web servers we start from vendor supplied configuration files; generally this is our OS vendor. In turn the OS vendor's configuration is generally derived from a combination of the standard or example configuration file in the upstream source plus a number of things designed to make an installed web server package work 'sensibly' out of the box.

Very frequently, this configuration contains things that you may not have expected. My discovery today was one of them. From the perspective of Ubuntu (and probably Debian), this configuration makes a certain amount of sense; it creates an out of the box feature that just works and that can be used by other packages that need to provide CGI-BINs that will 'just work' on a stock Ubuntu Apache without further sysadmin intervention. This is the good intention that is generally behind all of these surprises. In practice, though, this makes things work in the simple case at the cost of giving people surprises in the complex one.

(I suspect that Apache is especially prone to this because Apache configuration is so complex and baroque, or at least it is as usually presented. Maybe there is a small, simple hand-written configuration hiding deep inside all of the standard mess.)

I don't have any great fixes for this situation. We're probably never going to hand write our Apache configurations from the ground up so that we know and understand everything in them. This implies that we should at least scan through enabled modules and configuration snippets to see if anything jumps out at us.

This issue is related to but not quite the same as web server configurations that cross over between virtual hosts. In the crossover case we wanted Apache UserDirs, but only on one virtual host instead of all of them; in today's /cgi-bin/ script alias case, we didn't want this on any virtual host.

(Looking back at that entry I see that back then I was already planning to audit all the Apache global settings as a rainy day project. I guess I should move it up the priority list a bit.)

WebServerUnanticipatedFeatures written at 00:07:30; Add Comment

2017-02-27

The conflict between wildcard TLS certificates and Certificate Transparency

Certificate Transparency is an increasingly important part of the modern TLS world, especially for website certificates (which I believe are still the dominant use of TLS certificates). One part of Certificate Transparency is monitoring for certificates issued for your own sites and domains, but that's not the only use; another one is looking for certificates issued to suspicious names. For instance, a bunch of people would probably be interested if someone issued a certificate for www.really-its-paypal-login.com or geeglemail.com or any number of other TLS certificate names like that.

(This is not just of interest to the real Paypal and Google, it's also of interest to things like email spam filtering systems, intrusion detection systems, and the systems that help provide browser warnings of suspicious websites.)

The examples I've shown here are top level domain names, but that doesn't have to be the case. Often it's going to be easier to get an alarming name into a subdomain instead of a domain; for a start, you don't have to worry about a domain registrar alerting the moment something that matches *paypal* shows up in a new domain registration. When an attacker embeds the alarming name as a subdomain, one of the few ways that outside people can spot it is when the TLS certificate information shows up in the Certificate Transparency logs, because the TLS certificate exposes the full host name.

Well, at least until wildcard certificates come along. When combined with CT, the effect of wildcard certificates is to hide from scrutiny all of the names that can be put into the wildcarded portion. People monitoring the CT logs no longer see 'login.paypal.really.somedom.wat'; all they see is '*.somedom.wat' or '*.really.somedom.wat', which of course means that they basically see nothing.

(There are good aspects of this as well as bad ones, since CT with full host names exposes internal host names that you may not want to have known for various reasons.)

As a result, I'm not particularly surprised that Let's Encrypt doesn't support wildcard certificates. Let's Encrypt is intended for public hosts, and with automated issuance I feel that Certificate Transparency is especially important in case something goes wrong. Not issuing wildcard certificates maximizes public visibility into what LE is actually doing and issuing.

With all of this said, Let's Encrypt's FAQ says that their primary reason for not issuing wildcard certificates is the question of automated issuance (which I suspect partly means automated proving of control), not any philosophical reason. It's possible that LE would decide they had philosophical reasons too if people came up with a good technical solution; I guess we may find out someday.

CertificateTransparencyVsWildcardCerts written at 21:57:12; Add Comment

2017-02-19

Using pup to deal with Twitter's increasing demand for Javascript

I tweeted:

.@erchiang 's pup tool just turned a gnarly HTML parsing hassle into a trivial shell one liner. Recommended. https://github.com/ericchiang/pup

I like pup so much right now that I want to explain this and show you what pup let me do easily.

I read Twitter through a moderately Rube Goldberg environment (to the extent that I read it at all these days). Choqok, my Linux client, doesn't currently support new Twitter features like long tweets and quoted tweets; the best it can do is give me a link to read the tweet on Twitter's website. Twitter itself is increasingly demanding that you have Javascript on in order to make their site work, which I refuse to turn on for them. The latest irritation is a feature that Twitter calls 'cards'. Cards basically embed a preview of the contents of a link in the tweet; naturally they don't work without JavaScript, and naturally Twitter is turning an increasing number of completely ordinary links into cards, which means that I don't see them.

(This includes the Github link in my tweet about pup. Good work, Twitter.)

If you look at the raw HTML of a tweet, the actual link URL shows up in a number of places (well, the t.co shortened version of it, at least). In a surprise to me, one of them is in an actual <a> link in the Tweet text itself; unfortunately, that link is deliberately hidden with CSS and I don't currently have a viable CSS modification tool in my browser that could take that out. If we want to extract this link out of the HTML, the easiest place is in a <div> that has the link mentioned as a data-card-url property:

<div class="js-macaw-cards-iframe-container initial-card-height card-type-summary"
[...]
data-card-url="https://t.co/LEqaB79Lbg"
[...]

All we have to do is go through the HTML, find that property, and extract the property value. There are many ways to do this, some better than others; you might use curl, grep, and sed, or you might write a program in the language of your choice to fetch the URL and parse through the HTML with your language's HTML parsing tools.

This is where Eric Chiang's pup tool comes in. Pup is essentially jq for HTML, which means that it can be inadequately described as a structured, HTML-parsing version of grep and sed (see also). With pup, this problem turns into a shell one-liner:

wcat "$URL" | pup 'div[data-card-url] attr{data-card-url}'

The real script that uses this is somewhat more than one line, because it actually gets the URL from my current X selection and then invokes Firefox on it through remote control.

I've had pup sitting around for a while, but this is the first time I've used it. Now that I've experienced how easy pup makes it to grab things out of HTML, I suspect it's not going to be the last time. In fact I have a hand-written HTML parsing program for a similar job that I could replace with a similar pup one-liner.

(I'm not going to do so right now because the program works fine now. But the next time I have to change it, I'll probably just switch over to using pup. It's a lot less annoying to evolve and modify a shell script than it is to keep fiddling with and rebuilding a program.)

PS: via this response to my tweet, I found out about jid, which is basically an interactive version of jq. I suspect that this is going to be handy in the future.

PPS: That the URL is actually in a real <a> link in the HTML does mean that I can turn off CSS entirely (via 'view page in no style', which I have as a gesture in FireGestures because I use it frequently. This isn't all that great, though, because a de-CSS'd Tweet page has a lot of additional cruft on it that you have to scroll through to get to the actual tweet text. But at least it's an option.

Sidebar: Why I don't have CSS mangling in my Firefox

The short version is that both GreaseMonkey and Stylish leak memory on me. I would love to find an addon that doesn't leak memory and enables this kind of modification (here I'd like to strip a 'u-hidden' class from an <a href=...> link), but I haven't yet.

PupFixingTwitterMess written at 01:37:09; Add Comment

2017-02-17

robots.txt is a hint and a social contract between sites and web spiders

I recently read the Archive Team's Robots.txt is a suicide note (via), which strongly advocates removing your robots.txt. As it happens, I have a somewhat different view (including about how sites don't crash under load any more; we have students who beg to differ).

The simple way to put it is that the things I add to robots.txt are hints to web spiders. Some of the time they are a hint that crawling the particular URL hierarchy will not be successful anyways, for example because the hierarchy requires authentication that the robot doesn't have. We have inward facing websites with sections that provide web-based services to local users, and for that matter we have a webmail system. You can try to crawl those URLs all day, but you're not getting anywhere and you never will.

Some of the time my robots.txt entries are a hint that if you crawl this anyways and I notice, I will use server settings to block your robot from the entire site, including content that I was letting you crawl before then. Presumably you would like to crawl some of the content instead of none of it, but if you feel otherwise, well, crawl away. The same is true of signals like Crawl-Delay; you can decide to ignore these, but if you do our next line of defense is blocking you entirely. And we will.

(There are other sorts of hints, and for complex URL structures some of the hints of all sorts are delivered through nofollow. Beyond not irritating me, there are good operational reasons to pay attention to this.)

This points to the larger scale view of what robots.txt is, which is a social contract between sites and web spiders. Sites say 'respect these limits and we will (probably) not block you further'. As a direct consequence of this, robots.txt is also one method to see whether a web spider is polite and well behaved or whether it is rude and nasty. A well behaved web spider respects robots.txt; a nasty one does not. Any web spider that is crawling URLs that are blocked in a long-standing robots.txt is not a nice spider, and you can immediately proceed to whatever stronger measures you feel like using against such things (up to and including firewall IP address range bans, if you want).

By the way, it is a feature that robots self-identify themselves when matching robots.txt. A honest and polite web spider is in a better position to know what it is than a site that has to look at the User-Agent and other indicators, especially because people do dangerous things with their user-agent strings. If I ban a bad robot via server settings and you claim to be sort of like that bad robot for some reason, I'm probably banning you too as a side effect, and I'm unlikely to care if that's a misfire; by and large it's your problem.

(With all of this said, the Archive Team has a completely sensible reason for ignoring robots.txt and I broadly support them doing so. They will run into various sorts of problems from time to time as a result of this, but they know what they're doing so I'm sure they can sort the problems out.)

RobotsTxtHintAndSocialContract written at 23:16:33; Add Comment

2017-01-12

Modern ad networks are why adblockers are so effective

My entry on web adblockers and the Usenet killfile problem sparked a discussion on lobste.rs, and as part of that discussion I came to an obvious in retrospect realization about why adblockers are so effective and are going to stay that way. Namely, it's because of the needs of modern ad networks.

The modern ad-laden web page is assembled on the fly in the browser, through a variety of methods for getting client-side included content; JavaScript, embedded iframes, images loaded from outside domains, and so on. This client side assembly is essentially forced on the web ad ecology because ads are both widely distributed and heavily centralized. The ecology is widely distributed in that ads appear on a huge number of websites, large and small; it is heavily centralized because modern ad networks want to be as big as possible, which means being able to rapidly place ads on a lot of websites.

If an ad network wants to be big, it generally has to have a lot of websites that will display its ads. It can't afford to focus all its efforts on a small number of sites and work deeply with them; instead it needs to spread widely. If you want to spread widely, especially to small sites, you need to make it easy for those websites to put your ads on their pages, something simple for both them and you. Having the ads added to web pages by the browser instead of the web server is by far the easiest and most reliable approach for both the ad network and the websites.

(Adding the ads in the client is also kind of forced if you want both sophisticated real time ad bidding systems and web servers that perform well. If the web server's page assembly time budget is, say, 30 milliseconds, it isn't going to be calling out to even a single ad network API to do a 100 millisecond 'bid this ad slot out and give me an ad to serve' operation.)

When you leave it up to the web browser to add the ads to the web page, you make it possible for the web browser to turn around and not actually do that, fundamentally enabling adblockers. When you operate at scale with simple-to-add, simple-to-support snippets of HTML, JavaScript, or CSS that create the hooks for this assembly process, you need exactly the generic 'this thing will be an ad' markers that make it easy for adblocks to block this content en masse across many sites.

Big sites can do artisanal content management systems that integrate ads into their native page content in their backends, serving the entire thing to you as one mostly indistinguishable mass (and they're big enough to make it worthwhile for ad networks to do custom things for them). But this is never going to happen for a lot of smaller sites, which leaves ad networks creating the environment that allows adblockers to flourish.

AdblockersEnabledByAdtech written at 00:40:58; Add Comment

2017-01-07

How ready my Firefox extensions are for Firefox Electrolysis

Firefox Electrolysis is Mozilla's push to improve Firefox by making it multiprocess, but this requires a fundamental change in how Firefox extensions interact with Firefox. Mozilla is strongly pushing Electrolysis in 2017 and as part of that is strongly working on deprecating the old (current) Firefox extensions API. Their current schedule entirely abandons old extensions by roughly December of this year (2017), with Firefox 57. Mozilla has made available an extension, Add-on Compatibility Reporter, that can tell you if your extensions are compatible with the modern way of doing things. This is a lot more convenient than going through arewee10syet, so I've decided to write down the state of my regular set of extensions (note that I now use uBlock Origin) and my essential set for basic profiles.

In my essential set, things are simple. FireGestures and uBlock Origin are compatible, but Self-Destructing Cookies is not. That's not great news; SDC is an excellent low-friction way of defending myself against the obnoxious underbrush of cookies. I can't see any sign on SDC's page that an update to add Electrolysis compatibility is in progress, although it might be something that's quietly being worked on.

In my main browser with my regular set of extensions, well, things get mixed:

  • NoScript is compatible (as are FireGestures and uBlock Origin). In less core extensions, so is HTTPS Everywhere and even CipherFox (which I could really uninstall or disable at this point without missing anything).

  • my current cookie management extension, CS Lite Mod, is not compatible but I already knew it was obsolete, not getting updates, and going to have to be replaced someday. It's not clear if there's a good Electrolysis compatible cookie blocking extension yet, though (especially one that doesn't leak memory, which has been a problem in my earlier attempts to find a replacement).

  • FlashStopper is not compatible. Stopping video autoplay on Youtube is not really something I consider negotiable, but in theory NoScript might start working for this. In addition, the 'development channel' part of the addon's page suggests that a compatible version is in progress (see also).

  • It's All Text is not compatible. That's going to hurt a bunch, especially for writing comments on Wandering Thoughts. There's an open issue for it but it's been open since 2015 and apparently the original developer doesn't do much with Firefox any more (see also).

  • Open in Browser is not compatible. I like OiB but it's not a core part of my browser experience the way some of my extensions are. There's an open issue in the project about this.

Without a cookie management extension and with uncertainty about others updating in time (especially since I generally follow Firefox's development version, effectively Firefox Nightly), my most likely action is to simply not update to a version of Firefox that drops support for old extensions.

(The Firefox release calendar suggests that the development version will stop supporting old extensions sometime in June or July, so I really don't have all that much time left.)

FirefoxElectrolysisMyExtensions written at 01:55:04; Add Comment

2017-01-06

Web adblockers and the potential for recreating the Usenet killfile problem

Here is a rambling thought.

Back in the days of Usenet, most Usenet readers supported 'killfiles' for filtering your view of a newsgroup. As newsgroup after newsgroup descended into noise, the common reaction of people was to get more and more elaborate killfiles so they could preserve what they could. The long term problem with this was that new readers of a newsgroup generally had no killfiles, so they generally took one look at the unfiltered version and left.

If you've recently compared the versions of the web you see with and without your adblocker, you may be thinking that this last bit sounds familiar. Increasingly, the raw web is simply an unpleasant place, with more and more things shoving their way in front of your face. Although there are other reasons to block ads, such as keeping your machine safe and reducing data usage, my belief is that a lot of people turn to adblockers in large part to get this clutter out of their face.

So, what happens if adblocking becomes more and more common over time? I suspect that one response from websites will be to run more ads than ever before in an attempt to generate more revenue from the steadily decreasing number of people who are actually seeing ads. If this happens, the result will be to make the raw, adblocker-free Internet an increasingly shitty place. Generally this will be the version of the Internet that new people are exposed to, since new people are quite likely to start out without an adblocker in their browser.

(Browser vendors or system vendors preinstalling adblockers would be a drastic change and would probably provoke lawsuits and other explosions.)

At this point I run out of sensible speculation, so I'm writing this mostly to note the similarity I see in the mechanisms involved. In the spirit of fairness, here's some differences as well:

  • people don't necessarily have good alternatives to ad-laden websites, so maybe they'll just live with the terrible experience (certainly plenty of websites seem to be betting on this).

  • it's getting so that everyone knows about adblockers and it's generally quite easy to install one and start getting a good Internet experience (unlike the experience with Usenet killfiles, which were as if everyone had to write their own adblocker rules).

And, of course, the web could always explode, rendering the whole issue moot.

AdblockersKillfileProblem written at 00:12:19; Add Comment

2016-12-28

The HTML IMG attributes and styling that I think I want

Once upon a long time ago, I put very basic support for inline images into DWikiText (the wikitext dialect used here). At the time I felt I had both simple needs and a simple understanding of the IMG element, so I just specified the image's width and height and called it a day, putting them in as width= and height= attributes in the IMG element.

Much later on, on my personal site, I found that I wanted to include some images and make them as large as would fit in my layout. So I did a hack; I carefully worked out the rough largest image width that would fit in my own browser window in the relatively narrow width I normally keep it at, resized my images to be that wide, and called it a day. This worked for a while (as far as I could see) but then I got an iPhone and looked at my own site with it. The results were not appealing; with my nominally responsive CSS my fixed-width images somehow had tha paradoxical effect of forcing the text content to be narrower (and in a smaller font size) than it could be.

So I did some fiddling to see if I could do better. I decided that I wanted my inlined images to have the following behavior:

  1. never push the natural (text) content width wider (much like how I now want all <pre> content to behave)
  2. shrink down in size instead of getting truncated as the content width shrinks
  3. preserve their aspect ratio as the content width shrinks or grows
  4. don't enlarge themselves beyond their full size, because my inlined images don't necessarily look good zoomed up

Based on both experimentation and reading the MDN page on <img>, what I seem to want is <img> tags that specify a width= value (in pixels) but not a height=, combined with the CSS style of 'max-width=100%'. The CSS max-width gets me my first two things, specifying width appears to mean that browsers won't enlarge the image, and not specifying height appears to make browsers preserve the image aspect ratio if and as they shrink the image. Specifying height as well as width caused at least some browsers to not preserve the aspect ratio, which sort of vaguely makes sense if I squint at it enough.

(You can also put in invalid values for height, like 'x', and get the same effect.)

This feels a bit like a hack, especially the bit about omitting height=, but it also appears to work. Probably there are some less than desirable effects on page layout on slow networks, but I'll live with them unless I can find a better way.

(Some sources suggest that I should set the CSS height to 'auto' as well. The whole area of scaling images to fit in content areas appears to be rather complicated, based on some Internet searches, or perhaps most everyone is over-engineering it with Javascript and lots of CSS and so on. I'm pretty ignorant about the modern state of CSS, so I'm definitely working by superstition.)

HTMLImageSetupIWant written at 01:46:31; Add Comment

(Previous 10 or go back to December 2016 at 2016/12/19)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.