Wandering Thoughts archives

2016-05-15

It's time for me to upgrade my filtering HTTP proxy

I've been using a filtering HTTP proxy for a very long time now in order to filter out various sorts of things I didn't want from my browsing experience (most prominently cookies). I probably haven't been doing this for quite as long as filtering proxies have been available, but I now suspect that it's actually close, because it turns out that the filtering proxy I use was last updated in 1998. Increasingly, this long stasis in my filtering proxy is kind of a problem that I should deal with.

I haven't stuck with Junkbuster because I think it's some paragon of perfection here. Instead I've stuck with it because it still works without problems and, more importantly, I have a moderate collection of non-default filtering rules that I would have to copy over to any new filtering tool that I adopted (probably Privoxy, which I'm actually already using for filtering stuff related to syndication feed reading). But even though it still works, using Junkbuster is not without problems. Really, there are two big ones.

First, Junkbuster is HTTP/1.0 only, which means that all of my interactions with HTTP websites are constrained down to that. In practice this probably just costs me some amount of latency and speed. I'm not going to say that this is unimportant, but I can't say I've really noticed it. Still, I'd kind of like to have HTTP/1.1 available, if only to be nicer to websites out there by reusing the same connection instead of opening new one after new one.

More importantly (and more relevantly), Junkbuster is very definitely IPv4 only. This means that all of my regular HTTP browsing is IPv4 only, since it all goes through Junkbuster. Even if a site offers IPv6, I'll ignore that. As a result I don't actually use IPv6 all that much even when I have it available, and as a result of that I don't necessarily notice if my IPv6 connectivity breaks for some reason. I would like to change this, which definitely means a new proxy.

A mitigating factor is that all of this is irrelevant for HTTPS connections. Those are not proxied through anything for the obvious reasons, which means that they get HTTP/1.1 (or HTTP/2) and IPv6 support (and also that I have to rely purely on the protections of my browser addons). Over time I expect more and more browsing I do to be HTTPS browsing, but I also don't expect HTTP browsing to go away any time soon; there are still quite a lot of sites that are HTTP-based and they're probably going to stay that way for, oh, the next decade or more.

(As is traditional, I'm writing this entry partly to motivate myself into actually doing this at some point. Since nothing is really broken today, the work required is not entirely attractive; it's basically a bunch of work for very little or no visible improvement. Although probably I can simplify or eliminate a bunch of my current filtering rules; it's not as if I pay them much attention, so a bunch are likely to be long obsolete.)

web/ProxyUpgradeTime written at 01:08:40; Add Comment

2016-05-14

IPv6 is the future of the Internet

I say, have said, and will say a lot of negative things about IPv6 deployment and usability. I'm on record as believing that large scale IPv6 usage will cause lots of problems in the field, with all sorts of weird failures and broken software (and some software that is not broken as such but is IPv4 only), and that in practice lots of people will be very slow to update to IPv6 and there will be plenty of IPv4 only places for, oh, the next decade or more.

But let me say something explicitly: despite all that, I believe that IPv6 is the inevitable future of the Internet. IPv6 solves real problems, those problems are getting more acute over time, the deployment momentum is there, and and sooner or later people will upgrade. I don't have any idea of how soon this will happen ('not soon' is probably still a good bet), but over time it's clear that more and more traffic on the Internet will be IPv6, despite all of the warts and pain involved. The transition will be slow, but at this point I believe it's long since become inevitable.

(Whether different design and deployment decisions could have made it happen faster is an academically interesting question but probably not one that can really be answered today, although I have my doubts.)

This doesn't mean that I'm suddenly going to go all in on moving to IPv6. I still have all my old cautions and reluctance about that. I continue to think that the shift will be a bumpy road and I'm not eager to rush into it. But I do think that I should probably be working more on it than I currently am. I would like not to be on the trailing edge, and sooner or later there are going to be IPv6 only services that I want to use.

(IPv6 only websites and other services are probably inevitable but I don't know how soon we can expect them. Anything popular will probably be a sign of the trailing edge, but I wouldn't be surprised to see a certain sort of tech-oriented website go IPv6 only earlier than that as a way of making a point.)

As a result, I now feel that I should be working to move my software and my environment towards using IPv6, or at least being something that I can make IPv6 enabled. In part this means looking at programs and systems I'm using that are IPv4 only and considering what to do about them. Hopefully it will also mean making a conscious effort not to write IPv4 only code in the future, even if that code is easier.

(I would say 'old programs', but I have recently written something that's sort of implicitly IPv4 only because it contains embedded assumptions about eg doing DNS blocklist lookups.)

Probably I should attempt to embark on another program of learning about IPv6. I've tried that before, but it's proven to have the same issue for me as learning computer languages; without an actual concrete project, I just can't feel motivated about learning the intricacies of IPv6 DHCP and route discovery and this and that and the other. But probably I can look into DNS blocklists in the world of IPv6 and similar things; I do have a project that could use that knowledge.

tech/IPv6IsTheFuture written at 00:20:40; Add Comment

2016-05-13

You can call bind() on outgoing sockets, but you don't want to

It started with some tweets and some data by Julia Evans. In the data she mentions:

and here are a few hundred lines of strace output. What's going on? it is running bind() all the time, but it's making outgoing HTTP connections. that makes no sense!!

It turns out that this is valid behavior according to the Unix API, but you probably don't want to do this for a number of reasons.

First off, let's note more specifically what Erlang is doing here. It is not just calling bind(), it is calling bind() with no specific port and address picked:

bind(18, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16 <unfinished ...>

Those arguments are INADDR_ANY and the magic port (port 0) that tells bind() that it can pick any ephemeral port. What this bind() does is assign the socket a local ephemeral port (from whatever the ephemeral port range is). Since we specified the local address as INADDR_ANY, the socket remains unbound to any specific local IP; the local IP will only be chosen when we connect() the socket to some address.

(This is visible in anything that exposes the Unix socket API and has a getsockname() operation. I like using Python, since I can do all of this from an interactive REPL.)

There really isn't very much point in doing this for sockets that you're going to use for outgoing connections; about all it achieves is letting you know your local port before you make the connection, instead of only afterwards. In exchange for this minor advantage you make one extra system call and also increase your chances of running out of ephemeral ports under load, because you're putting an extra constraint on the kernel's port allocation.

In general, IP requires each connected socket to have a unique tuple of (local IP, local port, remote IP, report port). When you leave an outgoing socket unbound until you connect(), the kernel has the maximum freedom to find a local port that makes the tuple unique, because all it needs is one of the four things to be unique, not necessarily the local port number. If you're connecting to different ports on a remote server, the same port on different remote servers, or whatever, it may be able to reuse a local port number that's already been used for something else. By contrast, if you bind() before you connect and use INADDR_ANY, the kernel pretty much has the minimum freedom; it must ensure that the local port alone is completely unique, so that no matter what you then do with listen() and connect() later you'll never collide with an existing tuple.

(See this article for a discussion of how the Linux kernel does this, and in fact this entire issue.)

Some Unixes may frontload all of the checks necessary into bind(), but at least some of them defer some checks to connect(), even for pre-bound sockets. This is probably a sensible decision, especially since a normal connect() can fail because of ephemeral port exhaustion.

I'm sure there's some advantage to this 'bind before connect' approach, but I'm honestly hard pressed to think of any.

(There are situations where you want to bind to a specific IP address, but that's not what's happening here.)

(I sort of always knew that it was valid to bind() before calling connect() but I didn't know the details, so writing this has been useful. For instance, before I started writing I thought maybe the bind() picked an IP address as well as the ephemeral port, which turns out not to be the case; it leaves the IP address unbound. Which is really not that surprising once I think about it, since that's what you often do with servers; they listen to a specific port on INADDR_ANY. All of which goes to show that sometimes it's easier for me to experiment and find out things rather than reason through them from first principles.)

unix/BindingOutgoingSockets written at 02:08:18; Add Comment

2016-05-11

We're never going to be able to have everyone use two factor authentication

Every so often I think about two-factor authentication here or talk to people about it. In the process of doing this, I've come around to an important realization:

We're never going to be able to have all our users use two factor authentication.

The largest fundamental issue is cost. If we require universal two factor authentication, the department needs to provide 2FA tokens to everyone (and then manage them, of course). Current typical prices are $50 US one time costs plus we pay for all management, or $33/year (and someone else worries about management). At that cost level, we're looking at tens of thousands of dollars of cost. The budget for that is simply not there. Even with probably unpopular moves like charging graduate students a '2FA token deposit' or the like, we still have a not insignificant population that the department would have to cover the costs for directly (either directly out of base budget or by forcing professors to pay for postdocs, visitors, etc who are associated with them).

(We cannot assume that all people have smartphones, either, and delegate our 2FA authentication to smartphone apps.)

I presented this as a pure cost issue, but of course it's more than that; it's a cost versus benefits tradeoff here. If we were guarding extremely important data or systems it might be a cost that the department either was willing to incur or had no choice about, but the blunt reality is that we're not. And in practice moving everyone to 2FA would provide only a modest increase in security over what we currently have, since (as far as we know) account credentials are not getting compromised left and right. With high expenses and only a modest security increase, the tradeoff is not in favour of universal 2FA.

This doesn't mean that we'll never have any 2FA. Instead, what it means is that any two factor authentication deployment is going to be a selective one, where some accounts will be protected by it but many others won't be. A selective deployment puts various constraints on what software we can use and how we can deploy things. It also suggests that we may want to be able to use more than one 2FA system. Some people are likely to already have central university-issued 2FA tokens, some people will have smartphones that they can use for 2FA, and some people may have locally purchased 2FA tokens like Yubikeys. It would be good if we could accommodate them all, although this may not be realistic for various reasons.

sysadmin/UsNeverEntirely2FA written at 23:29:41; Add Comment

The difference between 'Delete' and 'Destroy' in X window managers

If you use a certain sort of window manager (generally an old school one), you may have discovered that there are two operations the window manager supports to tell X applications to close themselves. Names for these operations vary, but I will go with 'Delete' and 'Destroy' for them because this is fvwm's terminology. Perhaps you've wondered why there are two ways to do this, or what the difference is. As you might guess, the answers are closely tied to each other.

The one way to describe the difference between the two operations is who takes action. When your window manager performs a 'Delete' operation, what it is really doing is sending a message to the application behind the selected window saying 'please close yourself' (specifically it is sending a WM_DELETE_WINDOW message). This message and the whole protocol around it is not built in to the X server and the wire protocol; instead it is an additional system added on top.

(Potentially confusingly, this 'please close your window' stuff is also called a protocol. People who want to see the details can see the XSetWMProtocols() manpage and read up on client messages. See eg this example, and also the ICCCM.)

On the other hand, the 'Destroy' operation talks directly to the X server to say either 'destroy this window' or more commonly 'terminate the connection of the client responsible for this window'. Unlike 'Delete', this requires no cooperation from the client and will work even if the client has hung or is deciding to ignore your request to close a window (perhaps the client believes it's just that special).

Generally there are big side effects of a 'Destroy'. Since most programs only create a single connection to the X server, having the server close this connection will close all of the program's windows and generally cause it to immediately exit. Even if only a single window is destroyed, this usually causes the program to get X errors and most programs exit the moment they get an X error, which of course closes all of the program's windows.

How programs react to being asked politely to close one of their windows varies, but usually if a program has multiple windows it won't exit entirely, just close the specific window you asked it to. Partly this is because the 'close window' button on the window frame is actually doing that 'Delete' operation, and very few people are happy if a program exits entirely just because you clicked the little X for one of its windows.

Because 'Delete' is a protocol that has to be handled by the client and some clients are just that old, there are or at least were a few core X clients that didn't support it. And if you write an X client from scratch using low-level libraries, you have to take explicit steps to support it and you may not have bothered.

(To be friendly, fvwm supports a 'Close' operation that internally checks to see whether a window supports 'Delete' and uses that if possible; for primitive clients, it falls back to 'Destroy'. I suspect that many window managers support this or just automatically do it, but I haven't looked into any to be sure.)

Sidebar: 'Delete' and unresponsive clients

It would be nice if your window manager could detect that it's trying to send a 'Delete' request to a client that theoretically supports it but isn't responding to it, and perhaps escalate to a Destroy operation. Unfortunately I don't know if the window manager gets enough information from the X server to establish that the client is unresponsive, as opposed to just not closing the window, and there are legitimate situations where the client may not close the window itself right away, or ever.

(Consider you trying to 'Delete' a window with unsaved work. The program involved probably wants to pop up a 'do you want to save this?' dialog, and if you then ignore the dialog everything will just sit there. And if you click on 'oops, cancel that' the whole situation will look much like the program is ignoring your 'Delete' request.)

I believe that some window managers do attempt to detect unresponsive clients, but at most they pop up a dialog offering you the option to force-close the window and/or client. Others, such as fvwm, just leave it entirely to you.

unix/XDeleteVersusDestroy written at 03:19:16; Add Comment

2016-05-09

An Apache trick: using directories to create redirections

Suppose, not entirely hypothetically, that you're using Apache ProxyPass directives as a reverse proxy to map /someurl/ on your website to another web server. You generally have to redirect the URL with the trailing slash, but of course you would like a request for a plain '/someurl' to also work right instead of just getting a 404 status response. Here, 'work right' means that you'll generate a redirection from '/someurl' to '/someurl/'.

It's certainly possible to do this with one of the various Apache directives for explicit redirections (I'd use RedirectMatch). But often there's an easier way: use a real filesystem directory. If somedir is a directory in the Apache document root and you send a request for '/somedir', Apache's default behavior is to send exactly the redirection we want. And Apache doesn't care if the '/somedir/' URL is being diverted somewhere other than the filesystem (via ProxyPass or other directives); it will still send that redirection regardless.

So we can just do 'mkdir /docroot/someurl' and everything will work. The directory contents don't matter; I tend to put a README file there with a note about how the directory is just there for the redirection and actual contents in it will be ignored.

(This redirection trick happens automatically if you're using .htaccess files in a directory to control, say, internal redirections. However there are various reasons for not using .htaccess files, including centralizing your configuration in one easily visible place instead of spraying it all over the filesystem in a bunch of nominally invisible files.)

Back when I wrote my entry on ProxyPass, I theorized about using this trick. I can now say that we've actually done this in our configuration and it works fine.

(This trick is a very old one, of course; I'm sure people have been doing it in Apache for ages. I just feel like writing it down explicitly for various reasons.)

web/ApacheDirectoryRedirectTrick written at 22:01:33; Add Comment

You can't use expvar.Func to expose a bunch of expvar types

Suppose, hypothetically, that you have a collection of statistics that are each one of the expvar package's types. You want to put them in a namespace (so they are all 'mything.var1', 'mything.var2' and so on), but you'd like to avoid the tedious clutter of registering each expvar variable by hand with .Set(). So you have a clever idea.

First, you will embed all of the variables in a structure:

var somestats struct {
   var1, var2, var3   expvar.Int
   var4, var5, var6   expvar.Int
}

Then you will write a Stats() function that simply hands back the structure and then register it through expvar.Func():

func ReportStats() interface{} {
   return somestats
}

Unfortunately, this will not work (at least today). As I mentioned in my first set of notes on the expvar package, expvar.Func turns what your function returns into JSON by using json.Marshal, and this only returns exported fields. None of the expvar variable types have any exported fields, and so as a result expvar.Func() will convert them all to empty JSON (or possibly malfunction). You just can't get there from here.

This is kind of a bug (at a minimum, expvar.Func should document this restriction), but it's unlikely to change (apart from the documentation being updated). Beyond it not working today, there's no way to have a simple ReportStats function like this that work safely, and since you can't do this there's little to no point in making expvar variable types JSON-serializable through json.Marshal.

(To make this work, each expvar type would implement a MarshalJSON() method that did the appropriate thing. In fact, since expvar.String is really MarshalJSON() in disguise, you could just make one call the other.)

Sidebar: Why clear and complete documentation matters

Here is a question: is it deliberate that the 'thing to JSON' function for expvar.Var is not called MarshalJSON, or is it a historical accident? You can certainly argue that because my pattern above is fundamentally wrong, it's a feature that it doesn't work at all. Thus the choice of not using MarshalJSON in the expvar.Var interface could be entirely deliberate and informed, since it makes a broken thing (and all its variants) simply not work. Or this could be ultimately a mistake on the order of using String() in the expvar.Var interface, and so something that would be corrected in a hypothetical Go 2 (which is allowed to change APIs).

Without better documentation, people who come along to Go later just don't know. If you want to preserve the intent of the original designers of the language and the standard library, it really helps to be clear on what that intent is and perhaps the logic behind it. Otherwise it's hard to tell deliberate decisions from accidents.

programming/GoExpvarFuncLimit written at 02:45:15; Add Comment

2016-05-08

Issues in fair share scheduling of RAM via resident set sizes

Yesterday I talked about how fair share allocation of things needs a dynamic situation and how memory was not necessarily all that dynamic and flow-based. One possible approach to fair share allocation of memory is to do it on Resident Set Size. If you look at things from the right angle, RSS is sort of a flow in that the kernel and user programs already push it back and forth dynamically.

(Let's ignore all of the complications introduced on modern systems by memory sharing.)

While there has been some work on various fair share approaches to RSS, I think that one issue limiting the appeal here is that significantly constraining RSS often has significant undesirable side effects. Every program has a 'natural' RSS, which is the RSS at which it only infrequently or rarely has to ask for something that's been removed from its set of active memory. If you clamp a program's RSS below this value (and actually evict things from RAM), the program will start trying to page memory back in at a steadily increasing rate. Eventually you can clamp the program's RSS so low that it makes very little forward progress in between all of the page-ins of things it needs.

Up until very recently, all of this page-in activity had another serious effect: it ate up a lot of IO bandwidth to your disks. More exactly, it tended to eat up your very limited random IO capacity, since these sort of page-ins are often random IO. So if you pushed a program into having a small enough RSS, the resulting paging would kill the ability of pretty much all programs to get IO done. This wasn't technically swap death, but it might as well have been. To escape this, the kernel probably needs to limit not just the RSS but also the paging rate; a program that was paging heavily would wind up going to sleep more and more of the time in order to keep its load impact down.

(These days a SSD based system might have enough IO bandwidth and IOPS to not care about this.)

All of this is doable but it's also complicated, and it doesn't get you the sort of more or less obviously right results that fair share CPU scheduling does. I suspect that this has made fair share RSS allocation much less attractive than simpler things like CPU scheduling.

tech/FairShareRSSProblems written at 01:15:58; Add Comment

2016-05-07

'Fair share' scheduling pretty much requires a dynamic situation

When I was writing about fair share scheduling with systemd the other day, I rambled in passing about how I wished Linux had fair share memory allocation. Considering what fair share memory allocation would involve set off a cascade of actual thought, and so today I have what is probably an obvious observation.

In general what we mean by 'fair share' scheduling or allocation is something where your share of a resource is not statically assigned but is instead dynamically determined based on how much other people also want. Rather than saying that you get, say, 25% of the network bandwidth, we say that you get 1/Nth of it where N is how many consumers want network bandwidth. Fair share scheduling is attractive both because it's 'fair' (no one gets to over-consume a resource), it doesn't require setting hard caps or allocations in advance, and it responds to usage on the fly.

But given this, fair share scheduling really needs to be about something dynamic, something that can easily be adjusted on the fly from moment to moment and where current decisions are in no way permanent. Put another way, fair share scheduling wants to be about dividing up flows; the flow of CPU time, the flow of disk bandwidth, the flow of network bandwidth, and so on. Flows are easy to adjust; you (the fair share allocator) just give the consumers more or less this time around. If more consumers show up, the part of the flow that everyone gets becomes smaller; if consumers go away, the remaining ones get a larger share of the flow. The dynamic nature of the resource (or of the use of the resource) means that you can always easily reconsider and change how much of it the consumers get.

If you don't have something that's dynamic like this, well, I don't think that fair share scheduling or allocation is going to be very practical. If adjusting current allocations is hard or ineffective (or even just slow), you can't respond very well to consumers coming and going and thus the 'fair share' of the resource changing.

The bad news here is pretty simple: memory is not very much of a flow. Nor is, say, disk space. With relatively little dynamic flow nature to allocations of these things, they don't strike me as things where fair share scheduling is going to be very successful.

tech/FairShareAllocationAndFlows written at 01:27:40; Add Comment

2016-05-06

A weird little Firefox glitch with cut and paste

Yesterday I wrote about Chrome's cut and paste, so to be fair today is about my long standing little irritation with Firefox's cut and paste. Firefox doesn't have Chrome's problem; I can cut and paste from xterm to it without issues. It's going the other way that there's a little issue under what I've now determined are some odd and very limited circumstances.

The simplest way to discuss this is to show you the minimal HTML to reproduce this issue. Here it is:

<html> <body>
<div>word1
  word2</div>
</body> </html>

If you put this in a .html file, point Firefox at the file, double click on 'word2', and try to paste it somewhere, you will discover that Firefox has put a space in front of it (at least on X). If you take out the whitespace before 'word2' in the HTML source, the space in the paste goes away. No matter how many spaces are before 'word2', you only get one in the pasted version; however, if you put a real hard tab before word2, you get a tab instead of a space.

(You can add additional ' wordN' lines, and they'll get spaces before the wordN when pasted. Having only word2 is just the minimal version.)

You might wonder how I noticed such a weird thing. The answer is that this structure is present on Wandering Thoughts entry pages, such as the one for this entry. If you look up at the breadcrumbs at the top (and at the HTML source), the name of the page is structured like this. As it happens, I do a lot of selecting the filenames of existing entries when I'm writing new entries (many of my entries refer back to other entries), so I hit this all the time.

(Ironically I would not have hit this issue if I didn't care about making the HTML generated by DWiki look neat. The breadcrumbs are autogenerated, so there's no particular reason to indent them in the HTML; it just makes the HTML look better.)

This entry is also an illustration of the use of writing entries at all. Firefox has been doing this for years and years, and for those years and years I just assumed it was something known because I never bothered to narrow down exactly when it happened. Writing this entry made me systematically investigate the issue and even narrow down a minimal reproduction so I can file a Firefox bug report. I might even get a fixed version someday.

PS: If you use Firefox on Windows or Mac, I'd be interested to know if this cut and paste issue happens on them or if it's X-specific.

web/FirefoxCutAndPasteBug written at 00:58:35; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.