Wandering Thoughts

2018-02-05

I should remember that sometimes C is a perfectly good option

Recently I found myself needing a Linux command that reported how many CPUs are available for you to use. On Linux, the official way to do this is to call sched_getaffinity and count how many 1 bits are set in the CPU mask that you get back. My default tool for this sort of thing these days is Go and I found some convenient support for this (in the golang.org/x/sys/unix package), so I wrote the obvious Go program:

package main
import (
    "fmt"
    "os"
    "golang.org/x/sys/unix"
)

func main() {
    var cpuset unix.CPUSet
    err := unix.SchedGetaffinity(0, &cpuset)
    if err != nil {
        fmt.Printf("numcpus: cannot get affinity: %s\n", err)
        os.Exit(1)
    }
    fmt.Printf("%d\n", cpuset.Count())
}

This compiled, ran on most of our machines, and then reported an 'invalid argument' error on some of them. After staring at strace output for a while, I decided that I needed to write a C version of this so I understood exactly what it was doing and what I was seeing. I was expecting this to be annoying (because it would involve writing code to count bits), but it turns out that there's a set of macros for this so the code is just:

#define _GNU_SOURCE
#include    <sched.h>
#include    <unistd.h>
#include    <stdio.h>
#include    <stdlib.h>

#define MAXCPUS 0x400

int main(int argc, char **argv) {
    cpu_set_t *cpuset;
    cpuset = CPU_ALLOC(MAXCPUS);

    if (sched_getaffinity(0, CPU_ALLOC_SIZE(MAXCPUS), cpuset) < 0) {
        fprintf(stderr, "numcpus: sched_getaffinity: %m\n");
        exit(1);
    }
    printf("%d\n", CPU_COUNT(cpuset));
}

(I think I have an unnecessary include file in there but I don't care. I spray standard include files into my C programs until the compiler stops complaining. Also, I'm using a convenient glibc printf() extension since I'm writing for Linux.)

This compiled, worked, and demonstrated that what I was seeing was indeed a bug in the x/sys/unix package. I don't blame Go for this, by the way. Bugs can happen anywhere, and they're generally more likely to happen in my code than in library code (that's one reason I like to use library code whenever possible).

The Go version and the C version are roughly the same number of lines and wound up being roughly as complicated to write (although the C version fails to check for an out of memory condition that's extremely unlikely to ever happen).

The Go version builds to a 64-bit Linux binary that is 1.1 Mbytes on disk. The C version builds to a 64-bit Linux binary that is 5 Kbytes on disk.

(This is not particularly Go's fault, lest people think that I'm picking on it. The Go binary is statically linked, for example, while the C version is dynamically linked; statically linking the C version results in an 892 Kbyte binary. Of course, in practice it's a lot easier to dynamically link and run a program written in C than in anything else because glibc is so pervasive.)

When I started writing this entry, I was going to say that what I took from this is that sometimes C is the right answer. Perhaps it is, but that's too strong a conclusion for this example. Yes, the C version is the same size in source code and much smaller as a binary (and that large Go binary does sort of offend my old time Unix soul). But if the Go program had worked I wouldn't have cared enough about its size to write a C version, and if the CPU_SET macros didn't exist with exactly what I needed, the C version would certainly have been more annoying to write. And there is merit in focusing on a small set of tools that you like and know pretty well, even if they're not the ideal fit for every situation.

But still. There is merit in remembering that C exists and is perfectly useful and many things, especially low level operating system things, are probably quite direct to do in C. I could probably write more C than I do, and sometimes it might be no more work than doing it in another language. And I'd get small binaries, which a part of me cares about.

(At the same time, these days I generally find C to be annoying. It forces me to care about things that I mostly don't want to care about any more, like memory handling and making sure that I'm not going to blow my foot off.)

PS: I'm a little bit surprised and depressed that the statically linked C program is so close to the Go program in size, because the Go program includes a lot of complex runtime support in that 1.1 Mbytes (including an entire garbage collector). The C program has no such excuses.

CSometimesGoodAnswer written at 23:34:16; Add Comment

2018-01-28

Adding 'view page in no style' to the WebExtensions API in Firefox Quantum

After adding a 'view page in no style' toggle to the Firefox context menu, I still wasn't really satisfied. Having it in the context menu is better than nothing, but I'm very used to using a gesture for it and I really wanted that in Firefox Quantum (even if I'm not using Firefox 57+ today, I'm going to have to upgrade someday). My first hope was that the WebExtensions API exposed some way to call Firefox's sendAsyncMessage(), so I skimmed through the Firefox WebExtensions API documentation. Unsurprisingly, Firefox does not expose such a powerful internal API to WebExtensions, but I did find browser.tabs.toggleReaderMode(). At this point I had a dangerous thought: how difficult would it be to add a new WebExtensions API to Firefox?

It turns out that it's not that difficult and here is how I did it, building on the base of my context menu hack. Since I already have the core code to toggle 'view page in no style' on and off, all I need is a new API function. The obvious place to start is with how toggleReaderMode is specified and implemented, because it should be at least very close to what I need.

Unfortunately there are a lot of mentions of 'toggleReaderMode' in various files, so we need to look around and guess a bit:

; cd mozilla-central
; rg toggleReaderMode
[...]
browser/components/extensions/schemas/tabs.json
991:        "name": "toggleReaderMode",
[...]

The WebExtensions API has to be defined somewhere and looking at tabs.json pretty much confirms that this is it, as you can see from the full listing. The relevant JSON definition starts out:

{
  "name": "toggleReaderMode",
  "type": "function",
  "description": "Toggles reader mode for the document in the tab.",
  "async": true,
  "parameters": [
  [...]

It's notable that this doesn't seem to define any function name to be called when this API point is invoked, which suggests that the function just has the same name. So we can search for some function of that name:

; rg toggleReaderMode
[...]
browser/components/extensions/ext-tabs.js
1045:        async toggleReaderMode(tabId) {
[...]

This is very likely to be what we want, given the file path, and indeed it is. Looking at the full source to toggleReaderMode() shows that it works roughly how I would expect, in that it winds up calling sendAsyncMessage("Reader:ToggleReaderMode"). This means that all we need to turn it into our new toggleNoStyleMode() API is the trivial modification of changing the async message sent, and of course the name.

First, the implementation, more or less blindly copied from toggleReaderMode:

// <cks>: toggle View Page in No Style
async toggleNoStyleMode(tabId) {
  let tab = await promiseTabWhenReady(tabId);
  tab = getTabOrActive(tabId);

  tab.linkedBrowser.messageManager.sendAsyncMessage("PageStyle:Toggle");
},

toggleReaderMode() has code to deal with the possibility that the tab can't be put in Reader mode, which we've taken out; as before, we toggle things unconditionally without caring if it's applicable. This is probably not what you should do for a proper API, but this is a hack.

Having implemented the API function, we now need to make it part of the actual WebExtensions API by changing tabs.json. We use a straight copy of toggleReaderMode's API specification with the name and description changed (the latter just for neatness):

{
  "name": "toggleNoStyleMode",
  "type": "function",
  "description": "Toggles view page in no style mode for the document in the tab.",
  "async": true,
  "parameters": [
    {
      "type": "integer",
      "name": "tabId",
      "minimum": 0,
      "optional": true,
      "description": "Defaults to the active tab of the $(topic:current-window)[current window]."
    }
  ]
},

With my Firefox Quantum rebuilt with these changes included, I could now add a user script to Foxy Gestures to test and use this. Following the examples of common user scripts, especially the 'Go to URL' example, the necessary JavaScript code is:

 executeInBackground(() => {
    getActiveTab(tab => browser.tabs.toggleNoStyleMode(tab.id));
 });

(Since the tabId argument is optional and probably defaults to what I want, I could probably simplify this down to just calling browser.tabs.ToggleNoStyleMode() with no argument and without the getActiveTab() dance. But I'm writing all of this mostly by superstition, so carefully copying an existing working example doesn't hurt.)

Actually doing all of this and testing it immediately showed me a significant limitation of 'view page in no style' in Firefox Quantum, which is that when Firefox disables CSS for a page this way, it really disables all CSS. Including and especially the CSS that Foxy Gestures has to inject in order to show mouse trails for mouse gestures that you're in the process of making. The gestures still work, but I have to make them blindly. This is probably okay for my usage, but it's an unfortunate limitation in general. Perhaps Mozilla would accept a bug report that View → Page Style → No Style also affects addons.

(If I want another option, uMatrix allows you to disable CSS temporarily, but it takes a lot more work. And in Quantum it might still affect addon-injected CSS; I haven't experimented.)

(This and previous entries extend, at great length, my tweets. They definitely took more time to write than it took me to actually do both hacks.)

PS: I don't know how WebExtensions permissions are set up and controlled, but apparently however it works, Foxy Gestures didn't need any new permissions to access my new API. The MDN page on WebExtensions permissions suggests that because I put my new API in browser.tabs, it's available without any special permissions.

FirefoxNewWebExtsAPI written at 22:50:42; Add Comment

Adding 'view page in no style' to Firefox Quantum's context menu

Yesterday I covered the limitations of Firefox's Reader mode, why I like 'view page in no style' enough to make it a personal FireGestures user script gesture, and how that's impossible in Firefox 57+ because WebExtensions doesn't expose an API for it to addons. If I couldn't have this accessible through Foxy Gestures, I decided that I cared enough to hack my personal Firefox build to add something to toggle 'view page in no style' to the popup context menu (so that it'd be accessible purely through the mouse in my setup). Today I'm going to write up more or less how I did it, as a guide to anyone else who wants to do such modifications.

An important but not widely known thing about Firefox that makes this much more feasible than it seems is that a great deal of Firefox's UI (and a certain amount of its actual functionality) is implemented in JavaScript and various readable configuration files (XUL and otherwise), not in C++ code. This makes it much easier to modify and add to these things relatively blindly, and as a bonus you're far less likely to crash Firefox or otherwise cause problems if you get something wrong. I probably wouldn't have attempted this hack if it had involved writing or modifying C++ code, but as it was I was pretty confident I could do it all in JavaScript.

First we need to find out how and where Firefox handles viewing pages in no style. When I'm spelunking in the Firefox code base, my approach is to start from a string that I know is used by the functionality (for example in a menu entry that involves whatever it is that I want) and work backward. Here the obvious string to look for is 'No Style'.

; cd mozilla-central
; rg 'No Style'
[...]
browser/locales/en-US/chrome/browser/browser.dtd
708:<!ENTITY pageStyleNoStyle.label "No Style">

(rg is ripgrep, which has become my go-to tool for this kind of thing. You could use any recursive grep command.)

Firefox tends not to directly use strings in the UI; instead it adds a layer of indirection in order to make translation easier. Looking at browser.dtd shows that it just defines some text setup stuff, with no actual code, so we want to find where pageStyleNoStyle is used. Some more searching finds browser/base/content/browser-menubar.inc, where we can find the following fairly readable configuration text:

<menupopup onpopupshowing="gPageStyleMenu.fillPopup(this);">
  <menuitem id="menu_pageStyleNoStyle"
            label="&pageStyleNoStyle.label;"
            accesskey="&pageStyleNoStyle.accesskey;"
            oncommand="gPageStyleMenu.disableStyle();"
            type="radio"/>
[...]

The important bit here is oncommand, which is the actual JavaScript that gets run to disable the page style. Unfortunately this just runs a function, so we need to search the codebase again to find the function, which is in browser/base/content/browser.js:

 disableStyle() {
   let mm = gBrowser.selectedBrowser.messageManager;
   mm.sendAsyncMessage("PageStyle:Disable");
 },

Well, that's not too helpful, since all it does is send a message that's handled by something else. We need to find where this message is handled, which we can do by searching for 'PageStyle:Disable'. That turns up browser/base/content/tab-content.js, where we find:

var PageStyleHandler = {
 init() {
   addMessageListener("PageStyle:Switch", this);
   addMessageListener("PageStyle:Disable", this);
   [...]

 receiveMessage(msg) {
   switch (msg.name) {
   [...]
     case "PageStyle:Disable":
       this.markupDocumentViewer.authorStyleDisabled = true;
       break;
   [...]

Since I wanted my new context menu entry to toggle this setting, I decided that the simple way was to add a new message for this. That needs one line in init() to register the message:

   addMessageListener("PageStyle:Toggle", this); // <cks>

and then a new case in receiveMessage() to handle it:

     // <cks>
     case "PageStyle:Toggle":
       this.markupDocumentViewer.authorStyleDisabled = !this.markupDocumentViewer.authorStyleDisabled;
       break;

(I tend to annotate my own additions and modifications to Firefox's code so that I can later clearly identify that they're my work. Possibly this case should have a longer comment so that I can later remember why it's there and what might make it unneeded in the future.)

We can definitely disable the page style by setting authorStyleDisabled to true, but it's partly a guess that we can re-enable the current style by resetting authorStyleDisabled to false. However, it's a well informed guess since Firefox 56 and before worked this way (which I know from my FireGestures user script). It's worth trying, though, because duplicating what PageStyle:Switch does would be much more complicated.

Next I need to hook this new functionality up to the context menu, which means that I have to find the context menu. Once again I'll start from some text that I know appears there:

; rg 'Save Page As'
[...]
browser/locales/en-US/chrome/browser/browser.dtd
567:<!ENTITY savePageCmd.label            "Save Page As…">

There's a number of uses of savePageCmd in the Firefox source code, because there's a number of places where you can save the page, but the one I want is in browser/base/content/browser-context.inc (which we can basically guess from the file's name, if nothing else). Here's the full menu item where it's used:

<menuitem id="context-savepage"
          label="&savePageCmd.label;"
          accesskey="&savePageCmd.accesskey2;"
          oncommand="gContextMenu.savePageAs();"/>

At this point I had a choice in how I wanted to implement my new context menu item. As we can see from inspecting the oncommand for this entry (and others), the proper way is to add a new toggleNoStyle() function to gContextMenu that sends our new PageStyle:Toggle message. The hack way is to simply write the necessary JavaScript inline in the oncommand for our new menu entry. Let's do this the proper way, which means we need to find gContextMenu and friends.

Searching for savePageAs and hidePlugin (from another context menu entry) says that they're defined in browser/base/content/nsContextMenu.js. So I added, right after hidePlugin(),

 // <cks> Toggle view page in no style
 toggleNoStyle() {
   let mm = gBrowser.selectedBrowser.messageManager;
   mm.sendAsyncMessage("PageStyle:Toggle");
 },

(This is simply disableStyle() modified to send a different asynchronous message. Using gBrowser here may or may not be entirely proper, but this is a hack and it seems to work. Looking more deeply at other code in nsContextMenu.js suggests that perhaps I should be using this.browser.messageManager instead, and indeed that works just as well as using gBrowser. I'm preserving my improper code here so you can see my mis-steps as well as the successes.)

Now I can finally add in a new context menu entry to invoke this new gContextMenu function. Since this is just a hack, I'm not going to define a new DTD entity for it so it can be translated; I'm just going to stick the raw string in, and it's not going to have an access key. I'm putting it just before 'Save Page As' for convenience and so I don't have to worry that it's in a section of the context menu that might get disabled on me. The new menu item is thus just:

<menuitem id="context-pagestyletoggle"
          label="Toggle Page Style"
          oncommand="gContextMenu.toggleNoStyle();"/>

(I deliberately made the menu string short so that it wouldn't force the context menu to be wider than it already is. Probably I could make it slightly more verbose without problems.)

After rebuilding my local Firefox Quantum tree and running it, I could test that I had a new context menu item and that it did in fact work. I even opened up the browser console to verify that my various bits of JavaScript code didn't seem to be reporting any errors.

(There were lots of warnings and errors from other code, but that's not my problem.)

This is a hack (for reasons beyond a hard-coded string). I've made no attempt to see if the current page has a style that we can or should disable; the menu entry is unconditionally available and it's up to me to use it responsibly and deal with any errors that come up. It's also arguably in the wrong position in the context menu; it should probably really go at the bottom. I just want it more accessible than that, so I don't have to move my mouse as far down.

(Not putting it at the bottom also means that I don't need to worry about how to properly interact with addons that also add context menu entries. Probably there is some easy markup for that in browser-context.inc, but I'm lazy.)

PS: My first implementation of this used direct JavaScript in my oncommand. I changed it to the more proper approach for petty reasons, namely that putting it in this entry would have resulted in an annoyingly too-wide line of code.

FirefoxNoStyleInContext written at 20:28:55; Add Comment

2018-01-18

Why Go cares about the difference between unsafe.Pointer and uintptr

Go has two things that are more or less the representation of an untyped pointer; uintptr and unsafe.Pointer (which, contrary to appearances, is a built-in type). On the surface this is a little bit odd, because unsafe.Pointer and uintptr can be converted back and forth between each other. Why not have only one representation of a pointer? What's the difference between the two?

The superficial difference is that you can do arithmetic on an uintptr but not on an unsafe.Pointer (or any other Go pointer). The important difference is explicitly pointed out by the unsafe package's documentation:

A uintptr is an integer, not a reference. Converting a Pointer to a uintptr creates an integer value with no pointer semantics. Even if a uintptr holds the address of some object, the garbage collector will not update that uintptr's value if the object moves, nor will that uintptr keep the object from being reclaimed.

Although unsafe.Pointers are generic pointers, the Go garbage collector knows that they point to Go objects; in other words, they are real Go pointers. Through internal magic, the garbage collector can and will use them to keep live objects from being reclaimed and to discover further live objects (if the unsafe.Pointer points to an object that has pointers of its own). Due to this, a lot of the restrictions on what you can legally do with unsafe.Pointers boil down to 'at all times, they must point to real Go objects'. If you create an unsafe.Pointer that doesn't, even for a brief period of time, the Go garbage collector may choose that moment to look at it and then crash because it found an invalid Go pointer.

By contrast, a uintptr is merely a number. None of this special garbage collection magic applies to objects 'referred to' by a uintptr because as just a number, a uintptr doesn't refer to anything. In turn this leads to many of the careful limitations on the various ways that you can turn an unsafe.Pointer into a uintptr, manipulate it, and turn it back. The basic requirement is that you do this manipulation in such a way that the compiler and the runtime can shield the temporary non-pointerness of your unsafe.Pointer from the garbage collector so this temporary conversion will be atomic with respect to garbage collection.

(I think that my use of unsafe.Pointer in my entry on copying blobs of memory into Go structures is safe, but I admit I'm now not completely sure. I believe that there is some magic going on with cgo, since it can safely manufacture unsafe.Pointers that point to C memory, not Go memory.)

PS: As of Go 1.8, all Go pointers must always be valid (I believe including unsafe.Pointers), even if a garbage collection isn't running at the time. If you ever have an invalid pointer stored in a variable or a field, your code can crash merely by updating the field to a perfectly valid value, including nil. See for example this educational Go bug report.

(I was going to try to say something about the internal magic that allows the garbage collector to cope with untyped unsafe.Pointer pointers, but I'm convinced I don't understand enough about it to even say what sort of magic it uses.)

GoUintptrVsUnsafePointer written at 20:19:27; Add Comment

2018-01-02

Some notes on relative imports and vendor/ in Go

For reasons beyond the scope of this entry, I've become interested in ways to work with multiple packages in Go without doing it in the normal way, with a $GOPATH/src and packages and programs that can be go get'd from Github or any of the other places that the go tool supports. The short version of what I'm interested in is that I'd like to create self contained program source code trees that are still modularized into multiple Go packages, instead of throwing everything into 'package main'.

(If you put everything in main, you can put all your source code in one directory and then tell people to just cd there and do 'go build' to build it.)

Go has two plausible mechanisms for this, vendoring and the rarely used relative import path (which is more properly a filesystem path). I'll start with the latter. Relative import paths come from 'go help packages':

An import path that is a rooted path or that begins with a . or .. element is interpreted as a file system path and denotes the package in that directory.

(Italics mine.)

Suppose that you have a program and you want to have two sub-packages, a and b. Then you can put together a directory tree like this:

program/
   main.go
   a/a.go
   b/b.go

Your main.go will use 'import "./a" to make a relative import from the current directory, while your a.go would use 'import "../b" to reach package b (if it needed to). You have to put a.go and b.go into appropriate subdirectories, even if each package only has a single file, because even this sort of import applies to the entire directory.

However, relative imports turn out to have one drawback, and that's that you can't use them in a package or program that has its source code under your $GOPATH. If you do, you'll get an error message:

main.go:3:8: local import "./a" in non-local package

(Note that you get this error even if you are in the directory itself and just running 'go build', to build the current directory. It's possible that this is new behavior in Go 1.10, but if so, well, you're going to have to use Go 1.10 or later at some point.)

The other plausible mechanism is vendoring. Given the same main program and two sub-packages a and b, construct your directory tree as:

program/
   main.go
   vendor/
      a/a.go
      b/b.go

Now everything simply does 'import "a"' or 'import "b", which works in both main.go and a.go.

The drawback of vendoring is the exact reverse of relative imports; instead of not working under your $GOPATH, it only works under your $GOPATH. This unfortunately makes vendoring not entirely useful for my particular purpose here, because if I'm going to require a $GOPATH, I might as well use something approximating a proper Go package hierarchy:

src/cslab/program
             cmd/program/main.go
             a/a.go
             b/b.go

(The whole cmd/program subdirectory tree feels a bit excessive here, but I'm not sure if there's a generally accepted better way.)

With this setup I can refer to things as 'import "cslab/program/a"' and so on. There's relatively little need to hide them in a vendor/ subdirectory just so I can confuse everyone with 'import "a"'.

In fact, I think this potential confusion from vendoring makes it better to use relative imports if you have to pick one of these two options, because at least when people see them it's fairly clear what they mean.

(It would be useful if the vendor/ subdirectory worked even outside $GOPATH, but it doesn't currently so even if this was accepted as a bug by the Go people it would be some time before a fix appeared and was more or less universally usable.)

GoOddImportsNotes written at 03:57:49; Add Comment

2017-12-06

In practice, Go's slices are two different data structures in one

As I've seen them in Go code and used them myself, Go's slices are generally used in two pretty distinctly separate situations. As a result, I believe that many people have two different conceptual models of slices and their behavior, depending on which situation they're using slices in.

The first model and use of slices is as views into a concrete array (or string) that you already have in your code. You're taking an efficient reference to some portion of the array and saying 'here, deal with this chunk' to some piece of code. This is the use of slices that is initially presented in A Tour of Go here and that is implicitly used in, for example, io.Reader and io.Writer, both of which are given a reference to an underlying concrete byte array.

The second model and use of slices is as dynamically resizable arrays. This is the usage where, for example, you start with 'var accum []string', and then add things to it with 'accum = append(accum, ...)'. In general, any code using append() is using slices this way, as is code that uses explicit slice literals ('[]string{a, b, ..}'). Dynamically resizable arrays are a very convenient thing, so this sort of slice shows up in lots of Go code.

(Part of this is that Go's type system strongly encourages you to use slices instead of arrays, especially in arguments and return values.)

Slices as dynamically resizable arrays actually have an anonymous backing store behind them, but you don't normally think about it; it's materialized, managed, and deallocated for you by the runtime and you can't get a direct reference to it. As we've seen, it's easy to not remember that the second usage of slices is actually a funny, GC-driven special case of the first sort of use. This can lead to leaking memory or corrupting other slices.

(It's not quite fair to call the anonymous backing array an implementation detail, because Go explicitly documents it in the language specification. But I think people are often going to wind up working that way, with the slice as the real thing they deal with and the backing array just an implementation detail. This is especially tempting since it works almost all of the time.)

This distinct split in usage and conceptual model (and the glitches that result at the edges of it) are why I've wound up feeling that in practice, Go's slices are two different data structures in one. The two concepts may be implemented with the same language features and runtime mechanisms, but people treat them differently and have different expectations and beliefs about them.

GoSlicesTwoViews written at 00:26:37; Add Comment

2017-12-04

Some notes on using Go to check and verify SSH host keys

For reasons beyond the scope of this entry, I recently wrote a Go program to verify the SSH host keys of remote machines, using the golang.org/x/crypto/ssh package. In the process of doing this, I found a number of things in the package's documentation to be unclear or worth noting, so here are some notes about it.

In general, you check the server's host key by setting your own HostKeyCallback function in your ClientConfig structure. If you only want to verify a single host key, you can use FixedHostKey(), but if you want to check the server key against a number of them, you'll need to roll your own callback function. This includes the case where you have both a RSA and an ed25519 key for the remote server and you don't necessarily know which one you'll wind up verifying against.

(You can and should set your preferred order of key types in HostKeyAlgorithms in your ClientConfig, but you may or may not wish to accept multiple key types if you have them. There are potential security considerations because of how SSH host key verification works, and unless you go well out of your way you'll only verify one server host key.)

Although it's not documented that I can see, the way you compare two host keys to see if they're the same is to .Marshal() them to bytes and then compare the bytes. This is what the code for FixedHostKey() does, so I consider it official:

type fixedHostKey struct {
  key PublicKey
}

func (f *fixedHostKey) check(hostname string, remote net.Addr, key PublicKey) error {
  if f.key == nil {
    return fmt.Errorf("ssh: required host key was nil")
  }
  if !bytes.Equal(key.Marshal(), f.key.Marshal()) {
    return fmt.Errorf("ssh: host key mismatch"
  }
  return nil
}

In a pleasing display of sanity, your HostKeyCallback function is only called after the crypto/ssh package has verified that the server can authenticate itself with the asserted host key (ie, that the server knows the corresponding private key).

Unsurprisingly but a bit unfortunately, crypto/ssh does not separate out the process of using the SSH transport protocol to authenticate the server's host keys and create the encrypted connection from then trying to use that encrypted connection to authenticate as a particular user. This generally means that when you call ssh.NewClientConn() or ssh.Dial(), it's going to fail even if the server's host key is valid. As a result, you need your HostKeyCallback function to save the status of host key verification somewhere where you can recover it afterward, so you can distinguish between the two errors of 'server had a bad host key' and 'server did not let us authenticate with the "none" authentication method'.

(However, you may run into a server that does let you authenticate and so your call will actually succeed. In that case, remember to call .Close() on the SSH Conn that you wind up with in order to shut things down neatly; otherwise you'll have some resource leaks in your Go code.)

Also, note that it's possible for your SSH connection to the server to fail before it gets to host key authentication and thus to never have your HostKeyCallback function get called. For example, the server might not offer any key types that you've put in your HostKeyAlgorithms. As a result, you probably want your HostKeyCallback function to have to affirmatively set something to signal 'server's keys passed verification', instead of having it set a 'server's keys failed verification' flag.

(I almost made this mistake in my own code, which is why I'm bothering to mention it.)

As a cautious sysadmin, it's my view that you shouldn't use ssh.Dial() but should instead net.Dial() the net.Conn yourself and then use ssh.NewClientConn(). The problem with relying on ssh.Dial() is that you can't set any sort of timeout for the SSH authentication process; all you have control over is the timeout of the initial TCP connection. You probably don't want your check of SSH host keys to hang if the remote server's SSH daemon is having a bad day, which does happen from time to time. To avoid this, you need to call .SetDeadline() with an appropriate timeout value on the net.Conn after it's connected but before you let the crypto/ssh code take it over.

The crypto/ssh package has a convenient function for iteratively parsing a known_hosts file, ssh.ParseKnownHosts(). Unfortunately this function is not suitable for production use by itself, because it completely gives up the moment it encounters even a single significant error in your known_hosts file. This is not how OpenSSH ssh behaves, for example; by and large ssh will parse all valid lines and ignore lines with errors. If you want to duplicate this behavior, you'll need to split your known_hosts file up into lines with bytes.Split(), then feed each non-blank, non-comment line to ParseKnownHosts (if you get an io.EOF error here, it means 'this line isn't formatted like a SSH known hosts line'). You'll want to think about what you do about errors; I accumulate them all, report up to the first N of them, and then only abort if there's been too many.

(In our case we want to keep going if it looks like we've only made a mistake in a line or two, but if looks like things are badly wrong we're better off giving up entirely.)

Sidebar: Collecting SSH server host keys

If all you want to do is collect SSH server host keys for hosts, you need a relatively straightforward variation of this process. You'll repeatedly connect to the server with a different single key type in HostKeyAlgorithms each time, and your HostKeyCallback function will save the host key it gets called with. If I was doing this, I'd save the host key in its []byte marshalled form, but that's probably overkill.

GoSSHHostKeyCheckingNotes written at 23:41:41; Add Comment

2017-11-24

Shooting myself in the foot by using exec in a shell script

Once upon a time, we needed a simple shell script that did some setup stuff and then ran a program that did all of the work; this shell script was going to run from crontab once a day. Because I'm sometimes very me, I bummed the shell script like so:

#!/bin/sh
setup
yadda

exec realprog --option some-args

Why not use exec? After all, it's one less process, even if it doesn't really matter.

Later, we needed to run the program a second time with some different arguments, because we'd added some more things for it to process. So we updated the script:

#!/bin/sh
setup
yadda

exec realprog --option some-args
exec realprog --switch other-args

Yeah, there's a little issue here. It probably looks obvious written up here, but I was staring at this script just last week, knowing that for some reason it hadn't generated the relatively infrequent bit of email we'd expected, and I didn't see it then; my eyes and my mind skipped right over the implications of the repeated exec. I only realized what I was seeing today, when I took another, more determined look and this time it worked.

(The implication is that anything after the first exec will never get run, because the first exec replaces the shell that's running the shell script with realprog.)

This was a self-inflicted injury. The script in no way needed the original exec; putting it in there anyway in the name of a trivial optimization led directly to the error in the updated script. If I hadn't tried to be clever, we'd have been better off.

(I'm not going to call this premature optimization, because it wasn't. I wasn't trying to optimize the script, not particularly; I was just being (too) clever and bumming my shell script for fun because I could.)

PS: In the usual way of things, I felt silly for not having spotted the repeat exec issue last week. But we see what we're looking for, and I think that last week my big question before I checked the script was 'have we updated the script to work on the other-args'. We had a line to run the program on them, so the answer was 'yes' and I stopped there. Today I knew things had once again not happened, so I asked myself 'what on earth could be stopping the second command from running?' and the light suddenly dawned.

BourneMyTooCleverExec written at 00:24:13; Add Comment

2017-11-23

Understanding a tricky case of Bourne shell redirection and command parsing

I recently saw this tweet by Dan Bornstein (via) that poses a Bourne shell puzzle, or more specifically a Bash puzzle:

Shell fun time! Do you know what the following script does? Did you have to run it?

#!/bin/bash
function boop {
  echo "out $1"
  echo "err $1" 1>&2
}

 x='2';   boop A >& $x
 x='2';   boop B >& "$x"
 x=' 2 '; boop C >& $x
 x=' 2 '; boop D >& "$x"

There's a number of interesting things going on here, including a Bash feature I didn't know about and surprising (to me) difference between Bourne shells, and even between Bash run as bash and run as /bin/sh. So let's talk about what's going on and where Bourne shells differ in their interpretations of this (and why).

(If you want a version of this tweet that you can test in other Bourne shells, change the 'function boop' to 'boop()'.)

To start with, we can mostly ignore the specifics of boop; its job is to produce some output to standard output and to standard error. The '1>&2' just says 'redirect file descriptor 1 to file descriptor 2' (technically it means 'make file descriptor 1 be a duplicate of file descriptor 2'). For standard output (fd 1), the number is optional, so this can also be written as '>&2', which should look sort of familiar given the last four lines.

Now we need to break down each of the last four lines, one by one, starting with the first.

x='2';   boop A >& $x

After expanding $x, this is simply 'boop A >& 2', and so all Bourne shells redirect boop's standard output to standard error. The only mild surprise here is that Bash processes redirections after doing variable expansions, which isn't really a surprise if you read the very large Bash manpage (because it documents the order; see here). I believe that this order is in fact the historical Bourne shell behavior and it also seems to be the POSIX-required order.

x='2';   boop B >& "$x"

After variable expansion, this is 'boop B >& "2"'. In some languages, putting the 2 in quotes would cause the language to see it purely as a string, blocking its interpretation as a file descriptor number. In the Bourne shell, this is not what quotes do; they mostly just block word splitting, which there isn't any of here anyway. So this is the same as 'boop B >& 2', which is the same as the first line and has the same effect.

Now things start getting both interesting and different between shells.

x=' 2 '; boop C >& $x

In Bash, invoked in full Bash mode (and not as /bin/sh), this expands $x without any particular protection from the usual word splitting on whitespace. The application of word splitting trims off the spaces at the start and the end, leaving this as basically 'boop C >& 2 ', which is simply the same as the first two lines. Dash behaves the same way as Bash does.

(The value of $x has spaces in it, but in general when $x is used unquoted those spaces get removed by word splitting after the expansion happens. This reinterpretation of $x's value is required partly because the Bourne shell doesn't have lists.)

If you run Bash as /bin/sh (or in POSIX mode in general), it doesn't do word splitting on the expanded value of $x in this context. This leaves it with what is effectively boop C >& ' 2 '. Bash then interprets this as meaning to write standard output and standard error to a file called ' 2 ', for reasons I'll cover later.

(This mode difference is documented in Chet Ramey's page on Bash's POSIX mode; see number 11.)

Most other shells (apart from Bash and Dash) interpret this as an error, reporting some variant of 'illegal file descriptor name'. These shells don't do word splitting here either, and without word splitting and whitespace trimming we don't have a digit, we have a digit surrounded by spaces, which is not a valid file descriptor number.

(Not doing word splitting here appears to be the correct POSIX behavior, based on a careful reading of portions of 2.7 Redirection. See the third-last paragraph.)

x=' 2 '; boop D >& "$x"

At this point, the expansion of $x is fully protected from having its whitespace stripped; we wind up with effectively boop D >& ' 2 '. Bash in both normal and /bin/sh mode writes both standard output and standard error to a file called ' 2 ', which is probably a good part of the surprise that Dan Bornstein had in mind. Bash interprets things this way because it has a specific feature of its own for redirecting standard output and standard error. Bash recommends writing this as '&>word' but accepts '>&word' as a synonym, with the dry note that if word is a number, 'other redirection operators apply'. Here word is no longer a number (it's a number with spaces on either side), so Bash's special >& feature takes over.

(This interpretation of >&word is permitted by the Single Unix Standard, which says that the behavior in this case is unspecified. Since it's unspecified, Bash is free to do whatever it wants.)

Every other Bourne shell that I have handy to test with (Dash, FreeBSD /bin/sh, OmniOS /bin/sh, FreeBSD pdksh, and official versions of ksh) report the same 'illegal file descriptor name' as most of them did for the third line (and for the same reason; ' 2 ' is not a file descriptor number). This too is allowed by the Single Unix Standard; since the behavior is unspecified, we're allowed to make it an error.

Sidebar: The oddity of Dash's behavior

I was going to say that Dash is not POSIX compliant here, but that's wrong. POSIX specifically says that '>& word' where the word is not a plain number is unspecified, so basically anything goes in both the third and the fourth line. However, Dash is inconsistent and behaves oddly. The most clear oddity is the results of the following:

x=' 2 1 '; echo hi >& $x

This produces a 'hi' on standard error and nothing else; the trailing '1' in $x has simply vanished.

BourneRedirectionAndQuoting written at 00:40:34; Add Comment

2017-11-20

Major version changes are often partly a signal about support

Recently (for some value of recently) I read The cargo cult of versioning (via). To simplify, one of the things that it advocates is shifting the major version number to being part of the package name:

Next, move the major version to part of the name of a package. "Rails 5.1.4" becomes "Rails-5 (1, 4)". By following Rich Hickey's suggestion above, we also sidestep the question of what the default version should be. There's just no way to refer to a package without its major version.

I am biased to see versioning as ultimately a social thing. Looking at major version number changes through that lens, I feel that one of their social uses is being overlooked by this suggested change.

Many projects take what is effectively the default approach of having a single unified home page, a single repository, and so on, not separate home pages, repositories, and so on for each major version. In these setups, incrementing the major version number of your project and advertising the shift is a strong social signal that past major versions are no longer being supported unless you say otherwise. If you move from mypackage version 1.2.3 to mypackage version 2.0.0, you're signalling not just an API change but also that by default there will not be a mypackage 1.2.4. In fact you're further signalling that people should not build (new) projects using mypackage version 1.2.3; if they do so anyway, they're going to extra work to fish an old tag out of your repository or an old tarball out of your 'archived past versions' download area.

As I see it, this works because people think of there being a single project 'mypackage', not two separate projects 'mypackage version 1' and 'mypackage version 2'. Since there is only one project, it's relatively clear that by default it only has one actively developed major version, the latest, since I don't think people expect most projects to be working in several directions at once.

(Where projects have diverged from this model the result has often been confusion and annoyance. The case I'm closest to is Python, where having both 'Python 2' and 'Python 3' has caused all sorts of problems. But I'm not sure that calling them 'Python-2' and 'Python-3' would have helped. Incompatible major version changes in major projects appear to be a problem, period.)

I initially thought that you'd have problems getting this clear signal in a world where things were 'mypackage-1' and 'mypackage-2' instead of 'mypackage version ...'. Now, though, I'm not so sure. You'd probably have to be more explicit on your project web pages, which might not be such a bad thing, but it might be clear enough to people in general.

MajorVersionSupportSignal written at 23:35:08; Add Comment

(Previous 10 or go back to October 2017 at 2017/10/18)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.