Wandering Thoughts archives

2015-09-27

Wikitext not having errors creates a backtracking problem

In the past I've written some pain points of parsing wikitext and called out how there aren't conventional parsing errors in running wikitext, just things that turn out to be plain text instead of complete wikitext markup. Some of the consequences of this may not be obvious, and in fact they weren't obvious to me until I tried to make an overly ambitious change to DWiki's markup to HTML conversion code.

The obvious problem that 'no errors' creates is that you will have to either accept closing off incomplete markup or do lookahead to verify that you seem to have a complete entity, or both. If your markup denotes links as '[[ ... ]]', you probably want to look ahead for a ']]' before you start processing a '[[' as a link. Unfortunately doing lookahead correctly is quite hard if your wikitext permits various sorts of nested constructs. Consider DWikiText, which also has '(( ... ))' to quote uninterpreted text in a monospaced font, and then parsing the following:

This example [[looks like ((it has an ending ]])) but it doesn't.

Purely textual lookahead for a ']]' gets fooled here. So let's assume we're going to get fooled sooner or later and handle this better. Rather than trying to rely on fallible lookahead, if we reach the end of a paragraph with an unclosed entity we'll go back to the start of the entity and turn it into plain text.

Unfortunately this has problems too, because something retroactively becoming plain text may change the meaning of other text after that point. Consider this contrived example:

Lorem ipsum ((this *should be emphasis* because the '((' isn't closed and thus is plain text.

If you start out parsing the (( as real, the *'s are plain text. But once the (( is just plain text, they should be creating italics for emphasis. To really retroactively change the (( to plain text, you may need to backtrack all text processing since then and redo it. And backtracking is something conventional parsing technology is generally not designed for; in fact, conventional parsing technology usually avoids it like the plague (along with aggressive lookahead).

(I think the lookahead situation gets somewhat better if you look ahead in the token stream instead of in plain text, but it's still not great. You're basically parsing ahead of your actual parse, and you'd better keep both in sync. Backtracking your actual parsing is probably better.)

All of this has caused me to feel that parsing running wikitext in a single pass is not the best way to do it. Instead I have a multi-pass approach in mind (and have for some time), although I'm not entirely convinced it's right either. I probably won't know unless (and until) I actually implement it, which is probably unlikely.

(An alternate approach would be to simply have backtracking in a conventional recursive descent parser; every time you hit a 'parse error', the appropriate construct being parsed would turn its start token into plain text and continue the parsing from there. Unfortunately this feels like it could be vulnerable to pathological behavior, which is a potential issue for a parser that may be handed user-controlled input in the form of eg comments.)

PS: How I stubbed my toe on this issue was basically trying to do this sort of 'convert back to plain text' for general unclosed font changes in DWikiText. When I did this outside of a limited context, it blew up in my face.

WikitextNoErrorsBacktracking written at 02:10:30; Add Comment

2015-09-22

The Go 'rolling errors' pattern, in function call form

One of the small annoyances of Go's explicit error returns is that the basic approach of checking error returns at every step is annoying when all the error handling is actually the same. You wind up with the classic annoying pattern of, say:

s.f1, err = strconv.ParseUint(fields[1], 10, 64)
if err != nil {
   return nil, err
}
s.f2, err = strconv.ParseUint(fields[2], 10, 64)
if err != nil {
   return nil, err
}
[... repeat ...]

Of course, any good lazy programmer who is put into this starting situation is going to come up with a way to aggregate that error handling together. Go programmers are no exception, which has led to what I'll call a generic 'rolling errors' set of patterns. The basic pattern, as laid out in Rob Pike's Go blog entry Errors are values, is that as you do a sequence of operations you keep an internal marker of whether errors have occurred; at the end of processing, you check it and handle any error then.

Rob Pike's examples all use auxiliary storage for this internal marker (in one example, in a closure). I'm a lazy person so I tend to externalize this auxiliary storage as an extra function argument, which makes the whole thing look like this:

func getInt(field string, e error) (uint64, error) {
   i, err := strconv.ParseUint(field, 10, 64)
   if err != nil {
      return i, err
   }
   return i, e
}

func .... {
   [...]

   var err error
   s.f1, err = getInt(fields[1], err)
   s.f2, err = getInt(fields[2], err)
   s.f3, err = getInt(fields[3], err)

   if err != nil {
      return nil, err
   }
   [...]
}

This example code does bring up something you may want to think about in 'rolling errors' handling, which is what operations you want to do once you hit an error and which error you want to return. Sometimes the answer is clearly 'stop doing operations and return the first error'; other times, as with this code, you may decide that any of the errors is okay to return and it's simpler if the code keeps on doing operations (it may even be better).

(In retrospect I could have made this code just as simple while still stopping on the first error, but it didn't occur to me when I put this into a real program. In this case these error conditions are never expected to happen, since I'm parsing what should be numeric fields that are in a system generated file.)

As an obvious corollary, this 'rolling errors' pattern doesn't require using error itself. You can use it with any running or accumulated status indicator, including a simple boolean.

(Sometimes you don't need the entire infrastructure of error to signal problems. If this seems crazy, consider the case of subtracting two accumulating counters from each other to get a delta over a time interval where a counter might roll over and make this delta invalid. You generally don't need details or an error message here, you just want to know if the counter rolled over or not and thus whether or not you want to disregard this delta.)

GoRollingErrors written at 00:23:46; Add Comment

2015-09-14

A caution about cgo's error returns for errno

Go's cgo system for calling C functions offers a very convenient feature. As the documentation puts it:

Any C function (even void functions) may be called in a multiple assignment context to retrieve both the return value (if any) and the C errno variable as an error [...]

Reading this, you may be tempted to write more or less standard Go error-handling code like the following:

kcid, err := C.kstat_chain_update(t.kc)
if err != nil {
   return err
}

This code is a potential mistake. Unless the documentation for the C function you're calling says so explicitly, there is no guarantee that errno is zero on success. If the function returns success but errno is non-zero, cgo will dutifully generate a non-nil error return from it and then your Go code will bail out with an error that isn't.

This is not cgo's fault. Cgo has no magic knowledge of what C function return values are and aren't errors, so all it can do is exactly what it said it was going to do; if errno is non zero, you get an error version of it. This is just a C API issue (that ultimately comes about because errno is both an implicit return and global state). You'd never write code like this in Go, where 'only return non-nil error on actual errors' is well established, but we're stuck with the C API that we actually have instead of the Go-like one we'd like. So we have to deal with it, which means checking return values explicitly.

(In this case the real 'there has been an error' marker is a kcid return value of -1. I actually hit an irregular test failure when my code was just checking err, which is how I re-stubbed my toe on this particular C API issue.)

PS: the ultimate cause of this is that C code often doesn't explicitly set errno to zero on success but instead leaves it alone, which means errno can wind up set from whatever internal system call or library routine last failed and set it. There are many possibilities for how this can happen; a classical one is seeing ENOTTY from something checking to see if the file descriptor it is writing to is a TTY and so should be in line-buffered mode.

(In my case I saw EAGAIN, which I believe was the OmniOS kernel telling kstat_chain_update() that the buffer it had been given wasn't large enough, please try again with a bigger one.)

GoCgoErrorReturns written at 23:45:13; Add Comment

2015-09-09

Some notes on my experience using Go's cgo system

Cgo is the Go compiler suite's bridge to C. I recently used it to write a Go package that gives you access to Solaris kstat kernel statistics, so I want to write down some notes on the whole thing before they fall out of my memory. On the whole, using cgo was a pleasant enough experience. I don't have very much experience in FFIs so I can't say how cgo compares to others, but cgo makes C and Go seem like a reasonably natural fit. It often really does seem like you're just using another Go package.

(With that said, there's a lot more use of unsafe.Pointer() than you'll find almost anywhere else.)

At the mechanical level, the most annoying thing to deal with was C unions. Go has no equivalent and cgo basically leaves you on your own to read or set union fields. I wound up just writing some trivial C functions to extract union fields for me and then had my Go code call them, rather than wrestle with casts and unsafe.Pointer() and so on in Go code; the C functions were both short and less error prone for me to write.

A C function was also my solution to needing to do pointer arithmetic. In C, a common approach is to define a field as 'struct whatever *ptr;' and then say it actually points to an array of those structs, with the length of the array given by some other field. You access the elements of the array by doing things like incrementing ptr or indexing off it. Well, in Go that doesn't work; if you want to increment ptr to the next struct, you're going to have to throw in explicit C.sizeof invocations and so on. I decided it was simpler to do it in C instead:

kstat_named_t *get_nth_named(kstat_t *ks, uint_t n) {
   kstat_named_t *knp;
   if (!ks || !ks->ks_data || n >= ks->ks_ndata)
       return NULL;
   knp = (kstat_named_t *)ks->ks_data;
   return knp + n;
}

Typecasts are another one of the irritations of cgo. Cgo makes every C type into a Go type, and boy does a lot of C turn out to have a lot of different integer types. In C they mostly convert into each other without explicit casts; in Go they are all fully separate types and you must explicitly cast them around in order to interact with each other and with native Go integer types. This can get especially annoying in things like for loop indexing, because if you write 'for i := 0; i < CFIELD; i ++' the compiler will object that i is a different type than CFIELD. This resulted in a for loop that looks like this:

for i := C.uint_t(0); i < k.ksp.ks_ndata; i++ {
   ....
}

(I wrote more about the mechanics in getting C-compatible structs in Go, copying memory into Go structs, and cgo's string functions explained.)

At the design level, my biggest problem was handling C memory lifetime issues correctly. Part of this was figuring out where the C library had to be using dynamic allocation (and when it got freed), and part of it was working out what it was safe for Go structures to hold references to and when those references might become invalid because of some call I made to the C library API. Working this out is vital because of the impact of coupling Go and C memory lifetimes together, plus these memory lifetime issues are likely to have an effect on your package API. What operations can callers do or not do after others? What precautions do you need to take inside your package to try to avoid dereferencing now-free C memory if callers get the lifetime rules wrong? What things can you not expose because there's no way to guard against 'use after free' errors? And so on.

(runtime.SetFinalizer() can help with this by letting you clean up C memory when Go memory is going away, but it's not a complete cure.)

Not all uses of cgo will run into memory lifetime problems. Many are probably self-contained, where all of your interaction with C code is inside one function and when it returns you're done and can free up everything.

GoCgoExperienceNotes written at 02:03:53; Add Comment

2015-09-07

Getting gocode based autocompletion working for Go in GNU Emacs

The existing guides and documentation for this are terrible and incomplete for someone who is not already experienced with GNU Emacs Lisp packages (which describes me), so here is what worked for me. I'm going to assume that your $GOPATH is $HOME/go and that $GOPATH/bin is on your $PATH.

  • get gocode itself:
    go get github.com/nsf/gocode
    

    To work in GNU Emacs, gocode needs an auto-completion package; it recommends auto-complete, so that's what I decided to use. If you have that already you're done, but I didn't. At this point you might be tempted to go to the auto-complete website and try to follow directions from there, but you actually don't want to do this because there's an easier way to install it.

  • The easiest way to install auto-complete and its prerequisites is through MELPA, which is an additional package repo for Emacs Lisp packages on top of the default ELPA. To enable MELPA, you need to add a stanza to your .emacs following its getting started guide, generally:

    (require 'package)
    (add-to-list 'package-archives
      '("melpa" . "https://melpa.org/packages/"))
    (package-initialize)
    

  • make sure you have a $HOME/.emacs.d directory. You probably do.

  • (re)start Emacs and run M-x list-packages. Navigate to auto-complete and get it installed. If you're running GNU Emacs in X, you can just click on its name and then on the [Install] button; if you're running in a terminal window, navigating to each thing and then hitting Return on it does the same. This will install auto-complete and its prerequisite package popup, the latter of which is not mentioned on the auto-complete site.

    It's possible to install auto-complete manually, directly from the site or more accurately from the github repo's release page. Do it from MELPA instead; it's easier and less annoying. If you install manually you'll have to use MELPA to install popup itself.

  • Set up the .emacs stanza for gocode:

    (add-to-list 'load-path "~/go/src/github.com/nsf/gocode/emacs")
    (require 'go-autocomplete)
    (require 'auto-complete-config)
    (ac-config-default)
    

    This deliberately uses the go-autocomplete.el from gocode's Go package (and uses it in place), instead of one you might get through eg MELPA. I like this because it means that if (and when) I update gocode, I automatically get the correct and latest version of its Emacs Lisp as well.

Restarting GNU Emacs should then get you autocompletion when writing Go code. You may or may not like how it works and want to keep it; I haven't made up my mind yet. Its usage in X appears to be pretty intuitive but I haven't fully sorted out how it works in text mode (the major way seems to be hitting TAB to cycle through possible auto-completions it offers you).

(Plenty of people seem to like it, though, and I decided I wanted to play with the feature since I've never used a smart IDE-like environment before.)

See also Package management in Emacs: The Good, the Bad, and the Ugly and Emacs: How to Install Packages Using ELPA, MELPA, Marmalade. There are probably other resources too; my Emacs inexperience is showing here.

(As usual, I've written this because if I ever need it again I'll hate myself for not having written it down, especially since the directions here are the result of a whole bunch of missteps and earlier inferior attempts. The whole messy situation led to a Twitter rant.)

GoGocodeEmacsAutocomplete written at 21:34:18; Add Comment

2015-09-03

How I've decided to coordinate multiple git repos for a single project

I'm increasingly using git for my own projects (partly because I keep putting them on Github), and this has brought up a problem. On the one hand, I like linear VCS histories (even if they're lies); I don't plan on having branches be visible in the history of my own repos unless it's clearly necessary. On the other hand, I routinely have multiple copies of my repos spread across multiple machines. In theory I always keep all repos synchronized with each other before I start working in one and make commits. In practice, well, not necessarily, and the moment I screw that up a straightforward git pull/push workflow to propagate changes around creates merges.

My current solution goes like this. First, I elect one repo as the primary repo; this is the repo which I use to push changes to Github, for example. To avoid merge commits ever appearing in it, I set it to only allow fast-forward merges when I do 'git pull', with:

git config pull.ff only

This insures that if the primary repo and a secondary repo wind up with different changes, a pull from the secondary into the primary will fail instead of throwing me into creating a merge commit that I don't want. To avoid creating merge commits when I pull the primary into secondaries, all other repos are set to rebase on pulls following my standard recipe. This is exactly what I want; if I pull new changes from the primary into a secondary, any changes in the secondary are rebased on top of the primary's stuff and linear history is preserved. I can then turn around and pull the secondary's additional changes back into the primary as a fast-forward.

If I use 'git push' to move commits from one repo to another I'm already safe by default, because git push normally refuses to do anything except fast-forward updates of the remote. If it complains, the secondary repo involved needs a rebase. I can either do the rebase with 'git pull' in the secondary repo, or in the primary repo I can push to the remote tracking branch in the secondary with 'git push <machine>:<directory> master:origin/master' and then do a 'git rebase' on the secondary.

(Using a push from the primary usually means that my ssh activity flows the right way. And if I'm pushing frequently I should configure a remote for the secondary or something. I'm not quite hep on git repo remotes and remote tracking branches just yet, though, so that's going to take a bit of fumbling around when I get to it.)

GitMultiRepoWorkflow written at 00:55:03; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.