2015-09-27
Wikitext not having errors creates a backtracking problem
In the past I've written some pain points of parsing wikitext and called out how there aren't conventional parsing errors in running wikitext, just things that turn out to be plain text instead of complete wikitext markup. Some of the consequences of this may not be obvious, and in fact they weren't obvious to me until I tried to make an overly ambitious change to DWiki's markup to HTML conversion code.
The obvious problem that 'no errors' creates is that you will have to either accept closing off incomplete markup or do lookahead to verify that you seem to have a complete entity, or both. If your markup denotes links as '[[ ... ]]', you probably want to look ahead for a ']]' before you start processing a '[[' as a link. Unfortunately doing lookahead correctly is quite hard if your wikitext permits various sorts of nested constructs. Consider DWikiText, which also has '(( ... ))' to quote uninterpreted text in a monospaced font, and then parsing the following:
This example [[looks like ((it has an ending ]])) but it doesn't.
Purely textual lookahead for a ']]' gets fooled here. So let's assume we're going to get fooled sooner or later and handle this better. Rather than trying to rely on fallible lookahead, if we reach the end of a paragraph with an unclosed entity we'll go back to the start of the entity and turn it into plain text.
Unfortunately this has problems too, because something retroactively becoming plain text may change the meaning of other text after that point. Consider this contrived example:
Lorem ipsum ((this *should be emphasis* because the '((' isn't closed and thus is plain text.
If you start out parsing the (( as real, the *'s are plain text. But once the (( is just plain text, they should be creating italics for emphasis. To really retroactively change the (( to plain text, you may need to backtrack all text processing since then and redo it. And backtracking is something conventional parsing technology is generally not designed for; in fact, conventional parsing technology usually avoids it like the plague (along with aggressive lookahead).
(I think the lookahead situation gets somewhat better if you look ahead in the token stream instead of in plain text, but it's still not great. You're basically parsing ahead of your actual parse, and you'd better keep both in sync. Backtracking your actual parsing is probably better.)
All of this has caused me to feel that parsing running wikitext in a single pass is not the best way to do it. Instead I have a multi-pass approach in mind (and have for some time), although I'm not entirely convinced it's right either. I probably won't know unless (and until) I actually implement it, which is probably unlikely.
(An alternate approach would be to simply have backtracking in a conventional recursive descent parser; every time you hit a 'parse error', the appropriate construct being parsed would turn its start token into plain text and continue the parsing from there. Unfortunately this feels like it could be vulnerable to pathological behavior, which is a potential issue for a parser that may be handed user-controlled input in the form of eg comments.)
PS: How I stubbed my toe on this issue was basically trying to do this sort of 'convert back to plain text' for general unclosed font changes in DWikiText. When I did this outside of a limited context, it blew up in my face.
2015-09-22
The Go 'rolling errors' pattern, in function call form
One of the small annoyances of Go's explicit error returns is that the basic approach of checking error returns at every step is annoying when all the error handling is actually the same. You wind up with the classic annoying pattern of, say:
s.f1, err = strconv.ParseUint(fields[1], 10, 64)
if err != nil {
return nil, err
}
s.f2, err = strconv.ParseUint(fields[2], 10, 64)
if err != nil {
return nil, err
}
[... repeat ...]
Of course, any good lazy programmer who is put into this starting situation is going to come up with a way to aggregate that error handling together. Go programmers are no exception, which has led to what I'll call a generic 'rolling errors' set of patterns. The basic pattern, as laid out in Rob Pike's Go blog entry Errors are values, is that as you do a sequence of operations you keep an internal marker of whether errors have occurred; at the end of processing, you check it and handle any error then.
Rob Pike's examples all use auxiliary storage for this internal marker (in one example, in a closure). I'm a lazy person so I tend to externalize this auxiliary storage as an extra function argument, which makes the whole thing look like this:
func getInt(field string, e error) (uint64, error) {
i, err := strconv.ParseUint(field, 10, 64)
if err != nil {
return i, err
}
return i, e
}
func .... {
[...]
var err error
s.f1, err = getInt(fields[1], err)
s.f2, err = getInt(fields[2], err)
s.f3, err = getInt(fields[3], err)
if err != nil {
return nil, err
}
[...]
}
This example code does bring up something you may want to think about in 'rolling errors' handling, which is what operations you want to do once you hit an error and which error you want to return. Sometimes the answer is clearly 'stop doing operations and return the first error'; other times, as with this code, you may decide that any of the errors is okay to return and it's simpler if the code keeps on doing operations (it may even be better).
(In retrospect I could have made this code just as simple while still stopping on the first error, but it didn't occur to me when I put this into a real program. In this case these error conditions are never expected to happen, since I'm parsing what should be numeric fields that are in a system generated file.)
As an obvious corollary, this 'rolling errors' pattern doesn't
require using error itself. You can use it with any running or
accumulated status indicator, including a simple boolean.
(Sometimes you don't need the entire infrastructure of error to
signal problems. If this seems crazy, consider the case of subtracting
two accumulating counters from each other to get a delta over a
time interval where a counter might roll over and make this delta
invalid. You generally don't need details or an error message here,
you just want to know if the counter rolled over or not and thus
whether or not you want to disregard this delta.)
2015-09-14
A caution about cgo's error returns for errno
Go's cgo system for calling C functions offers a very convenient feature. As the documentation puts it:
Any C function (even void functions) may be called in a multiple assignment context to retrieve both the return value (if any) and the C errno variable as an error [...]
Reading this, you may be tempted to write more or less standard Go error-handling code like the following:
kcid, err := C.kstat_chain_update(t.kc)
if err != nil {
return err
}
This code is a potential mistake. Unless the documentation for the
C function you're calling says so explicitly, there is no guarantee
that errno is zero on success. If the function returns success
but errno is non-zero, cgo will dutifully generate a non-nil
error return from it and then your Go code will bail out with an
error that isn't.
This is not cgo's fault. Cgo has no magic knowledge of what C
function return values are and aren't errors, so all it can do is
exactly what it said it was going to do; if errno is non zero,
you get an error version of it. This is just a C API issue (that
ultimately comes about because errno is both an implicit return
and global state). You'd never write code like this in Go, where
'only return non-nil error on actual errors' is well established,
but we're stuck with the C API that we actually have instead of the
Go-like one we'd like. So we have to deal with it, which means
checking return values explicitly.
(In this case the real 'there has been an error' marker is a kcid
return value of -1. I actually hit an irregular test failure when
my code was just checking err, which is how I re-stubbed my toe
on this particular C API issue.)
PS: the ultimate cause of this is that C code often doesn't explicitly
set errno to zero on success but instead leaves it alone, which
means errno can wind up set from whatever internal system call or
library routine last failed and set it. There are many possibilities
for how this can happen; a classical one is seeing ENOTTY from
something checking to see if the file descriptor it is writing to
is a TTY and so should be in line-buffered mode.
(In my case I saw EAGAIN, which I believe was the OmniOS kernel telling kstat_chain_update()
that the buffer it had been given wasn't large enough, please try
again with a bigger one.)
2015-09-09
Some notes on my experience using Go's cgo system
Cgo is the Go compiler suite's bridge to C. I recently used it to write a Go package that gives you access to Solaris kstat kernel statistics, so I want to write down some notes on the whole thing before they fall out of my memory. On the whole, using cgo was a pleasant enough experience. I don't have very much experience in FFIs so I can't say how cgo compares to others, but cgo makes C and Go seem like a reasonably natural fit. It often really does seem like you're just using another Go package.
(With that said, there's a lot more use of unsafe.Pointer() than
you'll find almost anywhere else.)
At the mechanical level, the most annoying thing to deal with was
C unions. Go has no equivalent and cgo basically leaves you on your
own to read or set union fields. I wound up just writing some trivial
C functions to extract union fields for me and then had my Go code
call them, rather than wrestle with casts and unsafe.Pointer()
and so on in Go code; the C functions were both short and less error
prone for me to write.
A C function was also my solution to needing to do pointer arithmetic.
In C, a common approach is to define a field as 'struct whatever
*ptr;' and then say it actually points to an array of those
structs, with the length of the array given by some other field.
You access the elements of the array by doing things like incrementing
ptr or indexing off it. Well, in Go that doesn't work; if you
want to increment ptr to the next struct, you're going to have
to throw in explicit C.sizeof invocations and so on. I decided
it was simpler to do it in C instead:
kstat_named_t *get_nth_named(kstat_t *ks, uint_t n) {
kstat_named_t *knp;
if (!ks || !ks->ks_data || n >= ks->ks_ndata)
return NULL;
knp = (kstat_named_t *)ks->ks_data;
return knp + n;
}
Typecasts are another one of the irritations of cgo. Cgo makes every
C type into a Go type, and boy does a lot of C turn out to have a
lot of different integer types. In C they mostly convert into each
other without explicit casts; in Go they are all fully separate
types and you must explicitly cast them around in order to interact
with each other and with native Go integer types. This can get
especially annoying in things like for loop indexing, because if
you write 'for i := 0; i < CFIELD; i ++' the compiler will object
that i is a different type than CFIELD. This resulted in a for
loop that looks like this:
for i := C.uint_t(0); i < k.ksp.ks_ndata; i++ {
....
}
(I wrote more about the mechanics in getting C-compatible structs
in Go, copying memory into Go structs, and cgo's string functions explained.)
At the design level, my biggest problem was handling C memory lifetime issues correctly. Part of this was figuring out where the C library had to be using dynamic allocation (and when it got freed), and part of it was working out what it was safe for Go structures to hold references to and when those references might become invalid because of some call I made to the C library API. Working this out is vital because of the impact of coupling Go and C memory lifetimes together, plus these memory lifetime issues are likely to have an effect on your package API. What operations can callers do or not do after others? What precautions do you need to take inside your package to try to avoid dereferencing now-free C memory if callers get the lifetime rules wrong? What things can you not expose because there's no way to guard against 'use after free' errors? And so on.
(runtime.SetFinalizer()
can help with this by letting you clean up C memory when Go memory
is going away, but it's not a complete cure.)
Not all uses of cgo will run into memory lifetime problems. Many are probably self-contained, where all of your interaction with C code is inside one function and when it returns you're done and can free up everything.
2015-09-07
Getting gocode based autocompletion working for Go in GNU Emacs
The existing guides and documentation for this are terrible and
incomplete for someone who is not already experienced with GNU Emacs
Lisp packages (which describes me), so here is what worked for me. I'm
going to assume that your $GOPATH is $HOME/go and that $GOPATH/bin
is on your $PATH.
- get gocode itself:
go get github.com/nsf/gocode
To work in GNU Emacs,
gocodeneeds an auto-completion package; it recommends auto-complete, so that's what I decided to use. If you have that already you're done, but I didn't. At this point you might be tempted to go to the auto-complete website and try to follow directions from there, but you actually don't want to do this because there's an easier way to install it. - The easiest way to install auto-complete and its prerequisites
is through MELPA, which is an additional package
repo for Emacs Lisp packages on top of the default ELPA. To enable MELPA, you need to add a stanza
to your
.emacsfollowing its getting started guide, generally:(require 'package) (add-to-list 'package-archives '("melpa" . "https://melpa.org/packages/")) (package-initialize) - make sure you have a
$HOME/.emacs.ddirectory. You probably do. - (re)start Emacs and run M-x list-packages. Navigate to
auto-completeand get it installed. If you're running GNU Emacs in X, you can just click on its name and then on the [Install] button; if you're running in a terminal window, navigating to each thing and then hitting Return on it does the same. This will installauto-completeand its prerequisite packagepopup, the latter of which is not mentioned on the auto-complete site.It's possible to install auto-complete manually, directly from the site or more accurately from the github repo's release page. Do it from MELPA instead; it's easier and less annoying. If you install manually you'll have to use MELPA to install
popupitself. - Set up the
.emacsstanza for gocode:(add-to-list 'load-path "~/go/src/github.com/nsf/gocode/emacs") (require 'go-autocomplete) (require 'auto-complete-config) (ac-config-default)
This deliberately uses the go-autocomplete.el from
gocode's Go package (and uses it in place), instead of one you might get through eg MELPA. I like this because it means that if (and when) I updategocode, I automatically get the correct and latest version of its Emacs Lisp as well.
Restarting GNU Emacs should then get you autocompletion when writing Go code. You may or may not like how it works and want to keep it; I haven't made up my mind yet. Its usage in X appears to be pretty intuitive but I haven't fully sorted out how it works in text mode (the major way seems to be hitting TAB to cycle through possible auto-completions it offers you).
(Plenty of people seem to like it, though, and I decided I wanted to play with the feature since I've never used a smart IDE-like environment before.)
See also Package management in Emacs: The Good, the Bad, and the Ugly and Emacs: How to Install Packages Using ELPA, MELPA, Marmalade. There are probably other resources too; my Emacs inexperience is showing here.
(As usual, I've written this because if I ever need it again I'll hate myself for not having written it down, especially since the directions here are the result of a whole bunch of missteps and earlier inferior attempts. The whole messy situation led to a Twitter rant.)
2015-09-03
How I've decided to coordinate multiple git repos for a single project
I'm increasingly using git for my own projects (partly because I
keep putting them on Github),
and this has brought up a problem. On the one hand, I like linear
VCS histories (even if they're lies); I don't plan on having branches
be visible in the history of my own repos unless it's clearly
necessary. On the other hand, I routinely have multiple copies of
my repos spread across multiple machines. In theory I always keep
all repos synchronized with each other before I start working in
one and make commits. In practice, well, not necessarily, and the
moment I screw that up a straightforward git pull/push workflow
to propagate changes around creates merges.
My current solution goes like this. First, I elect one repo as the
primary repo; this is the repo which I use to push changes to Github,
for example. To avoid merge commits ever appearing in it, I set it
to only allow fast-forward merges when I do 'git pull', with:
git config pull.ff only
This insures that if the primary repo and a secondary repo wind up with different changes, a pull from the secondary into the primary will fail instead of throwing me into creating a merge commit that I don't want. To avoid creating merge commits when I pull the primary into secondaries, all other repos are set to rebase on pulls following my standard recipe. This is exactly what I want; if I pull new changes from the primary into a secondary, any changes in the secondary are rebased on top of the primary's stuff and linear history is preserved. I can then turn around and pull the secondary's additional changes back into the primary as a fast-forward.
If I use 'git push' to move commits from one repo to another I'm
already safe by default, because git push normally refuses to do
anything except fast-forward updates of the remote. If it complains,
the secondary repo involved needs a rebase. I can either do the
rebase with 'git pull' in the secondary repo, or in the primary
repo I can push to the remote tracking branch in the secondary with
'git push <machine>:<directory> master:origin/master' and then
do a 'git rebase' on the secondary.
(Using a push from the primary usually means that my ssh activity flows the right way. And if I'm pushing frequently I should configure a remote for the secondary or something. I'm not quite hep on git repo remotes and remote tracking branches just yet, though, so that's going to take a bit of fumbling around when I get to it.)