2015-05-29
I don't commit changes in my working repos
I build a number of open source projects from source for various reasons of my own, using clones of the upstream master repos. With some of them, I make changes for my own reasons, often changes I will never be attempting to push upstream. Whenever I make changes this way, I do not commit my changes to the repos. This is the case whether the project uses Git or Mercurial.
The plain and simple truth is that if you're going to perpetually
carry modest changes to an upstream project that changes frequently
(and that you're going to re-pull frequently), not committing your
changes is by far the easiest way to operate. If you have uncommitted
changes, almost all of the time a simple 'git pull' or 'hg pull
-u' will quietly get you up to date. If this doesn't work you can
stash your changes in various ways, update to the latest version
from a pristine setup, and then re-patch your changes. You spend a
minimum amount of time interacting with the version control system,
the actual checked in state of your repo exactly matches the master
repo, and any time you want to know what you changed you can check
'git status' or 'hg diff' or so on. It's also really easy to
temporarily or permanently set aside a particular change or set of
them.
If you commit the changes, well, you have a bunch more work and your repo has now diverged from master. Every time you pull an upstream update, you must either merge or rebase your commits; even if the VCS will do this for you automatically this time, the merges or rebases are still there in repo history. If you do merges, any bisection is probably going to be more exciting. If you do rebases, your repo steadily accumulates a bunch of clutter in the corners. In either case monitoring the state of your changes is somewhat less straightforward, as is manipulating them.
It's my view that all of the stuff you have to do if you commit your changes is essentially make-work. There are cases where it's (potentially) useful, but those cases have been few and far between for me, and in the mean time I'd be doing a lot of VCS commands that I don't currently have to.
(I find the situation unfortunate but as far as I know neither Git nor Mercurial has extensions or features that would improve the situation. Git autorebase is sort of what I want, but not quite. Life is complicated by the fact that upstreams sometimes screw up their repos so that my local mirror repo diverges from their new reality, in which case I want to abandon my repo and just re-clone from upstream (and apply my changes again).)
(I mentioned this yesterday but I suspect it's a sufficiently odd way of working that I wanted to explain it directly.)
2015-05-28
I don't find Github pull requests an easy way to submit patches
(This is a grumble. You've been warned.)
A lot of open source projects are hosted on Github these days and many of them prefer to get patch submissions in the form of Github pull requests. On the one hand I admire Github for making the process of pull requests relatively painless for the maintainer(s) in my brief experience. On the other hand, I don't find pull requests to be very attractive from the other side, as a (potential) submitter of changes, and I suspect that a lot of other potential change submitters are going to feel the same way.
In the moderately old days, submitting changes was really easy. You started with a current, checked out copy of the project's source repo and you carried your relevant changes on top of it as uncommitted changes (in a CVS or SVN based project you didn't have any choice about that and in general it was usually the easiest approach). When you wanted to submit your change, you ran '<whatever> diff' to get a unified diff relative to the latest master, dumped the output in a file, and emailed the result off to the project with an explanatory note. If you were running your modifications yourself, you could do this from your live working repo.
Creating a Github pull request is not so simple in practice. As far as I can tell, the workflow I'd want to use is something like the following:
- fork a copy of the project into my own Github account.
- '
git clone' that copy to a scratch repo locally. Create a new branch in the repo and switch to it. - '
git diff' my working repo with the uncommitted change then usepatchto apply the diff to the scratch repo. - '
git commit' my change in the scratch repo (on my branch) and push it to my Github fork. - go to Github and submit a pull request from my Github fork to the master.
- after the pull request is merged, delete my Github fork and my local scratch repo.
If I was being thorough, I should actually build the project out of this scratch repo and maybe do a test install (even though I've already built and installed from my working repo).
I have no idea what I'd do to keep clean commits if the master repo
moves forward between my pull submission and when it gets merged.
Maybe Github manages this transparently; maybe I need to update my
scratch repo, 'git rebase' to get a new clean commit (maybe on a
new branch), and push it back to Github to make a new pull request.
It's a fun new frontier of extra work.
(None of this is difficult with the 'email a patch' approach. You
'git pull' to update your working repo, maybe fiddle with your
change, then send another email with a newly generated 'git diff'
if you think you need to or get asked for it.)
Note that there are some simplifications to this that could be done if I contributed a lot to specific projects, which is what I suspect Github pull requests are good for. But I rather feel that they're not so good for essentially one-off contributions from people, which is the category I'm most likely to fall into. So I'd sure like it if Github based projects still made it easy to send them patches by email (and mentioned this in any 'how to contribute' documentation they have). Unfortunately patches by email don't integrate very well with Github issues (of course), while Github pull requests work great there. I'm sure that this is a not insignificant factor pushing projects towards pull requests.
2015-05-14
In Go, you need to always make sure that your goroutines will finish
Yesterday I described an approach to writing lexers in Go that pushed the actual lexing into a separate goroutine, so that it could run as straight-line code that simply consumed input and produced a stream of tokens (which were sent to a channel). Effectively we're using a goroutine to implement what would be a generator in some other languages. But because we're using goroutines and channels, there's something important we need to do: we need to make sure the lexer goroutine is run to completion, so that the goroutine will actually finish.
Right now you may be saying 'well of course the lexer will always be run to the end of the input, that's what the parser does'. But not so fast; what happens if the parser runs into a parse error because of a syntax error or the like? The natural thing to do in the parser is to immediately error out without looking at any further tokens from the lexer, which means that the actual lexer goroutine will stall as it sits there trying to send the next token into its communication channel, a channel that will never be read from because it's been abandoned by the parser.
The answer here is that the parser must do something to explicitly run the lexer to completion or otherwise cause it to exit, even if the tokens the lexer are producing will never be used. In some environments having the lexer process all of the remaining input is okay because it will always be small (and thus fast), but if you're lexing large bodies of text you'll want to arrange some sort of explicit termination signal via another channel or something.
This is an important way in which goroutines and channels aren't a perfect imitation of generators. In typical languages with generators, abandoning a generator results in it getting cleaned up via garbage collection; you can just walk away without doing anything special. In Go with goroutines, this isn't the case; you need to consider goroutine termination conditions and generally make sure it always happens.
You might think that this is a silly bug and of course anyone who uses goroutines like this will handle it as a matter of course. If so, I regret to inform you that I didn't come up with this realization on my own; instead Rob Pike taught it to me with his bugfix to Go's standard text/template module. If Rob Pike can initially overlook this issue in his own code in the standard library, anyone can.
2015-05-13
Go goroutines as a way to capture and hold state
The traditional annoyance when writing lexers is that lexers have internal state (at least their position in the stream of text), but wind up returning tokens to the parser at basically random points in their execution. This means holding the state somewhere and writing the typical start/stop style of code that you find at the bottom of a pile of subroutine calls; your 'get next token' entry point gets called, you run around a bunch of code, you save all your state, and you return the token. Manual state saving and this stuttering style of code execution doesn't lend itself to clear logic.
Some languages have ways around this structure. In languages with generators, your lexer can be a generator that yields tokens. In lazy evaluation languages your lexer turns into a stream transformation from raw text to tokens (and the runtime keeps this memory and execution efficient, only turning the crank when it needs the next token).
In Rob Pike's presentation on lexing in Go, he puts the lexer
code itself into its own little goroutine. It produces tokens by
sending them to a channel; your parser (running separately) obtains
tokens by reading the channel. There are two ways I could put what
Rob Pike's done here. The first is to say that you can use
goroutines to create generators, with a channel send and receive
taking the place of a yield operation. The second is that
goroutines can be used to capture and hold state. Just as with
ordinary threads, goroutines turn
asynchronous code with explicitly captured state into synchronous
code with implicitly captured state and thus simplify code.
(I suppose another way of putting it is that goroutines can be used for coroutines, although this feels kind of obvious to say.)
I suspect that this use for goroutines is not new for many people (and it's certainly implicit in Rob Pike's presentation), but I'm the kind of person who sometimes only catches on to things slowly. I've read so much about goroutines for concurrency and parallelism that the nature of what Rob Pike (and even I) were doing here didn't really sink in until now.
(I think it's possible to go too far overboard here; not everything needs to be a coroutine or works best that way. When I started with my project I thought I would have a whole pipeline of goroutines; in the end it turned out that having none was the right choice.)