My workflow for testing Github pull requests
Every so often a Github-based project I'm following has a pending pull
request that might solve a bug or otherwise deal with something I care
about, and it needs some testing by people like me. The simple case is
when I am not carrying any local changes; it is adequately covered by
part of Github's Checking out pull requests locally
(skip to the bit where they talk about '
git fetch'). A more elaborate
git fetch origin pull/<ID>/head:origin/pr/<ID> git checkout pr/<ID>
That creates a proper remote branch and then a local branch that tracks it, so I can add any local changes to the PR that I turn out to need and then keep track of them relative to the upstream pull request. If the upstream PR is rebased, well, I assume I get to delete my remote and then re-fetch it and probably do other magic. I'll cross that bridge when I reach it.
The not so simple case is when I am carrying local changes on top of the upstream master. In the fully elaborate case I actually have two repos, the first being a pure upstream tracker and the second being a 'build' repo that pulls from the first repo and carries my local changes. I need to apply some of my local changes on top of the pull request while skipping others (in this case, because some of them are workarounds for the problem the pull request is supposed to solve), and I want to do all of this work on a branch so that I can cleanly revert back to 'all of my changes on top of the real upstream master'.
The workflow I've cobbled together for this is:
- Add the Github master repo if I haven't already done so:
git remote add github https://github.com/zfsonlinux/zfs.git
.git/configto add a new '
fetch =' line so that we can also fetch pull requests from the
githubremote, where they will get mapped to the remote branches
github/pr/NNN. This will look like:
fetch = +refs/pull/*/head:refs/remotes/github/pr/*
(This comes from here.)
- Pull down all of the pull requests with '
git fetch github'.
I think an alternate to configuring and fetching all pull requests is the limited version I did in the simple case (changing
githubin both occurrences), but I haven't tested this. At the point that I have to do this complicated dance I'm in a 'swatting things with a hammer' mode, so pulling down all PRs seems perfectly fine. I may regret this later.
- Create a branch from
masterthat will be where I build and test the pull request (plus my local changes):
git checkout -b pr-NNN
It's vitally important that this branch start from
masterand thus already contain my local changes.
- Do an interactive rebase relative to the upstream pull request:
git rebase -i github/pr/NNN
This incorporates the pull request's changes 'below' my local changes to master, and with
-iI can drop conflicting or unneeded local changes. Effectively it is much like what happens when you do a regular '
git pull --rebase' on
master; the changes in
github/pr/NNNare being treated as upstream changes and we're rebasing my local changes on top of them.
- Set the upstream of the pr-NNN branch to the actual Github pull
git branch -u github/pr/NNN
This makes '
git status' report things like 'Your branch is ahead of ... by X commits', where X is the number of local commits I've added.
If the pull request is refreshed, my current guess is that I will
have to fully discard my local
pr-NNN branch and restart from
fetching the new PR and branching off
master. I'll undoubtedly
find out at some point.
Initially I thought I should be able to use a sufficiently clever
invocation of '
git rebase' to copy some of my local commits from
master on to a new branch that was based on the Github pull
request. With work I could get the rebasing to work right; however,
it always wound up with me on (and changing) the
which is not what I wanted. Based on this very helpful page on
git rebase' is really doing, what
I want is apparently impossible without explicitly making a new
branch first (and that new branch must already include my local
changes so they're what gets rebased, which is why we have to branch
This is probably not the optimal way to do this, but having hacked my way through today's git adventure game I'm going to stop now. Feel free to tell me how to improve this in comments.
(This is the kind of thing I write down partly to understand it and partly because I would hate to have to derive it again, and I'm sure I'll need it in the future.)
Sidebar: Why I use two repos in the elaborate case
In the complex case I want to both monitor changes in the Github
master repo and have strong control over what I incorporate into
my builds. My approach is to routinely do '
git pull' in the pure
tracking repo and read '
git log' for new changes. When it's time
to actually build, I '
git pull' (with rebasing) from the tracking repo into the build
repo and then proceed. Since I'm pulling from the tracking repo,
not the upstream, I know exactly what changes I'm going to get in
my build repo and I'll never be surprised by a just-added upstream
In theory I'm sure I could do this in a single repo with various tricks, but doing it in two repos is much easier for me to keep straight and reliable.
A cynical view on needing SSDs in all your machines in the future
Let's start with my tweets:
@thatcks: Dear Firefox Nightly: doing ten+ minutes of high disk IO on startup before you even start showing me my restored session is absurd.
@thatcks: Clearly the day is coming when using a SSD is going be not merely useful but essential to get modern programs to perform decently.
I didn't say this just because programs are going to want to do more and more disk IO over time. Instead, I said it because of a traditional developer behavior, namely that developers mostly assess how fast their work is based on how it runs on their machines and developer machines are generally very beefy ones. At this point it's extremely likely that most developer machines have decently fast SSDs (and for good reason), which means that it's actually going to be hard for developers to notice they've written code that basically assumes a SSD and only runs acceptably on it (either in general or when some moderate corner case triggers).
SSDs exacerbate this problem by being not just fast in general but especially hugely faster at random IO than traditional hard drives. If you accidentally write something that is random IO heavy (or becomes so under some circumstances, perhaps as you scale the size of the database up) but only run it on a SSD based system, you might not really notice. Run that same thing on a HD based one (with a large database) and it will grind to a halt for ten minutes.
(Today I don't think we have profiling tools for disk IO the way we do for CPU usage by code, so even if a developer wanted to check for this their only option is to find a machine with a HD and try things out. Perhaps part of the solution will be an 'act like a HD' emulation layer for software testing that does things like slowing down random IO. Of course it's much more likely that people will just say 'buy SSDs and stop bugging us', especially in a few years.)