Wandering Thoughts archives


My workflow for testing Github pull requests

Every so often a Github-based project I'm following has a pending pull request that might solve a bug or otherwise deal with something I care about, and it needs some testing by people like me. The simple case is when I am not carrying any local changes; it is adequately covered by part of Github's Checking out pull requests locally (skip to the bit where they talk about 'git fetch'). A more elaborate version is:

git fetch origin pull/<ID>/head:origin/pr/<ID>
git checkout pr/<ID>

That creates a proper remote branch and then a local branch that tracks it, so I can add any local changes to the PR that I turn out to need and then keep track of them relative to the upstream pull request. If the upstream PR is rebased, well, I assume I get to delete my remote and then re-fetch it and probably do other magic. I'll cross that bridge when I reach it.

The not so simple case is when I am carrying local changes on top of the upstream master. In the fully elaborate case I actually have two repos, the first being a pure upstream tracker and the second being a 'build' repo that pulls from the first repo and carries my local changes. I need to apply some of my local changes on top of the pull request while skipping others (in this case, because some of them are workarounds for the problem the pull request is supposed to solve), and I want to do all of this work on a branch so that I can cleanly revert back to 'all of my changes on top of the real upstream master'.

The workflow I've cobbled together for this is:

  • Add the Github master repo if I haven't already done so:
    git remote add github https://github.com/zfsonlinux/zfs.git

  • Edit .git/config to add a new 'fetch =' line so that we can also fetch pull requests from the github remote, where they will get mapped to the remote branches github/pr/NNN. This will look like:
    [remote "github"]
       fetch = +refs/pull/*/head:refs/remotes/github/pr/*

    (This comes from here.)

  • Pull down all of the pull requests with 'git fetch github'.

    I think an alternate to configuring and fetching all pull requests is the limited version I did in the simple case (changing origin to github in both occurrences), but I haven't tested this. At the point that I have to do this complicated dance I'm in a 'swatting things with a hammer' mode, so pulling down all PRs seems perfectly fine. I may regret this later.

  • Create a branch from master that will be where I build and test the pull request (plus my local changes):
    git checkout -b pr-NNN

    It's vitally important that this branch start from master and thus already contain my local changes.

  • Do an interactive rebase relative to the upstream pull request:
    git rebase -i github/pr/NNN

    This incorporates the pull request's changes 'below' my local changes to master, and with -i I can drop conflicting or unneeded local changes. Effectively it is much like what happens when you do a regular 'git pull --rebase' on master; the changes in github/pr/NNN are being treated as upstream changes and we're rebasing my local changes on top of them.

  • Set the upstream of the pr-NNN branch to the actual Github pull request branch:
    git branch -u github/pr/NNN

    This makes 'git status' report things like 'Your branch is ahead of ... by X commits', where X is the number of local commits I've added.

If the pull request is refreshed, my current guess is that I will have to fully discard my local pr-NNN branch and restart from fetching the new PR and branching off master. I'll undoubtedly find out at some point.

Initially I thought I should be able to use a sufficiently clever invocation of 'git rebase' to copy some of my local commits from master on to a new branch that was based on the Github pull request. With work I could get the rebasing to work right; however, it always wound up with me on (and changing) the master branch, which is not what I wanted. Based on this very helpful page on what 'git rebase' is really doing, what I want is apparently impossible without explicitly making a new branch first (and that new branch must already include my local changes so they're what gets rebased, which is why we have to branch from master).

This is probably not the optimal way to do this, but having hacked my way through today's git adventure game I'm going to stop now. Feel free to tell me how to improve this in comments.

(This is the kind of thing I write down partly to understand it and partly because I would hate to have to derive it again, and I'm sure I'll need it in the future.)

Sidebar: Why I use two repos in the elaborate case

In the complex case I want to both monitor changes in the Github master repo and have strong control over what I incorporate into my builds. My approach is to routinely do 'git pull' in the pure tracking repo and read 'git log' for new changes. When it's time to actually build, I 'git pull' (with rebasing) from the tracking repo into the build repo and then proceed. Since I'm pulling from the tracking repo, not the upstream, I know exactly what changes I'm going to get in my build repo and I'll never be surprised by a just-added upstream change.

In theory I'm sure I could do this in a single repo with various tricks, but doing it in two repos is much easier for me to keep straight and reliable.

programming/GithubPRTestingWorkflow written at 23:08:58; Add Comment

A cynical view on needing SSDs in all your machines in the future

Let's start with my tweets:

@thatcks: Dear Firefox Nightly: doing ten+ minutes of high disk IO on startup before you even start showing me my restored session is absurd.
@thatcks: Clearly the day is coming when using a SSD is going be not merely useful but essential to get modern programs to perform decently.

I didn't say this just because programs are going to want to do more and more disk IO over time. Instead, I said it because of a traditional developer behavior, namely that developers mostly assess how fast their work is based on how it runs on their machines and developer machines are generally very beefy ones. At this point it's extremely likely that most developer machines have decently fast SSDs (and for good reason), which means that it's actually going to be hard for developers to notice they've written code that basically assumes a SSD and only runs acceptably on it (either in general or when some moderate corner case triggers).

SSDs exacerbate this problem by being not just fast in general but especially hugely faster at random IO than traditional hard drives. If you accidentally write something that is random IO heavy (or becomes so under some circumstances, perhaps as you scale the size of the database up) but only run it on a SSD based system, you might not really notice. Run that same thing on a HD based one (with a large database) and it will grind to a halt for ten minutes.

(Today I don't think we have profiling tools for disk IO the way we do for CPU usage by code, so even if a developer wanted to check for this their only option is to find a machine with a HD and try things out. Perhaps part of the solution will be an 'act like a HD' emulation layer for software testing that does things like slowing down random IO. Of course it's much more likely that people will just say 'buy SSDs and stop bugging us', especially in a few years.)

tech/CynicalSSDInevitability written at 01:20:14; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.