Wandering Thoughts

2017-04-27

Understanding Git's model versus understanding its magic

In a comment on my entry on coming to a better understanding of what git rebase does, Ricky suggested I might find Understanding Git Conceptually to be of interest. This provides me with an opportunity to talk about what I think my problem with mastering Git is.

It's worth quoting Charles Duan here:

The conclusion I draw from this is that you can only really use Git if you understand how Git works. Merely memorizing which commands you should run at what times will work in the short run, but it’s only a matter of time before you get stuck or, worse, break something.

I actually feel that I have a relatively good grasp of the technical underpinnings of Git, what many people would call 'how Git works'. To wave my hands a bit, Git is a content addressable store that is used to create snapshots of trees, which are then threaded together in a sequence with commits, and so on and so forth. This lets me nod and go 'of course' about any number of apparently paradoxical things, such as git repositories with multiple initial commits. I don't particularly have this understanding because I worked for it; instead, I mostly have it because I happened to be standing around in the right place at the right time to see Git in its early days.

(There are bits of git that I understand less about the technicalities, like the index. I have probably read a description of the guts of the index at least a few times, but I couldn't tell you off the top of my head how even a simple version of the index works at a mechanical level. It turns out to be covered in this StackOverflow answer; the short version is that the index is a composite of a directory file and a bunch of normal object blobs.)

But in practice Git layers a great deal of magic on top of this technical model of its inner workings. Branches are references to commits (ie, heads) and git advances the reference when you make commits under the right circumstances; simple. Except that some branches have 'upstreams' and are 'remote tracking branches' and so on. All of these pieces of magic are not intrinsic to the technical model (partly because the technical model is a strictly local one), but they are very important for working with Git in many real situations.

It is this magic that I haven't mastered and internalized. For example, I understand what 'git fetch' does to your repository, and I can see why you would want it to update certain branch references so you can find the newly imported commits. But I have to think about why 'git fetch' will update certain branches and not others, and I don't know off the top of my head the settings that control this or how you change them.

It's possible that Git has general patterns in this sort of magic, the way it has general patterns at its technical level. If it does, I have not yet understood enough of the magic to have noticed the general patterns. My personal suspicion is that general patterns do not necessarily exist at this layer, because the commands and operations I think of as part of this layer are actually things that have accreted into Git over time and were written by different people.

(At one point Git had a split between 'porcelain' and 'plumbing', where porcelain was the convenient user interface and was at least partially developed by different people than the core 'plumbing'. And bits of porcelain were developed by different people who had their own mental models for how their particular operation should behave, with git rebase's lack of an option for the branch name of the result being an example.)

In a way my understanding of Git's internals has probably held me back with Git in general, because it's helped to encouraged me to have a lackadaisical attitude about learning Git in general. The result is that I make little surgical strikes on manpages and problems, and once I feel I've solved them well enough I go away again. In this I've been mirroring one of the two ways that I approach new programming languages. I've likely reached the point in Git where I should switch over to thoroughly slogging through some parts of it; one weakness that's become obvious in writing this entry is basically everything to do with remote repositories.

GitCoreVersusMagic written at 01:10:48; Add Comment

2017-04-26

Coming to a better understanding of what git rebase does

Although I've used it reasonably regularly, git rebase has so far been a little bit magical to me, as you may be able to tell from my extensive explanation to myself of using it to rebase changes on top of an upstream rebase. In my grand tradition, I'm going to write down what I hope is a better understanding of what it does and how its arguments interact with that.

What git rebase does is that it takes a series of commits, replays them on top of some new commit, and then gives the resulting top commit a name so that you can use it. When you use the three argument form with --onto, you are fully specifying all of these. Take this command:

git rebase --onto muennich/master old-muennich master

--onto names the new commit everything will be put onto (usually it's a branch, as it is here), the series of commits that will be replayed is old-muennich..master, and the new name is also master. You don't get a choice about the new name; git rebase always makes your new rebase into your branch, discarding the old value of the branch.

(As far as I can tell there's no technical reason why git rebase couldn't let you specify the branch name of the result; it's just not in the conceptual model the authors have of how it should work. If you need this, you need to manually create a new branch beforehand.)

The minimal version has no arguments:

git rebase

This only works on branches with an upstream. It replays your commits from the current branch on top of the current (ie new) upstream, and it determines the range of commits to rebase roughly by finding the closest common ancestor of your commits and the upstream:

A -> B -> C -> D               [origin/master]
      \-> local-1 -> local-2   [master]

In this bad plain text diagram, the upstream added C and D while you have local-1 and local-2. The common point is B, and so B..master describes the commits that will be put on top of origin/master and then your master branch will be switched to them (well, the new version of them).

A rebase is conceptually a push to cherry-pick's pull. In cherry picking, you start on the new clean branch and pull in changes from elsewhere. In rebasing, you start on your 'dirty' local branch and push its changes on top of some other (clean) branch. You then keep the name of your local branch but not its old origin point.

If you use the one or two argument form of git rebase, you're explicitly telling rebase what to consider the 'upstream' for both determining the common ancestor commit and for what to put your changes on top of. If I'm understanding this correctly, the following commands are both equivalent to a plain 'git rebase' on your master branch:

git rebase origin/master
git rebase origin/master master

Based on the diagrams in the git-rebase manpage, it looks like the one and two argument forms are most useful for cases where you have multiple local branches and want to shuffle around the relationship between them.

In general the git-rebase manpage has helpful examples combined with extensive ASCII diagrams. If I periodically read it carefully whenever I'm confused, it will probably all sink in eventually.

(Of course, the git manual page that I actually should read carefully several times until it all sinks in and sticks is the one on specifying revisions and ranges for Git. I sort of know what a number of the different forms mean, but in practice it's one part folklore to one part actual knowledge.)

GitRebaseUnderstanding written at 02:01:28; Add Comment

2017-04-23

How I rebased changes on top of other rebased changes in Git

A while ago I wrote an entry on some git repository changes that I didn't know how to do well. One of them was rebasing my own changes on top of a repository that itself had been rebased; in the comments, Aristotle Pagaltzis confirmed that his Stackoverflow answer about this was exactly what I wanted. Since I've now actually gone through this process for the first time, I want to write down the details for myself, with commentary to explain how and why everything works. Much of this commentary will seem obvious to people who use Git a lot, but it reflects some concerns and confusions that I had at the time.

First, the repositories involved. rc is the master upstream repository for Byron Rakitzis's Unix reimplementation of Tom Duff's rc shell. It is not rebased; infrequent changes flow forward as normal for a public Git repo. What I'm going to call muennich-rc is Bert Münnich's collection of interesting modifications on top of rc; it is periodically rebased, either in response to changes in rc or just as Bert Münnich does development on it. Finally I have my own repository with my own local changes on top of muennich-rc. When muennich-rc rebases, I want to rebase my own changes on top of that rebase.

I start in my own repository, before fetching anything from upstream:

  1. git branch old-me

    This creates a branch that captures the initial state of my tree. It's not used in the rebasing process; instead it's a safety measure so that I can reset back to it if necessary without having to consult something like the git reflog. Because I've run git branch without an additional argument, old-me is equivalent to master until I do something to change master.

  2. git branch old-muennich muennich/master.

    muennich/master is the upstream for muennich-rc. Creating a branch captures the (old) top commit for muennich-rc that my changes are on top of.

    Because both old-me and old-muennich have been created as plain ordinary git branches, not upstream tracking branches, their position won't change regardless of fetching and other changes during the rebase. I'm really using them as bookmarks for specific commits instead of actual branches that I will add commits on top of.

    (I'm sure this is second nature to experienced Git people, but when I made old-muennich I had to pause and convince myself that what commit it referred to wasn't going to change later, the way that master changes when you do a 'git pull'. Yes, I know, 'git pull' does more than 'git fetch' does and the difference is important here.)

  3. git fetch

    This pulls in the upstream changes from muennich-rc, updating what muennich/master refers to to be the current top commit of muennich-rc. It's now possible to do things like 'git diff old-muennich muennich/master' to see any differences between the old muennich-rc and the newly updated version.

    (Because I did git fetch instead of git pull or anything else, only muennich/master changed. In particular, master has not changed and is still the same as old-me.)

  4. git rebase --onto muennich/master old-muennich master

    This does all the work (well, I had to resolve and merge some conflicts). What it means is 'take all of the commits that go from old-muennich to master and rebase them on top of muennich/master; afterward, set the end result to be master'.

    (If I omitted the old-muennich argument, I would be trying to rebase both my local changes and the old upstream changes from muennich-rc on top of the current muennich-rc. Depending on the exact changes involved in muennich-rc's rebasing, this could have various conflicts and bad effects (for instance, reintroducing changes that Bert Münnich had decided to discard). There is a common ancestor in the master rc repository, but there could be a lot of changes between there and here.)

    The local changes that I added to the old version of muennich-rc are exactly the commits from old-muennich to master (ie, they're what would be shown by 'git log old-muennich..master', per the git-rebase manpage), so I'm putting my local commits on top of muennich/master. Since the current muennich/master is the top of the just-fetched new version of muennich-rc, I'm putting my local commits on top of the latest upstream rebase. This is exactly what I want to do; I'm rebasing my commits on top of an upstream rebase.

  5. After the dust has settled, I can get rid of the two branches I was using as bookmarks:

    git branch -D old-me
    git branch -D old-muennich
    

    I have to use -D because as far as git is concerned these branches both have unmerged changes. They're unmerged because these branches have both been orphaned by the combination of the muennich-rc rebase and my rebase.

Because I don't care (much) about the old version of my changes that are on top of the old version of muennich-rc, doing a rebase instead of a cherry-pick is the correct option. Following my realization on cherry-picking versus rebasing, there are related scenarios where I might want to cherry-pick instead, for example if I wasn't certain that I liked some of the changes in the rebased muennich-rc and I might want to fall back to the old version. Of course in this situation I could get the same effect by keeping the two branches after the rebase instead of deleting them.

GitRebaseOnRebase written at 01:02:10; Add Comment

2017-03-20

My theory on why Go's gofmt has wound up being accepted

In Three Months of Go (from a Haskeller's perspective) (via), Michael Walker makes the following observation in passing:

I do find it a little strange that gofmt has been completely accepted, whereas Python’s significant whitespace (which is there for exactly the same reason: enforcing readable code) has been much more contentious across the programming community.

As it happens, I have a theory about this: I think it's important that gofmt only has social force. By this I mean that you can write Go code in whatever style and indentation you want, and the Go compiler will accept it (in some styles you'll have to use more semicolons than in others). This is not the case in Python, where the language itself flatly insists that you use whitespace in roughly the correct way. In Go, the only thing 'forcing' you to put your code through gofmt is the social expectations of the Go community. This is a powerful force (especially when people learning Go also learn 'run your code through gofmt'), but it is a soft force as compared to the hard force of Python's language specification, and so I think people are more accepting of it. Many of the grumpy reactions to Python's indentation rules seem to be not because the formatting it imposes is bad but because people reflexively object to being forced to do it.

(This also means that Go looks more conventional as a programming language; it has explicit block delimiters, for example. I think that people often react to languages that look weird and unconventional.)

There is an important practical side effect of this that is worth noting, which is that your pre-gofmt code can be completely sloppy. You can just slap some code into the file with terrible indentation or no indentation at all, and gofmt will fix it all up for you. This is not the case in Python; because whitespace is part of the grammar, your Python code must have good indentation from the start and cannot be fixed later. This makes it easier to write Go code (and to write it in a wide variety of editors that don't necessarily have smart indentation support and so on).

The combination of these two gives the soft force of gofmt a great deal of power over the long term. It's quite convenient to be able to scribble sloppily formatted code down and then have gofmt make it all nice for you, but if you do this you must go along with gofmt's style choices even if you disagree with some of them. You can hold out and stick to your own style, but you're doing things the hard way as well as the socially disapproved way, and in my personal experience sooner or later it's not worth fighting Go's city hall any more. The lazy way wins out and gofmt notches up another quiet victory.

(It probably also matters that a number of editors have convenient gofmt integration. I wouldn't use it as a fixup tool as much as I do if I had to drop the file from my editor, run gofmt by hand, and then reload the now-changed file. And if it was less of a fixup tool, there would be less soft pressure of 'this is just the easiest way to fix up code formatting so it looks nice'; I'd be more likely to format my Go code 'correctly' in my editor to start with.)

GoWhyGofmtAccepted written at 01:01:12; Add Comment

2017-03-03

Why exposing only blocking APIs are ultimately a bad idea

I recently read Marek's Socket API thoughts, which mulls over a number of issues and ends with the remark:

But nonetheless, I very much like the idea of only blocking API's being exposed to the user.

This is definitely an attractive idea. All of the various attempts at select() style APIs have generally not gone well, high level callbacks give you 'callback hell', and it would be conceptually nice to combine cheap concurrency with purely blocking APIs to have our cake and eat it too. It's no wonder this idea comes up repeatedly and I feel the tug of it myself.

Unfortunately, I've wound up feeling that it's fundamentally a mistake. While superficially attractive, attempting to do this in the real world is going to wind up with an increasingly ugly mess in practice. For the moment let's set aside the issue that cheap concurrency is fundamentally an illusion and assume that we can make the illusion work well enough here. This still leaves us with the select() problem: sooner or later the result of one IO will make you want to stop doing another waiting IO. Or more generally, sooner or later you'll want to stop doing some bit of blocking IO as the result of other events and processing inside your program.

When all IO is blocking, separate IO must be handled by separate threads and thus you need to support external (cross-thread) cancellation of in-flight blocked IO out from underneath a thread. The moment you have this sort of unsynchronized and forced cross-thread interaction, you have a whole collection of thorny concurrency issues that we have historically not been very good at dealing with. It's basically guaranteed that people will write IO handling code with subtle race conditions and unhandled (or mishandled) error conditions, because (as usual) they didn't realize that something was possible or that their code could be trying to do thing X right as thing Y was happening.

(I'm sure that there are API design mistakes that can and will be made here, too, just as there have been a series of API design mistakes around select() and its successors. Even APIs are hard to get completely right in the face of concurrency issues.)

There is no fix for this that I can see for purely blocking APIs. Either you allow external cancellation of blocked IO, which creates the cross-thread problems, or you disallow it and significantly limit your IO model, creating real complications as well as limiting what kind of systems your APIs can support.

(For the people who are about to say 'but Go makes it work', I'm afraid that Go doesn't. It chooses to limit what sort of systems you can build, and I'm not just talking about the memory issues.)

PS: I think it's possible to sort of square the circle here, but the solution must be deeply embedded into the language and its runtime. The basic idea is to create a CSP like environment where waiting for IO to complete is a channel receive or send operation, and may be mixed with other channel operations in a select. Once you have this, you have a relatively clean way to cancel a blocked IO; the thread performing the IO simply uses a multi-select, where one channel is the IO operation and another is the 'abort the operation' channel. This doesn't guarantee that everyone will get it right, but it does at least reduce your problem down to the existing problem of properly handling channel operation ordering and so on. But this is not really a 'only blocking API' as we normally think of it and, as mentioned, it requires very deep support in the language and runtime (since under the hood this has to actually be asynchronous IO and possibly involve multiple threads).

This is also going to sometimes be somewhat of a lie, because on many systems there is a certain amount of IO that is genuinely synchronous and can't be interrupted at all, despite you putting it in a multi-channel select statement. Many Unixes don't really support asynchronous reads and writes from files on disk, for example.

PureBlockingAPIsWhyBad written at 23:53:52; Add Comment

2017-02-16

Waiting for a specific wall-clock time in Unix

At least on Unix systems, time is a subtle but big pain for programmers. The problem is that because the clock can jump forward, stand still (during leap seconds), or even go backwards, your expectations about what subtracting and adding times does can wind up being wrong under uncommon or rare circumstances. For instance, you can write code that assumes that the difference between a time in the past and now() can be at most zero. This assumption recently led to a Cloudflare DNS outage during a leap second, as covered in Cloudflare's great writeup of this incident.

The solution to this is a new sort of time. Instead of being based on wall-clock time, it is monotonic; it always ticks forward and ticks at a constant rate. Changes in wall-clock time don't affect the monotonic clock, whether those are leap seconds, large scale corrections to the clock, or simply your NTP daemon running the clock a little bit slow or fast in order to get it to the right time. Monotonic clocks are increasingly supported by Unix systems and more and more programming environments are either supporting them explicitly or quietly supporting them behind the scenes. All of this is good and fine and all that, and it's generally just what you want.

I have an unusual case, though, where I'd actually like the reverse functionality. I have a utility that wants to wait until a specific wall-clock time. If the system's wall-clock time is adjusted, I'd like my waiting to immediately be updated to reflect that and my program woken up if appropriate. Until I started writing this entry, I was going to say that this is impossible, but now I believe that it's possible in POSIX. Well, in theory it's possible in POSIX; in practice it's not portable to at least one major Unix OS, because FreeBSD doesn't currently support the necessary features.

On a system that supports this POSIX feature, you have two options: just sleeping, or using timers. Sleeping is easier; you use clock_nanosleep using the CLOCK_REALTIME clock with the TIMER_ABSTIME flag. The POSIX standard (and Linux) specify that if the wall-clock time is changed, you still get woken up when appropriate. With timers, you use a similar but more intricate process. You create a CLOCK_REALTIME timer with timer_create and then use timer_settime to set a TIMER_ABSTIME wait time. When the timer expires, you get signalled in whatever way you asked for.

In practice, though, this doesn't help me. Not only is this clearly not supported on every Unix, but as far as I can see Go doesn't expose any API for clock_nanosleep or equivalent functionality. This isn't terribly surprising, since sleeping in Go is already deeply intertwined with its multi-threaded runtime. Right now my program just approximates what I want by waking up periodically in order to check the clock; this is probably the best I can do in general for a portable program, even outside of Go.

(If I was happy with a non-portable program that only worked on Linux, probably the easiest path would be to use Python with the ctypes module to directly call clock_nanosleep with appropriate arguments. I'm picking Python here because I expect it's the easiest language for easy and reasonably general time parsing code. Anyways, I already know Python and I've never used the ctypes module, so it'd be fun.)

Sidebar: The torture case here is DST transitions

I started out thinking that DST transitions would be a real problem, since either an hour disappears or happens twice. For example, if I say 'wait until 2:30 am' on the night of a transition into DST, I probably want my code to wake up again when the wall-clock time ticks from 2 am straight to 3 am. Similarly, on a transition out of DST, should I say 'wake up at 2:10 am', I probably don't want my code waking up at the second 1:10 am.

However, the kernel actually deals in UTC time, not local time. In practice all of the complexity is in the translation from your (local time) time string into UTC time, and in theory a fully timezone and DST aware library could get this (mostly) right. For '2:30 am during the transition into DST', it would probably return an error (since that time doesn't actually exist), and for '2:10 am during the transition out of DST' it should return a UTC time that is an hour later than you'd innocently expect.

(This does suggest that parsing such times is sort of current-time dependent. Since there are two '1:30 am' times on the transition out of DST, which one you want depends in part on what time it is now. If the transition hasn't happened yet, you probably want the first one; if the transition has happened but it's not yet the new 1:30 am yet, you probably want the second.)

WallclockSleepDesire written at 01:33:44; Add Comment

2017-02-14

Does CR LF as a line ending cause extra problems with buffers?

In reaction to my entry about why having CR LF as your line ending is a mistake, Aristotle Pagaltzis raised the issue of CR LF's consequences for buffers in a comment:

When you said “state machine” in the context of network protocols, I thought you were going to talk about buffers. That’s an even more painful consequence than just the complexity of scanning for a sequence. [...]

My first reaction was that I didn't think a multi-byte line ending sequence causes extra problems, because dealing with line oriented input through buffering already gives you enough of them. Any time you read input in buffers but want to produce output in lines, you need to deal with the problem that a line may not end in the current buffer. This is especially common if you're reading through input in fixed-size chunks; you would have to be very lucky to always have a line end right at the end of every 4k block (or 16k block or whatever). Sooner or later a block boundary will happen in the middle and there you are. So you have to be prepared to glue lines together across buffers no matter what.

This is too simple a view, though, once you (ie, I) think about it more. When your line ending is a single byte, you have an unambiguous situation within a single buffer; either the line definitely ends in the buffer or it doesn't. Your check for the line ending is 'find occurrence of byte <X>' and once this fails you'll never have to re-check the current buffer's contents. This is not true with a multi-byte line ending, because the line ending CR LF sequence may be split over a buffer boundary. This means that you can no longer scan each buffer independently. Either you need to scan them together so that such split CR LF sequences are fused back together, or you need to remember that the last byte in the current buffer is a CR and look for a bare LF at the start of the next buffer.

Of course, CR LF line endings aren't the only case in modern text processing where you have multi-byte sequences. A great deal of modern text is encoded in UTF-8, and many UTF-8 codepoints are multi-byte sequences; if you want to recognize such a codepoint in buffers of UTF-8 text, you have the same problem that the UTF-8 encoding may start at the end of one buffer and finish in the start of the next. It feels like there ought to be a general way of dealing with this that could then be trivially applied to the CR LF case.

(As Aristotle Pagaltzis kind of mentions later in his comment, this is going to involve storing state somewhere, either explicitly in a data structure or implicitly in the call stack of a routine that's pulling in the next buffer's worth of data.)

CRLFAndBuffering written at 00:56:35; Add Comment

2017-01-26

Things that make Go channels expensive to implement

Recently, fchan: Fast Channels in Go made the rounds (via). I read it with some interest, because I'm always interested in interesting high-performance concurrent things like this, but my first reaction was that they'd started with an artificially inexpensive scenario. That got me thinking about what features make Go channels expensive (ie slower), regardless of the specifics of the implementation.

Generally, what makes concurrent operations intrinsically expensive is the need for cross-thread locking and coordination. So we can look for all of the places that require coordination or locks, as opposed to simple operations:

  • A sender may have to suspend, as opposed to either the channel being unbounded or sends failing immediately if the channel is full. Suspending means enqueuing, scheduling, and coordination to make sure that wakeups are not lost.

    (The mere potential of even a single waiting sender means that receivers must be prepared to wake senders, as opposed to passively pulling messages from a queue and possibly suspending until some show up.)

  • There may be multiple senders and thus multiple suspended senders. A new receiver and the overall runtime must insure that exactly one of them is woken to continue on (no more, no less).

  • There may be multiple receivers and thus multiple suspended receivers. A new sender must insure that exactly one of them is woken to receive its message.

    (You can cover all of this waiting sender and receiver stuff with a single lock on the channel as a whole, but now you have a single lock that everyone who touches the channel will be contending over, and it may be held over more than very short lengths of code.)

  • A receiver may be waiting in a select for multiple channels to become ready. A sender that wants to wake this receiver must insure that its send is the single event that wakes the receiver; all other channel events must be locked out and prevented from happening.

  • A sender may be waiting in a select for multiple channels to become ready. A receiver that wants to wake this sender to get its message must insure that its event is the one that wins the race; all other channel events must be locked out and prevented from happening.

  • When a goroutine performs a select on multiple channels, it must initially lock all the channels before determining their readiness because the language spec specifically says that the winning channel is chosen via a uniform pseudo-random selection. The Go runtime is not free to lock channels one after another until it finds the first ready one, take that, and stop there; it must lock everything before picking one ready channel.

    (This channel locking must also be performed in a consistent order so that selects in different goroutines with overlapping or identical channel lists don't deadlock against each other.)

In the fully general case you have one goroutine sleeping in select on multiple channels and another running goroutine starting a select that could succeed immediately by waking the first goroutine. The running goroutine must take some pains to insure that the sleeping goroutine is not actually woken out from underneath it by a third goroutine. There are quite a few locks and atomic operations flying around in the process.

(Part of the result of this is that the implementation of select in the runtime is reasonably involved. select.go is actually surprisingly readable, although it's not easy going. It's best read along with chan.go. As a bonus it contains at least one neat hack.)

PS: The implementation of select is so complicated that recently a long-standing and extremely intricate race deep inside the runtime was fixed in this commit (the commit discussion has an illuminating explanation of what the code is doing in general). The fix was rather simple but the journey there was clearly an epic one.

GoChannelsExpensiveFeatures written at 01:07:04; Add Comment

2017-01-14

My picks for mind-blowing Git features

It started on Twitter:

@tobyhede: What git feature would you show someone who has used source control (but not git) that would blow their mind?

@thatcks: Sysadmins: git bisect. People w/ local changes: rebase. Devs: partial/selective commits & commit reordering.

Given that at different times I fall into all three of these groups, I kind of cheated in my answer. But I'll stand by it anyways, and since Twitter forces a distinct terseness on things, I'm going to expand on why these things are mind-blowing.

If you use some open source package and you can compile it, git bisect (plus some time and work) generally gives you the superpower of being able to tell the developers 'this specific change broke a thing that matters to me', instead of having to tell them just 'it broke somewhere between vN and vN+1'. Being able to be this specific to developers drastically increases the chances that your bug will actually get fixed. You don't have to know how to program to narrow down your bug report, just be able to use git bisect, compile the package, and run it to test it.

(If what broke is 'it doesn't compile any more', you can even automate this.)

If you carry local modifications in your copy of an upstream project, changes that will never be integrated and that you have no intention of feeding upstream, git rebase is so much your friend that I wrote an entire entry about how and why. In the pre-git world, at best you wound up with a messy tangle of branches and merges that left the history of your local repository increasingly different from the upstream one; at worst your local changes weren't even committed to version control, just thrown on top of the upstream as patches and changes that tools like svn attempted to automatically merge into new upstream commits when you did svn up.

If you're developing changes, well, in theory you're disciplined and you use feature branches and do one thing at a time and your diffs are always pure. In practice I think that a lot of the time this is not true, and at that point git's ability to do selective commits, reorder commits, and so on will come along and save your bacon; you can use them to sort out the mess and create a series of clean commits. In the pre-git, pre-selective-commit era things were at least a bunch more work and perhaps more messy. Certainly for casual development people probably just made big commits with random additional changes in them; I know that I certainly did (and I kept doing it even in git until recently because I didn't have the right tools to make this easy).

(Of course this wasn't necessarily important for keeping track of your local changes, because before git you probably weren't committing them in the first place.)

PS: There is one git feature that blows my mind on a technical level because it is just so neat and so clever. But that's going to be another entry, and also it's technically not an official git feature.

(My line between 'official git feature' and 'neat addon hack' is whether the hack in question ships with git releases as an official command.)

GitMindblowingFeatures written at 01:06:55; Add Comment

2017-01-08

One downside of a queued IO model is memory consumption for idle connections

One of the common models for handling asynchronous IO is what I'll call the queued IO model, where you put all of your IO operations in some sort of a queue and then as things become ready, the OS completes various ones and hands them back to you. Sometimes this queue is explicitly exposed and sometimes, as in Go, the queue is implicit in a collection of threads all doing (what they see as) blocking IO operations. The queued IO model is generally simple and attractive, either in threaded form (in Go) or in explicit form where you pass operations that you'd like to do to the OS and it notifies you when various ones finish.

Recently I wound up reading Evan Klitzke's Goroutines, Nonblocking I/O, And Memory Usage, which pointed out a drawback to this model that hadn't been obvious to me before. That drawback is memory usage for pending operations, especially reads, in a situation where you have a significant number of idle connections. Suppose that you have 1,000 connections where you're waiting for the client to send you something. In a queued IO model the normal way to operate is to queue 1,000 read operations, and each of these queued read operations must come with an allocated buffer for the operating system to write the read data into. If only (say) 5% of those connections are active at any one time, you have quite a lot of memory tied up in buffers that are just sitting around inactive. In a select() style model that exposes readiness before you perform the IO, you can only allocate buffers when you're actually about to read data.

Writes often pre-compute and pre-allocate the data to be written, in which case this isn't much of an issue for them; the buffer for the data to be written has to be allocated beforehand either way. But in situations where the data to be written could be generated lazily on the fly, the queued IO model can once again force extra memory allocations where you have to allocate and fill buffers for everything, not just the connections that are ready to have more data pushed to them.

All of this may be obvious to people already, but it was surprising to me so I feel like writing it down, especially how it extends from Go style 'blocking IO with threads' to the general model of queuing up asynchronous IO operations for the kernel to complete for you as it can.

(Of course there are reasons to want a select() like interface beyond this issue, such as the cancellation problem.)

QueuedIOMemoryUsageDownside written at 00:52:23; Add Comment

(Previous 10 or go back to December 2016 at 2016/12/29)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.