Understanding Git's model versus understanding its magic
In a comment on my entry on coming to a better understanding of
git rebase does, Ricky suggested I
might find Understanding Git Conceptually to be of interest.
This provides me with an opportunity to talk about what I think
my problem with mastering Git is.
It's worth quoting Charles Duan here:
The conclusion I draw from this is that you can only really use Git if you understand how Git works. Merely memorizing which commands you should run at what times will work in the short run, but it’s only a matter of time before you get stuck or, worse, break something.
I actually feel that I have a relatively good grasp of the technical underpinnings of Git, what many people would call 'how Git works'. To wave my hands a bit, Git is a content addressable store that is used to create snapshots of trees, which are then threaded together in a sequence with commits, and so on and so forth. This lets me nod and go 'of course' about any number of apparently paradoxical things, such as git repositories with multiple initial commits. I don't particularly have this understanding because I worked for it; instead, I mostly have it because I happened to be standing around in the right place at the right time to see Git in its early days.
(There are bits of git that I understand less about the technicalities, like the index. I have probably read a description of the guts of the index at least a few times, but I couldn't tell you off the top of my head how even a simple version of the index works at a mechanical level. It turns out to be covered in this StackOverflow answer; the short version is that the index is a composite of a directory file and a bunch of normal object blobs.)
But in practice Git layers a great deal of magic on top of this technical model of its inner workings. Branches are references to commits (ie, heads) and git advances the reference when you make commits under the right circumstances; simple. Except that some branches have 'upstreams' and are 'remote tracking branches' and so on. All of these pieces of magic are not intrinsic to the technical model (partly because the technical model is a strictly local one), but they are very important for working with Git in many real situations.
It is this magic that I haven't mastered and internalized. For
example, I understand what '
git fetch' does to your repository,
and I can see why you would want it to update certain branch
references so you can find the newly imported commits. But I have
to think about why '
git fetch' will update certain branches and
not others, and I don't know off the top of my head the settings
that control this or how you change them.
It's possible that Git has general patterns in this sort of magic, the way it has general patterns at its technical level. If it does, I have not yet understood enough of the magic to have noticed the general patterns. My personal suspicion is that general patterns do not necessarily exist at this layer, because the commands and operations I think of as part of this layer are actually things that have accreted into Git over time and were written by different people.
(At one point Git had a split between 'porcelain' and 'plumbing',
where porcelain was the convenient user interface and was at least
partially developed by different people than the core 'plumbing'.
And bits of porcelain were developed by different people who had
their own mental models for how their particular operation should
git rebase's lack of an option for the branch name
of the result being an example.)
In a way my understanding of Git's internals has probably held me back with Git in general, because it's helped to encouraged me to have a lackadaisical attitude about learning Git in general. The result is that I make little surgical strikes on manpages and problems, and once I feel I've solved them well enough I go away again. In this I've been mirroring one of the two ways that I approach new programming languages. I've likely reached the point in Git where I should switch over to thoroughly slogging through some parts of it; one weakness that's become obvious in writing this entry is basically everything to do with remote repositories.
Coming to a better understanding of what
git rebase does
Although I've used it reasonably regularly,
git rebase has so far
been a little bit magical to me, as you may be able to tell from
my extensive explanation to myself of using it to rebase changes
on top of an upstream rebase. In my grand
tradition, I'm going to write down what I hope is a better understanding
of what it does and how its arguments interact with that.
git rebase does is that it takes a series of commits, replays
them on top of some new commit, and then gives the resulting
top commit a name so that you can use it. When you use the three
argument form with
--onto, you are fully specifying all of these.
Take this command:
git rebase --onto muennich/master old-muennich master
--onto names the new commit everything will be put onto (usually
it's a branch, as it is here), the series of commits that will be
old-muennich..master, and the new name is also
You don't get a choice about the new name;
git rebase always makes
your new rebase into your branch, discarding the old value of the
(As far as I can tell there's no technical reason why
couldn't let you specify the branch name of the result; it's just
not in the conceptual model the authors have of how it should work.
If you need this, you need to manually create a new branch beforehand.)
The minimal version has no arguments:
This only works on branches with an upstream. It replays your commits from the current branch on top of the current (ie new) upstream, and it determines the range of commits to rebase roughly by finding the closest common ancestor of your commits and the upstream:
A -> B -> C -> D [origin/master] \-> local-1 -> local-2 [master]
In this bad plain text diagram, the upstream added C and D while
you have local-1 and local-2. The common point is B, and so B..master
describes the commits that will be put on top of origin/master and
master branch will be switched to them (well, the new
version of them).
A rebase is conceptually a push to cherry-pick's pull. In cherry picking, you start on the new clean branch and pull in changes from elsewhere. In rebasing, you start on your 'dirty' local branch and push its changes on top of some other (clean) branch. You then keep the name of your local branch but not its old origin point.
If you use the one or two argument form of
git rebase, you're
explicitly telling rebase what to consider the 'upstream' for both
determining the common ancestor commit and for what to put your
changes on top of. If I'm understanding this correctly, the following
commands are both equivalent to a plain '
git rebase' on your
git rebase origin/master git rebase origin/master master
Based on the diagrams in the git-rebase manpage, it looks like the one and two argument forms are most useful for cases where you have multiple local branches and want to shuffle around the relationship between them.
In general the git-rebase manpage has helpful examples combined with extensive ASCII diagrams. If I periodically read it carefully whenever I'm confused, it will probably all sink in eventually.
(Of course, the git manual page that I actually should read carefully several times until it all sinks in and sticks is the one on specifying revisions and ranges for Git. I sort of know what a number of the different forms mean, but in practice it's one part folklore to one part actual knowledge.)
How I rebased changes on top of other rebased changes in Git
A while ago I wrote an entry on some git repository changes that I didn't know how to do well. One of them was rebasing my own changes on top of a repository that itself had been rebased; in the comments, Aristotle Pagaltzis confirmed that his Stackoverflow answer about this was exactly what I wanted. Since I've now actually gone through this process for the first time, I want to write down the details for myself, with commentary to explain how and why everything works. Much of this commentary will seem obvious to people who use Git a lot, but it reflects some concerns and confusions that I had at the time.
First, the repositories involved. rc is the master upstream repository for Byron Rakitzis's Unix reimplementation of Tom Duff's rc shell. It is not rebased; infrequent changes flow forward as normal for a public Git repo. What I'm going to call muennich-rc is Bert Münnich's collection of interesting modifications on top of rc; it is periodically rebased, either in response to changes in rc or just as Bert Münnich does development on it. Finally I have my own repository with my own local changes on top of muennich-rc. When muennich-rc rebases, I want to rebase my own changes on top of that rebase.
I start in my own repository, before fetching anything from upstream:
git branch old-me
This creates a branch that captures the initial state of my tree. It's not used in the rebasing process; instead it's a safety measure so that I can reset back to it if necessary without having to consult something like the git reflog. Because I've run
git branchwithout an additional argument,
old-meis equivalent to
masteruntil I do something to change
git branch old-muennich muennich/master.
old-muennichhave been created as plain ordinary git branches, not upstream tracking branches, their position won't change regardless of fetching and other changes during the rebase. I'm really using them as bookmarks for specific commits instead of actual branches that I will add commits on top of.
(I'm sure this is second nature to experienced Git people, but when I made
old-muennichI had to pause and convince myself that what commit it referred to wasn't going to change later, the way that
masterchanges when you do a '
git pull'. Yes, I know, '
git pull' does more than '
git fetch' does and the difference is important here.)
This pulls in the upstream changes from muennich-rc, updating what
muennich/masterrefers to to be the current top commit of muennich-rc. It's now possible to do things like '
git diff old-muennich muennich/master' to see any differences between the old muennich-rc and the newly updated version.
(Because I did
git fetchinstead of
git pullor anything else, only
muennich/masterchanged. In particular,
masterhas not changed and is still the same as
git rebase --onto muennich/master old-muennich master
This does all the work (well, I had to resolve and merge some conflicts). What it means is 'take all of the commits that go from
masterand rebase them on top of
muennich/master; afterward, set the end result to be
(If I omitted the
old-muennichargument, I would be trying to rebase both my local changes and the old upstream changes from muennich-rc on top of the current muennich-rc. Depending on the exact changes involved in muennich-rc's rebasing, this could have various conflicts and bad effects (for instance, reintroducing changes that Bert Münnich had decided to discard). There is a common ancestor in the master rc repository, but there could be a lot of changes between there and here.)
The local changes that I added to the old version of muennich-rc are exactly the commits from
master(ie, they're what would be shown by '
git log old-muennich..master', per the git-rebase manpage), so I'm putting my local commits on top of
muennich/master. Since the current
muennich/masteris the top of the just-fetched new version of muennich-rc, I'm putting my local commits on top of the latest upstream rebase. This is exactly what I want to do; I'm rebasing my commits on top of an upstream rebase.
- After the dust has settled, I can get rid of the two branches I was
using as bookmarks:
git branch -D old-me git branch -D old-muennich
I have to use
-Dbecause as far as git is concerned these branches both have unmerged changes. They're unmerged because these branches have both been orphaned by the combination of the muennich-rc rebase and my rebase.
Because I don't care (much) about the old version of my changes that are on top of the old version of muennich-rc, doing a rebase instead of a cherry-pick is the correct option. Following my realization on cherry-picking versus rebasing, there are related scenarios where I might want to cherry-pick instead, for example if I wasn't certain that I liked some of the changes in the rebased muennich-rc and I might want to fall back to the old version. Of course in this situation I could get the same effect by keeping the two branches after the rebase instead of deleting them.
My theory on why Go's
gofmt has wound up being accepted
In Three Months of Go (from a Haskeller's perspective) (via), Michael Walker makes the following observation in passing:
I do find it a little strange that gofmt has been completely accepted, whereas Python’s significant whitespace (which is there for exactly the same reason: enforcing readable code) has been much more contentious across the programming community.
As it happens, I have a theory about this: I think it's important
gofmt only has social force. By this I mean that you can
write Go code in whatever style and indentation you want, and the
Go compiler will accept it (in some styles you'll have to use more
semicolons than in others). This is not the case in Python, where
the language itself flatly insists that you use whitespace in roughly
the correct way. In Go, the only thing 'forcing' you to put your
gofmt is the social expectations of the Go community.
This is a powerful force (especially when people learning Go also
learn 'run your code through
gofmt'), but it is a soft force as
compared to the hard force of Python's language specification, and
so I think people are more accepting of it. Many of the grumpy
reactions to Python's indentation rules seem to be not because the
formatting it imposes is bad but because people reflexively object
to being forced to do it.
(This also means that Go looks more conventional as a programming language; it has explicit block delimiters, for example. I think that people often react to languages that look weird and unconventional.)
There is an important practical side effect of this that is worth
noting, which is that your pre-
gofmt code can be completely sloppy.
You can just slap some code into the file with terrible indentation
or no indentation at all, and
gofmt will fix it all up for you.
This is not the case in Python; because whitespace is part of the
grammar, your Python code must have good indentation from the
start and cannot be fixed later. This
makes it easier to write Go code (and to write it in a wide variety
of editors that don't necessarily have smart indentation support
and so on).
The combination of these two gives the soft force of
gofmt a great
deal of power over the long term. It's quite convenient to be able
to scribble sloppily formatted code down and then have
it all nice for you, but if you do this you must go along with
gofmt's style choices even if you disagree with some of them.
You can hold out and stick to your own style, but you're doing
things the hard way as well as the socially disapproved way, and
in my personal experience sooner or later it's
not worth fighting Go's city hall any more. The lazy way wins out
gofmt notches up another quiet victory.
(It probably also matters that a number of editors have convenient
gofmt integration. I wouldn't use it as a fixup tool as much as
I do if I had to drop the file from my editor, run
gofmt by hand,
and then reload the now-changed file. And if it was less of a fixup
tool, there would be less soft pressure of 'this is just the easiest
way to fix up code formatting so it looks nice'; I'd be more likely
to format my Go code 'correctly' in my editor to start with.)
Why exposing only blocking APIs are ultimately a bad idea
I recently read Marek's Socket API thoughts, which mulls over a number of issues and ends with the remark:
But nonetheless, I very much like the idea of only blocking API's being exposed to the user.
This is definitely an attractive idea. All of the various attempts
select() style APIs have generally not gone well, high level
callbacks give you 'callback hell', and it would be conceptually
nice to combine cheap concurrency with purely blocking APIs to have
our cake and eat it too. It's no wonder this idea comes up repeatedly
and I feel the tug of it myself.
Unfortunately, I've wound up feeling that it's fundamentally a
mistake. While superficially attractive, attempting to do this in
the real world is going to wind up with an increasingly ugly mess
in practice. For the moment let's set aside the issue that cheap
concurrency is fundamentally an illusion
and assume that we can make the illusion work well enough here.
This still leaves us with the
sooner or later the result of one IO will make you want to stop
doing another waiting IO. Or more generally, sooner or later
you'll want to stop doing some bit of blocking IO as the result of
other events and processing inside your program.
When all IO is blocking, separate IO must be handled by separate threads and thus you need to support external (cross-thread) cancellation of in-flight blocked IO out from underneath a thread. The moment you have this sort of unsynchronized and forced cross-thread interaction, you have a whole collection of thorny concurrency issues that we have historically not been very good at dealing with. It's basically guaranteed that people will write IO handling code with subtle race conditions and unhandled (or mishandled) error conditions, because (as usual) they didn't realize that something was possible or that their code could be trying to do thing X right as thing Y was happening.
(I'm sure that there are API design mistakes that can and will be
made here, too, just as there have been a series of API design
select() and its successors. Even APIs are hard
to get completely right in the face of concurrency issues.)
There is no fix for this that I can see for purely blocking APIs. Either you allow external cancellation of blocked IO, which creates the cross-thread problems, or you disallow it and significantly limit your IO model, creating real complications as well as limiting what kind of systems your APIs can support.
PS: I think it's possible to sort of square the circle here, but
the solution must be deeply embedded into the language and its
runtime. The basic idea is to create a CSP like environment where
waiting for IO to complete is a channel receive or send operation,
and may be mixed with other channel operations in a
you have this, you have a relatively clean way to cancel a blocked
IO; the thread performing the IO simply uses a multi-select, where
one channel is the IO operation and another is the 'abort the
operation' channel. This doesn't guarantee that everyone will get
it right, but it does at least reduce your problem down to the
existing problem of properly handling channel operation ordering
and so on. But this is not really a 'only blocking API' as we
normally think of it and, as mentioned, it requires very deep support
in the language and runtime (since under the hood this has to
actually be asynchronous IO and possibly involve multiple threads).
This is also going to sometimes be somewhat of a lie, because on
many systems there is a certain amount of IO that is genuinely
synchronous and can't be interrupted at all, despite you putting
it in a multi-channel
select statement. Many Unixes don't really
support asynchronous reads and writes from files on disk, for
Waiting for a specific wall-clock time in Unix
At least on Unix systems, time is a subtle but big pain for
programmers. The problem is that because the clock can jump forward,
stand still (during leap seconds), or even go backwards, your
expectations about what subtracting and adding times does can wind
up being wrong under uncommon or rare circumstances. For instance,
you can write code that assumes that the difference between a time
in the past and
now() can be at most zero. This assumption recently
led to a Cloudflare DNS outage during a leap second, as covered
in Cloudflare's great writeup of this incident.
The solution to this is a new sort of time. Instead of being based on wall-clock time, it is monotonic; it always ticks forward and ticks at a constant rate. Changes in wall-clock time don't affect the monotonic clock, whether those are leap seconds, large scale corrections to the clock, or simply your NTP daemon running the clock a little bit slow or fast in order to get it to the right time. Monotonic clocks are increasingly supported by Unix systems and more and more programming environments are either supporting them explicitly or quietly supporting them behind the scenes. All of this is good and fine and all that, and it's generally just what you want.
I have an unusual case, though, where I'd actually like the reverse functionality. I have a utility that wants to wait until a specific wall-clock time. If the system's wall-clock time is adjusted, I'd like my waiting to immediately be updated to reflect that and my program woken up if appropriate. Until I started writing this entry, I was going to say that this is impossible, but now I believe that it's possible in POSIX. Well, in theory it's possible in POSIX; in practice it's not portable to at least one major Unix OS, because FreeBSD doesn't currently support the necessary features.
On a system that supports this POSIX feature, you have two options:
just sleeping, or using timers. Sleeping is easier; you use
CLOCK_REALTIME clock with the
The POSIX standard (and Linux)
specify that if the wall-clock time is changed, you still get woken
up when appropriate. With timers, you use a similar but more
intricate process. You create a
CLOCK_REALTIME timer with
and then use
to set a
TIMER_ABSTIME wait time. When the timer expires, you
get signalled in whatever way you asked for.
In practice, though, this doesn't help me. Not only is this clearly
not supported on every Unix, but as far as I can see Go doesn't
expose any API for
clock_nanosleep or equivalent functionality.
This isn't terribly surprising, since sleeping in Go is already
deeply intertwined with its multi-threaded runtime. Right now my
program just approximates what I want by waking up periodically in
order to check the clock; this is probably the best I can do in
general for a portable program, even outside of Go.
(If I was happy with a non-portable program that only worked on
Linux, probably the easiest path would be to use Python with the
to directly call
clock_nanosleep with appropriate arguments.
I'm picking Python here because I expect it's the easiest language
for easy and reasonably general time parsing code. Anyways, I already
know Python and I've never used the
ctypes module, so it'd be fun.)
Sidebar: The torture case here is DST transitions
I started out thinking that DST transitions would be a real problem, since either an hour disappears or happens twice. For example, if I say 'wait until 2:30 am' on the night of a transition into DST, I probably want my code to wake up again when the wall-clock time ticks from 2 am straight to 3 am. Similarly, on a transition out of DST, should I say 'wake up at 2:10 am', I probably don't want my code waking up at the second 1:10 am.
However, the kernel actually deals in UTC time, not local time. In practice all of the complexity is in the translation from your (local time) time string into UTC time, and in theory a fully timezone and DST aware library could get this (mostly) right. For '2:30 am during the transition into DST', it would probably return an error (since that time doesn't actually exist), and for '2:10 am during the transition out of DST' it should return a UTC time that is an hour later than you'd innocently expect.
(This does suggest that parsing such times is sort of current-time dependent. Since there are two '1:30 am' times on the transition out of DST, which one you want depends in part on what time it is now. If the transition hasn't happened yet, you probably want the first one; if the transition has happened but it's not yet the new 1:30 am yet, you probably want the second.)
Does CR LF as a line ending cause extra problems with buffers?
When you said “state machine” in the context of network protocols, I thought you were going to talk about buffers. That’s an even more painful consequence than just the complexity of scanning for a sequence. [...]
My first reaction was that I didn't think a multi-byte line ending sequence causes extra problems, because dealing with line oriented input through buffering already gives you enough of them. Any time you read input in buffers but want to produce output in lines, you need to deal with the problem that a line may not end in the current buffer. This is especially common if you're reading through input in fixed-size chunks; you would have to be very lucky to always have a line end right at the end of every 4k block (or 16k block or whatever). Sooner or later a block boundary will happen in the middle and there you are. So you have to be prepared to glue lines together across buffers no matter what.
This is too simple a view, though, once you (ie, I) think about it more. When your line ending is a single byte, you have an unambiguous situation within a single buffer; either the line definitely ends in the buffer or it doesn't. Your check for the line ending is 'find occurrence of byte <X>' and once this fails you'll never have to re-check the current buffer's contents. This is not true with a multi-byte line ending, because the line ending CR LF sequence may be split over a buffer boundary. This means that you can no longer scan each buffer independently. Either you need to scan them together so that such split CR LF sequences are fused back together, or you need to remember that the last byte in the current buffer is a CR and look for a bare LF at the start of the next buffer.
Of course, CR LF line endings aren't the only case in modern text processing where you have multi-byte sequences. A great deal of modern text is encoded in UTF-8, and many UTF-8 codepoints are multi-byte sequences; if you want to recognize such a codepoint in buffers of UTF-8 text, you have the same problem that the UTF-8 encoding may start at the end of one buffer and finish in the start of the next. It feels like there ought to be a general way of dealing with this that could then be trivially applied to the CR LF case.
(As Aristotle Pagaltzis kind of mentions later in his comment, this is going to involve storing state somewhere, either explicitly in a data structure or implicitly in the call stack of a routine that's pulling in the next buffer's worth of data.)
Things that make Go channels expensive to implement
fchan: Fast Channels in Go made the rounds (via).
I read it with some interest, because I'm always interested in
interesting high-performance concurrent things like this, but my first
reaction was that they'd started with an artificially inexpensive
scenario. That got me thinking about what features make Go channels
expensive (ie slower), regardless of the specifics of the implementation.
Generally, what makes concurrent operations intrinsically expensive is the need for cross-thread locking and coordination. So we can look for all of the places that require coordination or locks, as opposed to simple operations:
- A sender may have to suspend, as opposed to either the channel
being unbounded or sends failing immediately if the channel is full.
Suspending means enqueuing, scheduling, and coordination to make
sure that wakeups are not lost.
(The mere potential of even a single waiting sender means that receivers must be prepared to wake senders, as opposed to passively pulling messages from a queue and possibly suspending until some show up.)
- There may be multiple senders and thus multiple suspended senders.
A new receiver and the overall runtime must insure that exactly
one of them is woken to continue on (no more, no less).
- There may be multiple receivers and thus multiple suspended
receivers. A new sender must insure that exactly one of them is
woken to receive its message.
(You can cover all of this waiting sender and receiver stuff with a single lock on the channel as a whole, but now you have a single lock that everyone who touches the channel will be contending over, and it may be held over more than very short lengths of code.)
- A receiver may be waiting in a
selectfor multiple channels to become ready. A sender that wants to wake this receiver must insure that its send is the single event that wakes the receiver; all other channel events must be locked out and prevented from happening.
- A sender may be waiting in a
selectfor multiple channels to become ready. A receiver that wants to wake this sender to get its message must insure that its event is the one that wins the race; all other channel events must be locked out and prevented from happening.
- When a goroutine performs a
selecton multiple channels, it must initially lock all the channels before determining their readiness because the language spec specifically says that the winning channel is chosen via a uniform pseudo-random selection. The Go runtime is not free to lock channels one after another until it finds the first ready one, take that, and stop there; it must lock everything before picking one ready channel.
(This channel locking must also be performed in a consistent order so that
selects in different goroutines with overlapping or identical channel lists don't deadlock against each other.)
In the fully general case you have one goroutine sleeping in
on multiple channels and another running goroutine starting a
select that could succeed immediately by waking the first goroutine.
The running goroutine must take some pains to insure that the
sleeping goroutine is not actually woken out from underneath it by
a third goroutine. There are quite a few locks and atomic operations
flying around in the process.
(Part of the result of this is that the implementation of
in the runtime is reasonably involved.
is actually surprisingly readable, although it's not easy going.
It's best read along with
As a bonus it contains at least one neat hack.)
PS: The implementation of
select is so complicated that recently
a long-standing and extremely intricate race deep inside the runtime
was fixed in this commit
(the commit discussion has an illuminating explanation of what the
code is doing in general). The fix was rather simple but the
journey there was clearly an epic one.
My picks for mind-blowing Git features
It started on Twitter:
@tobyhede: What git feature would you show someone who has used source control (but not git) that would blow their mind?
@thatcks: Sysadmins: git bisect. People w/ local changes: rebase. Devs: partial/selective commits & commit reordering.
Given that at different times I fall into all three of these groups, I kind of cheated in my answer. But I'll stand by it anyways, and since Twitter forces a distinct terseness on things, I'm going to expand on why these things are mind-blowing.
If you use some open source package and you can compile it,
bisect (plus some time and work) generally gives you the superpower
of being able to tell the developers 'this specific change broke a
thing that matters to me', instead of having to tell them just 'it
broke somewhere between vN and vN+1'. Being able to be this specific
to developers drastically increases the chances that your bug will
actually get fixed. You don't have to know how to program to narrow
down your bug report, just be able to use
git bisect, compile the
package, and run it to test it.
(If what broke is 'it doesn't compile any more', you can even automate this.)
If you carry local modifications in your copy of an upstream project,
changes that will never be integrated and that you have no intention
of feeding upstream,
git rebase is so much your friend that I
wrote an entire entry about how and
why. In the pre-git world, at best you wound up with a messy tangle
of branches and merges that left the history of your local repository
increasingly different from the upstream one; at worst your local
changes weren't even committed to version control, just thrown on
top of the upstream as patches and changes that tools like
attempted to automatically merge into new upstream commits when you
If you're developing changes, well, in theory you're disciplined and you use feature branches and do one thing at a time and your diffs are always pure. In practice I think that a lot of the time this is not true, and at that point git's ability to do selective commits, reorder commits, and so on will come along and save your bacon; you can use them to sort out the mess and create a series of clean commits. In the pre-git, pre-selective-commit era things were at least a bunch more work and perhaps more messy. Certainly for casual development people probably just made big commits with random additional changes in them; I know that I certainly did (and I kept doing it even in git until recently because I didn't have the right tools to make this easy).
(Of course this wasn't necessarily important for keeping track of your local changes, because before git you probably weren't committing them in the first place.)
PS: There is one git feature that blows my mind on a technical level because it is just so neat and so clever. But that's going to be another entry, and also it's technically not an official git feature.
(My line between 'official git feature' and 'neat addon hack' is whether the hack in question ships with git releases as an official command.)
One downside of a queued IO model is memory consumption for idle connections
One of the common models for handling asynchronous IO is what I'll call the queued IO model, where you put all of your IO operations in some sort of a queue and then as things become ready, the OS completes various ones and hands them back to you. Sometimes this queue is explicitly exposed and sometimes, as in Go, the queue is implicit in a collection of threads all doing (what they see as) blocking IO operations. The queued IO model is generally simple and attractive, either in threaded form (in Go) or in explicit form where you pass operations that you'd like to do to the OS and it notifies you when various ones finish.
Recently I wound up reading Evan Klitzke's Goroutines, Nonblocking
I/O, And Memory Usage, which
pointed out a drawback to this model that hadn't been obvious to
me before. That drawback is memory usage for pending operations,
especially reads, in a situation where you have a significant number
of idle connections. Suppose that you have 1,000 connections where
you're waiting for the client to send you something. In a queued
IO model the normal way to operate is to queue 1,000 read operations,
and each of these queued read operations must come with an allocated
buffer for the operating system to write the read data into. If
only (say) 5% of those connections are active at any one time, you
have quite a lot of memory tied up in buffers that are just sitting
around inactive. In a
select() style model that exposes readiness
before you perform the IO, you can only allocate buffers when you're
actually about to read data.
Writes often pre-compute and pre-allocate the data to be written, in which case this isn't much of an issue for them; the buffer for the data to be written has to be allocated beforehand either way. But in situations where the data to be written could be generated lazily on the fly, the queued IO model can once again force extra memory allocations where you have to allocate and fill buffers for everything, not just the connections that are ready to have more data pushed to them.
All of this may be obvious to people already, but it was surprising to me so I feel like writing it down, especially how it extends from Go style 'blocking IO with threads' to the general model of queuing up asynchronous IO operations for the kernel to complete for you as it can.
(Of course there are reasons to want a
select() like interface
beyond this issue, such as the cancellation problem.)