The downsides of processing files using too large a buffer size
I was recently reading this entry by Ben Boyter, part of which is his discussion of some attempts to optimize file IO. In these attempts he varied his buffer sizes, from reading the entire file more or less at once to reading in smaller buffers. As I thought about this, I had a belated realization about file buffer sizes when you're processing the result as a stream.
My normal inclination when picking read buffer sizes for file IO is to pick a large number. Classically this has two good effects; it reduces system call overhead (because you make fewer of them) and it gets the operating system to do IO to the underlying disks in larger chunks, which is often better. However, there is a hidden drawback to large buffer sizes, namely that reading data into a buffer is a synchronous action as far as your program is concerned; under normal circumstances, the operating system can't give your program back control until it's put the very last byte of the buffer into place. If you ask for 16 Kb, your program can start work once byte 16,384 has shown up; if you ask for 1 Mbyte, you get to wait until byte 1,048,576 has shown up, which is generally going to take longer. The more you try to read at once, the longer you're going to stall.
On the surface this looks like it reduces the time to process the start of the file but not necessarily the time to process the end (because to get to the end of a 1 Mbyte file, you still need to wait for byte 1,048,576 to show up). However, reading data is not necessarily a synchronous action all the way to the disk. If you're reading data sequentially, all OSes are going to start doing readahead. This readahead means that you're effectively doing asynchronous disk reads that at least partially overlap with your program's work; while your program is processing its current buffer of data, the OS is issuing readaheads and may be able to satisfy your program's next read by just copying things around in RAM, instead of waiting for the disk.
If you attempt to read the entire file before processing any of it, you don't get any of these benefits. If you read in quite large buffers, you probably only get moderate benefits; you're still waiting for relatively large read operations to finish before you can start processing data, and the OS may not be willing to do enough readahead to cover the next full buffer. For good results, you don't want your buffer sizes to be too large, although I don't know what a good size is these days.
(Because I like ZFS and the normal ZFS block size for many files is 128 Kb, I think that 128 Kb is a good starting point for a read buffer size. If you strongly care about this, you may want to benchmark on your specific environment, because it's going to depend on how much readahead your OS is willing to do for you.)
PS: This also depends on your processing of the file not taking too long. If you can only process the file at a rate far lower than the speed of your IO, IO time has a relatively low impact on things and so it may not matter much how you read the file.
(In retrospect this feels like a reasonably obvious thing, but it didn't occur to me until now and as mentioned I've tended to reflexively do read IO in quite large buffer sizes. I'm probably going to be changing that in the future, at least for programs that process what they read.)
What I use Github for and how I feel about it
In light of recent events (or at least rumours) of Microsoft buying Github, I've been driven to think about my view of Github and how I'd feel about it changing or disappearing. This really comes in two sides, those of someone who has repos on Github and those of someone who uses other people's repos on Github, and today I feel like writing about the first (because it's simpler).
Some people probably use Github as the center of their own work, and to a certain extent Github tries hard to make that inevitable if you have repositories that are popular enough to attract activity from other people (because they'll interact with your Github presence unless you work hard to prevent that). In my case, I don't have things set up that way, at least theoretically. Github doesn't host the master copy of any of my repositories; instead I maintain the master copies on my own machines and treat the Github version as a convenient publicly visible version (one that presents, more or less, what I want people to be using). If Github disappeared tomorrow, I could move the public version to another place (such as Gitlab or Bitbucket), or perhaps finally get around to setting up my own Git publishing system.
Well, except for the bit where most or all of my public projects currently list their Github URL in the README and so on, and I have places (such as my Firefox addon's page) that explicitly list the Github URLs. All of those would have to be updated, which starts to point out the problem; those updates would have to propagate through to any users of my software somehow. The reality is that I've been sort of lazy in my README links and so on; they tend to point only to Github, not to Github plus anywhere else. What they should really do is point to Github plus some page that I run (and perhaps additional public Git repo URLs if I establish them on Gitlab or wherever).
There's some additional things that I'd lose, too. To start with, any issues that people have filed and pull requests that people have made (although I think that one can get copies of those, and perhaps I should). I'd also lose knowledge of people's forks of my Github repos and the ability to look at any changes that they may have made to them, changes that either show me popular modifications or things that perhaps I should adopt.
All of these things make Github sticky in a soft way. It's not that you can't extract yourself from Github or maintain a presence apart from it; it's that Github has made its embrace inviting and easy to take advantage of. It's very easy to slide into Github tacitly being your open source presence, where people go to find you and your stuff. If I wanted to change this (which I currently don't), I'm honestly not sure how I'd make it clear on my Github presence that people should now look elsewhere.
I don't regret having drifted into using Github this way, because to be honest I probably wouldn't have public repositories and a central point for them without Github or some equivalent. At the same time I'm aware that I drift into bad habits because they're easy and it's possible that Github is one such bad habit. Am I going to go to the effort of changing this? Certainly not right away (especially to my own infrastructure). Probably, like many people who have their code on Github, I'm going to wait and see and above all hope that I don't actually have to do anything.
(I'm also not convinced that there is any truly safe option for having other people host the public side of my repositories. Sourceforge serves as a cautionary example of what can happen to such places, and it's not like Gitlab, Bitbucket, and so on are obviously safer than Github is or was; they're just not (currently) owned by Microsoft. The money to pay for all of the web servers and disk space and so on has to come from somewhere, and I'm probably going to be a freeloader on any of them.)
Some notes on Go's
I was recently reading go101's "Type-Unsafe Pointers" (via)
and ran across a usage of an interesting new
runtime package function,
runtime.KeepAlive(). I was initially puzzled
by how it was used, and then me being me I had to poke into how it
runtime.KeepAlive() does is that it keeps a variable 'alive',
which means that it (and what it refers to) will not be garbage
collected and any finalizers it has won't be run.
The documentation has
an example of its use. My initial confusion was why the use of
runtime.KeepAlive() was so late in the code; I had sort of expected
it to be used early, like finalizers are set, but then I realized
what it is really doing. In short,
runtime.KeepAlive() is using
the variable. A variable is obviously alive right up to the end of
its last use, so if you use a variable late, Go must keep it alive
all the way there.
At one level, there's nothing magical about
any use of the variable would do to keep it alive. At another level
there is an important bit of magic about
is that Go guarantees that this use of your variable will not be
cleverly optimized away because the compiler can see that nothing
actually really depends on your 'use'. There are various other ways
of using a variable, but even reasonably clever ones are vulnerable
to compiler optimization and aggressively clever ones have the
downside that they may accidentally defeat Go's reasonably clever
escape analysis, forcing what would otherwise
be a local stack variable to be allocated on the heap instead.
The other special magic trick in
runtime.KeepAlive() is in how
it's implemented, which is that it doesn't actually do anything.
In particular, it doesn't make a function call. Instead, much
unsafe.Pointer, it's a compiler
builtin, set up in ssa.go.
When your code uses
runtime.KeepAlive(), the Go compiler just
sets up a
OpKeepAlive SSA thing and then the rest of the compiler
knows that this is a use of the variable and keeps it alive through
to that point.
(Reading this ssa.go initialization function was interesting.
Unsurprisingly, it turns out that there are a number of nominal
package function calls that are mapped directly to instructions
that will be placed inline in your code, like
of these are platform-dependent, including a bunch of
runtime.KeepAlive is special magic has one direct consequence,
which is that you can't take its address. If you try, Go will report:
./tst.go:20:22: cannot take the address of runtime.KeepAlive
I don't know if Go will too-cleverly optimize away a function that
only exists to call
runtime.KeepAlive, but hopefully you're never
going to need to call
PS: Although it's tempting to say that one should never need to
runtime.KeepAlive on a stack allocated local variable
(including arguments) because the stack isn't cleaned up until the
function returns, I think that this is a dangerous assumption. The
compiler could be sufficiently clever to either reuse a stack slot
for two different variables with non-overlapping lifetimes or simply
tell garbage collection that it's done with something (for example
by overwriting the pointer to the object with
Bad versions of packages in the context of minimal version selection
Recently, both Sam Boyer and Matt Farina have made the point that Go's proposed package versioning lacks an explicit way for packages to declare known version incompatibilities with other packages. Suppose that you have a package A and it uses package X, initially at v1.5.0. The package X people release v1.6.0, which is fine, and then v1.7.0, they introduce an API behavior change that is incompatible with how your package uses the API (Matt Farina's post has a real world example of this). By the strict rules of semantic versioning this is a no-no, but in real life it happens for all sorts of reasons. People would like the ability to have their own package say 'I'm not compatible with v1.7.0 (and later versions)', which Russ Cox's proposal doesn't provide.
The first thing to note is that in a minimal version selection environment, this incompatibility doesn't even come up as long as you're only building the package or something using the package that has no other direct or indirect dependencies on package X. If you're only using package A, package A says it wants X@v1.5.0 and that's what MVS picks. MVS will never advance to the incompatible version v1.7.0 on its own; it must be forced to do so. Even if you're also using package B and B requires X@v1.6.0, you're still okay; MVS will advance the version of X but only to v1.6.0, the new minimal version.
(This willingness to advance the version at all is a pragmatic tradeoff. We can't be absolutely sure that v1.6.0 is really API compatible with A's required X@v1.5.0, but requiring everyone to use exactly the same version of a package is a non-starter in practice. In order to make MVS useful at all, we have to hope that advancing the version here is safe enough (by default, and if we lack other information).)
So this problem with incompatible package versions only comes up in MVS if you also have another package B that explicitly requires X@v1.7.0. The important thing here is that this version incompatibility is not a solvable situation. We cannot build a system that works; package A doesn't work with v1.7.0 while package B only works with v1.7.0 and we need both. The only question is whether MVS or an MVS-like algorithm will actually tell us about this problem, aborting the build, or whether it will build a system that doesn't work (if we're lucky the system will fail our tests).
To me, this changes how critical the problem is to address. Failure to build a working system where it's possible would be one thing, but we don't have that; instead we merely have the question of whether you're going to get told up front that what you want isn't possible.
The corollary to this is that when package A publishes information that it's incompatible with X version v1.7.0, it's doing so almost entirely as a service for other people, not something it needs for itself. Since A's manifest only requires X@v1.5.0, MVS will generally use v1.5.0 when building A alone (let's assume that none of A's other dependencies also use X and will someday advance to requiring X@v1.7.0). It's only when A gets bundled together with B that problems happen, and so this is mostly when A's information about version incompatibility is useful. Should this information be published in a machine readable form? Well, I think it would be nice, but it depends on what else we have to give up for it.
(The developers of A may want to leave themselves a note about the situation in their version manifest, of course, just so that no developer accidentally tries advancing X's version and then gets surprised by the results.)
PS: There is also an argument that such incompatible version blocks should only be advisory warnings or the like. As the person building the overall system, you may actually know that the end result will work anyway; perhaps you've taken steps to compensate for the API incompatibility in your own code. Since the failure is an overall system failure, package A can't necessarily be absolutely sure about things.
(Things might be easier to implement as advisory warnings. One approach would be to generate the MVS versions as usual, then check to see if anyone declared an incompatibility with the concrete versions chosen. Resolving the situation, if it's even possible, would be up to you.)
'Minimal version selection' accepts that semantic versioning is fallible
Go has been quietly wrestling with package versioning for a long time. Recently, Russ Cox brought forward a proposal for package versioning; one of the novel things about it is what he calls 'minimal version selection', which I believe has been somewhat controversial.
In package management and versioning, the problem of version selection is the problem of what version of package dependencies you'll use. If your package depends on another package A, and you say your minimum version of A is 1.1.0, and package A is available in 1.0.0, 1.1.0, 1.1.5, 1.2.0, and 2.0.0, version selection is picking one of those versions. Most package systems will pick the highest version available within some set of semantic versioning constraints; generally this means either 1.1.5 or 1.2.0 (but not 2.0.0, because the major version change is assumed to mean API incompatibilities exist). In MVS, you short-circuit all of this by picking the minimum version allowed; here, you would pick 1.1.0.
People have had various reactions to MVS, but as a grumpy sysadmin my reaction is positive, for a simple reason. As I see it, MVS is a tacit acceptance that semantic versioning is not perfect and fails often enough that we can't blindly rely on it. Why do I say this? Well, that's straightforward. The original version number (our minimum requirement) is the best information we have about what version the package will definitely work with. Any scheme that advances the version number is relying on that new version to be sufficiently compatible with the original version that it can be substituted for it; in other words, it's counting on people to have completely reliably followed semantic versioning.
The reality of life is that this doesn't happen all of the time. Sometimes mistakes are made; sometimes people have a different understanding of what semantic versioning means because semantic versioning is ultimately a social thing, not a technical one. In an environment where semver is not infallible (ie, in the real world), MVS is our best option to reliably select package versions with the highest likelihood of working.
(Some package management systems arrange to also record one or more 'known to work' package version sets. I happen to think that MVS is more straightforward than such two-sided schemes for various reasons, including practical experience with some Rust stuff.)
I understand that MVS is not very aesthetic. People really want semver to work and to be able to transparently take advantage of it working (and I agree that it would be great if it did work). But as a grumpy sysadmin, I have seen a non-zero amount of semver not working in these situations, and I would rather have things that I can build reliably even if they are not using all of the latest sexy bits.
Sorting out some of my current views on operator overloading in general
Operator overloading is a somewhat controversial topic in programming
language design and programming language comparisons. To somewhat
stereotype both sides, one side thinks that it's too often abused to
create sharp-edged surprises where familiar operators do completely
surprising things (such as
<< in C++ IO). The other side thinks that
it's a tool that can be used to create powerful advantages when done
well, and that its potential abuses shouldn't cause us to throw it out
In general, I think that operator overloading can be used for at least three things:
- implementing the familiar arithmetic operations on additional types
of numbers or very strongly number-like things, where the new
implementations respect the traditional arithmetic properties of
the operators; for example
- implementing these operations on things which already use these
operators in their written notation, even if how the operators
are used doesn't (fully) preserve their usual principles. Matrix
multiplication is not commutative, for example, but I don't
think many people would argue against using
*for it in a programming language.
- using these operators simply for convenient, compact notation in ways that have nothing to do with arithmetic, mathematical notation, or their customary uses in written form for the type of thing you're dealing with.
I don't think anyone disagrees with the use of operator overloading
for the first case. I suspect that there is some but not much
disagreement over the second case. It's the third case that I think
people are likely to be strongly divided over, because it's by far
the most confusing one. As an outside reader of the code, even
once you know the types of objects involved, you don't know anything
about what's actually happening; you have to read the definition
of what that type does with that operator. This is the 'say what?'
<< in C++ IO and
% with Python strings.
Languages are partly a cultural thing, not purely a technical one, and operator overloading (in its various sorts) can be a better or a worse fit for different languages. Operator overloading probably would clash badly with Go's culture, for example, even if you could find a good way to add it to the language (and I'm not sure you could without transforming Go into something relatively different).
(Designing operator overloading into your language pushes its culture in one direction but doesn't necessarily dictate where you wind up in the end. And there are design decisions that you can make here that will influence the culture, for example requiring people to define all of the arithmetic operators if they define any of them.)
Since I'm a strong believer in both the pragmatic effects and aesthetic power of syntax, I believe that even operator overloading purely to create convenient notation for something can be a good use of operator overloading in the right circumstances and given the right language culture. Generally the right circumstances are going to be when the operator you're overloading has some link to what the operation is doing. I admit that I'm biased here, because I've used the third sort of operator overloading from time to time in Python and I think it made my code easier to read, at least for me (and it certainly made it more compact).
(For example, I once implemented '
-' for objects that were
collections of statistics, most (but not all) of them time-dependent.
Subtracting one object from another gave you an object that had the
delta from one to the other, which I then processed to print
In thinking about this now, one thing that strikes me is that an advantage of operators over function calls is that operators tend to be written with whitespace, whereas function calls often run everything together in a hard to read blur. We know that whitespace helps readability, so if we're going to lean heavily on function calls in a language (including in the form of method calls), perhaps we should explore ways of adding whitespace to them. But I'm not sure whitespace alone is quite enough, since operators are also distinct from letters.
(I believe this is where a number of functional languages poke their heads up.)
Go and the pragmatic problems of having a Python-like
In a comment on my entry on finalizers in Go, Aneurin Price asked:
So there's no deterministic way to execute some code when an object goes out of scope? Does Go at least have something like Python's "with" statement? [...]
For those who haven't seen it, the Python
with statement is used
with open("output.txt", "w") as fp: ... do things with fp ... # fp is automatically closed by the # the time we get here.
with gives you reliable and automatic cleanup of
or whatever resource you're working with inside the
Your code doesn't have to know anything or do anything; all of the
magic is encapsulated inside
with and things that speak its
Naturally, Go has no equivalent; sure, we have the
it's not anywhere near the same thing. In my opinion this is
the right call for Go, because of two issues you would have
if you tried to have something like Python's
with in Go.
The obvious issue is that you would need some sort of protocol to handle
initialization and cleanup, which would be a first for Go. You need the
protocol because a big point of Python's
with is that it magically
handles everything for you without you having to remember to write any
extra code; it's part of the point that using
with is easier and
shorter than trying to roll your own version (which encourages people to
use it). If you're willing to write extra code, Go has everything today
in the form of
But beyond that there is a broader philosophical issue that's exposed by Aneurin Price's first question. In a language like Go where your local data may escape into functions you call, what does it mean for something to go out of scope? One answer is that things only go out of scope when there's no remaining reference to them. Unfortunately I believe that this is more or less impossible to implement efficiently without either going to Rust's extremes of ownership tracking in the language or forcing a reference counting garbage collector (where you know immediately when something is no longer referenced). This leaves you with the finalizer problem, where you're not actually cleaning up the resource promptly.
The other answer is that 'going out of scope' simply means 'execution reaches the end of the relevant block'. As in Python, you always invoke the cleanup actions at this point regardless of whether your resource may have escaped into things you've called and thus may still be alive somewhere. This implicit, hidden cleanup is a potentially dangerous trap for your code; if you forget and pass the resource to something that retains a reference to it, you may get explosions (much) later when that now-dead resource is used. If you're in luck, this use is deterministic so you can find it in tests. If you're unlucky, this use only happens in, say, an error path.
defer() instead of an implicit cleanup doesn't stop this
problem from happening, but it makes explicit what's going on. When
you write or see a
defer(fp.Close()), you're pointedly reminded
that at the end of the function, the resource will be dead. There
is no implicit magic, only explicit actions, and hopefully this
creates enough warning and awareness. Given Go's design goals, being explicit here as part of the language
design makes complete sense to me. You can still get it wrong,
but at least the wrongness is more visible.
(I don't think being explicit is necessarily better in general than
Python's implicit magic. Go and Python are different languages with
different goals; what's appropriate for one is not necessarily
appropriate for the other. Python has both language features and
cultural features that make
with a good thing for it.)
Using Go finalizers can be a better option than not using them
Go has finalizers, which let you have some code be invoked just as an object is about to be garbage collected. However, plenty of people don't like them and the usual advice is to completely avoid them (for example). Recently, David Crawshaw wrote The Tragedy of Finalizers (via), in which he points out various drawbacks of finalizers and shows a case where relying on them causes failures. I more or less agree with all of this, but at the same time, I've used finalizers myself in a Go package to access to Solaris/Illumos kstats and I'll defend that usage.
What I use finalizers for is to avoid an invisible leak if people don't use my API correctly. In theory when you call my package you get back a magic token, which holds the only reference to some C-allocated memory. When you're done with the token, you're supposed to call a method to close it down, which will free the C-allocated memory. In practice, well, people make API usage and object lifetime mistakes. Without a finalizer, if a token went out of scope and was lost to garbage collection we'd permanently leak that C-allocated memory. As with all memory and resource leaks of this nature, this would be an especially annoying and pernicious leak because it would be completely invisible from the Go level. None of the usual Go level memory leak tools would help you at all (and I suspect that the usual C leak finding tools would have serious problems due to the presence of Go).
At one level, using a finalizer here is a pragmatic decision; it protects people using my package from certain usage errors that would cause problems that are hard to deal with. At another level, though, I can argue that using finalizers here is actually within the broad spirit of Go. As a garbage collected language, Go has essentially made a decision that explicitly managing object lifetimes is too hard, too much work, and too error-prone. It's a bit peculiar to be perfectly fine with this for memory, but not fine with this for other resources for anything other than purely pragmatic reasons.
(At the same time, those pragmatic reasons are real; as David Crawshaw explains, relying on memory garbage collection to garbage collect other resources before you run out of them is at best dangerous. Even my case is a bit dubious, since C-allocated memory doesn't apply pressure to the Go garbage collector.)
David Crawshaw followed up his article with Sharp-Edged Finalizers in Go, where he advocated using finalizers in this situation to force panics when people fail to use your APIs correctly. You can do this, but it feels somewhat un-Go-like to me. As a result I think you should only resort to this if the consequence of not using your API correctly are quite severe (for example, potential data loss because you forgot to commit a database transaction and then check for errors in it).
As a general note, I wouldn't say that my sort of use of finalizers is intended to avoid resource leaks as such. You will have a resource leak in practice from the time when you stop needing the resource (the kstat token, the open file, or what have you) until the Go garbage collection calls your finalizer (if it ever does), because the resource is still there but neither in use nor wanted. What finalizers do is make that leak be theoretically a temporary one, instead of definitely permanent. In other words, it's a recoverable leak instead of an unrecoverable one.
(This has been on my mind for a while, but David Crawshaw's articles provide a convenient prompt and I hadn't thought of using finalizers to force a hard error in this situation.)
Frequent versus infrequent developers (in languages and so on)
Yesterday I mentioned the phrase 'infrequent developer' in an aside in my entry. Today I'm writing about what I mean by that and by its opposite, the frequent developer, and why I care about this.
What I'm calling frequent developers here in the context of, say, a language (such as Go) are people who routinely work with code or programs written in that language. When you're a frequent developer, you naturally develop expertise in that language's operation and often a development environment for it, because you use it often. You know the commands, you remember their options (or at least the ones that you need), you've run into some of the somewhat obscure corners and things that can go wrong. You know your way around things. You'll naturally learn and master even relatively complex procedures.
For a frequent developer, setting up and running some special piece of software to help work on the language is both okay and perfectly sensible. It may take a bit more time to learn and operate, but you use things frequently enough that the extra overhead is only a small portion of the time you spend dealing with the language. It's worth setting up caches and CI and so on, because you'll get enough benefit out of them. You are well up the XKCD 'is it worth the time' table. Frequent developers tend to accumulate a halo of tools that make their lives easier and often improve their results; they know about the linters, the checkers, the formatters, and so on.
An infrequent developer is someone who does not fit this profile. Sure, they have some software written in Go, or Python, or using Django, or whatever, but mostly it sits there working and they don't have to think about it very often. They only modify it or rebuild it or update its dependencies or the like once in a while. Since they're only occasional users of a language environment, infrequent developers generally don't maintain expertise in the finer details of the language's operation, although they can probably remember (or look up) how to do the common things and the basics. They won't remember how to deal with the unusual cases, and in fact may never have run into them. Complex procedures will probably have to be re-learned nearly every time they're needed (or re-Googled for).
Since infrequent developers spend relatively little time dealing with the language, setting up and running additional pieces of software is a much higher overhead for them and is generally not worth it if they have a choice. They get hit on both sides compared to frequent developers; they're less familiar with the software so working on it takes longer, and they use the language much less so the same amount of absolute time spend on additional software is proportionally much higher. Infrequent developers object strongly to thing 'just run this caching proxy, it only takes a bit of time to manage'. Overheads that are small to frequent developers loom very big for infrequent ones. Infrequent developers usually do not have the halo of tools that frequent developers do, and mostly stick to the basics (and as a result they miss out on various things).
It's quite easy and natural for a language community to think first and foremost about frequent developers. Frequent developers are your most active and best users, and generally they are the ones that talk to you most, have the most to say, and are the best informed about the current state of affairs and what their options are. But at the same time, focusing on frequent developers is a limited point of view and will cause you to miss what causes pain for infrequent developers. Worse, it can cause you to design only for frequent developers.
If you're only thinking about frequent developers, it's easy to create a system that assumes that of course people will set up this or that software, or that some particular pain point doesn't really matter because everyone will have tools that cover it over, or that a complex procedure is the right answer because of the power it exposes. To pick on something other than Go, it won't matter that your language refuses to mix spaces and tabs because everyone can just run an editor plugin to fix it automatically (or to automatically indent only with spaces).
(As far as complex procedures go, well, Git is famously full of them. And I say this as someone who considers himself in the 'frequent developer' camp with git, including having tools for dealing with it.)
As I mentioned in my aside yesterday, I have wound up feeling that the perspective of these infrequent developers is often overlooked and not widely heard from. I think that this is not a great thing; to summarize why, I think there are probably more infrequent developers for any popular language than you might think.
(The perspective of infrequent developers is similar to beginners in the language, but I don't think it's quite the same and I'm not sure that being beginner friendly will make you friendly to infrequent developers too.)
A sysadmin's perspective on Go vendoring and
One big thing in the Go world lately has been Russ Cox's writings
on adding package versioning to the core of Go through what is
currently being called Versioned Go,
vgo for short. His initial plans were for
vgo to completely
drop Go's current vendoring feature. If you
wanted to capture a local copy of your external dependencies, you
would have to set up your own proxy server (per his article on
vgo would come
with one). According to the vgo & vendoring golang-dev thread
opinions have since changed on this and the Go team accepts that
some form of vendoring will stay. My interest in vendoring is
probably different from what normal Go developers care about, so I
want to explain my usage case, why vendoring is important to us,
and why the initial proxy solution would not have made me very
We are never going to be doing
ongoing Go development, with a nice collection of Go programs and
tooling that we work on regularly and build frequently. Instead,
we're going to have a few programs written in Go because Go is the
right language (enough so to overcome our usual policy against it). If we're going to have local
software in a compiled language, we need to be able to rebuild it
on demand, just in case (otherwise it's a ticking time bomb). More
specifically, we want people who aren't Go specialists to be able
to reliably rebuild the program following some simple and robust
process. The closer the process is to 'copy this directory tree
/tmp, cd there, and run a standard
command', the better.
Today you can get most of the way there with vendoring, but as I
discovered this only works if you're working
from within a
$GOPATH. This is less than ideal because it means
that the build instructions are more involved than 'cd here and run
go build'. However, setting up a
$GOPATH is a lot better than
having to find and run an entire proxy just to satisfy
proxy makes sense if you routinely build Go programs (and running
it in that case is not a big deal), but we're only likely to be
building this program (or any Go program) once every few years.
Adding an entire daemon that we have to run in order to do our
builds would not make us happy, and even magic
would be kind of a pain (especially if we had to manually populate
and maintain the cache directory).
The good news for me is that Russ Cox's posting in golang-dev is pretty much everything I want here. It appears to let me create entirely self contained directory trees (of source code, with no magic binary files) that include the full external dependencies and that can be built with a simple standard command with no setup required.
(This entry was basically overtaken by events. When Russ Cox published his series of articles, my immediate reaction was that I hated losing vendoring and I was going to object loudly in an entry. Now that the Go team has already had enough feedback to change their minds, the entry is less objecting and more trying to explain why I care about this and describe our somewhat unusual perspective on things as what I'll call 'infrequent developers', a perspective that I think is often not widely heard from.)