2011-07-28
Another reason why version control systems should support history rewriting
In Wait, Not That Bit!, Greg Wilson writes about the problem of making a bunch of unrelated changes to a single file and then having to commit a big bullet list of changes. A discussion of splitting existing changes into multiple commits and how you test the resulting separate commits then ensued in the rest of the entry and the comments.
However, I'd like to note that there is a fundamental conflict inherent in this workflow. We want VCS commits to be very easy and lightweight so that developers will actually do them (and do them frequently), instead of developing and checkpointing things outside the VCS because it's more convenient. At the same time we want each VCS commit to be for a single separate change, and we want the change (and the commit) to pass tests. These goals are in conflict, and the discussion in Greg Wilson's entry is one sign of it; all of the proposed solutions involve a developer who has a finished chunk of code going through more work before they can capture it in their VCS. What happened to 'commit early, commit often'?
(Among other things, my strong opinion is that as a developer I want to be able to snapshot the code the moment that it actually works. Working code is precious and fragile. Changing the code without a snapshot after it works is an invitation to accidents, mistakes, damaged code and heartburn.)
The reality is that you want to do two sorts of commits here; as you develop you want to capture state (especially 'okay, this code works, make sure I don't lose it'), and once you're done you want to capture distinct changes. Or rather, once you're done you want to turn your captured state into a series of separate changes (and test them one by one). This sounds more or less exactly like 'rewriting history'; you start out with a history that is a series of state snapshots (the degenerate case is a single snapshot) and rewrite it into a history of separate changes. Then you publish only the second, proper history.
Once you're going to be rewriting history, it should be supported in your VCS for reason that I've written about before.
(You can try to argue that your VCS should be used only for the final commits and the state snapshots should be handled through some other mechanism or program. I feel that this is a mistake; among other things, this other mechanism is effectively a version control system itself and that means that someone had to write it, duplicating much or all of the work of writing your main VCS.)
2011-07-24
On documenting (or not documenting) binary protocols
In an aside in SaneBinaryProtocols, I noted that part of a sane binary protocol was documenting it, something that the people responsible for the sendmail milter protocol had failed to do. Today I have an aside on that general issue.
The traditional excuse for not documenting your protocol, at least in the open source world, is that you don't want to commit to supporting it over the long term; by not documenting it you preserve your freedom to change it without having to worry about backwards compatibility. The problem is that in practice this doesn't actually work.
If you have your own clients and servers out there in the field, you already have programs that are speaking the old version of your protocol. Unless you can force people to upgrade everything at the same time (which is often quite unpopular), you are in practice stuck with backwards compatibility anyways so that your new server can be talked to by old clients (and sometimes so that your new clients can talk to old servers).
(Failure to have backwards compatibility simply invites people not to upgrade to your new release, which is often not one of the goals that you have.)
Documenting your protocol and having other people write clients and servers against it may make protocol upgrades take somewhat longer to become pervasive out there into the world. But it does not intrinsically create a problem for you where none existed before unless you were already annoying people by insisting on lock-step upgrades.
Also, in real life people will reverse engineer your protocol anyways if it is at all useful, regardless of what you say, and will write clients and servers against their reverse engineered version. You can yank the rug out from underneath them by changing the protocol with no backwards compatibility if you want to, but again this just encourages people not to upgrade. (It also encourages people to add backwards compatibility to third party clients and servers, making them more attractive than your own versions.)
(Yes, sometimes abrupt protocol changes with no backwards compatibility are justified, such as if you discover that the previous version had a security vulnerability that allowed bad things.)
2011-07-12
Some thoughts on creating simple and sane binary protocols
The best way to create a new simple and sane binary protocol is to not do so; create a text based protocol instead. Text based protocols have any number of advantages; they're easier to write handlers for, they can be debugged and tested by hand, semi-smart proxies are easy to write, it's easy to use network monitoring tools to trace them in live systems, and so on. And protocols like HTTP and (E)SMTP prove that they are viable at large scale and high traffic volumes. Really, your situation is probably not an exception that requires an 'efficient' binary protocol.
But suppose that you've determined that you need a (new) binary protocol for some reason. Because you're nice, you want to make one that irritates programmers as little as possible, ie that is as easy as possible to write protocol encoders and decoders for. Having looked at a number of binary protocols and just recently written a codec for sendmail's milter protocol, I have a few opinions on what you should do.
(Beyond the obvious one of 'document it', which the sendmail people skipped.)
First, wrap your various structures, bitstreams, or whatever in a simple packet format. The important bit of such a format is that packets have a common fixed-size header that includes the packet size and then the remaining variable sized data. Having the size up front allows the decoder to know very early on if it has all of the data that it needs for the packet; this simplifies further decoding and enables various sorts of error checks. You want the packet header to be fixed size so that it is easy to unconditionally read and decode.
Second, build your messages out of as few primitive field types as possible and make those primitive types as simple as possible to decode and encode. In my view, the simplest field types are fixed sized fields, then (fixed-size) length plus data, and then bringing up the rear are delimited fields (where there is some end marker that you have to scan for). If you create complex encoded field types, expect programmers to hate you.
(In general, creating a field type that can't be encoded with either
memcpy() or a printf-like formatter is probably a mistake.)
Finally, have only a single field that determines the rest of the message's format, and put this field at a fixed early point in the packet. In other words, you have a fixed set of structures (or messages) that are encoded into your binary protocol and then some marker of which message this is. Avoid variable format messages, where how you decode the message depends in part on the message contents; for example, a specification like 'if field A has value X, field B is omitted' creates a variable format message. Variable format messages require conditional encoding and decoding, which complicates everyone's life. By contrast a fixed format message can be decoded to a list of field values based only on knowing the field types and their order (and it can be encoded from such a list in the same way).
(If you have to have variable format messages, the closer you stick to this approach the better. Recursive sub-messages are one obvious approach.)
A simple protocol like this can be described in a way that enables quite simple and relatively annoyance free encoding and decoding in modern high level languages. But that's another entry, since this one is already long enough.