2014-03-26
Why people keep creating new package managers
Matt Simmons recently wrote Just what we need ... another package manager, in which he has an unhappy reaction to yet another language introducing yet another package manager. As a sysadmin I've long agreed with him for all sorts of reasons. Packaging and managing language 'packages' is an ongoing problem and in our environment it also causes user heartburn when we have to turn down requests to install language packages through the language's own mechanisms.
(We have a strong policy that we only install official distribution packages in order to keep our own sanity. This works for us but not necessarily for other people.)
But at the same time I have a lot of sympathy for the language people. Let's look at the problem from their perspective. Languages need:
- package management everywhere they run, possibly including Windows
(which has no native package management system) and almost certainly
including the Macs that many developers will be using (which also
lack a native packaging system).
- something which doesn't force package contributors to learn more
than one packaging system, because most people won't and languages
want a healthy, thriving ecology of public packages. Ideally the
one packaging system will be a simple, lightweight, and low
friction one in order to encourage people to make and publish
packages.
- for developers of language packages to not have to deal with the
goat rodeos and political minefields that are the various
distribution packaging processes, because making them do so
is a great way of losing developers and not having packages.
- some relatively handy way to install and update packages that are not
in the official distribution repositories. No language can really
tolerate having its package availability being held hostage to
the whims of distributions because it basically guarantees an out
of date and limited package selection.
(The interests of languages, developers of language packages, and distributions are basically all at odds here.)
- support for installing packages in non-default, non-system locations, ideally on both a 'per developer' and a 'per encapsulated environment' basis.
From the language's perspective it would be nice if package management for the language could optionally be done the same way regardless of what host you're on. In other words, developers should be able to use the same commands to install and set up packages on their development Macs as they do on testing VMs running some Linux distribution (or even FreeBSD), and possibly also on the production systems.
(In the modern lightweight world many small companies will not have actual sysadmins and developers will be setting up the production machines too for a while. Sysadmins do not like this but it is a reality. And languages are not designed for or used by sysadmins, they are mostly designed for developers, so it is not surprising that they are driven by the needs of developers.)
It's theoretically possible for a language's package system to meet all of these needs while still enabling distribution packages and doing as much as possible with the core distribution packaging system, either explicitly or behind convenient cross-platform cover scripts. However there are plenty of cases (like non-system installation of packages) that are simply not handled by the distribution packaging system and beyond that there are significant difficulties on both the technical and political levels. It is simply much easier for a language to roll its own packaging system.
(Ideally it will support creating distribution packages and using the distribution packaging mechanisms as well. Don't hold your breath; it's not going to be a language priority.)
2014-03-21
Thinking about when rsync's incremental mode doesn't help
I mentioned recently that I had
seen cases where rsync's incremental mode didn't speed it up to any
significant degree. Of course there's an obvious way to create such a
situation, namely erasing and replacing all of the files involved, but
that wasn't it for us. Our case was more subtle and it's taken me a
while to understand why it happened. Ultimately it comes down to having
a subtly wrong mental model of what takes time in rsync.
Our specific situation was replicating a mail spool from one machine to
another. There were any number of medium and large inboxes on the mail
spool, but for the most part they were just getting new messages; as
far as we know no one did some major inbox reorganization that would
have changed all of their inbox. Naively you'd think that an rsync
incremental transfer here could go significantly faster than a full
copy; after all, most of what you need to transfer is just the new
messages added to the end of most mailboxes.
What I'm quietly overlooking here is the cost of finding out what
needs to be transferred, and in turn the reason for this is that I've
implicitly assumed that sending things over the network is (very)
expensive in comparison to reading them off the disk. This is an easy
bias to pick up when you work with rsync, because rsync's entire
purpose is optimizing network transmission and when you use it you
normally don't really think about how it's finding out the differences.
What's going on in our situation is that when rsync sees a changed
file it has to read the entire file and compute block checksums (on
both sides). It doesn't matter if you've just appended one new email
message to a 100 Mbyte file for a measly 5 Kbyte addition at the end;
rsync still has to read it all. If you have a bunch of midsized
to large files (especially if they're fragmented, as mail inboxes
often are), simply reading through all of the changed files can take a
significant amount of time.
In a way this is a variant of Amdahl's law. With a lot of slightly
changed files an rsync incremental transfer may speed up the network
IO and reduce it to nearly nothing but it can't do much about the
disk IO. Reading lots of data is reading lots of data, whether or
not you send it over the network; you only get a big win out of not
sending it over the network if the network is slow compared to the
disk IO. The closer disk and network IO speeds are to each other,
the less you can save here (and the more that disk IO speeds will
determine the minimum time that an rsync can possibly take).
The corollary is that where you really save is by doing less disk IO as
well as less network IO. This is where and why things like ZFS snapshots
and incremental 'zfs send' can win big, because to a large extent they
have very efficient ways of knowing the differences that need to be
sent.
PS: I'm also making another assumption, namely that CPU usage is free and is not a limiting factor. This is probably true for rsync checksum calculations on modern server hardware, but you never know (and our case was actually on really old SPARC hardware so it might actually have been a limiting factor).
2014-03-08
Why I think 10G-T will be the dominant form of 10G Ethernet
Today there are basically two options for 10G Ethernet products and interfaces, 10G-T (standard Ethernet ports and relatively normal Ethernet cables) and SFP+ (pluggable modules mostly using fiber). Historically SFP+-based products have been the dominant ones and some places have very large deployments of them while 10G-T seems to only have started becoming readily available recently. Despite this I believe that 10G-T is going to be the winning 10G format. There are two major 10G-T advantages that I think are going to drive this.
The first advantage is that 10G-T ports are simpler, smaller, and cheaper (at least potentially). SFP+ ports intrinsically require additional physical modules with their own circuitry plus a mechanical and electronic assembly to plug them into. This adds cost and it also adds physical space (especially depth) over what an Ethernet RJ45 connector and its circuitry require. In addition 10G-T is pretty much just an RJ45 connector and a chipset, and the hardware world is very good at driving down the price of chipsets over time. SFP+s do not have this simplicity and as such I don't think they can tap quite this price reduction power.
The second advantage is that 10G-T ports are backwards compatible with slower Ethernet while SFP+ ports talk only with other SFP+ ports. The really important aspect of this is that it's safe for manufacturers to replace 1G Ethernet ports with 10G-T Ethernet ports on servers (and on switches, for that matter). You can then buy such a 10G-T equipped server and drop it into your existing 1G infrastructure without any hassle. The same is not true if the manufacturer replaced 1G ports with SFP+ ports; suddenly you would need SFP+ modules (and cables) and a bunch of SFP+ switch ports that you probably don't have right now.
In short going from 1G to 10G-T is no big deal while going from 1G to SFP+ is a big, serious commitment where a bunch of things change.
This matters because server makers and their customers (ie, us) like 'no big deal' shifts but are very reluctant to make big serious commitments. That 10G-T is no big deal means that server makers can shift to offering it and people can shift to buying it. This drives a virtuous circle where more volume drives down the cost of 10G-T chipsets and hardware, which puts them in more places, which drives adoption of 10G-T as more and more equipment is 10G-T capable and so on and so forth. This is exactly the shift that I think will drive 10G-T to dominance.
I don't expect 10G-T to become dominant by replacing existing or future enterprise SFP+ deployments. I expect 10G-T to become dominant by replacing everyone's existing 1G deployments and eventually becoming as common as 1G is today. Enterprises are big, but the real volume is outside of them.
By the way: this is not a theoretical pattern. This is exactly the adoption shift that I got to watch with 1G Ethernet. Servers started shipping with some or all 1G ports instead of 100M ports, this drove demand for 1G switch ports, then switches started getting more and more 1G ports, and eventually we reached the point we're at today where random cheap hardware probably has a 1G port because why not; volume has driven the extra chipset cost to basically nothing.
Update: The reddit discussion of this entry has a bunch of interesting stuff about various aspects of this and 10G Ethernet in general. I found it usefully educational.