2012-11-27
You don't necessarily know what matters for performance
One of the obvious lessons I could learn from the switch issue at the heart of our recent disk performance issue is that low level details matter, sometimes a lot. While this is true, I think it's a superficial thing to take away from this learning experience. The lesson I really want to take away is a more general one:
I don't know in advance what matters for performance.
Until I had this experience you probably could have sat me down with the switch specifications for both switches, asked me if there were any differences that mattered for our iSCSI performance, and I would have gone through the specs and confidently told you no. Obviously I would have been wrong.
What this convinces me is that it is in practice impossible to fully understand system performance in advance. This is not an issue of incomplete knowledge (the specifications would have given me enough knowledge); it is instead an aspect of fragile complexity. The systems we use and build have enough interactions between their parts that we simply can't hold a complete picture in our heads even if we have all the information we need available (which is often not the case). We can reason (and measure) backwards from actual performance to see how it arises, but in practice we cannot reason forward from specifications and components to always predict the system performance. When we try to do this, we always simplify the picture of the system in order to make it tractable to reason about. Often this simplification works, but often is not always.
Reasoning backwards from performance to cause works better, but it's not always foolproof. We're just as eager to create simplifications when we work backwards as when we work forwards. Fundamentally humans are rationalizing animals; we want to tell ourselves stories that make order out of chaos and we're not too picky what those stories look like. So the real validation is always measuring (with as few preconceptions as possible).
PS: a great example of this is the performance effects of shouting at your disk drives.
2012-11-22
Optional features are in practice not optional to understand
In a comment on yesterday's entry, Aristotle Pagaltzis suggested that Markdown might be acceptable to my coworkers in part because:
It is just HTML! You can type HTML and it will come out verbatim. It's not really a separate markup language (simple or not) so much as a shorthand notation for HTML.
So you don't have to use Markdown's syntax instead of HTML. Forget how to write a link in Markdown? Just write it using HTML syntax. Will work either way.
[...]
This is a valuable feature and drastically reduces the learning curve for Markdown for writing things yourself; effectively you can learn bits and pieces of Markdown as you decide they're more convenient than straight HTML. Unfortunately this doesn't help persuade my coworkers to accept it because of a semi-paradox that optional features are in practice not optional.
The easy way to show this is to ask whether or not my coworkers ever need to learn Markdown and if so, when they need to learn it. To start with, if we never use Markdown syntax at all the answer is clearly no. But at that point Markdown isn't doing anything at all (except perhaps getting in the way because we have to escape things so that it doesn't process them); we might as well write HTML files and not process them through Markdown.
But suppose we use some Markdown. More specifically, let's suppose that I write in Markdown (as I'd like to) and my coworkers write in HTML (as they'd like to). Can they avoid learning Markdown? The answer is not really; the moment they want to touch something that I've worked on, they need to learn Markdown. In fact Markdown has a ratchet effect in this situation, in that the moment anyone introduces any bit of Markdown everyone winds up needing to know it in order to be able to understand and work on that bit of the text. If you do not learn Markdown, you're confined to the (probably shrinking) HTML-only content.
This is what makes optional features not optional to understand; they are only optional as long as they're not being used at all. The moment that they get into your codebase (or document base, or anything), you can no longer understand and work on the full codebase without learning them. Either you segment the codebase and responsibility for it or everyone gets to learn the optional feature.
(I have to thank my coworkers for patiently getting this idea through my thick skull.)
As suggested by my use of 'codebase' here, extensions to programming language features (and the use of entire programming languages) are left as an exercise for the reader. Really, there are lots of manifestations of this in development.
2012-11-13
A potential path to IPv6
For reasons beyond the scope of this entry, I wound up thinking about thinking about the (potential) transition to IPv6 again today. In the past I've been solidly gloomy on the prospects for a transition to IPv6, in large part because IPv6 doesn't benefit the people who have to do the work; the people who benefit from a transition are the people who don't already have IPv4 addresses, not the people who do. But today a potential path out of this occurred to me.
First off, let's assume that people start running out of assignable IPv4 addresses, as seems quite likely. When this happens, what we'll probably start seeing (and what has apparently started in some places) is consumer IPv6-only deployments that talk to the IPv4 world through large 'carrier grade' NAT systems. This only really allows outgoing connections (at least for IPv4 stuff), but for many ISPs that's a feature; if you want to run a server you can pay them for the privilege.
(If all of the software works fine most consumers won't care or notice.)
Of course even carrier grade NAT is not perfect or completely transparent. This is likely to result in an experience that's worse than directly accessing the same web site or other resource over native IPv6. With a worse experience for IPv4 than IPv6, if you want to do a good job serving these consumers you have an incentive (possibly a big one) to provide native IPv6 connectivity for your own services.
In other words, now you have a positive economic incentive to add native IPv6 support to your systems. Positive economic incentives (aka 'making more money') make the world go around, because they give organizations a concrete reward for the work they do.
(I will note in passing that the consumer ISPs involved have their own economic incentive to encourage IPv6 connectivity; the more native IPv6 their customers do, the less they need those expensive carrier grade NAT boxes.)
Where are these consumers? I think they're likely to be in what we today consider somewhat the peripheries of Internet usage, the places where it is only a recent development. North America generally has plenty of IPv4 addresses for its population, while places like Asia, Africa, and the Middle East are probably under-supplied. If this path to IPv6 is accurate, I'd expect to see serious IPv6 usage first show up in companies that are targeting these 'peripheral' users. And I wouldn't expect these companies to be based in North America.
(In fact I wouldn't be surprised if North America and western Europe wind up lagging the rest of the world in any IPv6 transition. The bigger question is if that will matter much.)
(All of this was in part sparked by (re)reading Avery Pennarun's entry on IPv6 from early 2011, due to reading this more recent entry of his.)
2012-11-06
Why I want to do full end-to-end performance tests
In light of the difficulty of doing real random IO tests, you might ask why we need to do that at all. Micro-benchmarks are easier and can be used to identify specific problems; as a commentator noted, I could use some block IO testing to test for our recent problem. The simple summary of why is that well done end to end performance testing is going to be comprehensive.
The advantage of good end to end performance tests is that they find all problems, even problems that you don't know about and didn't foresee as well as emergent problems that are the result of several layers combining together. By contrast, performance tests of specific components and other micro-benchmarks are more like negative tests; they're great for ruling out specific problems that you can foresee but they do not necessarily test for things that you haven't. A micro-benchmark certainly can find a new problem but it won't necessarily do so; it depends on whether or not the problem manifests symptoms that intersect with the micro-benchmark.
Of course in theory this is true of end to end performance tests too. The advantage of end to end tests is that they have what I'll call a greater testing surface. Because they go from the top of your system to the bottom (just as your real IO does), they touch a lot of components and significant problems with any one of them should manifest in the test results. If it doesn't, either it's not actually significant for your actual production IO load or your end to end tests aren't quite good enough.
Conversely the drawback of end to end tests is that while they may tell you that there's a problem they probably won't tell you where it is; since they touch so many components, you can't necessarily distinguish which component has a problem. That's one of the times when you turn to micro-benchmarks and specific fine-grained tests.
(At this point you can note that that an end to end random IO test is itself a kind of micro-benchmark. This is completely true; random read IO is just one component of our IO load. But I already have tests for sequential IO and I'm not as worried about random write IO and filesystem operations right now.)