2014-02-24
Nerving myself up to running experimental setups in production
One of the things that I want to do is move towards gathering OS level performance metrics for our systems, ideally for basically any performance stat that we can collect. All of the IO stats for all disks? Lots of stats for NFS mounts? CPU and memory utilization? Network link utilization and error counts? Bring them on, because the modern view is that you never know when this stuff will be useful or show you something interesting. The good news is that this is not a novel idea and there's a decent number of systems out there for doing all of the pieces of this sort of thing (collecting the stats on machines, forwarding them to a central place, aggregating and collating everything, graphing and querying them, etc). The bad news, in a sense, is that I don't know what we're doing here.
Like many places, we like everything we run in production to be fully baked. We work out all of the pieces in advance with whatever experimentation is needed, test it all, document it, and then put the finalized real version into production. We don't like to be constantly changing, adjusting, and rethinking things that are in production; that's a sign that we screwed up in the pre-production steps. Unfortunately it's become obvious to me that I can't make this approach work for the whole stats gathering project.
Oh, I can build a test stats collection server and some test machines to feed it data and make sure that all of the basic bits work, and I can test the 'production' version with less important and more peripheral production machines. But it's become obvious to me that really working out the best way to gather and present stats is going to take putting a stats-gathering system on real production servers and then seeing what explodes and what doesn't work for us (and what does). I simply don't think I can build a fully baked system that's ready to deploy onto our production servers in a final, unchanging configuration; I just don't know enough and I can't learn with just an artificial test environment. Instead we're going to have to put a half-baked, tentative setup on to production servers and then evolve it. There are going to be changes on the production machines, possibly drastic ones. We won't have nice build instructions and other documentation until well after the fact (once all the dust settles and we fully understand things).
As mentioned, this is not how we want to do production systems. But it's how we're going to have to do this one and I have to live with that. More than that, I have to embrace it. I have to be willing to stop trying to polish a test setup and just go, just put things on (some of) the production servers and see if it all works and then change it.
(I've sold my co-workers on this. Now I have to sell myself on it too (and stop using any number of ways to duck out of actually doing this), which is part of what this entry is about.)
The origins of DWiki and its drifting purpose
One of the interesting things about writing Wandering Thoughts has been getting a vivid and personal experience with what happens when some code you've written gets repurposed for something rather different than what it was originally designed for. Because, you see, DWiki (the wiki engine behind the blog) was not originally intended to be a blog engine and what it was originally designed for shaped it in a number of ways that still show today.
(I alluded to this when I talked about about why comments aren't immediately visible on entries.)
Put simply, I originally designed DWiki as yet another attempt to build
a local sysadmin documentation wiki that my co-workers would use. We
hadn't shown much enthusiasm for writing HTML pages and I didn't think
I could get my co-workers to edit things through the web, but I figured
I at least had a shot if I gave them simple and minimal markup that
they could edit by going 'cd /some/directory; vi file
'. This idea
never went anywhere but once I had the core wiki engine I added enough extra
features to make it able to create a
blog, and then I decided I might as well use the features and write one.
(From the right perspective a blog is just a paged time-based view over a directory hierarchy. So are Atom syndication feeds.)
One feature that this original purpose strongly affected is how comments are displayed. To put it one way, if you're creating a sysadmin documentation wiki, input from outsiders is not a primary source of content. It's a potential source of feedback to us, but it's definitely not on par to the (theoretical) stuff we were going to be writing. So I decided that (by default) comments would get a secondary position; if you were just browsing the wiki, you'd have to go out of your way to see the comments. As a wiki, if people left comments with seriously worthwhile feedback we'd fold that feedback into the main page.
(Adding comments was also a sop to the view that all true wikis are web-editable by outsiders. I wasn't going to make the wiki itself web-editable, but this way I could say that we were wiki-like in that we were still allowing outsiders to have a voice.)
Another thing that this original purpose strongly affected was DWiki's
choice of text formatting characters, especially its choice of _
as the 'typewriter text' formatting character. If you're writing about
sysadmin things it's quite common to want to set text in typewriter
text
to denote (Unix) commands so you want a nice convenient character
sequence for it; _
looks like a great choice because almost nothing
you write about is going to have actual underscores (they're very
uncommon in Unix command lines). When I instead started using DWiki to
write more and more about code, this turned into a terrible decision
since _
is an extremely common character in identifiers.
(Another choice that looked sensible for writing about Unix commands
but turned out to be bad for writing about code is using ((...)) for a
block of typewriter text with no further formatting. The problem is that
when you're writing about code you often wind up wanting to write about
things with (...)
on the end and that confuses the text parser.)
PS: In hindsight I can see all sorts of problems with my idea of a sysadmin documentation wiki. Even if I'd tried to market it better to my co-workers I suspect that it wouldn't have worked, especially as something that was publicly visible.