Nerving myself up to running experimental setups in production
One of the things that I want to do is move towards gathering OS level performance metrics for our systems, ideally for basically any performance stat that we can collect. All of the IO stats for all disks? Lots of stats for NFS mounts? CPU and memory utilization? Network link utilization and error counts? Bring them on, because the modern view is that you never know when this stuff will be useful or show you something interesting. The good news is that this is not a novel idea and there's a decent number of systems out there for doing all of the pieces of this sort of thing (collecting the stats on machines, forwarding them to a central place, aggregating and collating everything, graphing and querying them, etc). The bad news, in a sense, is that I don't know what we're doing here.
Like many places, we like everything we run in production to be fully baked. We work out all of the pieces in advance with whatever experimentation is needed, test it all, document it, and then put the finalized real version into production. We don't like to be constantly changing, adjusting, and rethinking things that are in production; that's a sign that we screwed up in the pre-production steps. Unfortunately it's become obvious to me that I can't make this approach work for the whole stats gathering project.
Oh, I can build a test stats collection server and some test machines to feed it data and make sure that all of the basic bits work, and I can test the 'production' version with less important and more peripheral production machines. But it's become obvious to me that really working out the best way to gather and present stats is going to take putting a stats-gathering system on real production servers and then seeing what explodes and what doesn't work for us (and what does). I simply don't think I can build a fully baked system that's ready to deploy onto our production servers in a final, unchanging configuration; I just don't know enough and I can't learn with just an artificial test environment. Instead we're going to have to put a half-baked, tentative setup on to production servers and then evolve it. There are going to be changes on the production machines, possibly drastic ones. We won't have nice build instructions and other documentation until well after the fact (once all the dust settles and we fully understand things).
As mentioned, this is not how we want to do production systems. But it's how we're going to have to do this one and I have to live with that. More than that, I have to embrace it. I have to be willing to stop trying to polish a test setup and just go, just put things on (some of) the production servers and see if it all works and then change it.
(I've sold my co-workers on this. Now I have to sell myself on it too (and stop using any number of ways to duck out of actually doing this), which is part of what this entry is about.)