Wandering Thoughts archives

2018-12-17

My current trick for keeping reasonably ready virtual machine images

A lot of my use of virtualization is to test and develop our Ubuntu systems and changes to them. If I have a significant configuration change for one of our Exim machines, for example, I will usually spin up a test VM, build a version of that machine, and test the configuration there before I risk deploying it onto the real machine. Similarly, I prototype a lot of the build and operating procedures for machines on VMs, because if something goes wrong I can pave and rebuild easily (or simply revert to a snapshot).

For a long time, I redid the Ubuntu install on these VMs from scratch every time I needed to change what I was doing with one of them, which was reasonably frequently (or at least I mostly did, because sometimes I was lazy and reused an existing Ubuntu install that wasn't too mutated from our baseline, and then sometimes I was wrong about how mutated it was). Years ago I toyed with the idea of preconfigured virtual machine templates, but I never went anywhere with it for various reasons, including how many potential image variants there were.

Recently I not so much figured something out as got lazy. Most of the VMs I build are of the latest version of Ubuntu and there's definitely one portion of our install system that changes extremely rarely, which is the initial install from our CD image. So at some point, I started just making a snapshot of that initial install in every test machine VM. This still means I have to go through our postinstall process, but that's mostly hands-off and especially I don't have to interact with the VM on the console; I can do it all through SSH.

(The CD stage of the install requires configuring networking and partitioning the local disk before Ubuntu can be installed and brought up on our network. After that it's deliberately accessible via SSH.)

This doesn't help if I need something other than Ubuntu 18.04, but in that case I do 'dd if=/dev/zero of=/dev/sda' and start from scratch with a CD install. I don't bother deleting the snapshot; I'll revert back to it when I go back to wanting this VM to be an 18.04 machine. Fortunately the CD install stage doesn't take too long.

(Most of our Exim machines are still running 16.04, so testing an Exim change is one recent reason I needed a 16.04 VM. I just recycled that one, in fact.)

PS: I can't keep a generic image and clone it to make new VMs, because our install CD sets the machine up with a specific IP address (whatever you enter). It's easier to keep frozen snapshots for several VMs, each with its own IP address in it, than to fix that somehow after I clone the image into a VM.

Sidebar: Something I should try out with VMWare Workstation

Normally I think of VMWare Workstation snapshots as happening in a straight line of time, but if I remember right you can actually fork them into a tree. If this works well, the obvious approach to dealing with different Ubuntu versions would be to keep forked snapshots of each. However, even if this is something you can do reasonably in the VMWare UI, I wonder what sort of performance impact it has on disk IO in the VM. I don't want to slow down my VMs too much for something that I only use occasionally.

sysadmin/KeepingReadyVMImages written at 22:11:56; Add Comment

Exploring casual questions with our new metrics system

On Mastodon a while back, I said:

What's surprised me about having our new metrics system is how handy it is to be able to answer casual questions, like 'does this machine's CPU get used much', and how often I poke around with that sort of little question or small issue I'm curious about.

(I was expecting us to only really care about metrics when trying to diagnose problems or crises.)

Putting together some sort of metrics and performance statistics system for our servers has been one of my intentions for many years now (see, for example, this 2014 entry, or this one on our then goals). Over all of that time, I have assumed that what mattered for this hypothetical system and what we'd use it for was being able to answer questions about problems (often serious ones), partly to make sure we actually understood our problem, or do things like check for changes in things we think are harmless. Recently we actually put together such a system, based around Prometheus and Grafana, and my experience with it so far has been rather different than I expected.

Over and over again, I've turned to our metrics system to answer relatively casual or small questions where it's simply useful to have the answers, not important or critical. Sometimes it's because we have questions such as how used a compute machine's CPU or memory is; sometimes it's let me confirm an explanation for a little mystery. Some of the time I don't even have a real question, I'm just curious about what's going on with a machine or a service. For instance, I've looked into what our Amanda servers are doing during backups and turned up interesting patterns in disk IO, as well as confirming and firming up some vague theories we had about how they performed and what their speed limits were.

(And just looking at systems has turned up interesting information, simply because I was curious or trying to put together a useful dashboard.)

The common element in all of this is that having a metrics system now makes asking questions and getting answers a pretty easy process. It took a lot of work to get to this point, but now that I've reached it I can plug PromQL queries into Prometheus or look at the dashboards I've built up and pull out a lot with low effort. Since it only takes a little effort to look, I wind up looking a fair bit, even for casual curiosities that we would never have bothered exploring before.

I didn't see this coming at all, not over all of the time that I've been circling around system performance stats and metrics and so on. Perhaps this is to be expected; our focus from the start has been on looking for problems and dealing with them, and when people talk about metrics systems it's mostly about how their system let them see or figure out something important about their environment.

(This focus is natural, since 'it solved our big problem' is a very good argument for why you want a metric system and why investing the time to set one up was a smart decision.)

PS: This is of course yet another example of how reducing friction increases use and visibility and so on. When it is easy to do something, you often wind up doing it more often, as I've seen over and over again.

sysadmin/MetricsExploringCasualThings written at 00:56:28; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.