Using our metrics system when I test systems before deployment

August 27, 2021

Years ago I wrote that I should document my test plans for our systems and their results, and I've somewhat managed to actually do that (and then the documentation's been used later, for example). Recently it struck me that our metrics system has a role to play in this.

To start with, if I add my test system to our metrics system (even with a hack), our system will faithfully capture all sorts of performance information for it over the test period. This information isn't necessarily as fine-grained as I could gather (it doesn't go down to second by second data), but it's far more broad and comprehensive than I would gather by hand. If I have questions about some aspect of the system's performance when I write up test plan results, it's quite likely that I can get answers for them on the spot by looking in Prometheus (without having to re-run tests while keeping an eye on the metric I've realized is interesting).

(As a corollary of this, looking at metrics provides an opportunity to see if anything is glaringly wrong, such as a surprisingly slow disk.)

In addition, if I'm testing a new replacement for an existing server, having metrics from both systems gives me some opportunity to compare the performance of the two systems. This comparison will always be somewhat artificial (the test system is not under real load, and I may have to do some artificial things to the production system as part of testing), but it can at least tell me about relatively obvious things, and it's easy to look at graphs and make comparisons.

Our current setup keeps metrics for as long as possible (and not downsampling them, which I maintain is a good thing). To the extent that we can keep on doing this, having metrics from the servers when I was testing them will let us compare their performance in testing to their performance when they (or some version of them) is in production. This might turn up anomalies, and generally I'd expect it to teach us about what to look for in the next round of testing.

To get all of this, it's not enough to just add test systems to our metrics setup (although that's a necessary prerequisite). I'll also need to document things so we can find them later in the metrics system. At a minimum I'll need the name used for the test system and the dates it was in testing while being monitored. Ideally I'll also have information on the dates and times when I ran various tests, so I don't have to start at graphs of metrics and reverse engineer what I was doing at the time. A certain amount of this is information that I should already be capturing in my notes, but I should be more systematic about recording timestamps from 'date' and so on.

Written on 27 August 2021.
« I'm turning off dnf-makecache on my Fedora machines
What my first Linux was, and its context »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Aug 27 00:04:32 2021
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.