Wandering Thoughts archives


How we choose our time intervals in our Grafana dashboards

In a comment on my entry on our Prometheus and Grafana setup, trallnag asked a good question:

Would you mind sharing your concrete approach to setting the time intervals for functions like rate() and increase()?

This is a good question, because trallnag goes on to cover why this is an issue you may want to think about:

I tend to switch between using $__interval, completely fixed values like 5m or a Grafana interval variable with multiple interval to choose from. None are perfect and all fail in certain circumstances, ranging from missing spikes with $__interval to under or oversampling with custom intervals.

The very simple answer is that so far I've universally used $__interval, which is Grafana's templating variable for 'whatever the step is on this graph given the time scale you're currently covering'. Using $__interval means that your graph is (theoretically) continuous but without oversampling; every moment in time is used for one and only one graph point.

The more complete answer is that we use $__interval but often tell Grafana that there is a minimum interval for the query that is usually slightly larger than how often we generate the metric. When you use rate(), increase(), and their kin, you need to make sure that your interval always has at least two metric points, otherwise they give you no value and your graphs look funny. Since we're using variable intervals, we have to set the minimum interval.

In a few graphs I've experimented with combining rate() and irate() with an or clause:

rate( ...[$__interval] ) or
   irate( ...[4m] )

The idea here is that if the interval is too short to get two metric points, the rate() will generate nothing and we fall through to irate(), which will give us the rate across the two most recent metric points (see rate() versus irate()). Unfortunately, this is both annoying to write (since you have to repeat your metric condition) and inefficient (since Prometheus will always evaluate both the rate() and the irate()), so I've mostly abandoned it.

The high level answer is that we use $__interval because I don't have a reason to make things more complicated. Our Grafana dashboards are for overviews (even detailed overviews), not narrow troubleshooting, and I feel that for this a continuous graph is generally the most useful. It's certainly the easiest to make work at both small and large timescales (including ones like 'the last week'). We're also in the position where we don't care specifically about the rate of anything over a fixed interval (eg, 'error rate in the last 5 minute should be under ...'), and probably don't care about momentary spikes, especially when we're using a large time range with a dashboard.

(Over a small time range, a continuous graph of rate() will show you all of the spikes and dips. Or you can go into Grafana's 'Explore' and switch to irate() over a fixed, large enough interval.)

If we wanted to always see short spikes (or dips) even on dashboards covering larger time ranges, we'd have to use the more complicated approach I covered in using Prometheus subqueries to look for spikes in rates. There's no clever choice of interval in Grafana that will get you out of this for all time ranges and situations, and Prometheus currently has no way to find these spikes or dips short of writing out the subquery. Going down this road also requires figuring out if you care about spikes, dips, or both, and if it's both how to represent them on a dashboard graph without overloading it (and yourself).

(Also, the metrics we generally graph with rate() are things that we expect to periodically have short term spikes (often to saturation, for things like CPU usage and network bandwidth). A dashboard calling out that these spikes happened would likely be too noisy to be useful.)

PS: This issue starts exposing a broader issue of what your Grafana dashboards are for, but that's another entry.

sysadmin/GrafanaOurIntervalSettings written at 22:06:10; Add Comment

Our problem installing an old Ubuntu kernel set of packages

On Twitter, I said:

It has been '0' days since I've wound up hating Debian's choice to sign package metadata instead of packages (or perhaps 'in addition to' these days). Why? Because it makes it much more difficult to support 'install a package, satisfying dependencies from this directory of debs'.

Naturally there is a story here.

We have some Linux fileservers running Ubuntu, and we are very controlled about upgrading their kernel versions (partly because of mysterious crashes). We have a new kernel version that's proven on our test fileserver and our most recently build fileserver (which is itself a story), and we're looking at upgrading the other fileservers to that kernel. However, this kernel is not the most recent Ubuntu kernel; it's sufficiently old that it's no longer in the official Ubuntu package repositories.

We have our own local Ubuntu mirror, where we never delete packages, and it has all of the many linux-* packages and meta-packages required. However, we can't just do 'apt-get install linux-generic=...' and get all of those packages. Because these older Linux kernels aren't in the Ubuntu official package repositories, they're not in the official repository index files. Because these index files are signed, our mirror can't just rebuild them to reflect the full set of packages we have available. Although we have these files available on our mirror, we can't use them, at least not easily.

Similarly, I suspect this fundamental assumption of signed index files (or at least the existence of index files) is part of why I don't think any dpkg frontend has an option to just get packages and dependencies from a directory you supply. You can 'dpkg -i *.deb' for everything in a directory, but that requires you to carefully curate the directory to have absolutely everything required, and Ubuntu kernels come in a rather large number of packages.

(If there is a command line frontend that supports this, I would like to know about it. I don't count dropping .debs into /var/cache/apt/archives for apt, although I've read that it actually works.)

You don't really have this problem on RPM based systems like CentOS. Since all RPM packages themselves are signed, signed metadata isn't as important and tools like yum and dnf are generally happy to work with a pool of RPMs in a directory.

(Note that unsigned repository metadata opens you up to some attacks, so you definitely want to sign it if possible. It's also safe to generate your own local unsigned repository metadata, since you generally trust yourself.)

(See also my wish to be able to easily update the packages on an Ubuntu ISO image, which also runs into this issue.)

linux/UbuntuOldPackageProblem written at 00:00:03; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.