How we choose our time intervals in our Grafana dashboards
In a comment on my entry on our Prometheus and Grafana setup, trallnag asked a good question:
Would you mind sharing your concrete approach to setting the time intervals for functions like rate() and increase()?
This is a good question, because trallnag goes on to cover why this is an issue you may want to think about:
I tend to switch between using $__interval, completely fixed values like 5m or a Grafana interval variable with multiple interval to choose from. None are perfect and all fail in certain circumstances, ranging from missing spikes with $__interval to under or oversampling with custom intervals.
The very simple answer is that so far I've universally used $__interval, which is Grafana's templating variable for 'whatever the step is on this graph given the time scale you're currently covering'. Using $__interval means that your graph is (theoretically) continuous but without oversampling; every moment in time is used for one and only one graph point.
The more complete answer is that we use $__interval but often
tell Grafana that there is a minimum interval for the query that
is usually slightly larger than how often we generate the metric.
When you use
increase(), and their kin, you need to
make sure that your interval always has at least two metric points,
otherwise they give you no value and your graphs look funny. Since
we're using variable intervals, we have to set the minimum interval.
In a few graphs I've experimented with combining
rate( ...[$__interval] ) or irate( ...[4m] )
The idea here is that if the interval is too short to get two metric
rate() will generate nothing and we fall through to
irate(), which will give us the rate across the two most recent
metric points (see
Unfortunately, this is both annoying to write (since you have to
repeat your metric condition) and inefficient (since Prometheus
will always evaluate both the
rate() and the
irate()), so I've
mostly abandoned it.
The high level answer is that we use $__interval because I don't have a reason to make things more complicated. Our Grafana dashboards are for overviews (even detailed overviews), not narrow troubleshooting, and I feel that for this a continuous graph is generally the most useful. It's certainly the easiest to make work at both small and large timescales (including ones like 'the last week'). We're also in the position where we don't care specifically about the rate of anything over a fixed interval (eg, 'error rate in the last 5 minute should be under ...'), and probably don't care about momentary spikes, especially when we're using a large time range with a dashboard.
(Over a small time range, a continuous graph of
rate() will show you
all of the spikes and dips. Or you can go into Grafana's 'Explore' and
irate() over a fixed, large enough interval.)
If we wanted to always see short spikes (or dips) even on dashboards covering larger time ranges, we'd have to use the more complicated approach I covered in using Prometheus subqueries to look for spikes in rates. There's no clever choice of interval in Grafana that will get you out of this for all time ranges and situations, and Prometheus currently has no way to find these spikes or dips short of writing out the subquery. Going down this road also requires figuring out if you care about spikes, dips, or both, and if it's both how to represent them on a dashboard graph without overloading it (and yourself).
(Also, the metrics we generally graph with
rate() are things that we
expect to periodically have short term spikes (often to saturation, for
things like CPU usage and network bandwidth). A dashboard calling out
that these spikes happened would likely be too noisy to be useful.)
PS: This issue starts exposing a broader issue of what your Grafana dashboards are for, but that's another entry.
Our problem installing an old Ubuntu kernel set of packages
On Twitter, I said:
It has been '0' days since I've wound up hating Debian's choice to sign package metadata instead of packages (or perhaps 'in addition to' these days). Why? Because it makes it much more difficult to support 'install a package, satisfying dependencies from this directory of debs'.
Naturally there is a story here.
We have some Linux fileservers running Ubuntu, and we are very controlled about upgrading their kernel versions (partly because of mysterious crashes). We have a new kernel version that's proven on our test fileserver and our most recently build fileserver (which is itself a story), and we're looking at upgrading the other fileservers to that kernel. However, this kernel is not the most recent Ubuntu kernel; it's sufficiently old that it's no longer in the official Ubuntu package repositories.
We have our own local Ubuntu mirror, where we never delete packages,
and it has all of the many linux-* packages and meta-packages
required. However, we can't just do '
linux-generic=...' and get all of those packages. Because these
older Linux kernels aren't in the Ubuntu official package repositories,
they're not in the official repository index files. Because these
index files are signed, our mirror can't just rebuild them to reflect
the full set of packages we have available. Although we have these
files available on our mirror, we can't use them, at least not
Similarly, I suspect this fundamental assumption of signed index files
(or at least the existence of index files) is part of why I don't think
dpkg frontend has an option to just get packages and dependencies
from a directory you supply. You can '
dpkg -i *.deb' for everything in
a directory, but that requires you to carefully curate the directory to
have absolutely everything required, and Ubuntu kernels come in a rather
large number of packages.
(If there is a command line frontend that supports this, I would like to know about it. I don't count dropping .debs into /var/cache/apt/archives for apt, although I've read that it actually works.)
You don't really have this problem on RPM based systems like
CentOS. Since all RPM packages themselves are signed, signed metadata
isn't as important and tools like
dnf are generally happy to
work with a pool of RPMs in a directory.
(Note that unsigned repository metadata opens you up to some attacks, so you definitely want to sign it if possible. It's also safe to generate your own local unsigned repository metadata, since you generally trust yourself.)
(See also my wish to be able to easily update the packages on an Ubuntu ISO image, which also runs into this issue.)