Wandering Thoughts archives

2020-08-20

What you're looking for with a Grafana dashboard affects its settings

Recently I wrote about how we chose our time intervals in dashboards, where the answer is that we mostly use $__interval because for our purposes this is the best option. But this raises the question of what is our purpose with our dashboards. Put another way, why do we not care about seeing brief spikes in our dashboards?

Broadly speaking, I think that dashboards can be there to look for signs of obvious issues, to look for signs of subtle issues, or to diagnose problems in detail (when you already know there's an issue and you're trying to understand what's going on). Pretty much all of our dashboards are for some combination of the first or the last, and we don't normally go looking for subtle issues.

(The flipside of looking for signs of obvious issues is reassuring you that there are no obvious issues right now. From a cynical perspective, this may be the purpose of a lot of overview dashboards.)

When you're looking for obvious issues, broad overviews are generally fine. If you have periodic very short usage spikes but nothing else notices on a larger scale, you almost certainly don't have an obvious issue. Similarly, showing very short usage spikes on a broad overview graph isn't necessarily useful unless you believe that these spikes are the sign of a larger issue. As a result, you might as well use $__interval even though it makes short term spikes disappear when you're looking at longer time periods.

When you're trying to diagnose problems in detail you already know something is going on and you're probably looking at fine time scales around specific times of interest. At fine time scales, a properly set up Grafana dashboard will show you all of the information available, including fine grained spikes, because it's using a very short $__interval since it covers only a small time range. This is certainly my experience with our dashboards, where I often wind up looking at only five or ten minute time windows in order to try to really understand what was going on at some point.

Looking for subtle issues is an interesting challenge in dashboard design. I suspect it's hard to do without knowing a fair bit about how your environment is supposed to behave (or at least believing that you do). At this point it's not something that I'm doing very much of in our dashboard design (although I've sort of done some of it).

(See also the problem of paying too much attention to our dashboards.)

DashboardsWhatForAndSettings written at 23:57:28; Add Comment

2020-08-16

"It works on my laptop" is a blame game

There is an infamous dialog between developers and operations teams (eg) where the core of the exchange is the developer saying "it works on my laptop" and then the operations team saying "well, pack up your laptop, it's going into production". Sometimes this is reframed as the developer saying "it works on my laptop, deploy it to production". One of many ways to understand this exchange is as a game of who is to blame for production issues.

When the developer says "well it works on my laptop", they're implicitly saying "you operations people screwed up when deploying it". When the operations people say "well pack up your laptop", they're implicitly saying in return "no we didn't, you screwed it up one way or another; either it didn't work or you didn't prepare it for deployment". The developer is trying to push blame to operations and operations is trying to push blame back.

(This exchange is perpetually darkly funny to system administrators because we often feel that we're taking the fall for what are actually other people's problems, and in this exchange the operations people get to push back.)

But the important thing here is that this is a social problem, just like any blame game. Sometimes this is because higher up people will punish someone (implicitly or explicitly) for the issue, and sometimes this is because incentives aren't aligned (which can lead to DevOps as a way to deal with the blame problem).

(This isn't the only thing that DevOps can be for.)

Playing the blame game in real life instead of in funny Internet jokes isn't productive, it's a problem. If your organization is having this dialog for real, it has multiple issues and you're probably going to get caught in the fallout.

(I almost wrote 'you have multiple issues', but it's not your problem, it's the organization's. Unless you're very highly placed, you can't fix these organizational problems, because they point to deep cultural issues on how developers and system administrators view each other, interact with each other, and probably are rewarded.)

Realizing this makes the "it works on my laptop" thing a little less funny and amusing to me, and a bit sadder and darker than it was before.

BlameAndWorksOnMyLaptop written at 00:00:42; Add Comment

2020-08-07

How we choose our time intervals in our Grafana dashboards

In a comment on my entry on our Prometheus and Grafana setup, trallnag asked a good question:

Would you mind sharing your concrete approach to setting the time intervals for functions like rate() and increase()?

This is a good question, because trallnag goes on to cover why this is an issue you may want to think about:

I tend to switch between using $__interval, completely fixed values like 5m or a Grafana interval variable with multiple interval to choose from. None are perfect and all fail in certain circumstances, ranging from missing spikes with $__interval to under or oversampling with custom intervals.

The very simple answer is that so far I've universally used $__interval, which is Grafana's templating variable for 'whatever the step is on this graph given the time scale you're currently covering'. Using $__interval means that your graph is (theoretically) continuous but without oversampling; every moment in time is used for one and only one graph point.

The more complete answer is that we use $__interval but often tell Grafana that there is a minimum interval for the query that is usually slightly larger than how often we generate the metric. When you use rate(), increase(), and their kin, you need to make sure that your interval always has at least two metric points, otherwise they give you no value and your graphs look funny. Since we're using variable intervals, we have to set the minimum interval.

In a few graphs I've experimented with combining rate() and irate() with an or clause:

rate( ...[$__interval] ) or
   irate( ...[4m] )

The idea here is that if the interval is too short to get two metric points, the rate() will generate nothing and we fall through to irate(), which will give us the rate across the two most recent metric points (see rate() versus irate()). Unfortunately, this is both annoying to write (since you have to repeat your metric condition) and inefficient (since Prometheus will always evaluate both the rate() and the irate()), so I've mostly abandoned it.

The high level answer is that we use $__interval because I don't have a reason to make things more complicated. Our Grafana dashboards are for overviews (even detailed overviews), not narrow troubleshooting, and I feel that for this a continuous graph is generally the most useful. It's certainly the easiest to make work at both small and large timescales (including ones like 'the last week'). We're also in the position where we don't care specifically about the rate of anything over a fixed interval (eg, 'error rate in the last 5 minute should be under ...'), and probably don't care about momentary spikes, especially when we're using a large time range with a dashboard.

(Over a small time range, a continuous graph of rate() will show you all of the spikes and dips. Or you can go into Grafana's 'Explore' and switch to irate() over a fixed, large enough interval.)

If we wanted to always see short spikes (or dips) even on dashboards covering larger time ranges, we'd have to use the more complicated approach I covered in using Prometheus subqueries to look for spikes in rates. There's no clever choice of interval in Grafana that will get you out of this for all time ranges and situations, and Prometheus currently has no way to find these spikes or dips short of writing out the subquery. Going down this road also requires figuring out if you care about spikes, dips, or both, and if it's both how to represent them on a dashboard graph without overloading it (and yourself).

(Also, the metrics we generally graph with rate() are things that we expect to periodically have short term spikes (often to saturation, for things like CPU usage and network bandwidth). A dashboard calling out that these spikes happened would likely be too noisy to be useful.)

PS: This issue starts exposing a broader issue of what your Grafana dashboards are for, but that's another entry.

GrafanaOurIntervalSettings written at 22:06:10; Add Comment

2020-08-03

Exim's change to 'taint' some Exim variables is going to cause us pain

Exim is a very flexible mail system (aka a MTA, Mail Transfer agent), to the extent that in its raw state Exim is a mailer construction kit more than a mailer (if you want a simple mailer, consider Postfix). You can use this power for a lot of things, like building simple mailing lists, where a mailing list is created by putting a file of addresses in a specific directory (the name of the file being the name of the mailing list).

This flexibility and power can create security issues, for example when you directly use information from the incoming mail message (information that's under the control of the sender) to open a file in a directory. If not carefully controlled, an attacker who knows enough about your Exim configuration could possibly make you open files you don't intend to, like '../../../etc/passwd'.

(This is a standard risk when using information that's ultimately provided by an attacker.)

For a long time, Exim left it up to whoever wrote your Exim configuration file to worry about this. It was on them to do input validation to make sure that /cs/lists/$local_part would never have anything dangerous in it. Recently the Exim developers decided that this was not sufficient and introduced the idea of 'tainted data', which isn't allowed to be used in various places (especially, as part of a filename that will be opened or accessed). Things that are under the control of a potential attacker, such as the local part or the domain of an address, are tainted.

Unfortunately, there are a lot of places where it's traditionally been natural to use the Exim $local_part string variable as part of file access, which is now tainted and forbidden. Specifically, we have various places in the Exim configurations for several mail machines that use it. These uses are safe in our environment because we only make use of $local_part after it's been verified to exist in our generated list of valid local addresses, but Exim isn't smart enough to know that they're safe. Instead there are a collection of ways to de-taint strings (eg, also, also, also), which from one perspective are a set of artificial hoops you now have to jump through to pacify Exim. Some of these options for de-tainting are backward compatible to versions of Exim before tainting was introduced, but generally the compatible ways are more awkward than the modern best ways.

People, us included, who upgrade to a version of Exim that includes tainting will have to go through their Exim configuration files and revise them to de-taint various things the configuration needs to use. For us, this has to happen for any upgrade of our mail machines to Ubuntu 20.04; 20.04 has a version of Exim with tainting, while the Exim versions in 18.04 and 16.04 are pre-tainting ones. This means that upgrading any of our mail machines to Ubuntu 20.04 needs configuration changes, and some of these configuration changes may not be backward compatible. I think I can find all of the places where our Exim configurations might use tainted data, but I'm not completely confident of that; if I miss one, we're going to experience Exim errors and failures to properly process some email in production.

This is going to be a little bit painful. I'm not looking forward to it, especially as it is yet another case of 'do more work to wind up in exactly the same place'.

(There's an obvious better way for the Exim people to have done this transition to tainted data, but it would have been slower and meant that Exim remained insecure by default for longer.)

PS: We're at least better off than the people on CentOS using EPEL, who apparently got a 'tainted data' version of Exim just dropped on them as a regular package update (cf).

EximTaintingPain written at 23:19:49; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.