2020-03-19
Make sure to keep useful labels in your Prometheus alert rules
Suppose, not entirely hypothetically, that you have some metrics that
are broken out across categories but what you care about are the total
number of things together. For example, you're monitoring some OpenBSD
firewalls and you care about the total number of PF states, but your
metrics break them down by protocol (this information is available
in 'pfctl -ss
' output). So your
alert rule is going to be something like:
- alert: TooManyStates expr: sum( pfctl_protocol_entries ) by (server) > 80000 ....
Congratulations, you may have just aimed a gun at your own foot.
If you have additional labels on that pfctl_protocol_entries
metric that you may want to use in the alert that will result from
this (perhaps the datacenter or some other metadata), you've just
lost them. When you said 'sum(...) by (server)
', Prometheus
faithfully did what you said; it summed everything by the server
and as part of that threw away all other labels, because you told
it all that mattered was the 'server
' label.
There are two ways around this. The obvious, simple way that you
may reach for in your haste to fix this issue is to add the additional
metadata label or labels that you care about to the 'by()
'
expression, so you have, eg, 'sum(...) by (server, datacenter)
'.
The problem with this is that you're playing whack-a-mole, having
to add each additional label to the list of labels as you remember
them (or discover problems because they're missing). The better
way is to be explicit about what you want to ignore:
sum( pfctl_protocol_entries ) without (proto)
This will automatically pass through all other labels, including ones that you add in six months from now as part of a metrics reorganization (long after you forgot that 'sum(..) by (...)' special case in one of your alert rules).
After this experience, I've come to think that doing aggregation using 'by (...)' in your alert rules (or recording rules) is potentially dangerous and ought to at least be scrutinized carefully and probably commented. Sometimes there are good reasons for it where you want to narrow down to a known set of common labels or the like, but otherwise it is a potential trap even if it works for your setup today.
Sorting out Go's 'for ... = range ..
' and when it copies things
I recently read Some tricks and tips for using for range in GoLang, where it said, somewhat in passing:
[...] As explained earlier, when the loop begins, it will copy the original array to a new one and loop through the elements, hence when appending elements to the original array, the copied array actually doesn't change.
My eyebrows went up because I'd forgotten this little bit of Go, and I promptly scuttled off to the official specification to read and understand the details. So here are some notes, because the issues behind this turn out to be more interesting than I expected.
Let's start with the basic form, which is 'for ... := range a { ...
}
'. The expression to the right of the range
is called the
range expression. The specification says (emphasis mine):
The range expression x is evaluated once before beginning the loop, with one exception: if at most one iteration variable is present and
len(x)
is constant, the range expression is not evaluated.
Obviously if the range expression is a function call, the function
call must be made (once) and then the return value used in the range
expression. However, in Go even evaluating an expression that's a
single variable produces a copy of the value of that variable (in
the abstract; in the concrete the compiler may optimize this out).
So when you write 'for a, b := range c
', Go (nominally) evaluates
c
and uses the resulting copy of c
's current value.
(Among other consequences, this means that assigning a different
value to c
itself inside the loop doesn't change what the loop
does; c
's value is frozen at the start, when it's evaluated.)
As the additional bit of the specification explains, this doesn't
happen if you use at most one iteration value and you're ranging
over one of the small number of things where len(x)
is a constant
(the rules for this are somewhat legalistic). If you use two
iteration variables, you always evaluate the range expression and
make a copy, which is another reason for Go to prefer the single
variable version (to go with nudging you to not copy actual values
unless necessary).
However, things get tricky if you use pointers. Here:
a := [5]int{1, 2, 3, 4, 5} for _, v := range a { a[3] = 10 fmt.Println("Pass 1:", v) } // reset our mutation a[3] = 4 // loop via pointer: b := &a for _, v := range b { b[3] = 10 fmt.Println("Pass 2:", v) }
In the second loop, what gets copied when the range expression is
evaluated is the pointer, not the array it points to (note that b
is not a slice, it's a pointer to an array). Go's implicit dereferencing
of pointers means that the code for the two loops looks exactly the
same, although they behave differently (the first prints the original
array values before the mutation in the loop, the second mutates
'a[3]
' before printing it).
On the one hand, this may be confusing. On the other hand, this provides a way to effectively sidestep all sorts of range expression copying (if you don't want it); all you have to do is pointerize your range expression, and almost nothing will care about the difference. Fortunately often you don't care about the copying to begin with, because making copies of strings, slices, and maps doesn't require copying the underlying data. The only thing that you can range over that's expensive to copy is an actual array, and directly using actual arrays in Go is relatively rare (especially when using real arrays can cause interesting errors).
If you do a 'copying' range over anything other than a real array (which is copied) or a string (which is immutable), you can still mutate the values from what you're ranging over in your range loop in a way that future iterations of your range loop will or at least may see. Probably you don't want to do this.
(This is the consequence of ranging over slices and maps not making a copy of the underlying data. Because your range copies the slice itself, shrinking or enlarging the original slice won't change the number of iterations. You can potentially change the number of iterations of a map inside of the loop, though.)
Probably I don't need to care about this range copying, at least from an efficiency perspective (I had better remember its other consequences). My Go code (and Go in general) only very rarely uses fixed size arrays, which are the only expensive thing to copy. Copying slices and maps is pretty close to free, and those are usually what I range over (apart from channels, which I consider a special case).