2023-06-03
Unix is not POSIX
GNU Grep's defense of its decision to ruin and then drop the
'fgrep
' and 'egrep
' commands
is that these commands aren't in POSIX. There are a number of problems
with this, but one of them is that Unix is not POSIX (and
conversely, POSIX is not Unix). In practice, POSIX partially overlaps
with modern Unixes, so just because something isn't in POSIX is not
any sort of reason for a Unix to not have it (and never has been).
Unixes have always had many things that aren't in POSIX.
(One obvious case where POSIX was less than Unix from the start
is its omission of egrep
and fgrep
, which had been in all of
the Unix families for years at the time POSIX was being written. The Rationale section of POSIX grep
makes it clear that the people writing POSIX were fully aware of
this and didn't care.)
POSIX is primarily a subset of Unix as of the late 1980s, with a certain amount of additional invention (some of which was good and other of which has been carefully dumped in a hole and forgotten). Not everything that we consider to be Unix is in POSIX, sometimes for good reason and sometimes not (this is true for commands, C library functions, the filesystem layout, and behavior). Since POSIX was mostly documenting (some) existing practices, it periodically made compromises or picked winners between the two dominant strands of Unix at the time (BSD and System V), and not all of those choices have proven to be particularly popular or durable.
(For example, your 'du
' command almost certainly doesn't conform
to POSIX in practice, because POSIX du
reports in 512-byte units
by default and
it turns out no one likes that once they had a choice.)
What people consider to be Unix and expect from any Unix system goes well beyond what POSIX requires. A system limited to things that appear in POSIX and that behave as POSIX specifies would likely be thin gruel for most people (as well as behave in surprising ways). Consider, for example, the limited list of POSIX commands (which also include any number of commands that no common Unix implements today), or you can peruse the full Single Unix Specification for C functions and other items of interest.
(Also, a POSIX compliant system doesn't have to look anything like you'd expect from a Unix system. There's no requirement for POSIX commands to be in /bin or /usr/bin, for example (cf), and not really much requirements for what files and directories must be present. To the best of my knowledge, this was a deliberate choice on POSIX's part for multiple reasons, including allowing non-Unix systems to be POSIX compliant (cf).)
2023-06-02
GNU Grep versus the (Linux) open source ecology
One of the changes in GNU Grep 3.8 was, to quote this release notice (also the GNU Grep 3.8 release NEWS):
The egrep and fgrep commands, which have been deprecated since release 2.5.3 (2007), now warn that they are obsolescent and should be replaced by grep -E and grep -F.
GNU Grep's fgrep
and egrep
commands were already shell scripts
that ran 'grep -F
' or 'grep -E
', so this change amounted to
adding an echo
to them (to standard error). Many Linux distributions
immediately reverted this change (for example, Debian), but Fedora
did not and so Fedora 38 eventually shipped with Grep 3.8. Fedora
38 also shipped with any number of open source packages that contain
installed scripts that use 'fgrep
' and 'egrep
' (cf what I
found on my machine), and
likely more of its packages use those commands in their build
scripts.
(There are reports of build failures in Gentoo (via).)
Since adding warnings and other new messages is a breaking API
change, all of these packages are now
broken in Fedora and by extension any other Linux distribution that
packages them, uses GNU Grep 3.8 or later, and hasn't reverted this
change. Some of them are only minorly broken; others, either
inspecting their standard error or operating in a context where
other programs expect to see and not see some things, are more
seriously affected. To repair this breakage, all of these packages
need to be changed to use 'grep -F
' and 'grep -E
' instead of
fgrep
and egrep
.
This change is pointless make-work inflicted on the broad open source ecology by GNU Grep. GNU Grep's decision to cause these long-standing commands to emit new messages requires everyone else to go through making changes in order to return to the status quo. This is exactly the same kind of make work as other pointless API changes, and just like them it's hostile to the broad open source ecology.
(It's also hostile to actual people, but that's another topic.)
You may be tempted to say 'but it's a small change'. There are two answers. First, a small change multiplied by a large number of open source projects is a lot of work overall. Second, that this is a make-work change at all is GNU Grep deciding that other projects don't matter that much. This decision is hostile to the wider open source ecology as a matter of principle. It's especially hostile given that any number of open source projects are at best dormant, although still perfectly functional, and thus not likely to make any changes, and other open source projects will likely tell GNU Grep to get bent and not change (after all, even Linux distributions are rejecting this GNU Grep change).
Due to how Linux distribution packaging generally works, it would
actually have been less harmful for the overall Linux distribution
ecology if GNU Grep had simply dropped their 'fgrep
' and 'egrep
'
cover scripts. If they had done so, Linux distributions would most
likely have shipped their own cover scripts (without warnings) as
additional packages; instead, GNU Grep has forced Linux distributions
to patch GNU Grep itself.
PS: While GNU Grep is in theory not Linux specific, in practice only Linux uses GNU Grep. Other open source Unixes have their own versions of the grep suite, and this GNU Grep change isn't going to encourage them to switch.
(I had a string of Fediverse reactions to this change when I upgraded to Fedora 38 on my work machine. Also, when GNU Grep released 3.8 last fall I wrote about how we're stuck with egrep and fgrep.)
2023-06-01
Capturing data you need later when using bpftrace
When using bpftrace, it's pretty common that not all of the data you want to report on is available in one spot, at least when you have to trace kernel functions instead of tracepoints. When this comes up, there is a common pattern that you can use to temporarily capture the data for later use. To summarize this pattern, it's to save the information in an associative array that's indexed by the thread id to create a per-thread variable. If you have more than one piece of information to save, you use more than one associative array.
Let's start with the simplest case; let's suppose that you need both a function's argument (available when it's entered) and its return value (so you can report only on successful functions). Then the pattern looks like this:
kprobe:afunction { // record argument into @arg0 // under our thread id (tid) @arg0[tid] = (struct something *)arg0; } // only act if we have the argument // recorded kretprobe:afunction /@arg0[tid] != 0/ { $arg = @arg0[tid]; printf(...., $arg) // or whatever // clean up recorded argument delete(@arg0[tid]); }
This example shows all of the common pieces. At the start, we capture
the function argument we care about into an associative array
that's indexed by the current thread ID (using the tid
builtin
variable),
then, provided that we have a recorded argument we use it when the
function returns. At the end, we clean up our associative array by
deleting our entry from it; if we didn't do this, we might have an
ever-growing associative array (or arrays) as different threads
called the function we're tracing. Incidentally, one time we might
invoke the kretprobe probe without the argument recorded is if we
start tracing while an existing invocation of the function is in
flight (which may be especially likely for functions that take a
while, such as handling a NFS request and reply).
(This pattern is so common it's mentioned in the documentation as
a per-thread variable. Note that the documentation's example
delete()
s the per-thread entry just as I do here.)
The reason we didn't use a simple global variable, as I did when I was recording ZFS's idea of available memory (in another bpftrace trick) is that multiple threads may be calling this function at the same time, and if they are, using a single global variable is obviously going to give us bad results.
Another case that often comes up is that the function we want to trace directly or indirectly calls another function that looks up important information, for example to map some opaque identifier into a more useful piece of data (a string, a structure) and return it. A variant of this is where the function will generate the information we want through a process that we can't hook into, but will then call another function to validate it or act on it, at which point we can grab the data. The full version of this pattern looks something like this:
// set a marker so we know to save info kprobe:afunction { @aflag[tid] = 1; } // if we're marked, save the information kprobe:subfunction /@aflag[tid] != 0/ { @magicarg[tid] = arg0; } // if we have saved information, use it // and clear it kretprobe:afunction /@magicarg[tid] != 0/ { .... do whatever ... delete(@magicarg[tid]); } // clear the marker kretprobe:afunction /@aflag[tid] != 0/ { delete(@aflag[tid]); }
One reason we need to set a marker and only save the subfunction's
information if we're marked is that the marker is our guarantee
that the saved information will be cleared later. If we unconditionally
saved the information when subfunction() was called but only cleared
it when subfunction() was called by afunction(), that would lead
to a slow growth of dead @magicarg
entries if subfunction() is
ever called from anywhere else.
A variant on this is if our 'subfunction' is actually a peer function to our function of interest (and gets called before it), with both being called from a containing function. The pattern here is more elaborate; the containing function sets the marker and must clean up everything, with the subfunction and our function saving and using the information.
Sidebar: Tracking currently active requests/etc in bpftrace
In DTrace, the traditional way to keep a running count of something
(such as how many threads were active inside afunction()
) was to
use a map with a fixed key that was incremented with sum(1)
and
decremented with sum(-1)
(see map functions),
with the decrement generally guarded so that you knew a matching
increment had been done. Although I haven't tested it, the bpftrace
documentation on the ++ and -- operators
seems to imply that these are safe to use on at least maps with
keys (including constant keys), and perhaps global variables in
general. Even if you have to use maps, this is at least clearer
than the sum()
version.
(You'll want to guard the decrement even if you use --.)
2023-05-31
DNSSEC failures are how you get people to disable DNSSEC
The news of the time interval is that the people in charge of the New Zealand country zones (things directly under .nz) fumbled a DNSSEC key (KSK) rollover in such a way as to break DNSSEC resolution for those domains (see DNSSEC chain validation issue for .nz domains, this news article, and more). The suggested resolution to return these domains to working DNSSEC was for all of the people running DNSSEC validating resolvers to flush the zone information for everything under .nz. Or you could wait for things to time out in a day or two.
You know what else you could do in your DNSSEC validating resolver to fix this and other future DNSSEC 'we shot ourselves in the foot' moments? That's right: you could disable DNSSEC validation entirely. The corollary is that every prominent DNSSEC failure is another push for people operating resolvers to give up on the whole set of complexity and hassles.
Some people are required to operating DNSSEC validating resolvers, and others are strongly committed to it (and are so far willing to pay the costs of doing so in staff time, people's complaints, and so on). But other people are not so committed and so the more big DNSSEC failures there are, the more of them are going to solve the problem once and for all by dropping out. And then DNSSEC becomes that much harder to adopt widely even if you think it's a good idea.
(As for whether DNSSEC is a useful idea, see for example this RIPE86 slide deck by Geoff Huston, via, also.)
An additional contributing factor to this dynamic is that attacks that are (or would be) stopped by DNSSEC seem relatively uncommon these days. In practice, for almost all people and almost all of the time, it seems to be that a DNSSEC validation failure happens because a zone operator screwed up. This gives us the security alert problem, where the typical person's experience is dominated by false positives that just get in their way.
PS: At this point it's probably too late to fix the core problem, since DNSSEC is already designed and deployed, and my impression is that it has low protocol agility (the ability to readily change). Exhorting people to not screw up things like DNSSEC KSK rollover clearly hasn't worked, so the only real solution would be better ways to automatically recover from it. Maybe there are practical changes to resolving DNS servers that can be done to work around the issue, so for example they have heuristics to trigger automatically flushing and re-fetching zones.
2023-05-30
Some tricks for getting the data you need when using bpftrace
When I talked about drgn versus bpftrace, I mentioned that one issue with bpftrace is that it doesn't have much access to global variables in the kernel (and things that they point to); at the moment it seems that bpftrace can only access (some) global variables in the main kernel, and not global variables in modules. However, often the information you may want to get is in module global variables, for example the NFS locks that the kernel NFS server is tracking or important state variables for changes in the ZFS ARC target size. When you want to get at these, you need to resort to a number of tricks, which all boil down to one idea: you find a place where what you want to know is exposed as a function argument or a function return value, because bpftrace has access to both of those.
(All of this means that you're going to need to read the kernel source, specifically the kernel source for the version of the kernel you're using, since the internal kernel structure changes over time.)
If you're really lucky, a function or kernel tracepoint that you already want to track will be passed the information you're interested in. This is unfortunately relatively rare, probably because there's usually no point in passing in an argument that's already available as a global variable.
Sometimes, you'll be able to find something that is called once on each item in a complex global data structure, which will let you indirectly see that global data structure. This was the case with bpftrace dumping of NFS lock clients, which also illustrates that you may need to do something to trigger this traversal (here, reading from /proc/locks). In general, files in /proc often have a kernel function that will produce one line of them and are given as an argument something they're reporting about.
Some kernel code is generalized by calling a function to obtain
information that's effectively from a global variable (or something
close to it). For example, ZFS on Linux has an idea of 'memory
available to ZFS' that's a critical input to decisions on the ZFS
ARC size, and this number is obtained by calling the function
'arc_available_memory()
'. If we want to know this value in
other functions (for example, the ZFS functions that decide about
shrinking the ARC target size), we can capture the information
for later use:
kretprobe:arc_available_memory { $rv = (int64) retval; @arc_available_memory = $rv; }
Here I'm capturing this information in a global bpftrace value, because it truly is a global piece of information. ZFS may call this function in many contexts, not just when thinking about shrinking the ARC target size, but all we care about is having it available later so the extra times we'll update our bpftrace global generally don't matter.
There are two unfortunate limitations of this approach, due to how the kernel is structured. First, some of what look like function calls in the kernel source code are actually #define'd macros in the kernel header files; you obviously can't hook into these with bpftrace. Second, some functions are inlined into their callers, often because they've specifically been marked as 'always inline'. These functions can't be traced either, which can be a pity because they're often exactly the sort of access functions that'd give us useful information.
(There are some general bpftrace techniques for picking up information that you want, but they're for another entry.)
PS: I believe that bpftrace can access CPU registers (and thus the stack) and can insert tracepoints inside functions, not just at their start. In theory with enough work this would allow you to get access to any value ever explicitly materialized at some point in a function (either in a register or in a local on the stack). In practice, this would be at best a desperation move; you'd have to disassemble code in your specific kernel to determine instruction offsets and other critical information in order to pull this off.
PPS: In theory with sufficient work you might be able to get access to module global variables in bpftrace. Their addresses are in /proc/kallsyms and I think you might be able to insert that address into a bpftrace script, then cast it to the relevant (pointer) type and dereference it. But this is untested and again I wouldn't want to do this in anything real.
2023-05-29
System administration's long slow march to configuration automation
Dan Luu recently asked about past and current computing productivity improvements that were so good that it was basically impossible for them not to get adopted. In my reply, I nominated configuration management (here):
From a system administration perspective, the move from hand-crafting systems to automating their setup (or as much of it as possible) feels both transformative and so obviously compelling to practitioners that you hardly have to sell the idea.
(The end point of that today is containerization and k8s/etc, but I'm not sure today's endpoint will persist.)
Although I claimed that you hardly had to sell the idea, in retrospect this is a bit overblown. Automated configuration management (at least of Unix machines) has spent a very long time slowly cooking away and slowly being adopted more and more broadly and more and more commonly. To see some of this, we can look at the initial release dates in Wikipedia's comparison of open-source configuration management. The oldest one listed, CFEngine, goes back to 1993, then there are some from the late 1990s, and then a flowering starting in the 2000s. If we take formal package management and automated installers as part of this, then those are present too by the late 1990s (in Linux distributions and elsewhere). And all of these ideas predate formal open source systems for them; people were trying to do package management and configuration management for Unix systems in the late 1980s, using hand-crafted and sometimes very complex systems (interested parties can trawl through old Usenix and LISA proceedings from that era, where various ones got written up).
One of the reasons that these things cropped up and keep cropping up is that the idea is so obviously appealing. Who among us really wants to manage systems and packages and so on by hand, especially across more than one machine? Very few system administrators actually like logging in to machine after machine to do something to them, and we famously script anything that moves, so the idea of automation effectively sells itself.
But despite all of this, automated configuration management as a practice didn't spread all that rapidly. For example, my memory is that the idea of 'pets versus cattle' and the related idea of being able to readily rebuild your machines only really became a thing in the field when virtual machines and VM images started to become a thing in the late 2000s or early 2010s. Certainly many of the configuration management systems listed in Wikipedia date from around then (although Wikipedia's list may be subject to survivorship bias on the grounds that most people are interested in still-viable systems).
I'm not sure that anyone in the 2000s or even the late 1990s would have argued against the abstract idea of automated configuration management. However, I suspect that many people would (and did) argue or feel that either it wasn't necessary for them in their particular situation (for example, because they only had a few machines, and perhaps those machines had critical state such as filesystem data), or that the existing configuration management systems didn't really fit their needs and environment, or that the existing systems would be too much work to adopt relative to the potential payoff. Even people who wound up with a decent number of systems could be in a situation where they'd evolved partial local solutions that worked well enough, because they'd started out too small to use configuration management and then scaled up bit by bit, without ever hitting a cut-over point where their local tools fell over.
So my more nuanced view is that we've wound up in a situation where the appeal of automating system setup and operation is obvious and widely accepted, but the implementation of it still isn't. And where the implementation is widely accepted it's partly because people are using larger scale systems that don't give them a choice, like more or less immutable containers that must be built by automation and deployed through systems.
(Perhaps this mirrors the state of other things, like Continuous Integration (CI) build systems.)
2023-05-28
My current editor usage (as of mid 2023)
I use three (Unix) editors on a regular basis, and there's a story or two in that and how my editor usage has shifted over time. For me, the big shift has been that vim has become my default editor, the editor I normally use unless there's some special circumstance. One way to put it is that vim has become my editing path of least resistance. This shift isn't something that I would have predicted years ago (back then I definitely didn't particularly like vim), but looking back from today it feels almost inevitable.
Many years ago I wrote Why vi
has become my sysadmin's editor, where I talked about how I used vi a lot as a
sysadmin because it was always there (at the time it really was
more or less vi, not vim, cf). Why vim
has become my default editor is kind of an extension of that. Because
I was using vi(m) all of the time for sysadmin things, learning new
vim tricks (like windows or
multi-file changes) had a high
payoff since I could use them everywhere, any time I was using vim
(and they pulled me into being a vim user, not a vi user). As I improved my vim skills and used it
more and more, vim usage became more and more reflexive and vim was
generally sufficient for the regular editing I wanted to do and
usually the easiest thing to use. Then of course I also learned new
vim tricks as part of regular editing, improving my sysadmin vim
usage as well, all in a virtuous cycle of steadily increasing usage.
My vim setup is almost completely stock, because I work in too many different environments to try to keep a vim configuration in sync across them. If I customized my own vim environment very much, I would lose the virtuous cycle of going back and forth between my vim environment and the various other standard setups where I'm using vim because it's there and it works. I do customize my vim environment slightly, but it's mostly to turn off irritations.
My second most frequently used editor is my patched version of an X11 version of Rob
Pike's sam
editor. Sam is the editor that
my exmh environment invokes when I use it to reply
to email, and I still read and reply to much of my email in exmh.
In theory it wouldn't be too hard to make exmh use vim instead (in
an xterm); in practice, I like sam and I like still using it here.
However, when I write or reply to email from the command line with
NMH commands, I edit that email in
vim. I sometimes use sam for other editing, but not very often, and
sometimes I'm sad about this shift. I like sam; I just wish I liked
it enough to stubbornly stick with it against the vim juggernaut.
My third most frequently used editor is GNU Emacs. GNU Emacs is what I use if I'm doing something that benefits from a superintelligent editor, and Magit is extremely compelling all by itself, especially with Magit's excellent support for selective commits. Apart from Magit, my major use for GNU Emacs is working with Go or Python, where I've gone through the effort to set up intelligent LSP-based support for them (Go version, Python version), as well as various additional tweaks and hacks (for example). If I had cause to do significant editing of C code, I'd probably also do it in GNU Emacs because I have an existing C auto-indentation setup that I like (I preserved it after blowing up my ancient Emacs setup). I still consider GNU Emacs to be my editor of choice for serious code editing (more or less regardless of the language), for various reasons, but I don't do very much programming these days. If I had to read and trace my way through Rust code, I might try doing it in GNU Emacs just because I have the Rust LSP server installed and I know how to jump around in lsp-mode.
(Today I mostly use GNU Emacs over X, because all of this LSP intelligence really wants graphics and various other things in order to look nice. GNU Emacs in a black and white xterm terminal window is a shadow of itself, at least in my configuration.)
My use of GNU Emacs stems from history. I used to use GNU Emacs a lot, so I built up a great deal of familiarity with editing in it and customizing it (my vim familiarity is much more recent). I use GNU Emacs enough to keep that familiarity alive, so it keeps being a decent editing environment for me. The same is true of my sam usage; there was a time when I used sam much more than I do now and I still retain a lot of the knowledge from then.
I'm sentimentally fond of sam, even if I don't use it broadly; it still feels right when I edit messages in it. I'm not sure I'm fond of either vim or GNU Emacs (any more than I am of most software), but vim has come to feel completely natural and GNU Emacs is an old friend even if I don't see it all that often. I feel no urge to try to make vim replace GNU Emacs by adding plugins, for various reasons including how I feel about doing that with vim (also).
(This expands on a Fediverse post, which was sparked by Jaana Dogan's mention of Acme, which linked to Russ Cox'x video tour of Acme.)
2023-05-27
How I set up a server for testing new Grafana versions and other things
I mentioned yesterday that you probably should have a server that you can test Grafana upgrades on, and having one is useful for experiments. There are a couple of ways to set up such a server, and as it happens our environment is built in such a way to make it especially easy. Although our Prometheus server and Grafana run on the same machine and so Grafana could access Prometheus as 'localhost:9090', when I set this up I decided that Grafana should instead access Prometheus through our reverse proxy Apache setup, using the server's public name.
(When I set this up, I think I had ideas of being able to watch for potential Grafana query problems by looking at the Apache log and seeing the queries it was sending to Prometheus. Then Grafana switched from querying Prometheus with GET and query parameters to using POST and a POST body, a change that's better for Grafana but which does limit what we can now get from Apache logs.)
This makes setting up a testing Grafana server basically trivial.
We (I) install a random (virtual) machine, follow the steps to set
up Apache and Grafana as if it were our production metrics server,
and copy the production metrics server's current grafana.db
to
it (while Grafana isn't running). When I restart Grafana, it will
come up with all of our dashboards and talking to our production
Prometheus data source, since it's using the public name. This
gives me a way to directly compare new versions of Grafana against
the production version, including trying to update old panel types
to new panel types and comparing the results.
(In our environment we have few enough changes to the production
grafana.db
that I can just copy the file around more or less any
time I want to; I don't need to shut down the production Grafana
to save a safe copy.)
This would still be relatively simple if I'd used 'localhost:9090' as the URL for our Prometheus data source instead of its public URL, since you can change the URL of a data source. I'd just have to remember to modify the copied database that way every time I copied or re-copied it. If our Prometheus server wasn't accessible at all off the machine (either on its own or through a reverse proxy Apache), I would probably enable my testing by ssh'ing from the test server to the production server to port forward 'localhost:9090'. I couldn't leave this running all the time, but there's generally no reason to leave the test Grafana server running unless I'm specifically interested in something.
(This is far easier to do with virtual machines than with physical ones, since these days starting up and shutting them down is as simple as 'virsh start <x>' and 'virsh shutdown <x>'.)
PS: Another trick you can play with a Grafana testing server is to
keep multiple copies of your grafana.db
around and swap them in
and out of being the live one depending on what you want to do. For
example, you might have one with a collection of test dashboards
(to look into things like how to display status over time in the
latest Grafana), and another that's a
current copy of your production Grafana's database.
2023-05-26
In practice, Grafana has not been great at backward compatibility
We started our Prometheus and Grafana based metrics setup in late 2018. Although many of our Grafana dashboards weren't created immediately, the majority of them were probably built by the middle of 2019. Based on release history, we probably started somewhere around v6.4.0 and had many dashboards done by the time v7.0.0 came out. We're currently frozen on v8.3.11, having tried v8.4.0 and rejected it and all subsequent versions. The reason for this is fairly straightforward; from v8.4.0 onward, Grafana broke too many of our dashboards. The breakage didn't start in 8.4, to be honest. For us, things started to degrade from the change between the 7.x series and 8.0, but 8.4 was the breaking point where too much was off or not working.
(I've done experiments with Grafana v9.0 and onward, and it had more issues over the latest 8.x releases. In one way this isn't too surprising, since it is a new major release.)
I've encountered issues in several areas in Grafana during upgrades. Grafana's handling of null results from Prometheus queries has regressed more than once while we've been using it. Third party panels that we use have been partially degraded or sometimes completely broken (cf). Old panel types sprouted new bugs; new panel types that were supposed to replace them had new bugs, or sometimes lacked important functionality that the old panel types had. Upgrading (especially automatically) from old panel types to their nominally equivalent new panel types didn't always carry over all of your settings (for settings the new panel type supported, which wasn't always all of them).
Grafana is developed and maintained by competent people. That these backward compatibility issues happen anyway tell me that broad backward compatibility is not a priority in Grafana development. This is a perfectly fair thing; the Grafana team is free to pick their priorities (for example, not preserving compatibility for third party panels if they feel the API being used is sub-par and needs to change). But I'm free to quietly react to them, as I have by freezing on 8.3.x, the last release where things worked well enough.
I personally think that Grafana's periodic lack of good backward compatibility is not a great thing. Dashboards are not programs, and I can't imagine that many places want them to be in constant development. I suspect that there are quite a lot of places that want to design and create their dashboards and then have them just keep working until the metrics they draw on change (forcing the dashboards to change to keep up). Having to spend time on dashboards simply to keep them working as they are is not going to leave people enthused, especially if the new version doesn't work as well as the old version.
The corollary of this is that I think you should maintain a testing Grafana server, kept up to date with your primary server's dashboards, where you can apply Grafana updates to test them to see if anything you care about is broken or sufficiently different to cause you problems. You should probably also think about what might happen if you have to either freeze your version of Grafana or significantly rebuild your dashboards to cope with a new version. If you allow lots of people to build their own dashboards, perhaps you want to consider how to reach out to them to get them to test their dashboards or let them know of issues you've found and the potential need to update their dashboards.
(I didn't bother filing bug reports about the Grafana issues that I encountered, because my experience with filing other Grafana issues was that doing so didn't produce results. I'm sure that there are many reasons for this, including that Grafana probably gets a lot of issues filed against it.)
2023-05-25
That people produce HTML with string templates is telling us something
A while back I had a hot take on the Fediverse:
Another day, another 'producing HTML with string templates/interpolation is wrong' article. People have been writing these articles for a decade or more and people doing web development have kept voting with their feet, which is why we have string templates everywhere. At this point, maybe people should consider writing about why things have worked out this way.
(I don't think it's going to change, either. No one has structured HTML creation that's as easy as string templates.)
One of my fundamental rules of system design is when people keep doing it wrong, the people are right and your system or idea is wrong. A corollary to this is that when you notice this happening, a productive reaction is to start asking questions about why people do it the 'wrong' way. Despite what you might expect from its title, Hugo Landau's [[Producing HTML using string templates has always been the wrong solution (via) actually has some ideas and pointers to ideas, for instance this quote from Using type inference to make web templates robust against XSS:
Strict structural containment is a sound, principled approach to building safe templates that is a great approach for anyone planning a new template language, but it cannot be bolted onto existing languages though because it requires that every element and attribute start and end in the same template. This assumption is violated by several very common idioms, such as the header-footer idiom in ways that often require drastic changes to repair.
Another thing to note here is that pretty much every programming language has a way to format strings, and many of them have ways to have multi-line strings. This makes producing HTML via string formatting something that scales up (and down) very easily; you can use the same idiom to format a small snippet as you would a large block. Even Go's html/template package doesn't scale down quite that far, although it comes close. String templating is often very close to string formatting and so probably fits naturally into how programmers are acclimatized to approach things.
(Hugo Landau classifies Go's html/template as 'context aware autoescaping HTML string templating', and considers it not as good as what the quote above calls 'strict structural containment' that works on the full syntax tree of the HTML document.)
I don't have any particular answers to why string templating has been enduringly popular so far (although I can come up with theories, including that string templating is naturally reusable to other contexts, such as plain text). But that it has suggests that people see string templating as having real advantages over their alternatives and those advantages keep being compelling, including in new languages such as Go (where the Go authors created html/template instead of trying to define a 'strict structural containment' system). If people want to displace string templating, figuring out what those current advantages are and how to duplicate them in alternatives seems likely to be important.
(I'll pass on the question of how important it is to replace the most powerful context aware autoescaping HTML string templating with something else.)