Wandering Thoughts archives

2024-03-08

A realization about shell pipeline steps on multi-core machines

Over on the Fediverse, I had a realization:

This is my face when I realize that on a big multi-core machine, I want to do 'sed ... | sed ... | sed ...' instead of the nominally more efficient 'sed -e ... -e ... -e ...' because sed is single-threaded and if I have several costly patterns, multiple seds will parallelize them across those multiple cores.

Even when doing on the fly shell pipelines, I've tended to reflexively use 'sed -e ... -e ...' when I had multiple separate sed transformations to do, instead of putting each transformation in its own 'sed' command. Similarly I sometimes try to cleverly merge multi-command things into one command, although usually I don't try too hard. In a world where you have enough cores (well, CPUs), this isn't necessarily the right thing to do. Most commands are single threaded and will use only one CPU, but every command in a pipeline can run on a different CPU. So splitting up a single giant 'sed' into several may reduce a single-core bottleneck and speed things up.

(Giving sed multiple expressions is especially single threaded because sed specifically promises that they're processed in order, and sometimes this matters.)

Whether this actually matters may vary a lot. In my case, it only made a trivial difference in the end, partly because only one of my sed patterns was CPU-intensive (but that pattern alone made sed use all the CPU it could get and made it the bottleneck in the entire pipeline). In some cases adding more commands may add more in overhead than it saves from parallelism. There are no universal answers.

One of my lessons learned from this is that if I'm on a machine with plenty of cores and doing a one-time thing, it probably isn't worth my while to carefully optimize how many processes are being run as I evolve the pipeline. I might as well jam more pipeline steps whenever and wherever they're convenient. If it's easy to move one step closer to the goal with one more pipeline step, do it. Even if it doesn't help, it probably won't hurt very much.

Another lesson learned is that I might want to look for single threaded choke points if I've got a long-running shell pipeline. These are generally relatively easy to spot; just run 'top' and look for what's using up all of one CPU (on Linux, this is 100% CPU time). Sometimes this will be as easy to split as 'sed' was, and other times I may need to be more creative (for example, if zcat is hitting CPU limits, maybe pigz can help a bit.

(If I have the fast disk space, possibly un-compressing the files in place in parallel will work. This comes up in system administration work more than you'd think, since we can want to search and process log files and they're often stored compressed.)

programming/ShellPipelineStepsAndCPUs written at 22:27:42; Add Comment

2024-03-07

Some notes about the Cloudflare eBPF Prometheus exporter for Linux

I've been a fan of the Cloudflare eBPF Prometheus exporter for some time, ever since I saw their example of per-disk IO latency histograms. And the general idea is extremely appealing; you can gather a lot of information with eBPF (usually from the kernel), and the ability to turn it into metrics is potentially quite powerful. However, actually using it has always been a bit arcane, especially if you were stepping outside the bounds of Cloudflare's canned examples. So here's some notes on the current version (which is more or less v2.4.0 as I write this), written in part for me in the future when I want to fiddle with eBPF-created metrics again.

If you build the ebpf_exporter yourself, you want to use their provided Makefile rather than try to do it directly. This Makefile will give you the choice to build a 'static' binary or a dynamic one (with 'make build-dynamic'); the static is the default. I put 'static' into quotes because of the glibc NSS problem; if you're on a glibc-using Linux, your static binary will still depend on your version of glibc. However, it will contain a statically linked libbpf, which will make your life easier. Unfortunately, building a static version is impossible on some Linux distributions, such as Fedora, because Fedora just doesn't provide static versions of some required libraries (as far as I can tell, libelf.a). If you have to build a dynamic executable, a normal ebpf_exporter build will depend on the libbpf shared library you can find in libbpf/dest/usr/lib. You'll need to set a LD_LIBRARY_PATH to find this copy of libbpf.so at runtime.

(You can try building with the system libbpf, but it may not be recent enough for ebpf_exporter.)

To get metrics from eBPF with ebpf_exporter, you need an eBPF program that collects the metrics and then a YAML configuration that tells ebpf_exporter how to handle what the eBPF program provides. The original version of ebpf_exporter had you specify eBPF programs in text in your (YAML) configuration file and then compiled them when it started. This approach has fallen out of favour, so now eBPF programs must be pre-compiled to special .o files that are loaded at runtime. I believe these .o files are relatively portable across systems; I've used ones built on Fedora 39 on Ubuntu 22.04. The simplest way to build either a provided example or your own one is to put it in the examples directory and then do 'make <name>.bpf.o'. Running 'make' in the examples directory will build all of the standard examples.

To run an eBPF program or programs, you copy their <name>.bpf.o and <name>.yaml to a configuration directory of your choice, specify this directory in theebpf_exporter '--config.dir' argument, and then use '--config.names=<name>,<name2>,...' to say what programs to run. The suffix of the YAML configuration file and the eBPF object file are always fixed.

The repository has some documentation on the YAML (and eBPF) that you have to write to get metrics. However, it is probably not sufficient to explain how to modify the examples or especially to write new ones. If you're doing this (for example, to revive an old example that was removed when the exporter moved to the current pre-compiled approach), you really want to read over existing examples and then copy their general structure more or less exactly. This is especially important because the main ebpf_exporter contains some special handling for at least histograms that assumes things are being done as in their examples. When reading examples, it helps to know that Cloudflare has a bunch of helpers that are in various header files in the examples directory. You want to use these helpers, not the normal, standard bpf helpers.

(However, although not documented in bpf-helpers(7), '__sync_fetch_and_add()' is a standard eBPF thing. It is not so much documented as mentioned in some kernel BPF documentation on arrays and maps and in bpf(2).)

One source of (e)BPF code to copy from that is generally similar to what you'll write for ebpf_exporter is bcc/libbpf-tools (in the <name>.bpf.c files). An eBPF program like runqlat.bpf.c will need restructuring to be used as an ebpf_exporter program, but it will show you what you can hook into with eBPF and how. Often these examples will be more elaborate than you need for ebpf_exporter, with more options and the ability to narrowly select things; you can take all of that out.

(When setting up things like the number of histogram slots, be careful to copy exactly what the examples do in both your .bpf.c and in your YAML, mysterious '+ 1's and all.)

linux/EbpfExporterNotes written at 23:01:56; Add Comment

2024-03-06

Where and how Ubuntu kernels get their ZFS modules

One of the interesting and convenient things about Ubuntu for people like us is that they provide pre-built and integrated ZFS kernel modules in their mainline kernels. If you want ZFS on your (our) ZFS fileservers, you don't have to add any extra PPA repositories or install any extra kernel module packages; it's just there. However, this leaves us with a little mystery, which is how the ZFS modules actually get there. The reason this is a mystery is that the ZFS modules are not in the Ubuntu kernel source, or at least not in the package source.

(One reason this matters is that you may want to see what patches Ubuntu has applied to their version of ZFS, because Ubuntu periodically backports patches to specific issues from upstream OpenZFS. If you go try to find ZFS patches, ZFS code, or a ZFS changelog in the regular Ubuntu kernel source, you will likely fail, and this will not be what you want.)

Ubuntu kernels are normally signed in order to work with Secure Boot. If you use 'apt source ...' on a signed kernel, what you get is not the kernel source but a 'source' that fetches specific unsigned kernels and does magic to sign them and generate new signed binary packages. To actually get the kernel source, you need to follow the directions in Build Your Own Kernel to get the source of the unsigned kernel package. However, as mentioned this kernel source does not include ZFS.

(You may be tempted to fetch the Git repository following the directions in Obtaining the kernel sources using git, but in my experience this may well leave you hunting around in confusing to try to find the branch that actually corresponds to even the current kernel for an Ubuntu release. Even if you have the Git repository cloned, downloading the source package can be easier.)

How ZFS modules get into the built Ubuntu kernel is that during the package build process, the Ubuntu kernel build downloads or copies a specific zfs-dkms package version and includes it in the tree that kernel modules are built from, which winds up including the built ZFS kernel modules in the binary kernel packages. Exactly what version of zfs-dkms will be included is specified in debian/dkms-versions, although good luck finding an accurate version of that file in the Git repository on any predictable branch or in any predictable location.

(The zfs-dkms package itself is the DKMS version of kernel ZFS modules, which means that it packages the source code of the modules along with directions for how DKMS should (re)build the binary kernel modules from the source.)

This means that if you want to know what specific version of the ZFS code is included in any particular Ubuntu kernel and what changed in it, you need to look at the source package for zfs-dkms, which is called zfs-linux and has its Git repository here. Don't ask me how the branches and tags in the Git repository are managed and how they correspond to released package versions. My current view is that I will be downloading specific zfs-linux source packages as needed (using 'apt source zfs-linux').

The zfs-linux source package is also used to build the zfsutils-linux binary package, which has the user space ZFS tools and libraries. You might ask if there is anything that makes zfsutils-linux versions stay in sync with the zfs-dkms versions included in Ubuntu kernels. The answer, as far as I can see, is no. Ubuntu is free to release new versions of zfsutils-linux and thus zfs-linux without updating the kernel's dkms-versions file to use the matching zfs-dkms version. Sufficiently cautious people may want to specifically install a matching version of zfsutils-linux and then hold the package.

I was going to write something about how you get the ZFS source for a particular kernel version, but it turns out that there is no straightforward way. Contrary to what the Ubuntu documentation suggests, if you do 'apt source linux-image-unsigned-$(uname -r)', you don't get the source package for that kernel version, you get the source package for the current version of the 'linux' kernel package, at whatever is the latest released version. Similarly, while you can inspect that source to see what zfs-dkms version it was built with, 'apt get source zfs-dkms' will only give you (easy) access to the current version of the zfs-linux source package. If you ask for an older version, apt will probably tell you it can't find it.

(Presumably Ubuntu has old source packages somewhere, but I don't know where.)

linux/UbuntuKernelsZFSWhereFrom written at 22:59:21; Add Comment

2024-03-05

A peculiarity of the X Window System: Windows all the way down

Every window system has windows, as an entity. Usually we think of these as being used for, well, windows and window like things; application windows, those extremely annoying pop-up modal dialogs that are always interrupting you at the wrong time, even perhaps things like pop-up menus. In its original state, X has more windows than that. Part of how and why it does this is that X allows windows to nest inside each other, in a window tree, which you can still see today with 'xwininfo -root -tree'.

One of the reasons that X has copious nested windows is that X was designed with a particular model of writing X programs in mind, and that model made everything into a (nested) window. Seriously, everything. In an old fashioned X application, windows are everywhere. Buttons are windows (or several windows if they're radio buttons or the like), text areas are windows, menu entries are each a window of their own within the window that is the menu, visible containers of things are windows (with more windows nested inside them), and so on.

This copious use of windows allows a lot of things to happen on the server side, because various things (like mouse cursors) are defined on a per-window basis, and also windows can be created with things like server-set borders. So the X server can render sub-window borders to give your buttons an outline and automatically change the cursor when the mouse moves into and out of a sub-window, all without the client having to do anything. And often input events like mouse clicks or keys can be specifically tied to some sub-window, so your program doesn't have to hunt through its widget geometry to figure out what was clicked. There are more tricks; for example, you can get 'enter' and 'leave' events when the mouse enters or leaves a (sub)window, which programs can use to highlight the current thing (ie, subwindow) under the cursor without the full cost of constantly tracking mouse motion and working out what widget is under the cursor every time.

The old, classical X toolkits like Xt and the Athena widget set (Xaw) heavily used this 'tree of nested windows' approach, and you can still see large window trees with 'xwininfo' when you apply it to old applications with lots of visible buttons; one example is 'xfontsel'. Even the venerable xterm normally contains a nested window (for the scrollbar, which I believe it uses partly to automatically change the X cursor when you move the mouse into the scrollbar). However, this doesn't seem to be universal; when I look at one Xaw-based application I have handy, it doesn't seem to use subwindows despite having a list widget of things to click on. Presumably in Xaw and perhaps Xt it depends on what sort of widget you're using, with some widgets using sub-windows and some not. Another program, written using Tk, does use subwindows for its buttons (with them clearly visible in 'xwininfo -tree').

This approach fell out of favour for various reasons, but certainly one significant one is that it's strongly tied to X's server side rendering. Because these subwindows are 'on top of' their parent (sub)windows, they have to be rendered individually; otherwise they'll cover what was rendered into the parent (and naturally they clip what is rendered to them to their visible boundaries). If you're sending rendering commands to the server, this is just a matter of what windows they're for and what coordinates you draw at, but if you render on the client, you have to ship over a ton of little buffers (one for each sub-window) instead of one big one for your whole window, and in fact you're probably sending extra data (the parts of all of the parent windows that gets covered up by child windows).

So in modern toolkits, the top level window and everything in it is generally only one X window with no nested subwindows, and all buttons and other UI elements are drawn by the client directly into that window (usually with client side drawing). The client itself tracks the mouse pointer and sends 'change the cursors to <X>' requests to the server as the pointer moves in and out of UI elements that should have different mouse cursors, and when it gets events, the client searches its own widget hierarchy to decide what should handle them (possibly including client side window decorations (CSD)).

(I think toolkits may create some invisible sub-windows for event handling reasons. Gnome-terminal and other Gnome applications appear to create a 1x1 sub-window, for example.)

As a side note, another place you can still find this many-window style is in some old fashioned X window managers, such as fvwm. When fvwm puts a frame around a window (such as the ones visible on windows on my desktop), the specific elements of the frame (the title bar, any buttons in the title bar, the side and corner drag-to-resize areas, and so on) are all separate X sub-windows. One thing I believe this is used for is to automatically show an appropriate mouse cursor when the mouse is over the right spot. For example, if your mouse is in the right side 'grab to resize right' border, the mouse cursor changes to show you this.

(The window managers for modern desktops, like Cinnamon, don't handle their window manager decorations like this; they draw everything as decorations and handle the 'widget' nature of title bar buttons and so on internally.)

unix/XWindowsAllTheWayDown written at 21:26:30; Add Comment

2024-03-04

An illustration of how much X cares about memory usage

In a comment on yesterday's entry talking about X's server side graphics rendering, B.Preston mentioned that another reason for this was to conserve memory. This is very true. In general, X is extremely conservative about requiring memory, sometimes to what we now consider extreme lengths, and there are specific protocol features (or limitations) related to this.

The modern approach to multi-window graphics rendering is that each window renders into a buffer that it owns (often with hardware assistance) and then the server composites (appropriate parts of) all of these buffers together to make up the visible screen. Often this compositing is done in hardware, enabling you to spin a cube of desktops and their windows around in real time. One of the things that clients simply don't worry about (at least for their graphics) is what happens when someone else's window is partially or completely on top of their window. From the client's perspective, nothing happens; they keep drawing into their buffer and their buffer is just as it was before, and all of the occlusion and stacking and so on are handled by the composition process.

(In this model, a client program's buffer doesn't normally get changed or taken away behind the client's back, although the client may flip between multiple buffers, only displaying one while completely repainting another.)

The X protocol specifically does not require such memory consuming luxuries as a separate buffer for each window, and early X implementations did not have them. An X server might have only one significant-sized buffer, that being screen memory itself, and X clients drew right on to their portion of the screen (by sending the X server drawing commands, because they didn't have direct access to screen memory). The X server would carefully clip client draw operations to only touch the visible pixels of the client's window. When you moved a window to be on top of part of another window, the X server simply threw away (well, overwrote) the 'under' portion of the other window. When the window on top was moved back away again, the X server mostly dealt with this by sending your client a notification that parts of its window had become visible and the client should repaint them.

(X was far from alone with this model, since at the time almost everyone was facing similar or worse memory constraints.)

The problem with this 'damage and repaint' model is that it can be janky; when a window is moved away, you get an ugly result until the client has had the time to do a redraw, which may take a while. So the X server had some additional protocol level features, called 'backing store' and 'save-under(s)'. If a given X server supported these (and it didn't have to), the client could request (usually during window creation) that the server maintain a copy of the obscured bits of the new window when it was covered by something else ('backing store') and separately that when this window covered part of another window, the obscured parts of that window should be saved ('save-under', which you might set for a transient pop-up window). Even if the server supported these features in general it could specifically stop doing them for you at any time it felt like it, and your client had to cope.

(The X server can also give your window backing store whether or not you asked for it, at its own discretion.)

All of this was to allow an X server to flexibly manage the amount of memory it used on behalf of clients. If an X server had a lot of memory, it could give everything backing store; if it started running short, it could throw some or all of the backing store out and reduce things down to (almost) a model where the major memory use was the screen itself. Even today you can probably arrange to start an X server in a mode where it doesn't have backing store (the '-bs' command line option, cf Xserver(1), which you can try in Xnest or the like today, and also '-wm'). I have a vague memory that back in the day there were serious arguments about whether or not you should disable backing store in order to speed up your X server, although I no longer have any memory about why that would be so (but see).

As far as I know all X servers normally operate with backing store these days. I wouldn't be surprised if some modern X clients would work rather badly if you ran them on an X server that had backing store forced off (much as I suspect that few modern programs will cope well with PseudoColor displays).

PS: Now that I look at 'xdpyinfo', my X server reports 'options: backing-store WHEN MAPPED, save-unders NO'. I suspect that this is a common default, since you don't really need save-unders if everything has backing store enabled when it's visible (well, in X mapped is not quite 'visible', cf, but close enough).

unix/XServerBackingStoreOptional written at 22:02:53; Add Comment

2024-03-03

X graphics rendering as contrasted to Wayland rendering

Recently, Thomas Adam (of fvwm fame) pointed out on the FVWM mailing list (here, also) a difference between X and Wayland that I'd been vaguely aware of before but hadn't actually thought much about. Today I feel like writing it down in my own words for various reasons.

X is a very old protocol (dating from the mid to late 1980s), and one aspect of that is that it contains things that modern graphics protocols don't. From a modern point of view, it isn't wrong to describe X as several protocols in a trenchcoat. Two of the largest such protocols are one for what you could call window management (including event handling) and a second one for graphics rendering. In the original vision of X, clients used the X server as their rendering engine, sending a series of 2D graphics commands to the server to draw things like lines, rectangles, arcs, and text. In the days of 10 Mbit/second local area networks and also slow inter-process communication on your local Unix machine, this was a relatively important part of both X's network transparency story and X's performance in general. We can call this server (side) rendering.

(If you look at the X server drawing APIs, you may notice that they're rather minimal and generally lack features that you'd like to do modern graphics. Some of this was semi-fixed in X protocol extensions, but in general the server side X rendering APIs are rather 1980s.)

However, X clients didn't have to do their rendering in the server. Right from the beginning they could render to a bitmap on the client side and then shove the bitmap over to the server somehow (the exact mechanisms depend on what X extensions are available). Over time, more and more clients started doing more and more client (side) rendering, where they rendered everything under their own control using their own code (well, realistically a library or a stack of them, especially for complex things like rendering fonts). Today, many clients and many common client libraries are entirely or almost entirely using client side rendering, in part to get modern graphics features that people want, and these days clients even do client side (window) decoration (CSD), where they draw 'standard' window buttons themselves.

(This tends to make window buttons not so standard any more, especially across libraries and toolkits.)

As a protocol designed relatively recently, Wayland is not several protocols in a trenchcoat. Instead, the (core) Wayland protocol is only for window management (including event handling), and it has no server side rendering. Wayland clients have to do client side rendering in order to display anything, using whatever libraries they find convenient for this. Of course this 'rendering' may be a series of OpenGL commands that are drawn on to a buffer that's shared with the Wayland server (what is called direct rendering (cf), which is also the common way to do client side rendering in X), but this is in some sense a detail. Wayland clients can simply render to bitmaps and then push those bitmaps to a server, and I believe this is part of how waypipe operates under the covers.

(Since Wayland was more or less targeted at environments with toolkits that already had their own graphics rendering APIs and were already generally doing client side rendering, this wasn't seen as a drawback. My impression is that these non-X graphics APIs were already in common use in many modern clients, since it includes things like Cairo. One reason that people switched to such libraries and their APIs even before Wayland is that the X drawing APIs are, well, very 1980s, and don't have a lot of features that modern graphics programming would like. And you can draw directly to a Wayland buffer if you want to, cf this example.)

One implication of this is that some current X programs are much easier to port (or migrate) to Wayland than others. The more an X program uses server side X rendering, the more it can't simply be re-targeted to Wayland, because it needs a client side library to substitute for the X server side rendering functionality. Generally such programs are either old or were deliberately written to be minimal X clients that didn't depend on toolkits like Gtk or even Cairo.

(Substituting in a stand alone client side drawing library is probably not a small job, since I don't think any of them so far are built to be API compatible with the relevant X APIs. It also means taking on additional dependencies for your program, although my impression is that some basic graphics libraries are essentially standards by now.)

unix/XRenderingVsWaylandRendering written at 22:56:12; Add Comment

2024-03-02

Something I don't know: How server core count interacts with RAM latency

When I wrote about how the speed of improvement in servers may have slowed down, I didn't address CPU core counts, which is one area where the numbers have been going up significantly. Of course you have to keep those cores busy, but if you have a bunch of CPU-bound workloads, the increased core count is good for you. Well, it's good for you if your workload is genuinely CPU bound, which generally means it fits within per-core caches. One of the areas I don't know much about is how the increasing CPU core counts interact with RAM latency.

RAM latency (for random requests) has been relatively flat for a while (it's been flat in time, which means that it's been going up in cycles as CPUs got faster). Total memory access latency has apparently been 90 to 100 nanoseconds for several memory generations (although individual DDR5 memory module access is apparently only part of this, also). Memory bandwidth has been going up steadily between the DDR generations, so per-core bandwidth has gone up nicely, but this is only nice if you have the kind of sequential workloads that benefit from it. As far as I know, the kind of random access that you get from things like pointer chasing is all dependent on latency.

(If the total latency has been basically flat, this seems to imply that bandwidth improvements don't help too much. Presumably they help for successive non-random reads, and my vague impression is that reading data from successive addresses from RAM is faster than reading random addresses (and not just because RAM typically transfers an entire cache line to the CPU at once).)

So now we get to the big question: how many memory reads can you have in flight at once with modern DDR4 or DDR5 memory, especially on servers? Where the limit is presumably matters since if you have a bunch of pointer-chasing workloads that are limited by 'memory latency' and you run them on a high core count system, at some point it seems that they'll run out of simultaneous RAM read capacity. I've tried to do some reading and gotten confused, which may be partly because modern DRAM is a pretty complex thing.

(I believe that individual processors and multi-socket systems have some number of memory channels, each of which can be in action simultaneously, and then there are memory ranks (also) and memory banks. How many memory channels you have depends partly on the processor you're using (well, its memory controller) and partly on the motherboard design. For example, 4th generation AMD Epyc processors apparently support 12 memory channels, although not all of them may be populated in a given memory configuration (cf). I think you need at least N (or maybe 2N) DIMMs for N channels. And here's a look at AMD Zen4 memory stuff, which doesn't seem to say much on multi-core random access latency.)

tech/ServerCPUDensityAndRAMLatency written at 22:54:58; Add Comment

2024-03-01

Options for your Grafana panels when your metrics change names

In an ideal world, your metrics never change their names; once you put them into a Grafana dashboard panel, they keep the same name and meaning forever. In the real world, sometimes a change in metric name is forced on you, for example because you might have to move from collecting a metric through one Prometheus exporter to collecting it with another exporter which naturally gives it a different name. And sometimes a metric will be renamed by its source.

In a Prometheus environment, the very brute force way to deal with this is either a recording rule (creating a duplicate metric with the old name) or renaming the metric during ingestion. However I feel that this is generally a mistake. Almost always, your Prometheus metrics should record the true state of affairs, warts and all, and it should be on other things to sort out the results.

(As part of this, I feel that Prometheus metric names should always be honest about where they come from. There's a convention that the name of the exporter is at the start of the metric name, and so you shouldn't generate your own metrics with someone else's name on them. If a metric name starts with 'node_*', it should come from the Prometheus host agent.)

So if your Prometheus metrics get renamed, you need to fix this in your Grafana panels (which can be a pain but is better in the long run). There are at least three approaches I know of. First, you can simply change the name of the metric in all of the panels. This keeps things simple but means that your historical data stops being visible on the dashboards. If you don't keep historical data for very long (or don't care about it much), this is fine; pretty soon the metric's new name will be the only one in your metrics database. In our case, we keep years of data and do want to be able to look back, so this isn't good enough.

The second option is to write your queries in Grafana as basically 'old_name or new_name'. If your queries involve rate() and avg() and other functions, this can be a lot of (manual) repetition, but if you're careful and lucky you can arrange for the old and the new query results to have the same labels as Grafana sees them, so your panel graphs will be continuous over the metrics name boundary.

The third option is to duplicate the query and then change the name of the metric (or the metrics) in the new copy of the query. This is usually straightforward and easy, but it definitely gives you graphs that aren't continuous around the name change boundary. The graphs will have one line for the old metric and then a new second line for your new metric. One advantage of separate queries is that you can someday turn the old query off in Grafana without having to delete it.

sysadmin/GrafanaMetricsNameChangeOptions written at 23:33:03; Add Comment

2024-02-29

The speed of improvement in servers may have slowed down

One of the bits of technology news that I saw recently was that AWS was changing how long it ran servers, from five years to six years. Obviously one large motivation for this is that it will save Amazon a nice chunk of money. However, I suspect that one enabling factor for this is that old servers are more similar to new servers than they used to be, as part of what could be called the great slowdown in computer performance improvement.

New CPUs and to a lesser extent memory are somewhat better than they used to be, both on an absolute measure and on a performance per watt basis, but the changes aren't huge the way they used to be. SATA SSD performance has been more or less stagnant for years; NVMe performance has improved, but from a baseline that was already very high, perhaps higher than many workloads could take advantage of. Network speeds are potentially better but it's already hard to truly take advantage of 10G speeds, especially with ordinary workloads and software.

(I don't know if SAS SSD bandwidth and performance has improved, although raw SAS bandwidth has and is above what SATA can provide.)

For both AWS and people running physical servers (like us) there's also the question of how many people need faster CPUs and more memory, and related to that, how much they're willing to pay for them. It's long been observed that a lot of what people run on servers is not a voracious consumer of CPU and memory (and IO bandwidth). If your VPS runs at 5% or 10% CPU load most of the time, you're probably not very enthused about paying more for a VPS with a faster CPU that will run at 2.5% almost all of the time.

(Now that I've written this it strikes me that this is one possible motivation for cloud providers to push 'function as a service' computing, because it potentially allows them to use those faster CPUs more effectively. If they're renting you CPU by the second and only when you use it, faster CPUs likely mean more people can be packed on to the same number of CPUs and machines.)

We have a few uses for very fast single-core CPU performance, but other than those cases (and our compute cluster) it's hard to identify machines that could make much use of faster CPUs than they already have. It would be nice if our fileservers had U.2 NVMe drives instead of SATA SSDs but I'm not sure we'd really notice; the fileservers only rarely see high IO loads.

PS: It's possible that I've missed important improvements here because I'm not all that tuned in to this stuff. One possible area is PCIe lanes directly supported by the system's CPU(s), which enable all of those fast NVMe drives, multiple 10G or faster network connections, and so on.

tech/ServersSpeedOfChangeDown written at 22:43:13; Add Comment

2024-02-28

Detecting absent Prometheus metrics without knowing their labels

When you have a Prometheus setup, one of the things you sooner or later worry about is important metrics quietly going missing because they're not being reported any more. There can be many reasons for metrics disappearing on you; for example, a network interface you expect to be at 10G speeds may not be there at all any more, because it got renamed at some point, so now you're not making sure the new name is at 10G.

(This happened to us with one machine's network interface, although I'm not sure exactly how except that it involves the depths of PCIe enumeration.)

The standard Prometheus feature for this is the 'absent()' function, or sometimes absent_over_time(). However, both of these have the problem that because of Prometheus's data model, you need to know at least some unique labels that your metrics are supposed to have. Without labels, all you can detect is a total disappearance of the metric at all, if nothing at all is reporting the metric. If you want to be alerted when some machine stops reporting a metric, you need to list all of the sources that should have the metric (following a pattern we've seen before):

absent(metric{host="a", device="em0"}) or
 absent(metric{host="b", device="eno1"}) or
 absent(metric{host="c", device="eth2"})

Sometimes you don't know all of the label values that your metric be present with (or it's tedious to list all of them and keep them up to date), and it's good enough to get a notification if a metric disappears when it was previously there (for a particular set of labels). For example, you might have an assortment of scripts that put their success results to somewhere and you don't want to have to keep a list of all of the scripts, but you do want to detect when a script stops reporting its metrics. In this case we can use 'offset' to check current metrics against old metrics. The simplest pattern is:

your_metric offset 1h
  unless your_metric

If the metric was there an hour ago and isn't there now, this will generate the metric as it was an hour ago (with the labels it had then), and you can use that to drive an alert (or at least a notification). If there are labels that might naturally change over time in your_metric, you can exclude them with 'unless ignoring (...)' or use 'unless on (...)' for a very focused result.

As written this has the drawback that it only looks at what versions of the metric were there exactly an hour ago. We can do better by using an *_over_time() function, for example:

max_over_time( your_metric[4h] ) offset 1h
  unless your_metric

Now if your metric existed (with some labels) at any point between five hours ago and one hour ago, and doesn't exist now, this expression will give you a result and you can alert on that. Since we're using *_over_time(), you can also leave off the 'offset 1h' and just extend the time range, and then maybe extend the other time range too:

max_over_time( your_metric[12h] )
  unless max_over_time( your_metric[20m] )

This expression will give you a result if your_metric has been present (with a given set of labels) at some point in the last 12 hours but has not been present within the last 20 minutes.

(You'd pick the particular *_over_time() function to use depending on what use, if any, you have for the value of the metric in your alert. If you have no particular use for the value (or you expect the value to be a constant), either max or min are efficient for Prometheus to compute.)

All of these clever versions have a drawback, which is that after enough time has gone by they shut off on their own. Once the metric has been missing for at least an hour or five hours or 12 hours or however long, even the first part of the expression has nothing and you get no results and no alert. So this is more of a 'notification' than a persistent 'alert'. That's unfortunately the best you can really do. If you need a persistent alert that will last until you take it out of your alert rules, you need to use absent() and explicitly specify the labels you expect and require.

sysadmin/PrometheusAbsentMetricsAndLabels written at 22:18:09; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.