Wandering Thoughts


My home DSL link really is fast enough to make remote X acceptable

A few years ago I wrote about how my home internet link had gradually gotten fast enough that I could toy around with running even VMWare Workstation over remote X. At the time (and afterward) I thought that that was kind of nice in theory, but I never really tested how far this would go and how it would feel to significantly use remote X for real (even when I missed various aspects of remote X). Recently, world and local events have made for an extended period of working from home, which means that I now very much miss some aspects of my work X environment and have been strongly motivated to see if I can use them over remote X. Because I'm lazy, I've been doing all of this over basic SSH X forwarding (with compression turned on) instead of anything more advanced that would require more work on my part.

I was going to say that I started with things that are fundamentally text based, but that's not really true. Even X programs that render text are no longer 'text based' in the sense of sending only 'draw this character' requests to the server, because modern X fonts are rendered in the client and sent to the server as bitmaps. Given font anti-aliasing for LCD displays, they may not even be black and white bitmaps. Still, programs like exmh and sam only really do relatively simple graphics, and not necessarily very often. All of this runs well enough that I'm mostly happy to keep on using it instead of other options. Given past experiences I wasn't really surprised by this.

What I have recently been surprised with is running VMWare Workstation remotely from my office machine, because what I was doing (from remote) reached the point where I wanted to spin up a test virtual machine and we didn't build a virtual machine host designed for remote use before we left the office. Back several years ago in the original entry, I didn't try to seriously use VMWare Workstation to get real work done; now I have, and it works decently (certainly enough for me to be productive with it). It doesn't even seem to saturate my DSL link or suffer too much when other things are using the link.

Of course, running X remotely over a DSL link that's only medium fast doesn't measure up to running it over a 1G Ethernet network, much less the local machine. I can certainly feel the difference (mostly in latency and responsiveness). But it's much more usable than I might have expected, and I've had to change my work habits less than I feared.

(I'm not sure if using SSH's build in compression is a good idea in general these days, but on a quick experiment it appears to be drastically reducing the total data sent from the remote VMWare Workstation to my home machine.)

PS: There are more sophisticated ways of doing remote X than just 'ssh -X' that are said to perform better. If we keep on working remotely for long enough, I will probably wind up exploring some of them.

tech/HomeInternetAcceptableX written at 21:45:26; Add Comment


It's worth documenting the obvious (before it stops being obvious)

I often feel a little bit silly when I write entries about things like making bar graphs in Grafana or tags for Grafana dashboard variables because when I write them up it's all pretty straightforward and even obvious. This is an illusion. It's all straightforward and obvious to me right now because I've been in the middle of doing this with Grafana, and so I have a lot of context and contextual knowledge. Not only do I know how to do things, I also know what they're called and roughly where to find information about them in Grafana's official documentation. All of this is going to fade away over time, as I stop making and updating our Grafana dashboards.

Writing down these obvious things has two uses. First and foremost, I'll have specific documentation for when I want to do this again in six months or a year or whatever (provided that I can remember that I wrote some entries on this and that I haven't left out crucial context, which I've done in the past). Second, actually writing down my own documentation forces me to understand things more thoroughly and hopefully helps fix them more solidly in my mind, so perhaps I won't even need my entries (or at least not need them so soon).

There's a lot of obvious things and obvious context that we don't document explicitly (in our worklog system or otherwise), which I've noticed before. Some of those obvious things don't really need to be documented because we do them all of the time, but I'm sure there's other things I'm dealing with right now that I won't be in six months. And even for the things that we do all the time, maybe it wouldn't hurt to explicitly write them up once (or every so often, or at least re-check the standard 'how we do X' documentation every so often).

(Also, just because we do something all the time right now doesn't mean we always will. What we do routinely can shift over time, and we won't even necessarily directly notice the shift; it may just slowly be more and more of this and less of that. Or perhaps we'll introduce a system that automates a lot of something we used to do by hand.)

The other side of this, and part of why I'm writing this entry, is that I shouldn't feel silly about documenting the obvious, or at least I shouldn't let that feeling stop me from doing it. There's value in doing it even if the obvious remains obvious to me, and I should keep on doing a certain amount of it.

(Telling myself not to feel things is probably mostly futile. Humans are not rational robots, no matter how much we tell ourselves that we are.)

sysadmin/DocumentTheObvious written at 21:37:13; Add Comment

Notes on Grafana 'value groups' for dashboard variables

Suppose, not hypothetically, that you have some sort of Grafana overview dashboard that can show you multiple hosts at once in some way. In many situations, you're going to want to use a Grafana dashboard variable to let you pick some or all of your hosts. If you're getting the data for what hosts should be in your list from Prometheus, often you'll want to use label_values() to extract the data you want. For example, suppose that you have a label field called 'cshost' that is your local short host name for a host. Then a plausible Grafana query for 'all of our hosts' for a dashboard variable would be:

label_values( node_load1, cshost )

(Pretty much every Unix that the Prometheus host agent runs on will supply a load average, although they may not supply other metrics.)

However, if you have a lot of hosts, this list can be overwhelming and also you may have sub-groupings of hosts, such as all SLURM nodes that you want to make it convenient to narrow down to. To support this, Grafana has a dashboard variable feature called value groups or just 'tags'. Value groups are a bit confusing and aren't as well documented as dashboard variables as a whole.

There are two parts to setting up a value group; you need a query that will give Grafana the names of all of the different groups (aka tags), and then a second query that will tell Grafana which hosts are in a particular group. Suppose that we have a metric to designate which classes a particular host is in:

cslab_class{ cshost="cpunode2", class="comps" }    1
cslab_class{ cshost="cpunode2", class="slurmcpu" } 1
cslab_class{ cshost="cpunode2", class="c6220" }    1

We can use this metric for both value group queries. The first query is to get all the tags, which are all the values of class:

label_values( cslab_class, class )

Note that we don't have to de-duplicate the result; Grafana will do that for us (although we could do it ourselves if we wanted to make a slightly more complex query).

The second query is to get all of values for a particular group (or tag), which is to say the hosts for a specific class. In this query, we have a special Grafana provided $tag variable that refers to the current class, so our query is now for the cshost label for things with that class:

label_values( cslab_class{ class="$tag" }, cshost )

It's entirely okay for this query to return some additional hosts (values) that aren't in our actual dashboard variable; Grafana will quietly ignore them for the most part.

Although you'll often want to use the same metric in both queries, it's not required. Both queries can be arbitrary and don't have to be particularly related to each other. Obviously, the results from the second query do have to exactly match the values you have in the dashboard variable itself. Unfortunately you don't have regexp rewriting for your results the way you do for the main dashboard variable query, so with Prometheus you may need to do some rewriting in the query itself using label_replace(). Also, there's no preview of what value groups (tags) your query generates, or what values are in what groups; you have to go play around with the dashboard to see what you get.

sysadmin/GrafanaVariableGroups written at 00:49:43; Add Comment


I set up Python program options and arguments in a separate function

Pretty much every programming language worth using has a standard library or package for parsing command line options and arguments, and Python is no exception; the standard for doing it is argparse. Argparse handles a lot of the hard work for you, but you still have to tell it what your command line options are, provide help text for things, and so on. In my own Python programs, I almost always do this setup in a separate function that returns a fully configured argparse.ArgumentParser instance.

My standard way of writing all of it looks like this:

def setup():
  p = argparse.ArgumentParser(usage="...",

  return p

def main():
  p = setup()
  opts = p.parse_args()

I don't like putting all of this directly in my main() because in most programs I write, this setup work is long and verbose enough to obscure the rest of what main() is doing. The actual top level processing and argument handling is the important thing in main(), not the setup of options, so I want all of the setup elsewhere where it's easy to skip over. In theory I could put it at the module level, not in a function, but I have a strong aversion to running code at import time. Among other issues, if I got something wrong I would much rather have the stack trace clearly say that it's happening in setup() than something more mysterious.

Putting it in a function that's run explicitly can have some advantages in specialized situations. For instance, it's much more natural to use complex logic (or run other functions) to determine the default arguments for some command line options. For people who want to write tests for this sort of thing, having all of the logic in a function also makes it possible to run the function repeatedly and inspect the resulting ArgumentParser object.

(I think it's widely accepted that you shouldn't run much or any code at import time by putting it in the top level. But setting up an ArgumentParser may look very much like setting up a simple Python data structure like a map or a list, even though it's not really.)

python/ArgparseSetupWhere written at 00:22:07; Add Comment


The Prometheus host agent's CPU utilization metrics can be a bit weird

Among other metrics, the Prometheus host agent collects CPU time statistics on most platforms (including OpenBSD, although it's not listed in the README). This is the familiar division into 'user time', 'system time', 'idle time', and so on, exposed on a per CPU basis on all of the supported platforms (all of which appear to be provided with this by the kernel on a per-CPU basis). We use this in our Grafana dashboards, in two forms. In one form we graph a simple summary of non-idle time, which is produced by subtracting the rate() of idle time from 1, so we can see what hosts have elevated CPU usage; in the other we use a stacked graph of all non-idle time, so we can see where a specific host is spending its CPU time on. Recently, the summary graph showed that one of our OpenBSD L2TP servers was quite busy but our detailed graph for its CPU time wasn't showing all that much; this led me to discover that currently (as of 1.0.0-rc.0), the Prometheus host agent doesn't support OpenBSD's 'spinning' CPU time category.

However, the discovery of this discrepancy and its cause made me wonder about an assumption we've implicitly been making in these graphs (and in general), which is that all of the CPU times really do sum up to 100%. Specifically, we sort of assume that a sum of the rate() of every CPU mode for a specific CPU should be 1 under normal circumstances:

sum( rate( node_cpu_seconds_total ) ) without (mode)

The great thing about a metrics system with a flexible query language is that we don't have to wonder about this; we can look at our data and find out, using Prometheus subqueries. We can look at this for both individual CPUs and the host overall; often, the host overall is more meaningful, because that's what we put in graphs. The simple way to explore this is to look at max_over_time() or min_over_time() for your systems for this over some suitable time interval. The more complicated way is to start looking at the standard deviation, standard variance, and other statistical measures (although at that point you might want to consider trying to visualize a histogram of this data to look at the distribution too).

(You can also simply graph the data and look how noisy it is.)

Now that I've looked at this data for our systems, I can say that while CPU times usually sum up to very close to 100%, they don't always do so. Over a day, most servers have an average sum just under 100%, but there are a decent number of servers (and individual CPUs) where it's under 99%. Individual CPUs can average out as low as 97%. If I look at the maximums and minimums, it's clear that there are real bursts of significant inaccuracies both high and low; over the past day, one CPU on one server saw a total sum of 23.7 seconds in a one-minute rate(), and some dipped as low as 0.6 second (which is 40% of that CPU's utilization just sort of vanishing for that measurement).

Some of these are undoubtedly due to scheduling anomalies with the host agent, where the accumulated CPU time data it reports is not really collected at the time that Prometheus thinks it is, and things either undershoot or overshoot. But I'm not sure that Linux and other Unixes really guarantee that these numbers always add up right even at the best of times. There are always things that can go on inside the kernel, and on multiprocessor systems (which is almost all of them today) there's always a tradeoff over how accurate you are at the cost of how much locking and synchronization.

On a large scale basis this probably doesn't matter. But if I'm looking at data from a system on a very fine timescale because I'm trying to look into a brief anomaly, I probably want to remember that this sort of thing is possible. At that level, those nice CPU utilization graphs may not be quite as trustworthy as they look.

(These issues aren't unique to Prometheus; they're going to happen in anything that collects CPU utilization from a Unix kernel. It's just that Prometheus and other metrics systems immortalize the data for us, so that we can go back and look at it and spot these sorts of anomalies.)

sysadmin/PrometheusCPUStatsCaution written at 01:53:21; Add Comment


OpenBSD's 'spinning' CPU time category

Unix systems have long had a basic breakdown of what your CPU (or CPUs) was spending its time doing. The traditional division is user time, system time, idle time, and 'nice' time (which is user time for tasks that have their scheduling priority lowered through nice(1) or the equivalent), and then often 'interrupt' time, for how much time the system spent in interrupt handling. Some Unixes have added 'iowait', which is traditionally defined as 'the system was idle but one or more processes were waiting for IO to complete'. OpenBSD doesn't have iowait, but current versions have a new time category, 'spinning'.

The 'spinning' category was introduced in May of 2018, in this change:

Stopping counting and reporting CPU time spent spinning on a lock as system time.

Introduce a new CP_SPIN "scheduler state" and modify userland tools to display the % of timer a CPU spents spinning.

(This is talking about a kernel lock.)

Since this dates from early 2018, I believe it's in everything from OpenBSD 6.4 onward. It's definitely in OpenBSD 6.6. This new CPU time category is supported in OpenBSD's versions of top and systat, but it is not explicitly broken out by vmstat; in fact vmstat's 'sy' time is actually the sum of OpenBSD 'system', 'interrupt', and 'spinning'. Third party tools may or may not have been updated to add this new category.

(I don't know why OpenBSD hasn't updated vmstat. Perhaps they consider its output frozen for some reason, even though it's hiding information by merging everything together into 'sy'. The vmstat manpage is not technically lying about what 'sy' is, since all of true system time, interrupts, and spinning time are forms of time spent in the kernel, but the breakdown between those three can be important. And it means that you can't directly compare vmstat 'sy' to the system time you get from top or systat.)

Our experience is that under some loads, it's possible for a current SMP OpenBSD machine to spend quite appreciable amounts of time in this 'spinning' state. Specifically we've seen our dual CPU OpenBSD L2TP server spend roughly 33% of its time this way while people were apparently trying to push data through it as fast as they could go (which didn't actually go all that fast, perhaps because of all of that spinning).

Marking whether or not the current CPU is spinning on a kernel lock is handled in sys/kern/kern_lock.c, modifying a per-CPU scheduler state spc_spinning field. Tracking and accounting for this is handled in sys/kern/kern_clock.c, in handling the 'statistics clock' (look for use of CP_SPIN). User programs find out all of this through sysctl(2), specifically KERN_CPTIME2 and friends. All of OpenBSD's CPU time categories are found in sys/sched.h.

PS: If you're using a program that doesn't currently support the 'spinning' category, you can reverse engineer the spinning value by adding up all of the other ones and looking for what's missing. Normally, you would expect that all of the categories of CPU time add up to more or less 100%; if you have all but one of them, you can work backward to the missing one based on that. This may not be completely precise, but at least it will pick up large gaps.

unix/OpenBSDCpuSpinTime written at 00:50:58; Add Comment


Any KVM over IP systems need to be on secure networks

In response to my entry wishing we had more servers with KVM over IP (among other things) now that we're working remotely, Ruben Greg raised a very important issue in a comment:

KVM over IP: Isnt this a huge security risk? Especially given the rare updates or poor security of these devices.

This is a very important issue if you're using KVM over IP, for two reasons. To start with, most KVM over IP implementations have turned out to have serious security issues, both in their web interface and often in their IPMI implementations. And beyond their generally terrible firmware security record, gaining access to a server's KVM over IP using stolen passwords or whatever generally gives an attacker full control over the server. It's almost as bad as letting them into your machine room so they can sit in front of the server and use its physical console.

(The KVM over IP and IPMI management systems are generally just little Linux servers running very old and very outdated software and kernels. Often it's not very good software either, and not necessarily developed with security as a high priority.)

If you use either KVM over IP or basic IPMI, you very much need to put the management interfaces on their own locked down network (or networks), and restrict (and guard) access to that network carefully. It's very much not safe to just give your KoI interfaces some extra IPs on your regular network, unless you already have a very high level of trust in everyone who has access to that network. How you implement these restrictions will depend on your local networking setup (and on how system administrators work), and also on how your specific KVM over IP systems do their magic for console access and so on.

(I know that some KVM over IP systems want to make additional TCP connections between the management interface and your machine, and they don't necessarily document what ports they use and so on. As a result, I wouldn't try to do this with just SSH port forwarding unless I had a lot of time to devote to trying to make it work. It's possible that modern HTML5-based KVM over IP systems have gotten much better about this; I haven't checked our recent SuperMicro servers (which are a few years old by now anyway).)

PS: This security issue is why you should very much prefer KVM over IP (and IPMI) setups that use a dedicated management port, not ones that share a management port with the normal host. The problem with the latter is that an attacker who has root level access to the host can always put the host on your otherwise secure KVM over IP management network through the shared port, and then go attack your other KVM over IP systems.

sysadmin/KVMOverIPSecurity written at 01:34:08; Add Comment


The problem of your (our) external mail gateway using internal DNS views

Suppose, not hypothetically, that you have an external mail gateway (your external MX, where incoming email from the Internet is handed to you). This external MX server is a standard server and so you install it through your standard install process. As part of that standard install, it gets your normal /etc/resolv.conf, which points it to your local DNS resolvers. If you have a split horizon DNS setup, your local, internal DNS resolvers will naturally provide the internal view, complete with internal only hosts and entire zones for purely internal 'sandbox' networks (in our case all under a .sandbox DNS namespace).

Now you have a potential problem. If you do nothing special with your external MX, it will accept SMTP envelope sender addresses (ie MAIL FROM addresses) that exist only in your internal DNS zones. After all, as far as it is concerned they resolve as perfectly good DNS names with A records, and that's good enough to declare that the host exists and the mail should be accepted. You might think that no one would actually send email with such an envelope sender, and this is partially correct. People in the outside world are extremely unlikely to do this. However, people setting up internal hosts and configuring mailers in straightforward ways are extremely likely to send email to your external MX, because that's where your domain's MX points. If their internal machine is trying to send email to 'owner@your.domain', by default it will go through your external MX.

(If the email is handled purely locally and doesn't bounce, things may go okay. If someone tries to forward their email to elsewhere, it's probably not going to work.)

Fortunately it turns out that I already thought of this long ago (possibly after an incident); the Exim configuration on our external MX specifically rejects *.sandbox as an envelope sender address. This still lets through internal only names that exist in our regular public domains, and there are some of those. This is probably not important enough to try to fix.

Fixing this in general is not straightforward and simple, because you probably don't already have a DNS resolver that provides an external view of the world (since you don't normally need such a thing). If I had to come up with a general fix, I would probably set up a local resolving DNS server on the external mail gateway (likely using Unbound) and have that provide a public DNS view instead of the internal one. Of course this might have side effects if used on a system wide level, which is probably the only way to really do it.

sysadmin/ExternalMXInternalDNS written at 00:40:16; Add Comment


How we set up our ZFS filesystem hierarchy in our ZFS pools

Our long standing practice here, predating even the first generation of our ZFS fileservers, is that we have two main sorts of filesystems, home directories (homedir filesystems) and what we call 'work directory' (workdir) filesystems. Homedir filesystems are called /h/NNN (for some NNN) and workdir filesystems are called /w/NNN; the NNN is unique across all of the different sorts of filesystems. Users are encouraged to put as much stuff as possible in workdirs and can have as many of them as they want, which mattered a lot more in the days when we used Solaris DiskSuite and had fixed-sized filesystems.

(This creates filesystems called things like /h/281 and /w/24.)

When we moved from DiskSuite to ZFS, we made the obvious decision to keep these user-visible filesystem names and the not entirely obvious decision that these filesystem names should work even on the fileservers themselves. This meant using the ZFS mountpoint property to set the mount point of all ZFS homedir and workdir filesystems, which works (and worked fine). However, this raised another question, that of what the actual filesystem name inside the ZFS pool should look like (since it no longer has to reflect the mount point).

There are a number of plausible answers here. For example, because our 'NNN' numbers are unique, we could have made all filesystems be simply '<pool>/NNN'. However, for various reasons we decided that the ZFS pool filesystem should reflect the full name of the filesystem, so /h/281 is '<pool>/h/281' instead of '<pool>/281' (among other things, we felt that this was easier to manage and work with). This created the next problem, which is that if you have a ZFS filesystem of <pool>/h/281, <pool>/h has to exist in some form. I suppose that we could have made these just be subdirectories in the root of the pool, but instead we decided to make them be empty and unmounted ZFS filesystems that are used only as containers:

zfs create -o mountpoint=none fs11-demo-01/h
zfs create -o mountpoint=none fs11-demo-01/w

We create these in every pool as part of our pool setup automation, and then we can make, for example, fs11-demo-01/h/281, which will be mounted everywhere as /h/281.

(Making these be real ZFS filesystems means that they can have properties that will be inherited by their children; this theoretically enables us to apply some ZFS properties only to a pool's homedir or workdir filesystems. Probably the only useful one here is quotas.)

solaris/ZFSOurContainerFilesystems written at 23:47:32; Add Comment

Why we use 1U servers, and the two sides of them

Every so often I talk about '1U servers' and sort of assume that people know both what '1U' means here and what sort of server I mean by this. The latter is somewhat of a leap, since there are two sorts of server that 1U servers can be, and the former requires some hardware knowledge that may be getting less and less common in this age of the cloud.

In this context, the 'U' in 1U (or 2U, 3U, 4U, 5U, and so on) stands for a rack unit, a measure of server height in a standard server rack. Because racks have a standard width and a standard maximum depth, height is the only important variation in size for in rack mounted servers. A 1U server is thus the smallest practical standalone server that you can get.

(Some 1U servers are shorter than others, and sometimes these short servers cause problems with physical access. They don't really save you any space because you generally can't put things behind them.)

In practice, there are two sorts of 1U servers, each with a separate audience. The first sort of 1U server is for people who have a limited amount of rack space and so want to pack as much computing into it as they can; these are high powered servers, densely packed with CPUs, memory, and storage, and are correspondingly expensive. The second sort of 1U server is for people who have a limited amount of money and want to get as many physical servers for it as possible; these servers have relatively sparse features and are generally not powerful, but they are the most inexpensive decently made rack mount servers you can buy.

(I believe that the cheapest servers are 1U because that minimizes the amount of sheet metal and so on involved. The motherboard, RAM, and a few 3.5" HDs can easily fit in the 1U height, and apparently it's not a problem for the power supply either. CPUs tend to be cooled using heatsinks with forced fan airflow over them, and often not very power hungry to start with. You generally get space for one or two PCIe cards mounted sideways on special risers, which is important if you want to add, say, 10G-T networking to your inexpensive 1U servers.)

We aren't rack space constrained, so our 1U servers are the inexpensive sort. We've had various generations of these servers, mostly from Dell; our 'current' generation are Dell R230s. That we buy 1U servers on price, to be inexpensive, is part of why our servers aren't as remote operation resilient as I'd now like.

(We have a few 1U servers that are more the 'dense and powerful' style than the 'inexpensive' style; they were generally bought for special purposes. I believe that some of them are from Supermicro.)

sysadmin/WhyWeUse1UServers written at 00:10:33; Add Comment

(Previous 10 or go back to March 2020 at 2020/03/22)

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.