Wandering Thoughts

2020-03-30

It's worth documenting the obvious (before it stops being obvious)

I often feel a little bit silly when I write entries about things like making bar graphs in Grafana or tags for Grafana dashboard variables because when I write them up it's all pretty straightforward and even obvious. This is an illusion. It's all straightforward and obvious to me right now because I've been in the middle of doing this with Grafana, and so I have a lot of context and contextual knowledge. Not only do I know how to do things, I also know what they're called and roughly where to find information about them in Grafana's official documentation. All of this is going to fade away over time, as I stop making and updating our Grafana dashboards.

Writing down these obvious things has two uses. First and foremost, I'll have specific documentation for when I want to do this again in six months or a year or whatever (provided that I can remember that I wrote some entries on this and that I haven't left out crucial context, which I've done in the past). Second, actually writing down my own documentation forces me to understand things more thoroughly and hopefully helps fix them more solidly in my mind, so perhaps I won't even need my entries (or at least not need them so soon).

There's a lot of obvious things and obvious context that we don't document explicitly (in our worklog system or otherwise), which I've noticed before. Some of those obvious things don't really need to be documented because we do them all of the time, but I'm sure there's other things I'm dealing with right now that I won't be in six months. And even for the things that we do all the time, maybe it wouldn't hurt to explicitly write them up once (or every so often, or at least re-check the standard 'how we do X' documentation every so often).

(Also, just because we do something all the time right now doesn't mean we always will. What we do routinely can shift over time, and we won't even necessarily directly notice the shift; it may just slowly be more and more of this and less of that. Or perhaps we'll introduce a system that automates a lot of something we used to do by hand.)

The other side of this, and part of why I'm writing this entry, is that I shouldn't feel silly about documenting the obvious, or at least I shouldn't let that feeling stop me from doing it. There's value in doing it even if the obvious remains obvious to me, and I should keep on doing a certain amount of it.

(Telling myself not to feel things is probably mostly futile. Humans are not rational robots, no matter how much we tell ourselves that we are.)

DocumentTheObvious written at 21:37:13; Add Comment

Notes on Grafana 'value groups' for dashboard variables

Suppose, not hypothetically, that you have some sort of Grafana overview dashboard that can show you multiple hosts at once in some way. In many situations, you're going to want to use a Grafana dashboard variable to let you pick some or all of your hosts. If you're getting the data for what hosts should be in your list from Prometheus, often you'll want to use label_values() to extract the data you want. For example, suppose that you have a label field called 'cshost' that is your local short host name for a host. Then a plausible Grafana query for 'all of our hosts' for a dashboard variable would be:

label_values( node_load1, cshost )

(Pretty much every Unix that the Prometheus host agent runs on will supply a load average, although they may not supply other metrics.)

However, if you have a lot of hosts, this list can be overwhelming and also you may have sub-groupings of hosts, such as all SLURM nodes that you want to make it convenient to narrow down to. To support this, Grafana has a dashboard variable feature called value groups or just 'tags'. Value groups are a bit confusing and aren't as well documented as dashboard variables as a whole.

There are two parts to setting up a value group; you need a query that will give Grafana the names of all of the different groups (aka tags), and then a second query that will tell Grafana which hosts are in a particular group. Suppose that we have a metric to designate which classes a particular host is in:

cslab_class{ cshost="cpunode2", class="comps" }    1
cslab_class{ cshost="cpunode2", class="slurmcpu" } 1
cslab_class{ cshost="cpunode2", class="c6220" }    1

We can use this metric for both value group queries. The first query is to get all the tags, which are all the values of class:

label_values( cslab_class, class )

Note that we don't have to de-duplicate the result; Grafana will do that for us (although we could do it ourselves if we wanted to make a slightly more complex query).

The second query is to get all of values for a particular group (or tag), which is to say the hosts for a specific class. In this query, we have a special Grafana provided $tag variable that refers to the current class, so our query is now for the cshost label for things with that class:

label_values( cslab_class{ class="$tag" }, cshost )

It's entirely okay for this query to return some additional hosts (values) that aren't in our actual dashboard variable; Grafana will quietly ignore them for the most part.

Although you'll often want to use the same metric in both queries, it's not required. Both queries can be arbitrary and don't have to be particularly related to each other. Obviously, the results from the second query do have to exactly match the values you have in the dashboard variable itself. Unfortunately you don't have regexp rewriting for your results the way you do for the main dashboard variable query, so with Prometheus you may need to do some rewriting in the query itself using label_replace(). Also, there's no preview of what value groups (tags) your query generates, or what values are in what groups; you have to go play around with the dashboard to see what you get.

GrafanaVariableGroups written at 00:49:43; Add Comment

2020-03-28

The Prometheus host agent's CPU utilization metrics can be a bit weird

Among other metrics, the Prometheus host agent collects CPU time statistics on most platforms (including OpenBSD, although it's not listed in the README). This is the familiar division into 'user time', 'system time', 'idle time', and so on, exposed on a per CPU basis on all of the supported platforms (all of which appear to be provided with this by the kernel on a per-CPU basis). We use this in our Grafana dashboards, in two forms. In one form we graph a simple summary of non-idle time, which is produced by subtracting the rate() of idle time from 1, so we can see what hosts have elevated CPU usage; in the other we use a stacked graph of all non-idle time, so we can see where a specific host is spending its CPU time on. Recently, the summary graph showed that one of our OpenBSD L2TP servers was quite busy but our detailed graph for its CPU time wasn't showing all that much; this led me to discover that currently (as of 1.0.0-rc.0), the Prometheus host agent doesn't support OpenBSD's 'spinning' CPU time category.

However, the discovery of this discrepancy and its cause made me wonder about an assumption we've implicitly been making in these graphs (and in general), which is that all of the CPU times really do sum up to 100%. Specifically, we sort of assume that a sum of the rate() of every CPU mode for a specific CPU should be 1 under normal circumstances:

sum( rate( node_cpu_seconds_total ) ) without (mode)

The great thing about a metrics system with a flexible query language is that we don't have to wonder about this; we can look at our data and find out, using Prometheus subqueries. We can look at this for both individual CPUs and the host overall; often, the host overall is more meaningful, because that's what we put in graphs. The simple way to explore this is to look at max_over_time() or min_over_time() for your systems for this over some suitable time interval. The more complicated way is to start looking at the standard deviation, standard variance, and other statistical measures (although at that point you might want to consider trying to visualize a histogram of this data to look at the distribution too).

(You can also simply graph the data and look how noisy it is.)

Now that I've looked at this data for our systems, I can say that while CPU times usually sum up to very close to 100%, they don't always do so. Over a day, most servers have an average sum just under 100%, but there are a decent number of servers (and individual CPUs) where it's under 99%. Individual CPUs can average out as low as 97%. If I look at the maximums and minimums, it's clear that there are real bursts of significant inaccuracies both high and low; over the past day, one CPU on one server saw a total sum of 23.7 seconds in a one-minute rate(), and some dipped as low as 0.6 second (which is 40% of that CPU's utilization just sort of vanishing for that measurement).

Some of these are undoubtedly due to scheduling anomalies with the host agent, where the accumulated CPU time data it reports is not really collected at the time that Prometheus thinks it is, and things either undershoot or overshoot. But I'm not sure that Linux and other Unixes really guarantee that these numbers always add up right even at the best of times. There are always things that can go on inside the kernel, and on multiprocessor systems (which is almost all of them today) there's always a tradeoff over how accurate you are at the cost of how much locking and synchronization.

On a large scale basis this probably doesn't matter. But if I'm looking at data from a system on a very fine timescale because I'm trying to look into a brief anomaly, I probably want to remember that this sort of thing is possible. At that level, those nice CPU utilization graphs may not be quite as trustworthy as they look.

(These issues aren't unique to Prometheus; they're going to happen in anything that collects CPU utilization from a Unix kernel. It's just that Prometheus and other metrics systems immortalize the data for us, so that we can go back and look at it and spot these sorts of anomalies.)

PrometheusCPUStatsCaution written at 01:53:21; Add Comment

2020-03-26

Any KVM over IP systems need to be on secure networks

In response to my entry wishing we had more servers with KVM over IP (among other things) now that we're working remotely, Ruben Greg raised a very important issue in a comment:

KVM over IP: Isnt this a huge security risk? Especially given the rare updates or poor security of these devices.

This is a very important issue if you're using KVM over IP, for two reasons. To start with, most KVM over IP implementations have turned out to have serious security issues, both in their web interface and often in their IPMI implementations. And beyond their generally terrible firmware security record, gaining access to a server's KVM over IP using stolen passwords or whatever generally gives an attacker full control over the server. It's almost as bad as letting them into your machine room so they can sit in front of the server and use its physical console.

(The KVM over IP and IPMI management systems are generally just little Linux servers running very old and very outdated software and kernels. Often it's not very good software either, and not necessarily developed with security as a high priority.)

If you use either KVM over IP or basic IPMI, you very much need to put the management interfaces on their own locked down network (or networks), and restrict (and guard) access to that network carefully. It's very much not safe to just give your KoI interfaces some extra IPs on your regular network, unless you already have a very high level of trust in everyone who has access to that network. How you implement these restrictions will depend on your local networking setup (and on how system administrators work), and also on how your specific KVM over IP systems do their magic for console access and so on.

(I know that some KVM over IP systems want to make additional TCP connections between the management interface and your machine, and they don't necessarily document what ports they use and so on. As a result, I wouldn't try to do this with just SSH port forwarding unless I had a lot of time to devote to trying to make it work. It's possible that modern HTML5-based KVM over IP systems have gotten much better about this; I haven't checked our recent SuperMicro servers (which are a few years old by now anyway).)

PS: This security issue is why you should very much prefer KVM over IP (and IPMI) setups that use a dedicated management port, not ones that share a management port with the normal host. The problem with the latter is that an attacker who has root level access to the host can always put the host on your otherwise secure KVM over IP management network through the shared port, and then go attack your other KVM over IP systems.

KVMOverIPSecurity written at 01:34:08; Add Comment

2020-03-25

The problem of your (our) external mail gateway using internal DNS views

Suppose, not hypothetically, that you have an external mail gateway (your external MX, where incoming email from the Internet is handed to you). This external MX server is a standard server and so you install it through your standard install process. As part of that standard install, it gets your normal /etc/resolv.conf, which points it to your local DNS resolvers. If you have a split horizon DNS setup, your local, internal DNS resolvers will naturally provide the internal view, complete with internal only hosts and entire zones for purely internal 'sandbox' networks (in our case all under a .sandbox DNS namespace).

Now you have a potential problem. If you do nothing special with your external MX, it will accept SMTP envelope sender addresses (ie MAIL FROM addresses) that exist only in your internal DNS zones. After all, as far as it is concerned they resolve as perfectly good DNS names with A records, and that's good enough to declare that the host exists and the mail should be accepted. You might think that no one would actually send email with such an envelope sender, and this is partially correct. People in the outside world are extremely unlikely to do this. However, people setting up internal hosts and configuring mailers in straightforward ways are extremely likely to send email to your external MX, because that's where your domain's MX points. If their internal machine is trying to send email to 'owner@your.domain', by default it will go through your external MX.

(If the email is handled purely locally and doesn't bounce, things may go okay. If someone tries to forward their email to elsewhere, it's probably not going to work.)

Fortunately it turns out that I already thought of this long ago (possibly after an incident); the Exim configuration on our external MX specifically rejects *.sandbox as an envelope sender address. This still lets through internal only names that exist in our regular public domains, and there are some of those. This is probably not important enough to try to fix.

Fixing this in general is not straightforward and simple, because you probably don't already have a DNS resolver that provides an external view of the world (since you don't normally need such a thing). If I had to come up with a general fix, I would probably set up a local resolving DNS server on the external mail gateway (likely using Unbound) and have that provide a public DNS view instead of the internal one. Of course this might have side effects if used on a system wide level, which is probably the only way to really do it.

ExternalMXInternalDNS written at 00:40:16; Add Comment

2020-03-23

Why we use 1U servers, and the two sides of them

Every so often I talk about '1U servers' and sort of assume that people know both what '1U' means here and what sort of server I mean by this. The latter is somewhat of a leap, since there are two sorts of server that 1U servers can be, and the former requires some hardware knowledge that may be getting less and less common in this age of the cloud.

In this context, the 'U' in 1U (or 2U, 3U, 4U, 5U, and so on) stands for a rack unit, a measure of server height in a standard server rack. Because racks have a standard width and a standard maximum depth, height is the only important variation in size for in rack mounted servers. A 1U server is thus the smallest practical standalone server that you can get.

(Some 1U servers are shorter than others, and sometimes these short servers cause problems with physical access. They don't really save you any space because you generally can't put things behind them.)

In practice, there are two sorts of 1U servers, each with a separate audience. The first sort of 1U server is for people who have a limited amount of rack space and so want to pack as much computing into it as they can; these are high powered servers, densely packed with CPUs, memory, and storage, and are correspondingly expensive. The second sort of 1U server is for people who have a limited amount of money and want to get as many physical servers for it as possible; these servers have relatively sparse features and are generally not powerful, but they are the most inexpensive decently made rack mount servers you can buy.

(I believe that the cheapest servers are 1U because that minimizes the amount of sheet metal and so on involved. The motherboard, RAM, and a few 3.5" HDs can easily fit in the 1U height, and apparently it's not a problem for the power supply either. CPUs tend to be cooled using heatsinks with forced fan airflow over them, and often not very power hungry to start with. You generally get space for one or two PCIe cards mounted sideways on special risers, which is important if you want to add, say, 10G-T networking to your inexpensive 1U servers.)

We aren't rack space constrained, so our 1U servers are the inexpensive sort. We've had various generations of these servers, mostly from Dell; our 'current' generation are Dell R230s. That we buy 1U servers on price, to be inexpensive, is part of why our servers aren't as remote operation resilient as I'd now like.

(We have a few 1U servers that are more the 'dense and powerful' style than the 'inexpensive' style; they were generally bought for special purposes. I believe that some of them are from Supermicro.)

WhyWeUse1UServers written at 00:10:33; Add Comment

2020-03-20

Wishing for a remote resilient server environment (now that it's too late)

Due to world and local events, we have all abruptly been working from home and will be for some time. The process has made me wish for a number of differences in our server environment to make it what I'll call more remote resilient, more able to cope with life when you don't have sysadmins down the hall during the hours of official support.

One big ticket thing I now wish for is some kind of virtual machine host for test machines, one that can be used remotely over some combination of SSH and a remote graphics or desktop protocol as opposed to basically being a local desktop setup. We're accustomed to installing and testing things on spare physical hardware or (for me) VMWare Workstation on my office desktop, but that's no longer an option now that we're not in the office (and we can't do it on our home desktops; our install setup and various other aspects of our environment assume we're on the local network, with high bandwidth and low latency). I'm sure we could set this up on a spare server, but of course that requires some work in the office.

(Since this is only for testing builds and software and so on, it doesn't have to be the kind of full scale VM environment that we'd need for production use.)

In great timing, we had our first ever Dell 1U server PSU failure on Monday, taking down our CUPS print server until someone could come in and swap the hardware around. These days I like redundant power supplies but we tend to only consider them for very important machines, like our ZFS fileservers. That's a sensible choice for normal times when sysadmins are in the office and we can swap hardware relatively rapidly, but these are not normal times; having dual power supplies in anything that's a singleton server would clearly create more remote resilience. I don't know if there are 1U or 2U servers with dual power supplies that are only moderately more expensive than basic 1U servers, but if there are perhaps we should consider getting some in our next round of hardware purchases.

(We probably can't afford to make almost all of our servers have dual PSUs, but it certainly would be nice if we could.)

The final thing I'm rather missing is pervasive support for remote management that goes all the way up to KVM over IP and using remote CD (or DVD) images. We have a serial console server, but that only gets some things and you can't remotely install (or reinstall) a machine through it (plus, our 'serial consoles' are not the machine's real console). Our old SunFire 1U servers had full scale KVM over IP and it was very great, but since then only higher end machines like our ZFS fileservers have it; our basic Dell 1U servers don't. KVM over IP is generally an extra cost feature in one way or another (either in the form of more expensive servers or as an explicit license) and we've traditionally not paid that cost, but it does cost us remote resilience in various ways.

RemoteResilientSetupWish written at 22:41:56; Add Comment

2020-03-19

Make sure to keep useful labels in your Prometheus alert rules

Suppose, not entirely hypothetically, that you have some metrics that are broken out across categories but what you care about are the total number of things together. For example, you're monitoring some OpenBSD firewalls and you care about the total number of PF states, but your metrics break them down by protocol (this information is available in 'pfctl -ss' output). So your alert rule is going to be something like:

- alert: TooManyStates
  expr: sum( pfctl_protocol_entries ) by (server) > 80000
  ....

Congratulations, you may have just aimed a gun at your own foot. If you have additional labels on that pfctl_protocol_entries metric that you may want to use in the alert that will result from this (perhaps the datacenter or some other metadata), you've just lost them. When you said 'sum(...) by (server)', Prometheus faithfully did what you said; it summed everything by the server and as part of that threw away all other labels, because you told it all that mattered was the 'server' label.

There are two ways around this. The obvious, simple way that you may reach for in your haste to fix this issue is to add the additional metadata label or labels that you care about to the 'by()' expression, so you have, eg, 'sum(...) by (server, datacenter)'. The problem with this is that you're playing whack-a-mole, having to add each additional label to the list of labels as you remember them (or discover problems because they're missing). The better way is to be explicit about what you want to ignore:

sum( pfctl_protocol_entries ) without (proto)

This will automatically pass through all other labels, including ones that you add in six months from now as part of a metrics reorganization (long after you forgot that 'sum(..) by (...)' special case in one of your alert rules).

After this experience, I've come to think that doing aggregation using 'by (...)' in your alert rules (or recording rules) is potentially dangerous and ought to at least be scrutinized carefully and probably commented. Sometimes there are good reasons for it where you want to narrow down to a known set of common labels or the like, but otherwise it is a potential trap even if it works for your setup today.

PrometheusKeepLabelsAlerts written at 23:51:33; Add Comment

2020-03-15

Why the choice of DNS over HTTPS server needs to be automatic (a sysadmin view)

At a general level, what DNS servers you should (and sometimes can) use depends on what network you're connected to. If you're connected to a network that just gives you general Internet access, then you (just) need a DNS server that gives you general public IP addresses. If you're connected to some sort of internal network, your network may have both special DNS names that aren't visible in the public DNS and special resolution for DNS names that are visible in the public DNS (often called split horizon DNS); resolving these names properly requires using the network's local DNS server. This is independent of whether you talk to those DNS servers with plain DNS, DNS over HTTPS the protocol, or DNS over TLS. We can call the public Internet and every different internal network environment a (network) world, and say that each world potentially needs a different DNS server setup.

A machine that doesn't move between network worlds can afford to have its DNS server settings configured manually, including manually setting it to use DNS over HTTPS to a local DNS server. A machine that moves between network worlds could theoretically be manually reconfigured after every move, but in practice these reconfigurations need to happen automatically (otherwise, someday they won't be done and then various problems appear). The generally available state of the art for automatic reconfiguration of DNS servers based on the network you're connected to is DHCP for IPv4 and DHCP6 or NDP for IPv6 (probably NDP, since I believe that Android devices still refuse to do DHCP6, so they can only find your DNS server through NDP); all of these give you plain old DNS servers.

(You can't automatically bootstrap to DNS over HTTPS or DNS over TLS with just this basic information, because normally all you get is the IP address and verifying TLS certificates requires you to know the name of what you're connecting to.)

The current state of the art of Firefox's use of DNS over HTTPS (the second and more common meaning of 'DNS over HTTPS') is mixed. Firefox doesn't have any way of switching DNS over HTTPS settings around based on the network world it's connected to, and while it attempts to work out when to use DNS over HTTPS those heuristics face an impossible job. Essentially Firefox is assuming that the majority of computers are connected mostly to the general Internet world. Sooner or later everyone will have to explicitly signal to Firefox that their internal networks are not in this general Internet world (using Firefox's canary domain for this). Even then, Firefox has no automatic way of reconfiguring which DNS over HTTPS server it uses; instead, all it has is enabling or disabling DoH (the protocol).

Until we get some form of support in Firefox for automatically changing its DNS over HTTPS server as it moves between network worlds, setting up your own local DoH server has only moderate practical use unless you're dealing only with machines that never move out of your local network world. If you truly need a local DNS server (as opposed to it being an efficiency or privacy thing), people can't use it all of the time, so when their machine moves outside your world their DNS over HTTPS settings have to change. But there's no way of doing that automatically today, so things are guaranteed to go wrong sooner or later.

(If you have some machines that don't move out of your local network world, you can set up a DNS over HTTPS server for them. Since this potentially enables encrypted SNI, there's some use for this. If you have no such machines, there's no point; your nice local DoH server will sit unused.)

PS: Using a local DNS over HTTPS server still allows you to keep track of local DNS lookups for things like malware detection, although you may not be able to use your fancy expensive IDS any more (at least until you can arrange to have your DNS server provide a feed of lookups to the IDS instead of having the IDS snoop on network traffic). Firefox using external DNS over HTTPS servers makes this worse than Firefox using external DNS servers in general, because you previously could snoop on that traffic without having to hunt down the owners of the devices to get them to change their DNS configuration to the right one.

DNSOverHTTPSVsNetworks written at 01:20:54; Add Comment

2020-03-04

Unix's iowait% is a narrow and limited measure that can be misleading

For a long time, iowait% has been one of my standard Unix system performance indicators to check, both for having visible iowait% and for not having it. As I interpreted it, a machine with high or appreciable iowait% clearly had potential IO issues, while a machine with low or no iowait% was fine as far as IO went, including NFS (this is on display in, for example, my entry on the elevated load average of our main IMAP server). Unfortunately, I've recently realized that the second half of this is not actually correct. To put it simply, programs waiting for IO may only create iowait% when the machine is otherwise idle.

Suppose, as a non hypothetical example, that you have a busy IMAP server with lots of sessions from people who are spread all over your fleet of NFS fileservers, some of which are lightly loaded and fast and some of which are not, along with a constant background noise of random attackers on the Internet trying password guessing through SMTP authentication attacks and so on. With a lot of uncorrelated processes, it's quite possible that something will be runnable on most of the times when (some) IMAP sessions are stalled waiting from NFS IO from your most heavily loaded fileserver. Since there are running processes, your waiting processes may well not show up as a visible iowait%, fooling you into thinking that everything is fine as far as IO goes.

In general, a high iowait% is a sign that your entire system is stalled on IO, but a low iowait% isn't necessarily a sign that no important processes are stalled on IO. The situation isn't symmetrical. In an ironic twist given what I wrote recently about it, I now think that an inexplicably high load average is probably a good signal that you have some processes stalling on IO while others are running fine (so that these stalls don't show up as iowait%), at least on Unixes where waiting on IO is reflected in the load average.

(The usual vmstat output reports a 'blocked' count, but that's an instantaneous number and may not fully capture things. The load average is gathered continuously and so will reflect more of the overall situation.)

Now that I've realized this, I'm going to have to be much more careful about seeing a low iowait% and concluding that the system is fine as far as IO goes. Unfortunately I'm not sure if there's any good metrics for this that are widely available and easily worked out, especially for NFS (where you don't generally have a 'utilization' percentage in the way you usually do for local disks).

(There's a practical problem with iowait% on modern Linux systems, but that needs another entry.)

IowaitIsNarrow written at 23:33:32; Add Comment

(Previous 10 or go back to February 2020 at 2020/02/29)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.