2024-03-15
The problem of using basic Prometheus to monitor DNS query results
Suppose that you want to make sure that your DNS servers are working correctly, for both your own zones and for outside DNS names that are important to you. If you have your own zones you may also care that outside people can properly resolve them, perhaps both within the organization and genuine outsiders using public DNS servers. The traditional answer to this is the Blackbox exporter, which can send the DNS queries of your choice to the DNS servers of your choice and validate the result. Well, more or less.
What you specifically do with the Blackbox exporter is that you configure some modules and then you provide those modules targets to check (through your Prometheus configuration). When you're probing DNS, the module's configuration specifies all of the parameters of the DNS query and its validation. This means that if you are checking N different DNS names to see if they give you a SOA record (or an A record or a MX record), you need N different modules. Quite reasonably, the metrics Blackbox generates when you check a target don't (currently) include the actual DNS name or query type that you're making. Why this matters is that it makes it difficult to write a generic alert that will create a specific message that says 'asking for the X type of record for host Y failed'.
You can somewhat get around this by encoding this information into the names of your Blackbox modules and then doing various creative things in your Prometheus configuration. However, you still have to write all of the modules out, even though many of them may be basically cut and paste versions of each other with only the DNS names changed. This has a number of issues, including that it's a disincentive to doing relatively comprehensive cross checks. (I speak from experience with our Prometheus setup.)
There is a third party dns_exporter that can be set up in a more flexible way where all parts of the DNS check can be provided by Prometheus (although it exposes some metrics that risk label cardinality explosions). However this still leaves you to list in your Prometheus configuration a cross-matrix of every DNS name you want to query and every DNS server you want to query against. What you'll avoid is needing to configure a bunch of Blackbox modules (although what you lose is the ability to verify that the queries returned specific results).
To do better, I think we'd need to write a custom program (perhaps run through the script exporter) that contained at least some of this knowledge, such as what DNS servers to check. Then our Prometheus configuration could just say 'check this DNS name against the usual servers' and the script would know the rest. Unfortunately you probably can't reuse any of the current Blackbox code for this, even if you wrote the core of this script in Go.
(You could make such a program relatively generic by having it take the list of DNS servers to query from a configuration file. You might want to make it support multiple lists of DNS servers, each of them named, and perhaps set various flags on each server, and you can get quite elaborate here if you want to.)
(This elaborates on a Fediverse post of mine.)
2024-03-14
You might want to think about if your system serial numbers are sensitive
Recently, a commentator on my entry about what's lost when running the Prometheus host agent as a non-root user on Linux pointed out that if you do this, one of the things omitted (that I hadn't noticed) is part of the system DMI information. Specifically, you lose various serial numbers and the 'product UUID', which is potentially another unique identifier for the system, because Linux makes the /sys/class/dmi/id files with these readable only by root (this appears to have been the case since support for these was added to /sys in 2007). This got me thinking about whether serial numbers are something we should consider sensitive in general.
My tentative conclusion is that for us, serial numbers probably aren't sensitive enough to do anything special about. I don't think any of our system or component serial numbers can be used to issue one time license keys or the like, and while people could probably do some mischief with some of them, this is likely a low risk thing in our academic environment.
(Broadly we don't consider any metrics to be deeply sensitive, or to put it another way we wouldn't want to collect any metrics that are because in our environment it would take a lot of work to protect them. And we do collect DMI information and put it into our metrics system.)
This doesn't mean that serial numbers have no sensitivity even for us; I definitely do consider them something that I generally wouldn't (and don't) put in entries here, for example. Depending on the vendor, revealing serial numbers to the public may let the public do things like see your exact system configuration, when it was delivered, and other potentially somewhat sensitive information. There's also more of a risk that bored Internet people will engage in even minor mischief.
However, your situation is not necessarily like ours. There are probably plenty of environments where serial numbers are potentially more sensitive or more dangerous if exposed (especially if exposed widely). And in some environments, people run semi-hostile software that would love to get its hands on a permanent and unique identifier for the machine. Before you gather or expose serial number information (for systems or for things like disks), you might want to think about this.
At the same time, having relatively detailed hardware configuration information can be important, as in the war story that inspired me to start collecting this information in our metrics system. And serial numbers are a great way to disambiguate exactly which piece of hardware was being used for what, when. We deliberately collect disk drive serial number information from SMART, for example, and put it into our metrics system (sometimes with amusing results).
2024-03-11
Why we should care about usage data for our internal services
I recently wrote about some practical-focused thoughts on usage data for your services. But there's a broader issue about usage data for services and having or not having it. My sense is that for a lot of sysadmins, building things to collect usage data feels like accounting work and likely to lead to unpleasant and damaging things, like internal chargebacks (which have create various problems, and also). However, I think we should strongly consider routinely gathering this data anyway, for fundamentally the same reasons as you should collect information on what TLS protocols and ciphers are being used by your people and software.
We periodically face decisions both obvious and subtle about what to do about services and the things they run on. Do we spend the money to buy new hardware, do we spend the time to upgrade the operating system or the version of the third party software, do we need to closely monitor this system or service, does it need to be optimized or be given better hardware, and so on. Conversely, maybe this is now a little-used service that can be scaled down, dropped, or simplified. In general, the big question is do we need to care about this service, and if so how much. High level usage data is what gives you most of the real answers.
(In some environments one fate for narrowly used services is to be made the responsibility of the people or groups who are the service's big users, instead of something that is provided on a larger and higher level.)
Your system and application metrics can provide you some basic information, like whether your systems are using CPU and memory and disk space, and perhaps how that usage is changing over a relatively long time base (if you keep metrics data long enough). But they can't really tell you why that is happening or not happening, or who is using your services, and deriving usage information from things like CPU utilization requires either knowing things about how your systems perform or assuming them (eg, assuming you can estimate service usage from CPU usage because you're sure it uses a visible amount of CPU time). Deliberately collecting actual usage gives you direct answers.
Knowing who is using your services and who is not also gives you the opportunity to talk to both groups about what they like about your current services, what they'd like you to add, what pieces of your service they care about, what they need, and perhaps what's keeping them from using some of your services. If you don't have usage data and don't actually ask people, you're flying relatively blind on all of these questions.
Of course collecting usage data has its traps. One of them is that what usage data you collect is often driven by what sort of usage you think matters, and in turn this can be driven by how you expect people to use your services and what you think they care about. Or to put it another way, you're measuring what you assume matters and you're assuming what you don't measure doesn't matter. You may be wrong about that, which is one reason why talking to people periodically is useful.
PS: In theory, gathering usage data is separate from the question of whether you should pay attention to it, where the answer may well be that you should ignore that shiny new data. In practice, well, people are bad at staying away from shiny things. Perhaps it's not a bad thing to have your usage data require some effort to assemble.
(This is partly written to persuade myself of this, because maybe we want to routinely collect and track more usage data than we currently do.)
2024-03-09
Some thoughts on usage data for your systems and services
Some day, you may be called on by decision makers (including yourself) to provide some sort of usage information for things you operate so that you can make decisions about them. I'm not talking about system metrics such as how much CPU is being used (although for some systems that may be part of higher level usage information, for example for our SLURM cluster); this is more on the level of how much things are being used, by who, and perhaps for what. In the very old days we might have called this 'accounting data' (and perhaps disdained collecting it unless we were forced to by things like chargeback policies).
In an ideal world, you will already be generating and retaining the sort of usage information that can be used to make decisions about services. But internal services aren't necessarily automatically instrumented the way revenue generating things are, so you may not have this sort of thing built in from the start. In this case, you'll generally wind up hunting around for creative ways to generate higher level usage information from low level metrics and logs that you do have. When you do this, my first suggestion is write down how you generated your usage information. This probably won't be the last time you need to generate usage information, and also if decision makers (including you in the future) have questions about exactly what your numbers mean, you can go back to look at exactly how you generated them to provide answers.
(Of course, your systems may have changed around by the next time you need to generate usage information, so your old ways don't work or aren't applicable. But at least you'll have something.)
My second suggestion is to look around today to see if there's data you can easily collect and retain now that will let you provide better usage information in the future. This is obviously related to keeping your logs longer, but it also includes making sure that things make it to your logs (or at least to your retained logs, which may mean setting things to send their log data to syslog instead of keeping their own log files). At this point I will sing the praises of things like 'end of session' summary log records that put all of the information about a session in a single place instead of forcing you to put the information together from multiple log lines.
(When you've just been through the exercise of generating usage data is an especially good time to do this, because you'll be familiar with all of the bits that were troublesome or where you could only provide limited data.)
Of course there are privacy implications of retaining lots of logs and usage data. This may be a good time to ask around to get advance agreement on what sort of usage information you want to be able to provide and what sort you definitely don't want to have available for people to ask for. This is also another use for arranging to log your own 'end of session' summary records, because if you're doing it yourself you can arrange to include only the usage information you've decided is okay.
2024-03-01
Options for your Grafana panels when your metrics change names
In an ideal world, your metrics never change their names; once you put them into a Grafana dashboard panel, they keep the same name and meaning forever. In the real world, sometimes a change in metric name is forced on you, for example because you might have to move from collecting a metric through one Prometheus exporter to collecting it with another exporter which naturally gives it a different name. And sometimes a metric will be renamed by its source.
In a Prometheus environment, the very brute force way to deal with this is either a recording rule (creating a duplicate metric with the old name) or renaming the metric during ingestion. However I feel that this is generally a mistake. Almost always, your Prometheus metrics should record the true state of affairs, warts and all, and it should be on other things to sort out the results.
(As part of this, I feel that Prometheus metric names should always
be honest about where they come from. There's a convention that the
name of the exporter is at the start of the metric name, and so you
shouldn't generate your own metrics with someone else's name on them.
If a metric name starts with 'node_*
', it should come from the
Prometheus host agent.)
So if your Prometheus metrics get renamed, you need to fix this in your Grafana panels (which can be a pain but is better in the long run). There are at least three approaches I know of. First, you can simply change the name of the metric in all of the panels. This keeps things simple but means that your historical data stops being visible on the dashboards. If you don't keep historical data for very long (or don't care about it much), this is fine; pretty soon the metric's new name will be the only one in your metrics database. In our case, we keep years of data and do want to be able to look back, so this isn't good enough.
The second option is to write your queries in Grafana as basically
'old_name or new_name
'. If your queries involve rate() and avg()
and other functions, this can be a lot of (manual) repetition, but
if you're careful and lucky you can arrange for the old and the new
query results to have the same labels as Grafana sees them, so your
panel graphs will be continuous over the metrics name boundary.
The third option is to duplicate the query and then change the name of the metric (or the metrics) in the new copy of the query. This is usually straightforward and easy, but it definitely gives you graphs that aren't continuous around the name change boundary. The graphs will have one line for the old metric and then a new second line for your new metric. One advantage of separate queries is that you can someday turn the old query off in Grafana without having to delete it.
2024-02-28
Detecting absent Prometheus metrics without knowing their labels
When you have a Prometheus setup, one of the things you sooner or later worry about is important metrics quietly going missing because they're not being reported any more. There can be many reasons for metrics disappearing on you; for example, a network interface you expect to be at 10G speeds may not be there at all any more, because it got renamed at some point, so now you're not making sure the new name is at 10G.
(This happened to us with one machine's network interface, although I'm not sure exactly how except that it involves the depths of PCIe enumeration.)
The standard Prometheus feature for this is the 'absent()
'
function, or sometimes absent_over_time()
.
However, both of these have the problem that because of Prometheus's
data model, you need to know at least some unique labels that your
metrics are supposed to have. Without labels, all you can detect
is a total disappearance of the metric at all, if nothing at all
is reporting the metric. If you want to be alerted when some machine
stops reporting a metric, you need to list all of the sources that
should have the metric (following a pattern we've seen before):
absent(metric{host="a", device="em0"}) or absent(metric{host="b", device="eno1"}) or absent(metric{host="c", device="eth2"})
Sometimes you don't know all of the label values that your metric
be present with (or it's tedious to list all of them and keep them
up to date), and it's good enough to get a notification if a metric
disappears when it was previously there (for a particular set of
labels). For example, you might have an assortment of scripts that
put their success results to somewhere and you don't want to have
to keep a list of all of the scripts, but you do want to detect
when a script stops reporting its metrics. In this case we can use
'offset
'
to check current metrics against old metrics. The simplest pattern
is:
your_metric offset 1h unless your_metric
If the metric was there an hour ago and isn't there now, this will generate the metric as it was an hour ago (with the labels it had then), and you can use that to drive an alert (or at least a notification). If there are labels that might naturally change over time in your_metric, you can exclude them with 'unless ignoring (...)' or use 'unless on (...)' for a very focused result.
As written this has the drawback that it only looks at what versions of the metric were there exactly an hour ago. We can do better by using an *_over_time() function, for example:
max_over_time( your_metric[4h] ) offset 1h unless your_metric
Now if your metric existed (with some labels) at any point between five hours ago and one hour ago, and doesn't exist now, this expression will give you a result and you can alert on that. Since we're using *_over_time(), you can also leave off the 'offset 1h' and just extend the time range, and then maybe extend the other time range too:
max_over_time( your_metric[12h] ) unless max_over_time( your_metric[20m] )
This expression will give you a result if your_metric has been present (with a given set of labels) at some point in the last 12 hours but has not been present within the last 20 minutes.
(You'd pick the particular *_over_time() function to use depending on what use, if any, you have for the value of the metric in your alert. If you have no particular use for the value (or you expect the value to be a constant), either max or min are efficient for Prometheus to compute.)
All of these clever versions have a drawback, which is that after enough time has gone by they shut off on their own. Once the metric has been missing for at least an hour or five hours or 12 hours or however long, even the first part of the expression has nothing and you get no results and no alert. So this is more of a 'notification' than a persistent 'alert'. That's unfortunately the best you can really do. If you need a persistent alert that will last until you take it out of your alert rules, you need to use absent() and explicitly specify the labels you expect and require.
2024-02-27
Our probably-typical (lack of) machine inventory situation
As part of thinking about how we configure machines to monitor and what to monitor on them, I mentioned in passing that we don't generate this information from some central machine inventory because we don't have a single source of truth for a machine inventory. This isn't to say that we don't have any inventory of our machines; instead, the problem is that we have too many inventories, each serving somewhat different purposes.
The core reason that we have wound up with many different lists of machines is that we use many different tools and systems that need to have lists of machines and each of them has a different input format and input sources. It's technically possible to generate all of these different lists of machines for different programs and tools from some single master source, but by and large you get to build, manage, maintain both the software for the master source and the software to extract and reformat all of the machine lists for the various programs that need them. In many cases (certainly in ours), this adds extra work over just maintaining N lists of machines for N programs and subsystems.
(It also generally means maintaining a bespoke custom system for your environment, which is a constant ongoing expense in various ways.)
So we have all sorts of lists of machines, for a broad view of what a machine is. Here's an incomplete list:
- DNS entries (all of our servers have static IPs), but not all DNS
entries still exist as hardware, much less hardware that is turned
on. In addition, we have DNS entries for various IP aliases and other
things that aren't unique machines.
(We'd have more confusion if we used virtual machines, but all of our production machines are on physical hardware.)
- NFS export permissions for hosts that can do NFS mounts from our
fileservers, but not all of our
active machines can do this and there are some listed host names
that are no longer turned on or perhaps even still in DNS.
(NFS export permissions aren't uniform between hosts; some have extra privileges.)
- Hosts that we have established SSH host keys for. This includes
hosts that aren't currently in service and may never be in service
again.
- Ubuntu machines that are updated by our bulk updates system, which is driven by another 'list
of machines' file that is also used for some other bulk operations.
But this data file omits various machines we don't manage that
way (or at best only belatedly includes them), and while it tracks
some machine characteristics it doesn't have all of them.
(And sometimes we forget to add machines to this data file, which we at least get a notification about. Well, for Ubuntu machines.)
- Unix machines that we monitor in various ways in our Prometheus
system. These machines may be ping'd,
have their SSH port checked to see if it answers, run the Prometheus
host agent, and run additional agents to export things like GPU
metrics, depending on what the machine is.
Not all turned-on machines are monitored by Prometheus for various reasons, including that they are test or experimental machines. And temporarily turned off machines tend to be temporarily removed to reduce alert and dashboard noise.
- Our console server has a whole configuration file of what machines have a serial console and how they're configured and connected up. Turned-off machines that are still connected to the console server remain in this configuration file, and they can then linger even after being de-cabled.
- We mostly use 'smart'
PDUs that
can selectively turn outlets off, which means that we track what
machine is on what PDU port. This is tracked both in a master file
and in the PDU configurations (they have menus that give text labels
to ports).
- A 'server inventory' of where servers are physically located and other basic information about the server hardware, generally including a serial number. Not all racked physical servers are powered on, and not all powered on servers are in production.
- Some degree of network maps, to track what servers are connected
to what switches for troubleshooting purposes.
- Various forms of server purchase records with details about the physical hardware, including serial numbers, which we have to keep in order to be able to get rid of the hardware later. This doesn't include the current host name (if any) that the hardware is currently being used for, or where the hardware is (currently) located.
If we assigned IPs to servers through DHCP, we'd also have DHCP configuration files. These would have to track servers by another identity, their Ethernet address, which would in turn depend on what networking the server was using. If we switched a server from 1G networking to 10G networking by putting a 10G card in it, we'd have to change the DHCP MAC information for the server but nothing else about it would change.
There's also confusion over what exactly 'a machine' is, partly because different pieces care about different aspects. We assign DNS host names to roles, not to physical hardware, but the role is implemented in some chunk of physical hardware and sometimes the details of that hardware matter. This leads to more potential confusion in physical hardware inventories, because sometimes we want to track that a particular piece of hardware was 'the old <X>' in case we have to fall back to that older OS for some reason.
(And sometimes we have pre-racked spare hardware for some important role and so what hardware is live in that role and what is the spare can swap around.)
We could put all of this information in a single database (probably in multiple tables) and then try to derive all of the various configuration files from it. But it clearly wouldn't be simple (and some of it would always have to be manually maintained, such as the physical location of hardware). If there is off the shelf open source software that will do a good job of handling this, it's quite likely that setting it up (and setting up our inventory schema) would be fairly complex.
Instead, the natural thing to do in our environment when you need a new list of machines for some purpose (for example, when you're setting up a new monitoring system) is to set up a new configuration file for it, possibly deriving the list of machines from another, existing source. This is especially natural if the tool you're working with already has its own configuration file format.
(If our lists of machines had to change a lot it might be tempting to automatically derive some of the configuration files from 'upstream' data. But generally they don't, which means that manual handling is less work because you don't have to build an entire system to handle errors, special exceptions, and so on.)
2024-02-22
A recent abrupt change in Internet SSH brute force attacks against us
It's general wisdom in the sysadmin community that if you expose a SSH port to the Internet, people will show up to poke at it, and by 'people' I mean 'attackers that are probably mostly automated'. For several years, the pattern to this that I've noticed was an apparent combination of two activities. There was a constant background pitter-patter of various IPs each making a probe once a minute or less (but for tens of minutes or longer), and then periodic bursts where a single IP would be more active, sometimes significantly so.
(Although I can't be sure, I think the rate of both the background probes and the periodic bursts was significantly up compared to how it was a couple of years ago. Unfortunately making direct comparisons is a bit difficult due to Grafana Loki issues.)
Then there came this past Tuesday, and I noticed something that I reported on the Fediverse:
This is my system administrator's "what is wrong" face when Internet ssh authentication probes against our systems seem to have fallen off a cliff, as reported by system logs. We shouldn't be seeing only two in the last hour.
(The nose dive seems to have started at 6:30 am Eastern and hit 'basically nothing' by 9:30 am.)
After looking at this longer, the pattern I'm now seeing on our systems is basically that the background low-volume probes seem to have gone away. Every so often some attacker will fire up a serious bulk probe, making (for example) 400 attempts over a half an hour (often for a random assortment of nonexistent logins); rarely there will be a burst where a dozen IPs will each make an attempt or two and then stop (there's some signs that a lot of the IPs are Tor exit nodes). But for a lot of the time, there's nothing. We can go an hour or three with absolutely no probes at all, which never used to happen; previously a typical baseline rate of probes was around a hundred an hour.
Since the higher-rate SSH probes get through fine, this doesn't seem to be anything in our firewalls or local configurations (I initially wondered about things like a change in logging that came in with an Ubuntu package update). Instead it seems to be a change in attacker behavior, and since it took about two hours to take full effect on Tuesday morning, I wonder if it was something getting progressively shut down or reoriented.
2024-02-09
Compatibility lingers long after it's needed (until it gets noticed)
We have a system for propagating login and password information around our fleet. In this system, all information about user logins flows out from our 'password master' machine, and each other machine can filter and transform that global login information as the machine merges it into the local /etc/passwd. Normal machines use the login information more or less as-is, but unusual ones can do things like set the shells of all non-staff accounts to a program that just prints out 'only staff can log in to this machine' and logs them out. All of this behavior is controlled by a configuration file that tells the program what to do, by matching characteristics of logins and then applying transformations based on what matched. This system has existed for a very long time, probably since we started significantly using Ubuntu sometime in late 2006 or 2007.
Because this system is so old, it once existed in a world where we had a bunch of Solaris servers that users logged in to and the password master machine itself was a Solaris machine. These Solaris machines had quite different paths both for some user shells, like Bash, and 'administrative' shells like the program that told people this was a staff machine or their account was suspended (this was back in the days when you could reasonably use shells for that sort of thing). When we propagated login entries from these Solaris machines to our new Ubuntu machines, we needed to change these Solaris paths to Ubuntu paths, and by 'we' I mean that our password merging and mangling program did. For reasons beyond the scope of this entry, these Solaris path rewritings are specified as transformations in the configuration file, although in practice we applied them all of the time.
We long ago stopped having Solaris login servers or using a Solaris machine as the password master (that ended at the start of 2010, which is later than I expected and had vaguely remembered; at that point our Ubuntu environment was several years old). At the point where our password master became an Ubuntu server, all of that remapping of Solaris shell paths was unnecessary. However, our configuration files for password mangling have faithfully preserved those boiler plate directives for the Solaris shell path rewriting:
@hdir: newhomedir /u fixlocalshell fixadmshell @all: fixadmshell
These 'fixlocalshell' and 'fixadmshell' directives are the lingering remains of that Solaris compatibility. They've been unneeded for more than a decade, but we never really noticed them and so they stayed. They would still be an ignored layer of now-unneeded compatibility if I hadn't wound up re-working some of the documentation for the program today, and in the process realized that we could and should take them out.
(We should remove them from the configuration file because they're confusing noise, especially if you don't work with this program very often and so you have to try to remember what all of the directives do.)
Are there other places with lingering pieces of compatibility with Solaris and other now-gone things in our environment? Probably. We don't particularly look for these things, and often our eyes probably just pass over them as a background thing that we're accustomed to. It's how things are done, and we don't think too much about it on a day to day basis (in other words, it's sort of a superstition, and also).
2024-02-05
We might want to regularly keep track of how important each server is
Today we had a significant machine room air conditioning failure in our main machine room, one that certainly couldn't be fixed on the spot ('glycol all over the roof' is not a phrase you really want to hear about your AC's chiller). To keep the machine room's temperature down, we had to power off as many machines as possible without too badly affecting the services we offer to people here, which are rather varied. Some choices were obvious; all of our SLURM nodes that were in the main machine room got turned off right away. But others weren't things we necessarily remembered right away or we weren't clear if they were safe to turn off and what effects it would have. In the end we took several rounds of turning servers off, looking at what was left, spotting remaining machines, and turning more things off, and we're probably not done yet.
(We have secondary machine room space and we're probably going to have to evacuate servers into it, too.)
One thing we could do to avoid this flailing in the future is to explicitly (try to) keep track of which machines are important and which ones aren't, to pre-plan which machines we could shut down if we had a limited amount of cooling or power. If we documented this, we could avoid having to wrack our brains at the last minute and worry about dependencies or uses that we'd forgotten. Of course documentation isn't free; there's an ongoing amount of work to write it and keep it up to date. But possibly we could do this work as part of deploying machines or changing their configurations.
(This would also help identify machines that we didn't need any more but hadn't gotten around to taking out of service, which we found a couple of in this iteration.)
Writing all of this just in case of further AC failures is probably not all that great a choice of where to spend our time. But writing down this sort of thing can often help to clarify how your environment is connected together in general, including things like what will probably break or have problems if a specific machine (or service) is out, and perhaps which people depend on what service. This can be valuable information in general. The machine room archaeology of 'what is this machine, why is it on, and who is using it' can be fun occasionally, but you probably don't want to do it regularly.
(Will we actually do this? I suspect not. When we deploy and start using a machine its purpose and so on feel obvious, because we have all of the context.)