Our probably-typical (lack of) machine inventory situation

February 27, 2024

As part of thinking about how we configure machines to monitor and what to monitor on them, I mentioned in passing that we don't generate this information from some central machine inventory because we don't have a single source of truth for a machine inventory. This isn't to say that we don't have any inventory of our machines; instead, the problem is that we have too many inventories, each serving somewhat different purposes.

The core reason that we have wound up with many different lists of machines is that we use many different tools and systems that need to have lists of machines and each of them has a different input format and input sources. It's technically possible to generate all of these different lists of machines for different programs and tools from some single master source, but by and large you get to build, manage, maintain both the software for the master source and the software to extract and reformat all of the machine lists for the various programs that need them. In many cases (certainly in ours), this adds extra work over just maintaining N lists of machines for N programs and subsystems.

(It also generally means maintaining a bespoke custom system for your environment, which is a constant ongoing expense in various ways.)

So we have all sorts of lists of machines, for a broad view of what a machine is. Here's an incomplete list:

  • DNS entries (all of our servers have static IPs), but not all DNS entries still exist as hardware, much less hardware that is turned on. In addition, we have DNS entries for various IP aliases and other things that aren't unique machines.

    (We'd have more confusion if we used virtual machines, but all of our production machines are on physical hardware.)

  • NFS export permissions for hosts that can do NFS mounts from our fileservers, but not all of our active machines can do this and there are some listed host names that are no longer turned on or perhaps even still in DNS.

    (NFS export permissions aren't uniform between hosts; some have extra privileges.)

  • Hosts that we have established SSH host keys for. This includes hosts that aren't currently in service and may never be in service again.

  • Ubuntu machines that are updated by our bulk updates system, which is driven by another 'list of machines' file that is also used for some other bulk operations. But this data file omits various machines we don't manage that way (or at best only belatedly includes them), and while it tracks some machine characteristics it doesn't have all of them.

    (And sometimes we forget to add machines to this data file, which we at least get a notification about. Well, for Ubuntu machines.)

  • Unix machines that we monitor in various ways in our Prometheus system. These machines may be ping'd, have their SSH port checked to see if it answers, run the Prometheus host agent, and run additional agents to export things like GPU metrics, depending on what the machine is.

    Not all turned-on machines are monitored by Prometheus for various reasons, including that they are test or experimental machines. And temporarily turned off machines tend to be temporarily removed to reduce alert and dashboard noise.

  • Our console server has a whole configuration file of what machines have a serial console and how they're configured and connected up. Turned-off machines that are still connected to the console server remain in this configuration file, and they can then linger even after being de-cabled.
  • We mostly use 'smart' PDUs that can selectively turn outlets off, which means that we track what machine is on what PDU port. This is tracked both in a master file and in the PDU configurations (they have menus that give text labels to ports).

  • A 'server inventory' of where servers are physically located and other basic information about the server hardware, generally including a serial number. Not all racked physical servers are powered on, and not all powered on servers are in production.
  • Some degree of network maps, to track what servers are connected to what switches for troubleshooting purposes.

  • Various forms of server purchase records with details about the physical hardware, including serial numbers, which we have to keep in order to be able to get rid of the hardware later. This doesn't include the current host name (if any) that the hardware is currently being used for, or where the hardware is (currently) located.

If we assigned IPs to servers through DHCP, we'd also have DHCP configuration files. These would have to track servers by another identity, their Ethernet address, which would in turn depend on what networking the server was using. If we switched a server from 1G networking to 10G networking by putting a 10G card in it, we'd have to change the DHCP MAC information for the server but nothing else about it would change.

There's also confusion over what exactly 'a machine' is, partly because different pieces care about different aspects. We assign DNS host names to roles, not to physical hardware, but the role is implemented in some chunk of physical hardware and sometimes the details of that hardware matter. This leads to more potential confusion in physical hardware inventories, because sometimes we want to track that a particular piece of hardware was 'the old <X>' in case we have to fall back to that older OS for some reason.

(And sometimes we have pre-racked spare hardware for some important role and so what hardware is live in that role and what is the spare can swap around.)

We could put all of this information in a single database (probably in multiple tables) and then try to derive all of the various configuration files from it. But it clearly wouldn't be simple (and some of it would always have to be manually maintained, such as the physical location of hardware). If there is off the shelf open source software that will do a good job of handling this, it's quite likely that setting it up (and setting up our inventory schema) would be fairly complex.

Instead, the natural thing to do in our environment when you need a new list of machines for some purpose (for example, when you're setting up a new monitoring system) is to set up a new configuration file for it, possibly deriving the list of machines from another, existing source. This is especially natural if the tool you're working with already has its own configuration file format.

(If our lists of machines had to change a lot it might be tempting to automatically derive some of the configuration files from 'upstream' data. But generally they don't, which means that manual handling is less work because you don't have to build an entire system to handle errors, special exceptions, and so on.)

Written on 27 February 2024.
« How to make your GNU Emacs commands 'relevant' for M-X
Detecting absent Prometheus metrics without knowing their labels »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Feb 27 23:03:18 2024
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.