Wandering Thoughts archives

2024-03-27

Some questions to ask about what silencing alerts means

A common desired feature for an alert notification system is that you can silence (some) alert notifications for a while. You might silence alerts about things that are under planned maintenance, or do it generally in the dead of night for things that aren't important enough to wake someone. This sounds straightforward but in practice my simple description here is under-specified and raises some questions about how things behave (or should behave).

The simplest implementation of silencing alert notifications is for the alerting system to go through all of its normal process for sending notifications but not actually deliver the notifications; the notifications are discarded, diverted to /dev/null, or whatever. In the view of the overall system, the alert notifications were successfully delivered, while in your view you didn't get emailed, paged, notified in some chat channel, or whatever.

However, there are a number of situations where you may not want to discard alert notifications this way, but instead defer them until after the silence has ended. Here are some cases:

  • If an alert starts during the silence and is still in effect when the silence ends, many people will want to get an alert notification about it at (or soon after) the end of the silence. Otherwise, you have to remember to look at dashboards or other sources of alert information to see what current problems you have.

  • If an alert started before the silence and ends (resolves) during the silence, some people will want to get an alert notification about the alert having been resolved at the end of the silence. Otherwise you're once again left to look at your dashboards to notice that some things cleaned up during the silence.

    (This assumes you normally send notifications about resolved alerts, which not everyone does.)

  • If an alert both starts and ends during the silence, most people will say that you shouldn't get an alert notification about it afterward. Otherwise silences would simply defer alert notifications about things like planned maintenance, not eliminate them. However, some people would like to get some sort of summary or general notification about alerts that came up and got resolved during the silence.

    (This is perhaps especially likely for the 'silence in the depths of the night' or 'silence over the weekend' sorts of schedule based silencing. You may still want to know that things happened, just not bother people with them on the spot.)

Whether you want post-silence alert notifications in some or all of these situations will depend in part on what you use alert notifications for (or how the designers of your system expect this to work). In some environments, an alert notification is in effect a message that says 'go look at your dashboards', so you don't need this at the end of a planned maintenance since you're probably already doing that. In other environments, the alert notification is either the primary signal that something is wrong or the primary source of information for what to do about it (by carrying links to runbooks, suggested remediations, relevant dashboards, and so on). Getting an alert notification for 'new' alerts is then vital because that's primarily how you know you have to do something and maybe know what to do.

(And in some environments, getting alert notifications about resolved alerts is the primary method people use to track outstanding alerts, making those important.)

sysadmin/AlertSilencingQuestions written at 23:25:22; Add Comment

2024-03-26

How I would automate monitoring DNS queries in basic Prometheus

Recently I wrote about the problem of using basic Prometheus to monitor DNS query results, which comes about primarily because the Blackbox exporter requires a configuration stanza (a module) for every DNS query you want to make and doesn't expose any labels for what the query type and name are. In a comment, Mike Kohne asked if I'd considered using a script to generate the various configurations needed for this, where you want to check N DNS queries across M different DNS servers. I hadn't really thought about it and we're unlikely to do it, but here is how I would if we did.

The input for the generation system is a list of DNS queries we want to confirm work, which is at least a name and a DNS query type (A, MX, SOA, etc), possibly along with an expected result, and a list of the DNS servers that we want to make these queries against. A full blown system would allow multiple groups of queries and DNS servers, so that you can query your internal DNS servers for internal names as well as external names you want to always be resolvable.

First, I'd run a completely separate Blackbox instance for this purpose, so that its configuration can be entirely script-generated. For each DNS query to be made, the script will work out the Blackbox module's name and then put together the formulaic stanza, for example:

autodns_a_utoronto_something:
  prober: dns
  dns:
    query_name: "utoronto.example.com"
    query_type: "A"
    validate_answer_rrs:
      fail_if_none_matches_regexp:
        - ".*\t[0-9]*\tIN\tA\t.*"

Then your generation program combines all of these stanzas together with some stock front matter and you have this Blackbox instance's configuration file. It only needs to change if you add a new DNS name to query.

The other thing the script generates is a list of scrape targets and labels for them in the format that Prometheus file discovery expects. Since we're automatically generating this file we might as well put all of the smart stuff into labels, including specifying the Blackbox module. This would give us one block for each module that lists all of the DNS servers that will be queried for that module, and the labels necessary. This could be JSON or YAML, and in YAML form it would look like (for one module):

- labels:
    # Hopefully you can directly set __param_module in
    # a label like this.
    __param_module: autodns_a_utoronto_something
    query_name: utoronto.example.com
    query_type: A
    [... additional labels based on local needs ...]
  targets:
  - dns1.example.org:53
  - dns2.example.org:53
  - 8.8.8.8:53
  - 1.1.1.1:53
  [...]

(If we're starting with data in a program it's probably better to generate JSON. Pretty much every language can create JSON by now, and it's a more forgiving format than trying to auto-generate YAML even if the result is less readable. But if I was going to put the result in a version control repository, I'd generate YAML.)

More elaborate breakdowns are possible, for example to separate external DNS servers from internal ones, and other people's DNS names from your DNS names. You'll get an awful lot of stanzas with various mixes of labels, but the whole thing is being generated automatically and you don't have to look at it. In our local configuration we'd wind up with at least a few extra labels and a more complicated set of combinations.

We need the query name and query type available as labels because we're going to write one generic alert rule for all of these Blackbox modules, something like:

- alert: DNSGeneric
  expr: probe_success{probe=~"autodns_.*"} == 0
  for: 5m
  annotations:
    summary: "We cannot get the {{$labels.query_type}} record for {{$labels.query_name}} from the DNS server ..."

(If Blackbox put these labels in DNS probe metrics we could skip adding them in the scrape target configuration. We'd also be able to fold a number of our existing DNS alerts into more generic ones.)

If you go the extra distance to have some DNS lookups require specific results (instead of just 'some A record' or 'some MX record'), then you might need additional labels to let you write a more specific alert record.

For us, both generated files would be relatively static. As a practical matter we don't add extra names to check or DNS servers to test against very often.

We could certainly write such a configuration file generation system and get more comprehensive coverage of our DNS zones and various nameservers than we currently have. However, my current view is that the extra complexity almost certainly wouldn't be worth it in terms of detecting problems and maintaining the system. We'd make more queries against more DNS servers if it was easier, if it would be with such a generation system, but those queries would almost never detect anything we didn't already know.

sysadmin/PrometheusAutomatingDNSChecks written at 23:06:45; Add Comment

2024-03-25

Options for diverting alerts in Prometheus

Suppose, not hypothetically, that you have a collection of machines and some machines are less important than others or are of interest only to a particular person. Alerts about normal machines should go to everyone; alerts about the special machines should go elsewhere. There are a number of options to set this up in Prometheus and Alertmanager, so today I want to run down a collection of them for my own future use.

First, you have to decide the approach you'll use in Alertmanager. One option is to is to specifically configure an early Alertmanager route that specifically knows the names of these machines. This is the most self-contained option, but it has the drawback that Alertmanager routes can often intertwine in complicated ways that are hard to keep track of. For instance, you need to keep your separate notification routes for these machines in sync.

(I should write down in one place the ordering requirements for routes in our Alertmanager configuration, because several times I've made changes that didn't have the effect I wanted because I had the route in the wrong spot.)

The other Alertmanager option is to set up general label-based markers for alerts that should be diverted and rely on Prometheus to get the necessary label on to the alerts about these special machines. My view is that you're going to want to have such 'testing' alerts in general, so sooner or later you're going to wind up with this in your Alertmanager configuration.

Once Prometheus is responsible for labeling the specific alerts that should be diverted, you have some options:

  • The Prometheus alert rule can specifically add the appropriate label. This works great if it's a testing alert rule that you always want to divert, but less well if it's a general alert that you only want to divert some of the time.

  • You can arrange for metrics from the specific machines to have the special label values necessary. This has three problems. First, it creates additional metrics series if you change how a machine's alerts are handled. Second, it may require ugly contortions to pull some scrape targets out to different sections of a static file, so you can put different labels on them. And lastly, it's error-prone, because you have to make sure all of the scrape targets for the machine have the label on them.

    (You might even be doing special things in your alert rules to create alerts for the machine out of metrics that don't come from scraping it, which can require extra work to add labels to them.)

  • You can add the special label marker in Prometheus alert relabeling, by matching against your 'host' label and creating a new label. This will be something like:

    - source_labels: [host]
      regex: vmhost1
      target_label: send
      replacement: testing
    

    You'll likely want to do this at the end, or at least after any other alert label canonicalization you're doing to clean up host names, map service names to hosts, and so on.

Now that I've sat down and thought about all of these options, the one I think I like the best is alert relabeling. Alert relabeling in Prometheus puts this configuration in one central place, instead of spreading it out over scrape targets and alert rules, and it does so in a setting that doesn't have quite as many complex ordering issues as Alertmanager routes do.

(Adding labels in alert rules is still the right answer if the alert itself is in testing, in my view.)

sysadmin/PrometheusAlertDiversionOptions written at 22:58:26; Add Comment

2024-03-24

Platform peculiarities and Python (with an example)

I have a long standing little Python tool to turn IP addresses into verified hostnames and report what's wrong if it can't do this (doing verified reverse DNS lookups is somewhat complicated). Recently I discovered that socket.gethostbyaddr() on my Linux machines was only returning a single name for an IP address that was associated with more than one. A Fediverse thread revealed that this reproduced for some people, but not for everyone, and that it also happened in other programs.

The Python socket.gethostbyaddr() documentation doesn't discuss specific limitations like this, but the overall socket documentation does say that the module is basically a layer over the platform's C library APIs. However, it doesn't document exactly what APIs are used, and in this case it matters. Glibc on Linux says that gethostbyaddr() is deprecated in favour of getnameinfo(), so a C program like CPython might reasonably use either to implement its gethostbyaddr(). The C gethostbyaddr() supports returning multiple names (at least in theory), but getnameinfo() specifically does not; it only ever returns a single name.

In practice, the current CPython on Linux will normally use gethostbyaddr_r() (see Modules/socketmodule.c's socket_gethostbyaddr()). This means that CPython isn't restricted to returning a single name and is instead inheriting whatever peculiarities of glibc (or another libc, for people on Linux distributions that use an alternative libc). On glibc, it appears that this behavior depends on what NSS modules you're using, with the default glibc 'dns' NSS module not seeming to normally return multiple names this way, even for glibc APIs where this is possible.

Given all of this, it's not surprising that the CPython documentation doesn't say anything specific. There's not very much specific it can say, since the behavior varies in so many peculiar ways (and has probably changed over time). However, this does illustrate that platform peculiarities are visible through CPython APIs, for better or worse (and, like me, you may not even be aware of those peculiarities until you encounter them). If you want something that is certain to bypass platform peculiarities, you probably need to do it yourself (in this case, probably with dnspython).

(The Go documentation for a similar function does specifically say that if it uses the C library it returns at most one result, but that's because the Go authors know their function calls getnameinfo() and as mentioned, that can only return one name (at most).)

python/PythonAndPlatformPeculiarities written at 22:53:03; Add Comment

2024-03-23

The many possible results of turning an IP address into a 'hostname'

One of the things that you can do with the DNS is ask it to give you the DNS name for an IP address, in what is called a reverse DNS lookup. A full and careful reverse DNS lookup is more complex than it looks and has more possible results than you might expect. As a result, it's common for system administrators to talk about validated reverse DNS lookups versus plain or unvalidated reverse DNS lookups. If you care about the results of the reverse DNS lookup, you want to validate it, and this validation is where most of the extra results come in to play.

(To put the answer first, a validated reverse DNS lookup is one where the name you got from the reverse DNS lookup also exists in DNS and lists your initial IP address as one of its IP addresses. This means that the organization responsible for the name agrees that this IP is one of the IPs for that name.)

The result of a plain reverse DNS lookup can be zero, one, or even many names, or a timeout (which is in effect zero results but which takes much longer). Returning more than one name from a reverse DNS lookup is uncommon and some APIs for doing this don't support it at all, although DNS does. However, you cannot trust the name or names that result from reverse DNS, because reverse DNS lookups is done using a completely different set of DNS zones than domain names use, and as a result can be controlled by a completely different person or organization. I am not Google, but I can make reverse DNS for an IP address here claim to be a Google hostname.

(Even within an organization, people can make mistakes with their reverse DNS information, precisely because it's less used than the normal (forward) DNS information. If you have a hostname that resolves to the wrong IP address, people will notice right away; if you have an IP address that resolves to the wrong name, people may not notice for some time.)

So for each name you get in the initial reverse DNS lookup, there are a number of possibilities:

  • Tha name is actually an (IPv4, generally) IP address in text form. People really do this even if they're not supposed to, and your DNS software probably won't screen these out.

  • The name is the special DNS name used for that IP address's reverse DNS lookup (or at least some IP's lookup). It's possible for such names to also have IP addresses, and so you may want to explicitly screen them out and not consider them to be validated names.

  • The name is for a private or non-global name or zone. People do sometimes leak internal DNS names into reverse DNS records for public IPs.
  • The name is for what should be a public name but it doesn't exist in the DNS, or it doesn't have any IP addresses associated with it in a forward lookup.

    In both of these cases we can say the name is unknown. If you don't treat 'the name is an IP address' specially, such a name will also turn up as unknown here if you make a genuine DNS query.

  • The name exists in DNS with IP addresses, but the IP address you started with is not among the IP addresses returned for it in a forward lookup. We can say that the name is inconsistent.

  • The name exists in DNS with IP addresses, and one of those IP addresses is the IP address you started with. The name is consistent and the reverse DNS lookup is valid; the IP address you started with is really called that name.

(There may be a slight bit of complexity in doing the forward DNS lookup.)

If a reverse DNS lookup for an IP address gave you more than one name, you may only care whether there is one valid name (which gives you a name for the IP), you may want to know all of the valid names, or you may want to check that all names are valid and consider it an error if any of them aren't. It depends on why you're doing the reverse DNS lookup and validation. And you might also care about why a name doesn't validate for an IP address, or that an IP address has no reverse DNS lookup information.

Of course if you're trying to find the name for an IP address, you don't necessarily have to use a reverse DNS lookup. In some sense, the 'name' or 'names' for an IP address are whatever DNS names point to it as (one of) their IP address(es). If you have an idea what those names might be, you can just directly check them all to see if you find the IP you're curious about.

If you're writing code that validates IP address reverse DNS lookups, one reason to specifically check for and care about a name that is an IP address is that some languages have 'name to IP address' APIs that will helpfully give you back an IP address if you give them one in text form. If you don't check explicitly, you can look up an IP address, get the IP address in text form, feed it into such an API, get the IP address back again, and conclude that this is a validated (DNS) name for the IP.

It's extremely common for IP addresses to have names that are unknown or inconsistent. It's also pretty common for IP addresses to not have any names, and not uncommon for reverse DNS lookups to time out because the people involved don't operate DNS servers that return timely answers (for one reason or another).

PS: It's also possible to find out who an IP address theoretically belongs to, but that's an entire different discussion (or several of them). Who an IP address belongs to can be entirely separate from what its proper name is. For example, in common colocation setups and VPS services, the colocation provider or VPS service will own the IP, but its proper name may be a hostname in the organization that is renting use of the provider's services.

tech/DNSIpLookupsManyPossibilities written at 23:07:31; Add Comment

2024-03-22

The Linux kernel.task_delayacct sysctl and why you might care about it

If you run a recent enough version of iotop on a typical Linux system, it may nag at you to the effect of:

CONFIG_TASK_DELAY_ACCT and kernel.task_delayacct sysctl not enabled in kernel, cannot determine SWAPIN and IO %

You might wonder whether you should turn on this sysctl, how much you care, and why it was defaulted to being disabled in the first place.

This sysctl enables (Task) Delay accounting, which tracks things like how long things wait for the CPU or wait for their IO to complete on a per-task basis (which in Linux means 'thread', more or less). General system information will provide you an overall measure of this in things like 'iowait%' and pressure stall information, but those are aggregates; you may be interested in known things like how much specific processes are being delayed or are waiting for IO.

(Also, overall system iowait% is a conservative measure and won't give you a completely accurate picture of how much processes are waiting for IO. You can get per-cgroup pressure stall information, which in some cases can come close to a per-process number.)

In the context of iotop specifically, the major thing you will miss is 'IO %', which is the percent of the time that a particular process is waiting for IO. Task delay accounting can give you information about per-process (or task) run queue latency but I don't know if there are any tools similar to iotop that will give you this information. There is a program in the kernel source, tools/accounting/getdelays.c, that will dump the raw information on a one-time basis (and in some versions, compute averages for you, which may be informative). The (current) task delay accounting information you can theoretically get is documented in comments in include/uapi/linux/taskstats.h, or this version in the documentation. You may also want to look at include/linux/delayacct.h, which I think is the kernel internal version that tracks this information.

(You may need the version of getdelays.c from your kernel's source tree, as the current version may not be backward compatible to your kernel. This typically comes up as compile errors, which are at least obvious.)

How you can access this information yourself is sort of covered in Per-task statistics interface, but in practice you'll want to read the source code of getdelays.c or the Python source code of iotop. If you specifically want to track how long a task spends delaying for IO, there is also a field for it in /proc/<pid>/stat; per proc(5), field 42 is delayacct_blkio_ticks. As far as I can tell from the kernel source, this is the same information that the netlink interface will provide, although it only has the total time waiting for 'block' (filesystem) IO and doesn't have the count of block IO operations.

Task delay accounting can theoretically be requested on a per-cgroup basis (as I saw in a previous entry on where the Linux load average comes from), but in practice this only works for cgroup v1. This (task) delay accounting has never been added to cgroup v2, which may be a sign that the whole feature is a bit neglected. I couldn't find much to say why delay accounting was changed (in 2021) to default to being off. The commit that made this change seems to imply it was defaulted to off on the assumption that it wasn't used much. Also see this kernel mailing list message and this reddit thread.

Now that I've discovered kernel.task_delayacct and played around with it a bit, I think it's useful enough for us for diagnosing issues that we're going to turn it on by default until and unless we see problems (performance or otherwise). Probably I'll stick to doing this with an /etc/sysctl.d/ drop in file, because I think that gets activated early enough in boot to cover most processes of interest.

(As covered somewhere, if you turn delay accounting on through the sysctl, it apparently only covers processes that were started after the sysctl was changed. Processes started before have no delay accounting information, or perhaps only 'CPU' delay accounting information. One such process is init, PID 1, which will always be started before the sysctl is set.)

PS: The per-task IO delays do include NFS IO, just as iowait does, which may make it more interesting if you have NFS clients. Sometimes it's obvious which programs are being affected by slow NFS servers, but sometimes not.

linux/TaskDelayAccountingNotes written at 23:09:37; Add Comment

2024-03-21

Reading the Linux cpufreq sysfs interface is (deliberately) slow

The Linux kernel has a CPU frequency (management) system, called cpufreq. As part of this, Linux (on supported hardware) exposes various CPU frequency information under /sys/devices/system/cpu, as covered in Policy Interface in sysfs. Reading these files can provide you with some information about the state of your system's CPUs, especially their current frequency (more or less). This information is considered interesting enough that the Prometheus host agent collects (some) cpufreq information by default. However, there is a little caution, which is that apparently the kernel deliberately slows down reading this information from /sys (as I learned recently. A comment in the relevant Prometheus code says that this delay is 50 milliseconds, but this comment dates from 2019 and may be out of date now (I wasn't able to spot the slowdown in the kernel code itself).

On a machine with only a few CPUs, reading this information is probably not going to slow things down enough that you really notice. On a machine with a lot of CPUs, the story can be very different. We have one AMD 512-CPU machine, and on this machine reading every CPU's scaling_cur_freq one at a time takes over ten seconds:

; cd /sys/devices/system/cpu/cpufreq
; time cat policy*/scaling_cur_freq >/dev/null
10.25 real 0.07 user 0.00 kernel

On a 112-CPU Xeon Gold server, things are not so bad at 2.24 seconds; a 128-Core AMD takes 2.56 seconds. A 64-CPU server is down to 1.28 seconds, a 32-CPU one 0.64 seconds, and on my 16-CPU and 12-CPU desktops (running Fedora instead of Ubuntu) the time is reported as '0.00 real'.

This potentially matters on high-CPU machines where you're running any sort of routine monitoring that tries to read this information, including the Prometheus host agent in its default configuration. The Prometheus host agent reduces the impact of this slowdown somewhat, but it's still noticeably slower to collect all of the system information if we have the 'cpufreq' collector enabled on these machines. As a result of discovering this, I've now disabled the Prometheus host agent's 'cpufreq' collector on anything with 64 cores or more, and we may reduce that in the future. We don't have a burning need to see CPU frequency information and we would like to avoid slow data collection and occasional apparent impacts on the rest of the system.

(Typical Prometheus configurations magnify the effect of the slowdown because it's common to query ('scrape') the host agent quite often, for example every fifteen seconds. Every time you do this, the host agent re-reads these cpufreq sysfs files and hits this delay.)

PS: I currently have no views on how useful the system's CPU frequencies are as a metric, and how much they might be perturbed by querying them (although the Prometheus host agent deliberately pretends it's running on a single-CPU machine, partly to avoid problems in this area). If you do, you might either universally not collect CPU frequency information or take the time impact to do so even on high-CPU machines.

linux/CpufreqSlowToRead written at 23:09:03; Add Comment

2024-03-20

When I reimplement one of my programs, I often wind up polishing it too

Today I discovered a weird limitation of some IP address lookup stuff on the Linux machines I use (a limitation that's apparently not universal). In response to this, I rewrote the little Python program that I had previously been using for looking up IP addresses as a Go program, because I was relatively confident I could get Go to work (although it turns out I couldn't use net.LookupAddr() and had to be slightly more complicated). I could have made the Go program a basically straight port of the Python one, but as I was writing it, I couldn't resist polishing off some of the rough edges and adding missing features (some of which the Python program could have had, and some which would have been awkward to add).

This isn't even the first time this particular program has been polished as part of re-doing it; it was one of the Python programs I added things to when I moved them to Python 3 and the argparse package. That was a lesser thing than the Go port and the polishing changes were smaller, but they were still there.

This 'reimplementation leads to polishing' thing is something I've experienced before. It seems that more often than not, if I'm re-doing something I'm going to make it better (or at least what I consider better), unless I'm specifically implementing something with the goal of being essentially an exact duplicate but in a faster environment (which happened once). It doesn't have to be a reimplementation in a different language, although that certainly helps; I've re-done Python programs and shell scripts and had it lead to polishing.

One trigger for polishing is writing new documentation and code comments. In a pattern that's probably familiar to many programmers, when I find myself about to document some limitation or code issue, I'll frequently get the urge to fix it instead. Or I'll write the documentation about the imperfection, have it quietly nibble at me, and then go back to the code so I can delete that bit of the documentation after all. But some of what drives this polishing is the sheer momentum of having the code open in my editor and already changing or writing it.

Why doesn't happen when I write the program the first time? I think part of it is that I understand the problem and what I want to do better the second time around. When I'm putting together the initial quick utility, I have no experience with it and I don't necessarily know what's missing and what's awkward; I'm sort of building a 'minimum viable product' to deal with my immediate need (such as turning IP addresses into host names with validation of the result). When I come back to re-do or re-implement some or all of the program, I know both the problem and my needs better.

programming/ReimplementationPolish written at 23:10:44; Add Comment

2024-03-19

About DRAM-less SSDs and whether that matters to us

Over on the Fediverse, I grumbled about trying to find SATA SSDs for server OS drives:

Trends I do not like: apparently approximately everyone is making their non-Enterprise ($$$) SATA SSDs be kind of terrible these days, while everyone's eyes are on NVMe. We still use plenty of SATA SSDs in our servers and we don't want to get stuck with terrible slow 'DRAM-less' (QLC) designs. But even reputable manufacturers are nerfing their SATA SSDs into these monsters.

(By the '(QLC)' bit I meant SATA SSDs that were both DRAM-less and used QLC flash, which is generally not as good as other flash cell technology but is apparently cheaper. The two don't have to go together, but if you're trying to make a cheap design you might as well go all the way.)

In a reply to that post, @cesarb noted that the SSD DRAM is most important for caching internal metadata, and shared links to Sabrent's "DRAM & HMB" and Phison's "NAND Flash 101: Host Memory Buffer", both of which cover this issue from the perspective of NVMe SSDs.

All SSDs need to use (and maintain) metadata that tracks things like where logical blocks are in the physical flash, what parts of physical flash can be written to right now, and how many writes each chunk of flash has had for wear leveling (since flash can only be written to so many times). The master version of this information must be maintained in flash or other durable storage, but an old fashioned conventional SSD with DRAM had some amount of DRAM that was used in large part to cache this information for fast access and perhaps fast bulk updating before it was flushed to flash. A DRAMless SSD still needs to access and use this metadata, but it can only hold a small amount of it in the controller's internal memory, which means it must spend more time reading and re-reading bits of metadata from flash and may not have as comprehensive a view of things like wear leveling or the best ready to write flash space.

Because they're PCIe devices, DRAMless NVMe SSDs can borrow some amount of host RAM from the host (your computer), much like some or perhaps all integrated graphics 'cards' (which are also nominally PCIe devices) borrow host RAM to use for GPU purposes (the NVMe "Host Memory Buffer (HMB)" of the links). This option isn't available to SATA (or SAS) SSDs, which are entirely on their own. The operating system generally caches data read from disk and will often buffer data written before sending it to the disk in bulk, but it can't help with the SSD's internal metadata.

(DRAMless NVMe drives with a HMB aren't out of the woods, since I believe the HMB size is typically much smaller than the amount of DRAM that would be on a good NVMe drive. There's an interesting looking academic article from 2020, HMB in DRAM-less NVMe SSDs: Their usage and effects on performance (also).)

How much the limited amount of metadata affects the drive's performance depends on what you're doing, based on both anecdotes and Sabrent's and Phison's articles. It seems that the more internal metadata whatever you're doing needs, the worse off you are. The easily visible case is widely distributed random reads, where a DRAMless controller will apparently spend a visible amount of time pulling metadata off the flash in order to find where those random logical blocks are (enough so that it clearly affects SATA SSD latency, per the Sabrent article). Anecdotally, some DRAMless SATA SSDs can experience terrible write performance under the right (or wrong) circumstances and actually wind up performing worse than HDDs.

Our typical server doesn't need much disk space for its system disk (well, the mirrored pair that we almost always use); even a generous Ubuntu install barely reaches 30 GBytes. With automatic weekly TRIMs of all unused space (cf), the SSDs will hopefully easily be able to find free space during writes and not feel too much metadata pressure then, and random reads will hopefully mostly be handled by Linux's in RAM disk cache. So I'm willing to believe that a competently implemented DRAMless SATA SSD could perform reasonably for us. One of the problems with this theory is finding such a 'competently implemented' SATA SSD, since the reason that SSD vendors are going DRAMless on SATA SSDs (and even NVMe drives) is to cut costs and corners. A competent, well performing implementation is a cost too.

PS: I suspect there's no theoretical obstacle to a U.2 form factor NVMe drive being DRAMless and using a Host Memory Buffer over its PCIe connection. In practice U.2 drives are explicitly supposed to be hot-swappable and I wouldn't really want to do that with a HMB, so I suspect DRAM-less NVMe drives with HMB are all M.2 in practice.

(I also have worries about how well the HMB is protected from stray host writes to that RAM, and how much the NVMe disk is just trusting that it hasn't gotten corrupted. Corrupting internal flash metadata through OS faults or other problems seems like a great way to have a very bad day.)

tech/SSDsUnderstandingDramless written at 23:15:41; Add Comment

2024-03-18

Sorting out PIDs, Tgids, and tasks on Linux

In the beginning, Unix only had processes and processes had process IDs (PIDs), and life was simple. Then people added (kernel-supported) threads, so processes could be multi-threaded. When you add threads, you need to give them some user-visible identifier. There are many options for what this identifier is and how it works (and how threads themselves work inside the kernel). The choice Linux made was that threads were just processes (that shared more than usual with other processes), and so their identifier was a process ID, allocated from the same global space of process IDs as regular independent processes. This has created some ambiguity in what programs and other tools mean by 'process ID' (including for me).

The true name for what used to be a 'process ID', which is to say the PID of the overall entity that is 'a process with all its threads', is a TGID (Thread or Task Group ID). The TGID of a process is the PID of the main thread; a single-threaded program will have a TGID that is the same as its PID. You can see this in the 'Tgid:' and 'Pid:' fields of /proc/<PID>/status. Although some places will talk about 'pids' as separate from 'tids' (eg some parts of proc(5)), the two types are both allocated from the same range of numbers because they're both 'PIDs'. If I just give you a 'PID' with no further detail, there's no way to know if it's a process's PID or a task's PID.

In every /proc/<PID> directory, there is a 'tasks' subdirectory; this contains the PIDs of all tasks (threads) that are part of the thread group (ie, have the same TGID). All PIDs have a /proc/<PID> directory, but for convenience things like 'ls /proc' only lists the PIDs of processes (which you can think of as TGIDs). The /proc/<PID> directories for other tasks aren't returned by the kernel when you ask for the directory contents of /proc, although you can use them if you access them directly (and you can also access or discover them through /proc/<PID>/tasks). I'm not sure what information in the /proc/<PID> directories for tasks are specific to the task itself or are in total across all tasks in the TGID. The proc(5) manual page sometimes talks about processes and sometimes about tasks, but I'm not sure that's comprehensive.

(Much of the time when you're looking at what is actually a TGID, you want the total information across all threads in the TGID. If /proc/<PID> always gave you only task information even for the 'process' PID/TGID, multi-threaded programs could report confusingly low numbers for things like CPU usage unless you went out of your way to sum /proc/<PID>/tasks/* information yourself.)

Various tools will normally return the PID (TGID) of the overall process, not the PID of a random task in a multi-threaded process. For example 'pidof <thing>' behaves this way. Depending on how the specific process works, this may or may not be the 'main thread' of the program (some multi-threaded programs more or less park their initial thread and do their main work on another one created later), and the program may not even have such a thing (I believe Go programs mostly don't, as they multiplex goroutines on to actual threads as needed).

If a tool or system offers you the choice to work on or with a 'PID' or a 'TGID', you are being given the choice to work with a single thread (task) or the overall process. Which one you want depends on what you're doing, but if you're doing things like asking for task delay information, using the TGID may better correspond to what you expect (since it will be the overall information for the entire process, not information for a specific thread). If a program only talks about PIDs, it's probably going to operate on or give you information about the entire process by default, although if you give it the PID of a task within the process (instead of the PID that is the TGID), you may get things specific to that task.

In a kernel context such as eBPF programs, I think you'll almost always want to track things by PID, not TGID. It is PIDs that do things like experience run queue scheduling latency, make system calls, and incur block IO delays, not TGIDs. However, if you're selecting what to report on, monitor, and so on, you'll most likely want to match on the TGID, not the PID, so that you report on all of the tasks in a multi-threaded program, not just one of them (unless you're specifically looking at tasks/threads, not 'a process').

(I'm writing this down partly to get it clear in my head, since I had some confusion recently when working with eBPF programs.)

linux/PidsTgidsAndTasks written at 21:59:58; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.