Wandering Thoughts

2018-10-15

Garbage collection and the underappreciated power of good enough

Today I read The success of Go heralds that of Rust (via). In it the author argues that Rust is poised to take over from Go because Rust is a more powerful language environment. As one of their arguments against Go, the author says:

All the remaining issues with Go stem from three design choices:

  • It’s garbage collected, rather than having compile time defined lifetimes for all its resources. This harms performance, removes useful concepts (like move semantics and destructors) and makes compile-time error checking less powerful.

[...]

Fix those things, and Go essentially becomes the language of the future. [...]

I think the author is going to be disappointed about what happens next, because in my opinion they've missed something important about how people use programming languages and what matters to people.

To put it simply, good garbage collection is good enough in practice for almost everyone and it's a lot easier than the fully correct way of carefully implementing lifetimes for all resources. And Go has good garbage collection, because the Go people considered it a critical issue and spent a lot of resources on it. This means that Go's garbage collection is not an issue in practice for most people, regardless of what the author thinks, and thus that Rust's compile time defined lifetimes are not a compelling advantage for Rust that will draw many current (or future) Go users from Go to Rust.

There are people who need the guarantees and minimal memory usage that lifetimes for memory gives you, and there are programs that blow up under any particular garbage collection scheme. But most people do not have such needs or such programs. Most programs are perfectly fine with using some more memory than they strictly need and spending some more CPU time on garbage collection, and most people are fine with this because they have other priorities than making their programs run as fast and with as minimal resource use as possible.

(These people are not being 'lazy'. They are optimizing what matters to them in their particular environment, which may well be things like speed to deployment.)

The underappreciated power of 'good enough' is that good enough is sufficient for most people and most programs; really, it's almost always sufficient. People don't always stop at good enough, but they often do, especially when achieving better than good enough is significantly harder. This is not a new thing and we have seen this over and over again in programming languages; just look at how people have flocked to Perl, Python, PHP, Ruby, and JavaScript, despite there being more powerful alternatives.

(Good enough changes over time and over scale. What is good enough at small scale is not good enough at larger scale, but many times you never need to handle that larger scale.)

What matters to most people most of the time is not perfection, it is good enough. They may go beyond that, but you should not count on it, and especially you should not count on perfection pulling lots of people away from good enough. We have plenty of proof that it doesn't.

programming/GarbageCollectionGoodEnough written at 00:06:10; Add Comment

2018-10-13

Getting a CPU utilization breakdown in Prometheus's query language, PromQL

A certain amount of Prometheus's query language is reasonably obvious, but once you start getting into the details and the clever tricks you wind up needing to wrap your mind around how PromQL wants you to think about its world. Today I want to tackle one apparently obvious thing, which is getting a graph (or numbers) of CPU utilization.

Prometheus's host agent (its 'node exporter') gives us per-CPU, per mode usage stats as a running counter of seconds in that mode (which is basically what the Linux kernel gives us). A given data point of this looks like:

node_cpu_seconds_total{cpu="1", instance="comps1:9100", job="node", mode="user"} 3632.28

Suppose that we want to know how our machine's entire CPU state breaks down over a time period. Our starting point is the rate over non-idle CPU modes:

irate(node_cpu_seconds_total {mode!="idle"} [1m])

(I'm adding some spaces here to make things wrap better here on Wandering Thoughts; in practice, it's conventional to leave them all out.)

Unfortunately this gives us the rate of individual CPUs (expressed as time in that mode per second, because rate() gives us a per second rate). No problem, let's sum that over everything but the CPUs:

sum(irate(node_cpu_seconds_total {mode!="idle"} [1m])) without (cpu)

If you do this on a busy system with multiple CPUs, you will soon observe that the numbers add up to more than 1 second. This is because we're summing over multiple CPUs; if each of them is in user mode for all of the time, the summed rate of user mode is however many CPUs we have. In order to turn this into a percentage, we need to divide by how many CPUs the machine has. We could hardcode this, but we may have different numbers of CPUs on different machines. So how do we count how many CPUs we have in a machine?

As a stand-alone expression, counting CPUs is (sort of):

count(node_cpu_seconds_total) without (cpu)

Let's break this down, since I breezed over 'without (cpu)' before. This takes our per-CPU, per-host node_cpu_seconds_total Prometheus metric, and counts up how many things there are in each distinct set of labels when you ignore the cpu label. This doesn't give us a CPU count number; instead it gives us a CPU count per CPU mode:

{instance="comps1:9100", job="node", mode="user"} 32

Fortunately this is what we want in the full expression:

(sum(irate(node_cpu_seconds_total {mode!="idle"} [1m])) without (cpu)) / count(node_cpu_seconds_total) without (cpu)

Our right side is a vector, and when you divide by vectors in PromQL, you divide by matching elements (ie, the same set of labels). On the left we have labels and values like this:

{instance="comps1:9100", job="node", mode="user"} 2.9826666666675776

And on the right we have a matching set of labels, as we saw, that gives us the number '32'. So it all works out.

In general, when you're doing this sort of cross-metric operation you need to make it so that the labels come out the same on each side. If you try too hard to turn your CPU count into a pure number, well, it can work if you get the magic right but you probably want to go at it the PromQL way and match the labels the way we have.

(I'm writing this down today because while it all seems obvious and clear to me now, that's because I've spent much of the last week immersed in Prometheus and Grafana. Once we get our entire system set up, it's quite likely that I'll not deal with Prometheus for months at a time and thus will have forgotten all of this 'obvious' stuff by the next time I have to touch something here.)

PS: The choice of irate() versus rate() is a complicated subject that requires an entry of its own. The short version is that if you are looking at statistics over a short time range with a small query step, you probably want to use irate() with a range selector that is normally a couple of times your basic sampling interval.

sysadmin/PrometheusCPUStats written at 21:47:03; Add Comment

How Prometheus's query steps (aka query resolution) work

Prometheus and the combination of Prometheus and Grafana have many dark corners and barely explained things that you seem to be expected to just understand. One of them is what is variously called query resolution or query steps (in, for example, the Grafana documentation for using Prometheus). Here is what I think I understand about this area, having poked at a number of things and scrutinized the documentation carefully.

In general, when you write a simple Prometheus PromQL query, it is evaluated at some point in time (normally the current instant, unless you use an offset modifier). This includes queries with range vector selectors; the range vector selector chooses how far back to go from the current instant. This is the experience you will get in Prometheus's expression browser console. However, something different happens when you want to graph something, either directly in Prometheus's expression browser or through Grafana, because in order to graph things we need multiple points spread over time, and that means we have to somehow pick which points.

In a Prometheus graphing query, there is a range of time you're covering and then there is the query step. How Prometheus appears to work is that your expression is repeatedly evaluated at instants throughout the time range, starting at the first instant of the time range and then moving forward by the query step until things end. The query step or query resolution (plus the absolute time range) determines how many points you will get back. The HTTP API documentation for range queries makes this more or less explicit in its example; in a query against a 30-second range with a query step of 15 seconds, there are three data points returned, one at the start time, one in the middle, and one at the end time.

A range query's query step is completely independent from any range durations specified in the PromQL expression it evaluates. If you have 'rate(http_requests_total[5m])', you can evaluate this at a query step of 15 seconds and Prometheus doesn't care either way. What happens is that every 15 seconds, you look back 5 minutes and take the rate between then and now. It is rather likely that this rate won't change much on a 15 second basis, so you'll probably get a smooth result. On the other hand, if you use a very large query step with this query, you may see your graphs go very jagged and spiky because you're sampling very infrequently. You may also get surprisingly jagged and staircased results if you have very small query steps.

The Prometheus expression browser's graph view will tell you the natural query step in the top right hand corner (it calls this the query resolution), and it will also let you manually set the query step without changing anything else about the query. This is convenient for getting a hang on what happens to a graph of your data as you change the resolution of a given expression. In Grafana, you have to look at the URL you can see in the editor's query inspector; you're looking for the '&step=<something>' at the end. In Grafana, the minimum step is (or can be) limited in various ways, both for the entire query (in the data source 'Options') and in the individual metrics queries ('Min step', which Grafana grumbles about in the Grafana documentation for using Prometheus).

This unfortunately means that there is no universal range duration that works across all time ranges for Prometheus graphs. Instead the range duration you want is quite dependent on both the query resolution and how frequently your data updates; roughly speaking, I think you want the maximum of the query resolution and something slightly over your metric's minimum update period. Unfortunately I don't believe you can express this in Grafana. This leaves you deciding in advance on the primary usage of your graphs, especially in Grafana; you want to decided if you are mostly going to look at large time ranges with large query steps or small time ranges with fine grained query steps.

(You can get very close to generating the maximum of two times here, but then you run aground on a combination of the limitations of Grafana's dashboard variables and what manipulations you can do in PromQL.)

(This is one of those entries that I write partly for myself in the future, where I am unfortunately probably going to need this.)

sysadmin/PrometheusQuerySteps written at 02:15:48; Add Comment

2018-10-12

My unusual use for Firefox's Private Browsing mode

I've written before about how I use my browsing history to keep track of what I've read, although that only partly works in this era of websites making visited links look the same as unvisited ones. However, there is a little problem with using the 'visited' status to keep track of what I've read, and that is that visiting a web page doesn't necessarily correspond with actually reading it. Specifically, sometimes I run across a potentially interesting link but I'm not sure I have the time to read it right now (or if it's actually going to be interesting). If I follow the link to check it out, it's now a visited link that's saved in my browser history, and if I close the window I've lost track of the fact that I haven't actually read the page.

For years I used various hacky workarounds to take quick peeks at pages to see if I wanted to actually read them. Recently it occurred to me that I could use Firefox's Private Browsing mode to conveniently deal with this issue. If I'm reading a page or an aggregator site in my main Firefox instance and I'm not sure about a link, I can use the context menu's 'Open Link in New Private Window' option and there I go; I can check it out without it being marked as 'visited' and thus 'read' in my history. If I decide that I want to read the page for real, I can re-open the link normally.

(Reading the whole thing in Private Browsing mode is somewhat dangerous, because any links that I follow will themselves be in Private Browsing mode and thus my 'I have read these' history status for them will be lost when I close things down. So I usually switch once I've read enough to know that I want to read this for real.)

In retrospect it's a bit amusing that it took me this long to wake up to this use of Private Browsing. Until now I've spent years ignoring the feature on the grounds that I had other browsers that I used for actual private browsing (ones with far less risk of an information leak in either direction). Even though I was carefully doing things to not record certain web page visits in my main browser's history, using Private Browsing for this never occurred to me, perhaps because I wasn't doing this because I wanted private browsing.

(I think I woke up partly as part of walking away from Chrome, which I sometimes used for such quick peeks.)

One of the gesture actions that Foxy Gestures can do is open links in new private browsing windows, although I don't have a gesture assigned to that action. However I'm not sure I use this enough to make coming up with (and memorizing) a new gesture worth it. And on that note, I wish that you could attach meanings to 'middle-clicking' links with modifier keys held down, so I could make a special variant of middle-clicking do 'open in new private browser window' instead of 'open in new regular browser window'.

(Firefox turns out to have a number of behaviors here, but I'm not sure any of them are clearly documented.)

web/FirefoxMyPrivateBrowsing written at 01:24:57; Add Comment

2018-10-11

Some notes on Prometheus's Blackbox exporter

To make a long story short, I'm currently enthusiastically experimenting with Prometheus. As part of this I'm trying out Prometheus's support for 'black box' status and health checks, where you test services and so on from the outside (instead of the 'white box' approach of extracting health metrics from them directly). The Prometheus people don't seem to be too enthusiastic about black box metrics, so it's perhaps not surprising that the official Prometheus blackbox exporter is somewhat underdocumented and hard to understand.

The three important components in setting up blackbox metrics are targets, modules, and probers. A prober is the low level mechanism for making a check, such as making a HTTP request or a TCP connection; the very limited set of probers is built into the code of the blackbox exporter (and Prometheus is probably unenthused about adding more). A module specifies a collection of parameters for a specific prober that are used together to check a target. More than one module may use the same prober, presumably with different parameters. Modules are specified in the blackbox exporter's configuration file. Finally, a target is whatever you are checking with a module and its prober, and it comes from your Prometheus configuration.

The names of probers are set because they are built into the code of the blackbox exporter. The names of modules are arbitrary; you may call them whatever you want and find convenient. Although the official examples give modules names that are related to their prober (such as http_2xx and imap_starttls), this doesn't matter and doesn't influence the prober's behavior, such as what port the TCP prober connects to. This was quite puzzling to me for a long time because it was far from obvious where the TCP prober got the port to connect to from (and it isn't documented).

When Prometheus makes a blackbox check, the blackbox exporter is passed the module and the target in the URL of the request:

http://localhost:9115/probe?target=TARGET&module=MODULE

The target is the only per-target parameter that is passed in to the blackbox exporter, so everything that the prober allows you to vary or specify on a per-target basis is encoded into it (and all other prober configuration comes from the specific module you use). How things are encoded in the target and what you can put there depends on the specific prober.

For the icmp prober, the target is a host name or an IP address. No further per-target things can be provided, and the module parameters are sort of minimal too.

For the dns prober, the target is a host name or IP address of the DNS server to query, plus the port, formatted as 'host:port' (and so normally 'host:53'). What DNS query to make is set in the module's parameters, as is what reply to expect, whether to use TCP or UDP, and so on. There is no way to specify these as part of the target, so if you want to query different DNS names, you need different modules. This is not particularly scalable if you want to query the same DNS server for several names, but then I suspect that the Prometheus people would tell you to write a script for that sort of thing.

(It turns out that if you leave off ':port', it defaults to 53, but I had to read the code to find this out.)

For the http prober, the target is the full URL to be requested (or, as an undocumented feature, you can leave off the 'http://' at the start of the URL). What to require from the result is configured through the module's parameters, as is various aspects of TLS. As with DNS probes, if you want to check that some URLs return a 2xx status and some URLs redirect, you will need two separate modules. The http prober automatically choses HTTP or HTTPS based on the scheme of the target, as you'd expect, which is why a single http prober based module can be used for URLs from either scheme. Under normal circumstances, using HTTPS URLs automatically verifies the certificate chain through whatever system certificate store Go is using on your machine.

For the tcp prober, the target is the host:port to connect to. As we should expect by now, everything else is configured through the module's parameters, including everything to do with TLS; this means that unlike HTTP, you need different modules for checking non-TLS connections and TLS connections. The tcp prober lets the module control whether or not to do TLS on connection (normally with server certificate verification), and you can set up a little chat dialog to test the service you're connecting to (complete with switching to TLS at some suitable point in the dialog). Contrary to the documentation, the expect: strings in the chat dialog are regular expressions, not plain strings.

Much of Prometheus's official blackbox checking would be more flexible if you could pass optional additional parameters outside of the target; the obvious case is DNS checks.

Sidebar: Understanding blackbox relabeling

The Prometheus configuration examples for the blackbox exporter contain a great deal of magic use of relabel_configs. Perhaps what it is doing and what is required is obvious to experienced Prometheus people, but in any case I am not one right now.

The standard example from the README is:

metrics_path: /probe
params:
  module: [AMODULE]
static_configs:
  - targets:
     - ATARGET
relabel_configs:
  - source_labels: [__address__]
    target_label: __param_target
  - source_labels: [__param_target]
    target_label: instance
  - target_label: __address__
    replacement: 127.0.0.1:9115

The __address__ label starts out being set to ATARGET, our blackbox exporter target, because that is what we told Prometheus; if we were using an ordinary exporter, this would be the host and port that Prometheus was going to scrape. Since the blackbox exporter is special, we must instead turn it into the target= parameter of the URL we will scrape, which we do by relabeling it to __param_target. We also save it into the instance label, which will propagate through to the final metrics that come from this (so that we can later find things by our targets). Finally, we set the __address__ label to the actual host:port to scrape from, because if we didn't Prometheus wouldn't even talk to the blackbox exporter.

We also need a module= URL parameter. Here, that comes from the module: in the params section; during relabeling it is __param_module, and it will be set to AMODULE.

sysadmin/PrometheusBlackboxNotes written at 01:12:39; Add Comment

2018-10-10

Even systemd services and dependencies are not self-documenting

I tweeted:

I'm sure that past-me had a good reason for configuring my Wireguard tunnel to only start during boot after the VMWare modules had been loaded. I just wish he'd written it down for present-me.

Systemd units are really easy to write, straightforward to read, and quite easy to hack on and modify. But, just like everything else in system administration, they aren't really self documenting. Systemd units will generally tell you clearly what they're doing, but they won't (and can't) tell you why you set them up that way, and one of the places where this can be very acute is in what their dependencies are. Sometimes those dependencies are entirely obvious, and sometimes they are sort of obvious and also sort of obviously superstitious. But sometimes, as in this case, they are outright mysterious, and then your future self (if no one else) is going to have a problem.

(Systemd dependencies are often superstitious because systemd still generally still lacks clear documentation for standard dependencies and 'depend on this if you want to be started only when <X> is ready'. Admittedly, some of this is because the systemd people disagree with everyone else about how to handle certain sorts of issues, like services that want to activate only when networking is nicely set up and the machine has all its configured static IP addresses or has acquired its IP address via DHCP.)

Dependencies are also dangerous for this because it is so easy to add another one. If you're in a hurry and you're slapping dependencies on in an attempt to get something to work right, this means that adding a comment to explain yourself adds proportionally much more work than it would if you already had to do a fair bit of work to add the dependency itself. Since it's so much extra work, it's that much more tempting to not write a comment explaining it, especially if you're in a hurry or can talk yourself into believing that it's obvious (or both). I'm going to have to be on the watch for this, and in general I should take more care to document my systemd dependency additions and other modifications in the future.

(This is one of the thing that version controlled configuration files are good for. Sooner or later you'll have to write a commit message for your change, and when you do hopefully you'll get pushed to explain it.)

As for this particular case, I believe that what happened is that I added the VMWare dependency back when I was having mysteries Wireguard issues on boot because, it eventually turned out, I had forgotten to set some important .service options. When I was working on the issue, one of my theories was that Wireguard was setting up its networking, then VMWare's own networking stuff was starting up and taking Wireguard's interface down because the VMWare code didn't recognize this 'wireguard' type interface. So I set a dependency so that Wireguard would start after VMWare, then when I found the real problem I never went back to remove the spurious dependency.

(I uncovered this issue today as part of trying to make my machine boot faster, which is partially achieved now.)

linux/DocumentStartupDependencies written at 01:37:27; Add Comment

2018-10-09

Something systemd is missing for diagnosing oddly slow boots

In yesterday's entry, I mentioned that my office machine has always been oddly slow to boot (and this goes back years, as far back as when I switched to systemd's networkd). Over the years I have made various attempts to figure out why this was so and what I could do about it, generally not getting anywhere with them. As part of this, I have of course poked around at what systemd-analyze could tell me, looking at things like 'systemd-analyze blame', 'systemd-analyze critical-path', and 'systemd-analyze plot'. None of them have given me any useful understanding of what's going on.

(One of the things that drives this is that I have a very similar Fedora system at home, and it has always booted visibly faster. This has driven me not just with the general question of 'why does my office machine boot what feels like slowly' but also the specific question of 'what makes my office machine slower than my home machine'. However I've just realized that I've never systematically explored the details of this difference, for example by carefully comparing 'systemd-analyze blame' on both machines.)

Based on 'systemd-analyze blame', I suspect that some of the delay is happening inside the startup of specific units and even specific programs (for example, starting up ZFS seems to take significantly longer on my office machine than at home). Diagnosing this is something that systemd can only partially help with, although it could do more than it currently does. It would be quite helpful to get the timings of each of the separate Exec steps in a single .service unit, so that I could see, for example, which bit of Unbound's multi-step .service is making it take over 12 seconds to start.

(In Unbound's case I suspect that it's 'unbound-anchor' that's so slow, but it would be nice to know for sure. This information may inconsistently be available in the giant blob of stuff you get from 'systemd-analyze dump'; it is on one machine and it isn't on the other. It may depend on how long ago the system was booted.)

However, there's something important that systemd could do that it doesn't, which is record what it was that finally allowed (or caused) a service to be started. Every boot time unit has a set of dependencies and preconditioned, and it only starts when all of them are ready; the whole thing forms an ordered graph. But right now, as far as I can tell, systemd does not explicitly record the information required to reconstruct the critical path through the graph for any given service. To put it another way, you can't reliably ask 'why didn't this service start before then', at least not easily. In an ideal world you could load up a plot of your entire boot, click on any unit, and have at least its critical predecessor shown.

(In theory 'systemd-analyze critical-chain' might answer this question. In practice it gives me answers that are clearly off; for example, it currently claims that one of the services on the critical chain for getty@tty1.service started more than ten seconds after it says that the getty did. Also, the manual page specifically mentions various cautions.)

In a related issue, I also wish that systemd-analyze would give you the unit start time and duration information in text form, instead of only embedded in a SVG through 'systemd-analyze plot'. This would make it easier to pick through and do my own analysis of this sort of thing, and the information is clearly available. It's just that systemd-analyze has apparently decided that the only form we could want it in is a SVG.

Sidebar: Deciding on what you care about for startup duration

If you run 'systemd-analyze critical-chain', it gives you the critical chain for multi-user.target by default. But is this actually what you care about? In my case, I've decided that one thing that I care about is when I can log in to my just rebooted machine, which seems to happen well before multi-user.target is considered officially finished. That's why I've also been looking at getty@tty1.service, which that's a reasonable indication of reaching this point.

linux/SystemdBootTimingWish written at 00:03:41; Add Comment

2018-10-07

It's good to check systemd for new boot-time things every so often

I tweeted:

Oh. The reason my office workstation always took a somewhat unusually long time to (re)boot is that I had an ancient, nonexistent iSCSI target configured, so finding all disk devices had to wait for the iSCSI connection attempt to time out. That was silly.

This turns out to not quite be the case. While my ancient iSCSI target configuration had been there for a very long time, I only installed the iSCSI initiator programs recently, on September 4th, when I seem to have installed a bunch of QEMU-related packages. The dependencies pulled in the Fedora iSCSI initiator package, and that package decided that it should enable itself by default.

(Specifically it enabled the iscsi.service unit, which attempts to automatically login to all suitably configured iSCSI targets. Also, I'm not sure my diagnosis in the tweet is correct, although I saw it delaying the boot in general; I just jumped to that explanation when I made the tweet.)

What this experience taught me is that it's a good idea to go look at what boot-time things are enabled every so often, to spot new ones that I don't want. On Fedora, clearly a certain number of things feel free to activate themselves on install, on the assumption that they're harmless (I may disagree). Checking for this every so often is a good idea.

I think that the easiest and most official way is to go look for recently changed things in /etc/systemd/system and important .wants subdirectories of this, especially multi-user.target.wants and sockets.target.wants. Conveniently, enabling units creates symlinks and the modification time of those symlinks is when the unit was enabled. A quick 'ls -lt' will thus turn up any recent new activations.

On the one hand, this feels sort of like a hack. On the other hand, it's a perfectly good Unix way of answering this question in a system where state is represented in the filesystem in straightforward ways. There's no need to provide an API and a command and so on when existing Unix tools can already answer the question perfectly well.

Now that I've stubbed my toe on this, I'm going to see if I can remember to check for new boot-time services every so often. I've already turned up one surprise on my office workstation, where sssd.service got enabled on August 8th (which was when I upgraded the machine from Fedora 27 to Fedora 28, although this didn't happen on my home machine when I upgraded it). This is probably harmless, so I've let it be so far.

(Since the iSCSI stuff has only been there for a month or so, it's not the long term cause of my slower than expected boots (although it may have been delaying them lately). But that's a discussion for another entry.)

PS: This isn't specific to systemd, of course. Any init system is subject to having new or updated packages decide that they absolutely should be enabled. Systemd does give you a somewhat richer set of paths to this due to socket activation, and any dependency-based system can hide extra activations because they can happen concurrently and not obviously delay your boot (most of the time).

linux/SystemdCheckNewServices written at 22:48:06; Add Comment

2018-10-06

A deep dive into the OS memory use of a simple Go program

One of the enduring mysteries of actually using Go programs is understanding how much OS-level memory they use, as opposed to the various Go-level memory metrics exposed by runtime.MemStats. OS level memory use matters because it influences things like how much real memory your program needs and how likely it is to be killed by the OS in a low-memory situation, but there has always been a disconnect between OS level information and Go level information. After researching enough to write about how Go doesn't free heap memory back to the OS, I got sufficiently curious to really dig down into the details of a very simple program and now I'm going to go through them. All of this is for Go 1.11; other Go versions have had different behavior.

Our very simple program is going to do nothing except sit there so that we can examine its memory use:

package main
func main() {
    var i uint64
    for {
        i++
    }
}

(It turns out that we could use time.Sleep() to pause without dragging in extra complications, because it's actually handled directly in the runtime, despite it nominally being in the time package.)

This simple looking program already has a complicated runtime environment, with several system goroutines operating behind the scene. It also has more memory use than you probably expect. Here's what its memory map looks like on my 64-bit Linux machine:

0000000000400000    316K r-x-- memdemo
000000000044f000    432K r---- memdemo
00000000004bb000     12K rw--- memdemo
00000000004be000    124K rw---   [ bss ]
000000c000000000  65536K rw---   [ anon ]
00007efdfc10c000  35264K rw---   [ anon ]
00007ffc088f1000    136K rw---   [ stack ]
00007ffc08933000     12K r----   [ vvar ]
00007ffc08936000      8K r-x--   [ vdso ]
ffffffffff600000      4K r-x--   [ vsyscall ]
 total           101844K

The vvar, vdso, and vsyscall mappings come from the Linux kernel; the '[ stack ]' mapping is the standard process stack created by the Linux kernel, and the first four mappings are all from the program itself (the actual compiled machine code, the read-only data, plain data, and then the zero'd data respectively). Go itself has allocated the two '[ anon ]' mappings in the middle, which are most of the program's memory use; we have one 64 MB mapping at 0x00c000000000 and one 34.4 MB mapping at 0x7efdfc10c000.

(The addresses for some of these mappings will vary from run to run.)

As described in Allocator Wrestling (see also, and), Go allocates heap memory (including the memory for goroutine stacks) in chunks of memory called spans that come from arenas. Arenas are 64 MB in size and are allocated at fixed locations; on 64-bit Linux, they start at 0x00c000000000. So this is our 64 MB mapping; it is the program's first arena, the only one necessary, which handles all normal Go memory allocation.

If we run our program under strace -e trace=%memory, we'll discover that the remaining mysterious mapping actually comes from a number of separate mmap() calls that the Linux kernel has merged together into one memory area. Here is the trace for our program:

mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efdfe33c000
mmap(0xc000000000, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc000000000
mmap(0xc000000000, 67108864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xc000000000
mmap(NULL, 33554432, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efdfc33c000
mmap(NULL, 2162688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efdfc12c000
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efdfc11c000
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efdfc10c000

So we have, in order, a 256 KB allocation, the 64 MB arena allocated at its fixed address, a 32 MB allocation, a slightly over 2 MB allocation, and two 64 KB allocations. Everything except the arena allocation is allocated at successively lower addresses next to each other and gets merged together into the single mapping starting at 0x7efdfc10c000. All of these allocations are internal allocations from the Go runtime, and I'm going to run down them in order.

The initial 256 KB allocation is for the first chunk of the Go runtime's area for persistent allocations. These are runtime things that will never be freed up and which can be (and are) allocated outside of the regular heap arenas. Various things are allocated in persistent allocations, and the persistent allocator mostly works in 256 KB chunks that it gets from the OS. Our first mmap() is thus the runtime starting to allocate from this area, which causes the allocator to get its first chunk from the OS. The memory for these persistent allocator chunks is mostly recorded in runtime.MemStats.OtherSys, although it's not the only thing that falls into that category and some persistent allocations are in different categories.

The 32 MB allocation immediately after our first arena is for the heap allocator's "L2" arena map. As the comments in runtime/malloc.go note, most 64-bit architectures (including Linux) have only a single large L2 arena map, which has to be allocated when the first arena is allocated. The next allocation, which is 2112 KB or 2 MB plus 64 KB, turns out to be for the heapArena structure for our newly allocated arena. It has two fields; the .bitmap field is 2 MB in size, and the .spans field is 64 KB (in 8192 8-byte pointers). This explains the odd size requested.

(If I'm reading the code correctly, the L2 arena map isn't accounted for in any runtime.MemStats value; this may be a bug. The heapArena structure is accounted for in runtime.MemStats.GcSys.)

The final two 64 KB allocations are for the initial version of a data structure used to keep track of all spans (set up in recordspan()) and the allocation for a data structure (gcBits) that is used in garbage collection (set up in newArenaMayUnlock()). The span tracking structure is accounted for in runtime.MemStats.OtherSys, while the gcBits stuff is in runtime.MemStats.GcSys.

As your program uses more memory, I believe that in general you can expect more arenas to be allocated from the OS, and with each arena you'll also get another arenaHeap structure. I believe that the L2 arena map is only allocated once on 64-bit Unix. You will probably periodically have larger span data structures and more gcBits structures allocated, and you will definitely periodically have new 256 KB chunks allocated for persistent allocations.

(There are probably other sources of allocations from the OS in the Go runtime. Interested parties can search through the source code for calls to sysAlloc(), persistentalloc(), and so on. In the end everything apart from arenas comes from sysAlloc(), but there are often layers of indirection.)

PS: If you want to track down this sort of thing yourself, the easiest way to do it is to run your test program under gdb, set a breakpoint on runtime.sysAlloc, and then use where every time the breakpoint is hit. On most Unixes, this is the only low level runtime function that allocates floating anonymous memory with mmap(); you can see this in, for example, the Linux version of low level memory allocation.

programming/GoProgramMemoryUse written at 21:42:04; Add Comment

Go basically never frees heap memory back to the operating system

Over on Reddit's r/golang, I ran into an interesting question about Go's memory use as part of this general memory question:

[...] However Go is not immediately freeing the memory, at least from htop's perspective.

What can I do to A) gain insight on when this memory will be made available to the OS, [...]

The usual question about memory usage in Go programs is when things will be garbage collected (which can be tricky). However, this person wants to know when Go will return free memory back to the operating system. This is a good question partly because programs often don't do very much of this (or really we should say the versions of malloc() that programs use don't do this), for various reasons. Somewhat to my surprise, it turns out that Go basically never returns memory address space to the OS, as of Go 1.11. In htop, you can expect normal Go programs to only ever be constant sized or grow, never to shrink.

(The qualification about Go 1.11 is important, because Go's memory handling changes over time. Back in 2014 with Go 1.5 or so, Go processes used a huge amount of virtual memory, but that's changed since then.)

The Go runtime itself initially allocates memory in relatively decent sized chunks of memory called 'spans', as discussed in the big comment at the start of runtime/malloc.go (and see also this and this (also); spans are at least 8 KB, but may be larger. If a span has no objects allocated in it, it is an idle span; how many bytes are in idle spans is in runtime.MemStats.HeapIdle. If a span is idle for sufficiently long, the Go runtime 'releases' it back to the OS, although this doesn't mean what you think. Released spans are a subset of idle spans; when a span is released, it still counts as idle.

(In theory the number of bytes of idle spans released back to the operating system is runtime.MemStats.HeapReleased, but you probably want to read the comment about this in the source code of runtime/mstats.go.)

Counting released spans as idle sounds peculiar until you understand something important; Go doesn't actually give any memory address space back to the OS when a span is released. Instead, what Go does is to tell the OS that it doesn't need the contents of the span's memory pages any more and the OS can replace them with zero bytes at its whim. So 'released' here doesn't mean 'return the memory back to the OS', it means 'discard the contents of the memory'. The memory itself remains part of the process and counts as part of the process size (it may or may not count as part of the resident set size, depending on the OS), and Go can immediately use such a released idle span again if it wants to, just as it can a plain idle span.

(On Unix, releasing pages back to the OS consists of calling madvise() (Linux, FreeBSD) on them with either MADV_FREE or MADV_DONTNEED, depending on the specific Unix. On Windows, Go uses VirtualFree() with MEM_DECOMMIT. On versions of Linux with MADV_FREE, I'm not sure what happens to your RSS after doing it; some sources suggest that your RSS doesn't go down until the kernel starts actually reclaiming the pages from you, which may be some time later.)

As far as I can tell from inspecting the current runtime code, Go only very rarely returns memory that it has used back to the operating system by calling munmap() or the Windows equivalent. In particular, once Go has used memory for regular heap allocations it will never be returned to the OS even if Go has plenty of released idle memory that's been untouched for a very long time (as far as I can tell). As a result, the process virtual size that you see in tools like htop is basically a high water mark, and you can expect it to never go down. If you want to know how much memory your Go program is really using, you need to carefully look at the various bits and pieces in runtime.MemStats, perhaps exported through net/http/pprof.

programming/GoNoMemoryFreeing written at 00:04:48; Add Comment

(Previous 10 or go back to October 2018 at 2018/10/05)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.