A deep dive into the OS memory use of a simple Go program

October 6, 2018

One of the enduring mysteries of actually using Go programs is understanding how much OS-level memory they use, as opposed to the various Go-level memory metrics exposed by runtime.MemStats. OS level memory use matters because it influences things like how much real memory your program needs and how likely it is to be killed by the OS in a low-memory situation, but there has always been a disconnect between OS level information and Go level information. After researching enough to write about how Go doesn't free heap memory back to the OS, I got sufficiently curious to really dig down into the details of a very simple program and now I'm going to go through them. All of this is for Go 1.11; other Go versions have had different behavior.

Our very simple program is going to do nothing except sit there so that we can examine its memory use:

package main
func main() {
    var i uint64
    for {

(It turns out that we could use time.Sleep() to pause without dragging in extra complications, because it's actually handled directly in the runtime, despite it nominally being in the time package.)

This simple looking program already has a complicated runtime environment, with several system goroutines operating behind the scene. It also has more memory use than you probably expect. Here's what its memory map looks like on my 64-bit Linux machine:

0000000000400000    316K r-x-- memdemo
000000000044f000    432K r---- memdemo
00000000004bb000     12K rw--- memdemo
00000000004be000    124K rw---   [ bss ]
000000c000000000  65536K rw---   [ anon ]
00007efdfc10c000  35264K rw---   [ anon ]
00007ffc088f1000    136K rw---   [ stack ]
00007ffc08933000     12K r----   [ vvar ]
00007ffc08936000      8K r-x--   [ vdso ]
ffffffffff600000      4K r-x--   [ vsyscall ]
 total           101844K

The vvar, vdso, and vsyscall mappings come from the Linux kernel; the '[ stack ]' mapping is the standard process stack created by the Linux kernel, and the first four mappings are all from the program itself (the actual compiled machine code, the read-only data, plain data, and then the zero'd data respectively). Go itself has allocated the two '[ anon ]' mappings in the middle, which are most of the program's memory use; we have one 64 MB mapping at 0x00c000000000 and one 34.4 MB mapping at 0x7efdfc10c000.

(The addresses for some of these mappings will vary from run to run.)

As described in Allocator Wrestling (see also, and), Go allocates heap memory (including the memory for goroutine stacks) in chunks of memory called spans that come from arenas. Arenas are 64 MB in size and are allocated at fixed locations; on 64-bit Linux, they start at 0x00c000000000. So this is our 64 MB mapping; it is the program's first arena, the only one necessary, which handles all normal Go memory allocation.

If we run our program under strace -e trace=%memory, we'll discover that the remaining mysterious mapping actually comes from a number of separate mmap() calls that the Linux kernel has merged together into one memory area. Here is the trace for our program:

mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efdfe33c000
mmap(0xc000000000, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xc000000000
mmap(0xc000000000, 67108864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xc000000000
mmap(NULL, 33554432, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efdfc33c000
mmap(NULL, 2162688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efdfc12c000
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efdfc11c000
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efdfc10c000

So we have, in order, a 256 KB allocation, the 64 MB arena allocated at its fixed address, a 32 MB allocation, a slightly over 2 MB allocation, and two 64 KB allocations. Everything except the arena allocation is allocated at successively lower addresses next to each other and gets merged together into the single mapping starting at 0x7efdfc10c000. All of these allocations are internal allocations from the Go runtime, and I'm going to run down them in order.

The initial 256 KB allocation is for the first chunk of the Go runtime's area for persistent allocations. These are runtime things that will never be freed up and which can be (and are) allocated outside of the regular heap arenas. Various things are allocated in persistent allocations, and the persistent allocator mostly works in 256 KB chunks that it gets from the OS. Our first mmap() is thus the runtime starting to allocate from this area, which causes the allocator to get its first chunk from the OS. The memory for these persistent allocator chunks is mostly recorded in runtime.MemStats.OtherSys, although it's not the only thing that falls into that category and some persistent allocations are in different categories.

The 32 MB allocation immediately after our first arena is for the heap allocator's "L2" arena map. As the comments in runtime/malloc.go note, most 64-bit architectures (including Linux) have only a single large L2 arena map, which has to be allocated when the first arena is allocated. The next allocation, which is 2112 KB or 2 MB plus 64 KB, turns out to be for the heapArena structure for our newly allocated arena. It has two fields; the .bitmap field is 2 MB in size, and the .spans field is 64 KB (in 8192 8-byte pointers). This explains the odd size requested.

(If I'm reading the code correctly, the L2 arena map isn't accounted for in any runtime.MemStats value; this may be a bug. The heapArena structure is accounted for in runtime.MemStats.GcSys.)

The final two 64 KB allocations are for the initial version of a data structure used to keep track of all spans (set up in recordspan()) and the allocation for a data structure (gcBits) that is used in garbage collection (set up in newArenaMayUnlock()). The span tracking structure is accounted for in runtime.MemStats.OtherSys, while the gcBits stuff is in runtime.MemStats.GcSys.

As your program uses more memory, I believe that in general you can expect more arenas to be allocated from the OS, and with each arena you'll also get another arenaHeap structure. I believe that the L2 arena map is only allocated once on 64-bit Unix. You will probably periodically have larger span data structures and more gcBits structures allocated, and you will definitely periodically have new 256 KB chunks allocated for persistent allocations.

(There are probably other sources of allocations from the OS in the Go runtime. Interested parties can search through the source code for calls to sysAlloc(), persistentalloc(), and so on. In the end everything apart from arenas comes from sysAlloc(), but there are often layers of indirection.)

PS: If you want to track down this sort of thing yourself, the easiest way to do it is to run your test program under gdb, set a breakpoint on runtime.sysAlloc, and then use where every time the breakpoint is hit. On most Unixes, this is the only low level runtime function that allocates floating anonymous memory with mmap(); you can see this in, for example, the Linux version of low level memory allocation.

Written on 06 October 2018.
« Go basically never frees heap memory back to the operating system
It's good to check systemd for new boot-time things every so often »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Oct 6 21:42:04 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.