Decoding the Linux kernel's page allocation failure messages
One of our machines has recently started logging cryptic but verbose kernel messages about page allocation failures. Naturally I went digging to try to find out just what the bits mean, how alarmed I should be, and what was probably causing the failures. I now know a certain amount about decoding the cryptic reports, so that's what I'm going to cover in this entry.
Here's what I know so far about decoding, in the form of a walk-through of a sample message:
imap: page allocation failure. order:4, mode:0xc0d0
The 'imap:' bit is the command that the process that had the failed allocation was running. The 'order:4' bit tells you how many pages were requested, but indirectly. The kernel code always requests pages in powers of two (so one page, two contiguous pages, four contiguous pages, and so on); the 'order' is that power. So the order 4 request here is for 2^4 pages, which is 16 pages or 64 Kb. Memory fragmentation means that it's quite possible for high-order requests to fail even if there's plenty of pages free, while an order:0 failures mean that your machine is totally out of memory with not even a single page free.
The 'mode' is the flags passed to the kernel memory allocator. If you
have kernel source handy they are covered in include/linux/gfp.h,
but you need to know a certain amount of kernel internals to
understand them. You may see this particular 0xc0d0 mode a lot
because it's the normal flags used when code calls
to allocate an array of structures for kernel purposes (ie, with
GFP_KERNEL); this sort of large block allocation is one of the more
likely ones to fail.
Pid: 15273, comm: imap Not tainted 2.6.32-41-server #88-Ubuntu
So this happened to PID 15273, an imap process.
Call Trace: [<ffffffff810fcd27>] __alloc_pages_slowpath+0x4a7/0x590 [<ffffffff810fcf89>] __alloc_pages_nodemask+0x179/0x180 [<ffffffff81130177>] alloc_pages_current+0x87/0xd0 [<ffffffff810fbe7e>] __get_free_pages+0xe/0x50 [<ffffffff81139de8>] __kmalloc+0x128/0x1a0
The early part of the kernel stack call trace is generally not useful because it's all memory allocation functions, as we see here.
[<ffffffffa0158d9e>] xs_setup_xprt+0xae/0x1c0 [sunrpc] [<ffffffffa0159411>] xs_setup_tcp+0x21/0x2c0 [sunrpc] [<ffffffffa0153ccb>] xprt_create_transport+0x5b/0x290 [sunrpc] [<ffffffffa01538db>] rpc_create+0x5b/0x1d0 [sunrpc]
Here we see what kernel code actually tried to do the failing memory
xs_setup_xprt is part of the kernel NFS RPC code, and I
believe that in recent kernels it has been reformed so that it doesn't
try to make large contiguous allocations.
(I'm eliding the rest of the kernel call stack because it's not quite interesting enough.)
Now the kernel prints a whole bunch of information about the
memory state of the machine. If you want (or need) to read the
kernel code involved, the overall function is
in lib/show_mem.c, with most of the report coming from
show_free_areas() in mm/page_alloc.c.
Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 0 CPU 1: hi: 186, btch: 31 usd: 0 Node 0 Normal per-cpu: CPU 0: hi: 186, btch: 31 usd: 0 CPU 1: hi: 186, btch: 31 usd: 0
What I believe this is is that each CPU can maintain a private list of
free pages in each zone, which is useful because pages can normally
be allocated from the list without taking any global locks. The most
important number here is '
usd:', how many pages are currently in each
Because this machine has a DMA32 zone, we know it's running a 64-bit kernel. A 32-bit kernel would have a HighMem zone and no DMA32 zone.
active_anon:79266 inactive_anon:31437 isolated_anon:0 active_file:260339 inactive_file:255635 isolated_file:31 unevictable:0 dirty:38 writeback:1 unstable:1 free:7660 slab_reclaimable:320826 slab_unreclaimable:18290 mapped:19846 shmem:7 pagetables:11033 bounce:0
This is global memory usage information, reporting how many pages are
in each of various states. If you don't already know a lot about the
Linux virtual memory system the most useful number is
free; here it's
telling us that we have a decent number of free pages.
Node 0 DMA free:15900kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15336kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
This is giving us more or less the same memory usage information but
on a per-zone basis. There are some new bits of information, though;
present: tells us how large the zone is, and the
high: report on various (low) watermarks for page allocation and page
reclaims. How the watermarks affect whether or not you get your memory
is so complicated that I am going to have to write another entry on it,
but it turns out that they can matter a lot (and they are probably why
this allocation failed).
I believe that it's quite normal for the DMA zone to be basically unused, as it is here; it's only 16 MBytes and it's the last zone used for allocations.
lowmem_reserve: 0 3511 4016 4016
This is the 'lowmem reservation' level for the DMA zone. When the kernel is handling an allocation that started in another zone (either DMA32 or Normal on this machine), the DMA zone must have at least a certain number of pages free in order to be a fallback candidate for the allocation (sort of, it actually gets quite complicated). Here the numbers are 0 pages free for DMA allocations, 3511 pages free for DMA32 allocations, and 4016 pages free for Normal (or HighMem) allocations. Since the DMA zone has less than 4016 pages, it's perhaps not too surprising that it's mostly free right now.
(See the discussion of
lowmem_reserve_ratio in the documentation
for virtual memory sysctls for more
information on this.)
Node 0 DMA32 free:12560kB min:7076kB low:8844kB high:10612kB active_anon:267460kB inactive_anon:75996kB active_file:935288kB inactive_file:938308kB unevictable:0kB isolated(anon):0kB isolated(file):124kB present:3596192kB mlocked:0kB dirty:128kB writeback:4kB mapped:68660kB shmem:16kB slab_reclaimable:1185640kB slab_unreclaimable:51468kB kernel_stack:4808kB pagetables:30040kB unstable:4kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve: 0 0 505 505 Node 0 Normal free:2180kB min:1016kB low:1268kB high:1524kB active_anon:49604kB inactive_anon:49752kB active_file:106068kB inactive_file:84232kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:517120kB mlocked:0kB dirty:24kB writeback:0kB mapped:10724kB shmem:12kB slab_reclaimable:97664kB slab_unreclaimable:21692kB kernel_stack:2352kB pagetables:14092kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve: 0 0 0 0
The same report for the other two zones. Now, notice something interesting: this machine has a tiny Normal zone and a much larger DMA32 zone. This is because it has only 4 GB of memory; as mentioned before, this means that almost all of its memory goes into the DMA32 zone instead of Normal.
(You might wonder why any memory goes into the Normal zone at all on a machine with only 4 GB of memory. The answer is 'PC hardware craziness'; the hardware needs some address space below 4 GB for PCI devices, so some of your 4 GB of RAM gets remapped above the 4 GB address space boundary.)
Node 0 DMA: 3*4kB 2*8kB 4*16kB 2*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15900kB Node 0 DMA32: 2360*4kB 80*8kB 21*16kB 7*32kB 4*64kB 1*128kB 2*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 12560kB Node 0 Normal: 377*4kB 14*8kB 17*16kB 9*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2180kB
We have finally reached the really interesting bit. This is a report of how many chunks of each order are free in each zone (helpfully expressed in the actual allocation sizes instead of the order number, so you have to remember to map back and forth). From this we can see that the DMA32 and DMA zones could easily satisfy an order 4 (64 Kb) allocation, but the Normal zone is entirely out of everything larger than 32 Kb (which is not really surprising since it only has 512 MBytes of memory total).
This gives me enough information to understand why this allocation failed, but explaining the exact details is sufficiently complex (and involves deep kernel internals) that I have to put it in another entry. The short version is that when you do high-order allocations there is a very complex interaction between the low water mark for a zone and how many pages are free at various orders that can rule zones out when you might not expect it. This can especially happen when a zone has a significant amount of its free memory in order 0 pages, as is the case for DMA32 here.
530984 total pagecache pages 14979 pages in swap cache Swap cache stats: add 800917, delete 785938, find 6804111/6850366 Free swap = 881852kB Total swap = 975864kB
This is information about the state of the swap system, produced by
show_swap_cache_info() in mm/swap_state.c. It's probably
mostly of interest to see if you've exhausted your swap space at the
time of the allocation failure.
1048576 pages RAM 34079 pages reserved 277863 pages shared 958709 pages non-shared
At the end of the report the kernel prints out some overall memory state information.
On a side note, you might wonder why the OOM killer didn't trigger here (it didn't). The short answer is that it's not triggered for higher order allocations, only for allocations of 32 Kb and smaller (order 3 or below). So this allocation request is just large enough to not trigger OOM processing. All things considered that's probably a good thing.
(Going through the OOM processing might have changed the memory situation enough for the allocation to succeed but it might also have started killing processes that we definitely didn't want to die, especially when the system has a decent amount of free memory.)