Why our server had its page allocation failure
In the previous entry I went through the kernel messages printed when one of our Linux servers had a page allocation failure. Now it's time to explain why our server had that failure despite what looks like plenty of memory. To refresh, the kernel was trying to allocate a 64 Kbyte ('order 4') chunk of contiguous memory in the Normal zone. There was no such chunk in the server's small Normal zone, but there were several such chunks free in the DMA32 zone (and also larger chunks free in DMA32 that could have been split).
First off, we can rule out allocation from the small DMA zone entirely. As far as I can tell, general memory will almost never be allocated from the DMA zone because most of it is effectively reserved for allocations that have to have DMA memory. This is not a big loss since it's only 16 MBytes of RAM. What matters in our case is the state of the DMA32 zone, and in particular two bits of its state:
Node 0 DMA32 free:12560kB min:7076kB low:8844kB high:10612kB [...]
Node 0 DMA32: 2360*4kB 80*8kB 21*16kB 7*32kB 4*64kB 1*128kB 2*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 12560kB
min:' is the minimum low water mark for most non-urgent memory
allocation requests, and the second line reports on how many chunks of
each allocation size (or order) are free.
On first glance it looks like everything should be fine here, because
there is a free 64 Kb chunk and the zone has more free memory than any
of the low water marks (especially
min), but it turns out that the
kernel does something tricky for higher-order allocations (allocations
that aren't for a single 4 Kb page). To simplify and generalize,
the kernel has decided that when checking free memory limits, if
you're asking for a higher-order page the free memory in lower-order
pages shouldn't count towards the amount of memory it considers free
(presumably because such 'free memory' can't satisfy your request). At
the same time it has to reduce the minimum amount of free memory
required to avoid absurd results.
(One little thing is that the kernel check is made based on what the free memory would be after your allocation is made. This probably won't matter for small requests but might matter if you ask for an order 10 allocation of four megabytes.)
So for higher-order allocations only memory available at that order and higher counts, and the starting low water mark is divided by two for every order above 0, ie for an order 4 request like ours the water marks wind up divided by 16 (which conveniently is 2^order). In theory in our situation this means that the kernel would consider there to be 1920 Kb free in DMA32 (well, 1856 Kb after we take off our allocation) instead of 12560 Kb and the minimum low water mark would be 442 Kb instead of 7076 Kb. This still looks like our allocation request should pass muster.
However, the kernel doesn't actually implement the check this way in a single computation. Instead it does it iteratively, using a loop that is done for each order up to (but not including) the order of your request. In pseudo-code:
for each order starting with 0 up to (our order - 1): free memory -= free memory for the current order minimum memory = minimum memory / 2 if free memory <= minimum memory: return failure
The problem is that this iterative approach causes an early failure if a significant amount of the free memory in a zone is in very low order pages, because you can lose a lot of free memory while the current minimum memory requirement only drops by a bit (well, by half). In our situation, much of the free memory in DMA32 is in order 0 pages so the first pass through the loop gives us a new free memory of 3056 Kbytes (12560 Kb minus 9440 Kb of order-0 pages and our 64 Kb request) but a minimum memory requirement of 3538 Kb (the initial 7076 Kb divided by two) and the code immediately declares that there is not enough memory in this zone.
(People who want to read the gory details can see them in
zone_watermark_ok() in mm/page_alloc.c in the Ubuntu 10.04
kernel source, which has been renamed to
current 3.5-rcN kernels.)
I'm reluctant to declare this behavior a bug; the kernel memory people may well consider it working as designed that a zone with a disproportionate amount of its free memory in low-order pages is very reluctant to allocate higher-order chunks, even more reluctant than you might think. However I do think that the current code is at least very unclear as to whether this is intentional (or simply an accident of the current implementation) and what the actual logic is.
(I personally would prefer the direct computation logic. As it stands, you have to know and then explain (and simulate) the actual kernel code in order to understand why this allocation failed; there is no simple to express general rule.)