Wandering Thoughts archives

2009-12-25

Linux's non-strict overcommit is the right default

I've written before about Linux's overcommit settings and the general background of virtual memory limits. Several years of experience of running various general use systems with strict overcommit and paying attention to the numbers have convinced me of something: non-strict overcommit is the right default on Linux, and probably on any modern Unix in general.

I have come to this view through experimental evidence, namely that all of our user login machines generally run with significant amounts of committed address space, yet they have lots of free memory and very little swap used. When they run alarmingly close to their strict overcommit limits (and they have), they are never under memory pressure; they are hitting an artificial limit.

Clearly a lot of programs do not use anywhere near all of their committed address space, and you can get a general estimate of the magnitude of the non-use by comparing the numbers from free with the the Committed_AS number from /proc/meminfo. Although right now is not the best time to run the numbers on our machines (everyone is on vacation and so usage is low), the results for our login servers range from 64% to 34% of the committed address space being actually in use (at most); even most of our compute servers are using noticeably less than the committed AS.

(Feel free to run this test on your own machines and see what the results are. Note that free's numbers are not really the amount of user space memory used, and generally are higher. I'm not sure if it's possible to get actual numbers for just user space memory usage.)

The reality is that configuring our login servers for true strict overcommit would require heroic amounts of swap space, which has its own downsides. As it is, while we have strict overcommit turned on, we have the settings tuned to significantly overcommit real memory and we have too much swap configured (6 GB of swap on machines with 8 GB of memory), all in order to not run into what are now almost entirely artificial limits on committed address space.

It's not really surprising that this is the case, because strict overcommit has to be intensely pessimistic since it is making such a strong promise. Very few programs fork and then immediately scribble all over the copy-on-write virtual memory shared between parent and child, but strict overcommit has to assume that they all will, because it's promised that they can. Similarly, very few programs scribble on all of the code pages in shared libraries that they've mapped read-write (which is necessary in order to do relocation fixups), and in fact these days many may never touch any. But again, strict overcommit must assume that they will and it has to deliver on those promises. And so on, in all of the ways that programs may put claims on physical memory.

(See also this comment, especially its discussion of the various things that malloc does in the name of speed and efficient real memory use.)

(This entry was sparked by reading Jeff Davis's blog entry on the Linux OOM killer.)

NonStrictOvercommitDefault written at 01:38:45; Add Comment

2009-12-16

How Linux software RAID is making me grumpy right now

This weekend, one of my machines sent me email to report:

WARNING: mismatch_cnt is not 0 on /dev/md0
WARNING: mismatch_cnt is not 0 on /dev/md3

What this means (as opposed to what it says) is that a software RAID data scrub has detected some number of inconsistencies between the mirrors for two of my software RAID devices.

(I believe that the kernel also notices this under some other circumstances, but I can't follow the code well enough to be sure or tell what they are. (The mismatch_cnt it is talking about is the one found in /sys/block/mdN/md. You can read the full discussion about it in Documentation/md.txt.)

Let me inventory the obvious failures here.

  • Fedora's raid-check script doesn't bother to tell you what mismatch_cnt is, apart from 'not zero'. Since this is both volatile (it's only in kernel memory so it gets reset on reboot) and a measure of how much inconsistency was found, sysadmins would kind of like to have it recorded for posterity. Speaking for myself, I would really like to know if my arrays are progressively getting more and more inconsistent every week, or if it seems to have happened once and then stopped.

  • the software RAID code does not log any messages when it detects inconsistencies. If you do not know to look at mismatch_cnt and naively just watch syslog or the kernel messages, you are out of luck.

  • worse, the software RAID code doesn't tell you where the errors are. What do they affect? You have no way of finding out short of duplicating the work yourself in order to actually find out the sector numbers.

    (I have read of people who shut down the software RAID device, directly mount each side's filesystem read-only, and diff -r them. People with LVM on software RAID are plain out of luck.)

The lack of information about where the errors are is extremely bad, because there is no actual repair process for this problem. The software RAID 'repair' operation is not a repair, it is a resync; if there is an inconsistency, it picks one side of the mirror (somehow) and force-updates the other to match it. There is no certainty that it will pick the right one.

Therefor, if this happens to you you are best off doing nothing until you can specifically identify what was damaged (if anything) and then either try to recover data from the other mirror or restore things from backups. I foresee a very long downtime with a live CD in my future. Or some kernel hacks. Or both.

The final failure is what may have caused this inconsistency. According to Neil Brown (in a message quoted here), under some circumstances the software RAID code can write inconsistent data to the two sides of the mirror because it allows the page to be changed between when it is written to one side and when it is written to the other. According to his message, this should be harmless because the newly-dirty page will be rewritten at some point. Other reports suggest strongly that this is not the case and that the inconsistencies can persist in real files.

I am frankly dumbfounded that any software RAID implementation allows inconsistent data to be written to different sides of its mirrors. It strikes me as an utterly basic correctness invariant that a RAID-1 pair is always in sync (apart from in-flight writes, etc etc) in the abscence of disk errors and abnormal shutdowns.

SoftwareRaidFail written at 01:58:15; Add Comment

2009-12-11

A wish for KVM virtualization: simple bridged networking

I have a sad confession: despite running Fedora 11 on a machine that's fully capable of hardware virtualization, I am still using VMWare (and living with various bits of pain). While there are side reasons, the major one is that as far as I can see, KVM doesn't have simple bridged networking that needs no host-side changes, and VMWare does.

(Now, I freely admit that I may well be missing something in KVM and its setup. I certainly hope so; I would like to stop routinely using VMWare.)

My most common use for virtualization is to bring up test servers that are part of our overall environment. As you might expect, our environment does not expect servers to be living behind NAT gateways or the like; it expects them to live on distinct and fully reachable IP addresses. In short, they need to be bridged onto the same network that my host machine is on.

In VMWare this is reasonably simple in the GUI but apparently requires ugly kernel hacks behind the scene. It does not require any changes to my host's own networking; VMWare's ugly kernel hacks make it magically work.

In KVM, it appears that you must create a bridge and then change your host's networking to use the bridge instead of the normal network interface. This is terribly invasive, because my host networking is very complex; I have VLANs, firewall rules, and complex policy based routing all attached to my host network. It's not even clear exactly what has to move; for example, do the VLANs stay attached to the real network interface or move to the bridge?

(If you create the bridge and just attach it to your normal network interface without changing your host's networking, your host networking stops working. Or at least this is what happened when I tried it.)

Now, I can see why the kernel people consider this the technically correct solution. But, much as with Xen, I'm not willing to make that big a commitment to KVM when VMware can be used without it, especially when I'm not certain that I'll like KVM in the first place. So I really wish that KVM did have simple bridged networking, because then I could actually try it out.

(VirtualBox does seem to have simple bridged networking, but then I'm not sure that it's any friendlier to the kernel than VMWare is and I know that VMWare has a better interface.)

Sidebar: why a virtual machine network is not an option

In theory, the correct way to deal with this issue is to create a new public but virtual network for my virtual machines to sit on that my host machine 'routes' to. This keeps potentially troublesome virtual machines off our physical network while still giving them reachable IP addresses.

As theoretically elegant as this solution is, it's unworkable in practice, especially when more than one sysadmin wants to use virtualization (we use our workstations as host machines). First, adding a new reachable network is not a trivial operation in our environment, even assuming that I wanted my host machine to be configured as a (full) router. Second, this essentially precludes using public Internet IP addresses, and some of the testing that we do needs machines with such IP addresses.

KVMSimpleBridgingWish written at 02:18:01; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.