Wandering Thoughts archives

2013-12-23

A good reason to use write-intent bitmaps

Modern versions of Linux's software RAID support what the documentation calls 'write-intent bitmaps' (see also the md(4) manpage). To summarize the documentation, this bitmap keeps track of which parts of the RAID array are potentially 'dirty', ie where different components of the array may be out of sync with each other. An array with this information can often be resynchronized much faster after a crash or after a situation where one component drops out temporarily. However the drawback of a write intent bitmap is some amount of extra synchronous writes that will probably require seeks, since the relevant portion of the bitmap must be marked as dirty and then flushed to disk before the real write happens.

Now, here's a serious question: how many writes do your systems typically do to their arrays, and how performance-critical are the speeds of those writes? On most of our systems both answers are 'not very', because we're using RAID-1 to make sure systems can ride out a single disk failure.

As far as I can tell the great advantage to write-intent bitmaps is if your system abruptly crashes while it's active (so your arrays have to resync), you have a much better chance to survive if a 'good' source drive for the resync has a read error (whether latent or live). In an ideal world Linux's software RAID would be smart enough in this situation to pick the data from another drive, since it's probably good; however, I'm not sure if it is that smart right now and I don't want to have to trust its smarts. Write-intent bitmaps improve this situation because you're resyncing much less data; instead of a read error anywhere on your entire disk being really dangerous, now you only have to dread a read error on the hopefully small amount being resynced.

Based on this logic I'm going to be turning on write-intent bitmaps on most of our systems, because frankly very few of them are write-intensive. Even on my desktop(s) I think I'd rather live with some write overhead in order to cut down the risk of catastrophic data loss.

(All of this came to mind after a recent power failure at home that required resyncing a more-than-a-TB array. I was suddenly nervous when I started to think about it.)

PS: yes, I know about ZFS on Linux and it would definitely do this much better. But it's not ready yet.

Sidebar: about that overhead

How much the write-intent bitmap overhead affects you depends on IO patterns and how much disk space a given bit in the bitmap covers. The best case is probably repeatedly rewriting the same area of the array, where the bits will get set and then just sit there. The worst case is writing once to spots all over the array, requiring one bit to be marked for every real write you do.

Write intent bitmaps on SSD based arrays are much cheaper but probably not totally free, since they require an extra disk cache flush.

SoftwareRaidWriteIntentBitmap written at 11:34:16; Add Comment

2013-12-21

If you're using Linux's magic SysRq, increase the log level right away

Here is an important thing about using magic SysRq in practice that I recently learned the hard way:

If you're about to use magic SysRq to diagnose a problem, start by using it to increase the log level.

Specifically, invoke magic SysRq and use 9 to set kernel console logging verbosity to maximum. What this does is insure that all kernel messages go to the console, including all messages produced by magic SysRq operations.

(I believe you can also do this through dmesg with 'dmesg -n debug', which will be most useful if done in advance, say at boot time. Your distribution may have some magic settings file that controls this for you and for that matter it can probably be set as a kernel command line parameter. And of course all of your servers should have magic SysRq enabled.)

If you are naive like the me of a few days ago, you might innocently believe that of course magic SysRq messages will always be shown on the console because it would be kind of stupid to do otherwise. I am sad to say that this is not the case, at least for serial consoles. At the default kernel log levels for at least Ubuntu 12.04, it's quite possible for information dumps from some useful magic SysRq operations (like 'show memory state' and 'show task state') to not be displayed on the serial console, although they are making it into the kernel log buffer (and may be saved somewhere if your system is not too broken).

Speaking from personal experience, this is very frustrating. It's also dangerous because the situation looks just like a kernel that's in so much trouble that various magic SysRq operations aren't completing; as such it can make you more inclined to force-reboot a system that doesn't actually need it.

(Discovering this after the fact is also frustrating if you also discover that the system only managed to write partial logs and you've thus missed capturing potentially vital debugging information.)

I maintain that this is a kernel bug, but even if I could manage to convince the kernel developers of that it would be something like half a decade or more before the fix made it into all of the kernels we and other people will be using. In the mean time, remember that the first magic SysRq operation you want to do is 9 for maximum console logging.

PS: if you're using serial consoles see this caution about serial consoles and magic SysRq, although I suppose the situation may have changed in the last six years (I haven't tested since back then).

MagicSysrqIncreaseLogLevel written at 01:37:56; Add Comment

2013-12-13

Using cgroups to limit something's RAM consumption (a practical guide)

Suppose, not entirely hypothetically, that you periodically rebuild Firefox from source and that you've discovered that parts of this build process kill your machine due to memory (over)use. You would like to fix this by somehow choking off the total amount of RAM that the whole Firefox build process uses. The relatively simple tool to use for this is a cgroup.

There is probably lots of documentation on the full details of cgroups floating around. This is a practical guide instead.

First we need a cgroup that will actually apply memory limits. Ad-hoc cgroups are created on the fly with cgcreate (which is run as root):

cgcreate -t cks -a cks -g memory,cpu,blkio:confine

I'm doing some overkill here; in theory we only need to limit memory usage. But who cares. As I found out, it's important to specify both -t and -a here; -a lets us set limits, -t lets us actually put something into the new cgroup.

The easiest way to set limits is by writing to files in /sys/fs/cgroup/controller/path. Here we have controllers for memory, cpu, and blkio and our path under them is confine, so:

cd /sys/fs/cgroup
# This is 3 GB
echo 3221225472 >memory/confine/memory.limit_in_bytes

# only gets half of contended CPU and disk bandwidth
# (in theory)
echo 512 >cpu/confine/cpu.shares
echo 500 >blkio/confine/blkio.weight

(Since we gave ourselves permissions with -a, we can set all of these limits directly without being root.)

What parameters controllers take is established partly by poking around in /sys/fs/cgroup, partly by experimentation, partly by Internet searches, and sometimes from the official kernel documentation. Where limits exist (and work) they may have side effects; for example, limiting total RAM here is going to force a memory-hungry program to swap, using up a bunch of disk IO bandwidth.

(If you want this cgroup and its settings to persist over reboot you can make a suitable entry in /etc/cgconfig.conf. On Fedora you may also need to make sure that the cgconfig service is enabled.)

Finally we need to actually run our make or whatever so that it is put into our new 'confine' cgroup and it and its children have their total RAM usage limited the way we want. This is done on the fly with cgexec (run as ourselves):

cgexec -g memory,cpu,blkio:confine --sticky make

You don't need --sticky in various common situations, for example if you're not running the cgroups automatic classification daemon. But I don't think it does any harm to supply it and anyways you may well want to wrap this magic command up in a script so you don't have to remember it.

You can check to see that cgexec is properly putting things into cgroups by looking at /proc/pid/cgroup to see what cgroups a suitable process is part of. In this case you would expect to see memory:/confine among the list. Testing whether your actual cgroup controller settings are working and doing what you want is beyond the scope of this entry.

The good news is that this seems to work for me. My Firefox build process has been significantly tamed.

(I've looked at fair share scheduling with cgroups before, which certainly helped here. People have written all of this information down in bits and pieces and partial explanation and Stack Overflow answers and so on, but since I put it together I want to write it down all in one place for later use. (I'm sure there'll be later use.))

CgroupsForMemoryLimiting written at 02:00:17; Add Comment

2013-12-07

Things get weird with read-only NFS mounts and atime on Linux

It all started with this tweet by Matt Simmons which led to a mailing list question and a second message with context:

Nevertheless, I tested it and unless I messed up my test, an NFS mount with -o ro, you read a file on the mounted FS, and the access time is updated.

For the test the server was a NetApp, the client was Linux.

There is a mount flag -o noatime that does what I want. But I would argue that this is not right. The simplest behavior - nothing is ever written period - should be what you get by default, and then there could be a flag that enables exceptional behavior, that is updating the access time.

Actually things here are much more interesting and odd than you might think. In light of the fact that mounting the filesystem ro on the client doesn't quite mean what you might think on NFS (and how atime works on NFS), the ordinary behavior makes complete sense. It's not that the client is sending atime updates to the server despite the NFS mount being read-only, it's that the server is updating atimes when the client does reads. The weird thing is what seems to happen when the client mounts the filesystem with noatime.

(The server mounting the filesystem with noatime should turn off atime updates regardless of what the setting is on your client.)

It turns out that this is an illusion. As a stateless protocol, NFS servers do not send any sort of notification to clients when a file's attributes change; instead NFS clients have to check for this by issuing a NFS GETATTR operation every time they need to know. Because a Unix system looks at file attributes quite frequently, NFS clients normally keep caches of recent enough GETATTR results and expire them periodically. On Linux, when you mount a NFS filesystem without noatime the NFS client code decides that you really want to know about atime updates and so it often deliberately bypasses this GETATTR cache so it can pull in the latest atime update from the server. When you mount the same NFS filesystem with noatime this special bypass doesn't happen and you get the full attribute cache effects, which means that you don't see server updates to atime until the cache entry for your file expires (or is invalidated for some reason, or is evicted).

(Attribute caching is covered in the nfs(5) manpage; see the description of ac, actimeo, and so on.)

So what's really happening here is that with a noatime NFS mount the atime is still being updated on the server but you won't see that update on your client for some amount of time (how long depends on various things). If you check the atime immediately after reading from a file, this will look like the atime isn't being updated at all. You could see the true state of affairs by looking at the atime on either the server or on another client that was mounting the filesystem without noatime.

The corollary of this is that mounting your NFS filesystems with noatime will reduce the number of NFS requests you make to the server (although I don't know by how much). In some situations this may be a quite useful feature, especially if you've already turned off atime updates on the server itself.

NFSReadonlyAtime written at 03:18:07; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.