2013-12-23
A good reason to use write-intent bitmaps
Modern versions of Linux's software RAID support what the documentation
calls 'write-intent bitmaps' (see
also the md(4) manpage). To summarize the documentation, this
bitmap keeps track of which parts of the RAID array are potentially
'dirty', ie where different components of the array may be out of
sync with each other. An array with this information can often be
resynchronized much faster after a crash or after a situation
where one component drops out temporarily. However the drawback of a write
intent bitmap is some amount of extra synchronous writes that
will probably require seeks, since the relevant portion of the
bitmap must be marked as dirty and then flushed to disk before the
real write happens.
Now, here's a serious question: how many writes do your systems typically do to their arrays, and how performance-critical are the speeds of those writes? On most of our systems both answers are 'not very', because we're using RAID-1 to make sure systems can ride out a single disk failure.
As far as I can tell the great advantage to write-intent bitmaps is if your system abruptly crashes while it's active (so your arrays have to resync), you have a much better chance to survive if a 'good' source drive for the resync has a read error (whether latent or live). In an ideal world Linux's software RAID would be smart enough in this situation to pick the data from another drive, since it's probably good; however, I'm not sure if it is that smart right now and I don't want to have to trust its smarts. Write-intent bitmaps improve this situation because you're resyncing much less data; instead of a read error anywhere on your entire disk being really dangerous, now you only have to dread a read error on the hopefully small amount being resynced.
Based on this logic I'm going to be turning on write-intent bitmaps on most of our systems, because frankly very few of them are write-intensive. Even on my desktop(s) I think I'd rather live with some write overhead in order to cut down the risk of catastrophic data loss.
(All of this came to mind after a recent power failure at home that required resyncing a more-than-a-TB array. I was suddenly nervous when I started to think about it.)
PS: yes, I know about ZFS on Linux and it would definitely do this much better. But it's not ready yet.
Sidebar: about that overhead
How much the write-intent bitmap overhead affects you depends on IO patterns and how much disk space a given bit in the bitmap covers. The best case is probably repeatedly rewriting the same area of the array, where the bits will get set and then just sit there. The worst case is writing once to spots all over the array, requiring one bit to be marked for every real write you do.
Write intent bitmaps on SSD based arrays are much cheaper but probably not totally free, since they require an extra disk cache flush.
2013-12-21
If you're using Linux's magic SysRq, increase the log level right away
Here is an important thing about using magic SysRq in practice that I recently learned the hard way:
If you're about to use magic SysRq to diagnose a problem, start by using it to increase the log level.
Specifically, invoke magic SysRq and use 9 to set kernel console
logging verbosity to maximum. What this does is insure that all
kernel messages go to the console, including all messages produced
by magic SysRq operations.
(I believe you can also do this through dmesg with 'dmesg -n debug',
which will be most useful if done in advance, say at boot time. Your
distribution may have some magic settings file that controls this for
you and for that matter it can probably be set as a kernel command line
parameter. And of course all of your servers should have magic SysRq
enabled.)
If you are naive like the me of a few days ago, you might innocently believe that of course magic SysRq messages will always be shown on the console because it would be kind of stupid to do otherwise. I am sad to say that this is not the case, at least for serial consoles. At the default kernel log levels for at least Ubuntu 12.04, it's quite possible for information dumps from some useful magic SysRq operations (like 'show memory state' and 'show task state') to not be displayed on the serial console, although they are making it into the kernel log buffer (and may be saved somewhere if your system is not too broken).
Speaking from personal experience, this is very frustrating. It's also dangerous because the situation looks just like a kernel that's in so much trouble that various magic SysRq operations aren't completing; as such it can make you more inclined to force-reboot a system that doesn't actually need it.
(Discovering this after the fact is also frustrating if you also discover that the system only managed to write partial logs and you've thus missed capturing potentially vital debugging information.)
I maintain that this is a kernel bug, but even if I could manage to
convince the kernel developers of that it would be something like half
a decade or more before the fix made it into all of the kernels we and
other people will be using. In the mean time, remember that the first
magic SysRq operation you want to do is 9 for maximum console logging.
PS: if you're using serial consoles see this caution about serial consoles and magic SysRq, although I suppose the situation may have changed in the last six years (I haven't tested since back then).
2013-12-13
Using cgroups to limit something's RAM consumption (a practical guide)
Suppose, not entirely hypothetically, that you periodically rebuild Firefox from source and that you've discovered that parts of this build process kill your machine due to memory (over)use. You would like to fix this by somehow choking off the total amount of RAM that the whole Firefox build process uses. The relatively simple tool to use for this is a cgroup.
There is probably lots of documentation on the full details of cgroups floating around. This is a practical guide instead.
First we need a cgroup that will actually apply memory limits.
Ad-hoc cgroups are created on the fly with cgcreate (which is
run as root):
cgcreate -t cks -a cks -g memory,cpu,blkio:confine
I'm doing some overkill here; in theory we only need to limit memory
usage. But who cares. As I found out, it's important to specify both
-t and -a here; -a lets us set limits, -t lets us actually
put something into the new cgroup.
The easiest way to set limits is by writing to files in
/sys/fs/cgroup/controller/path. Here we have controllers
for memory, cpu, and blkio and our path under them is confine,
so:
cd /sys/fs/cgroup # This is 3 GB echo 3221225472 >memory/confine/memory.limit_in_bytes # only gets half of contended CPU and disk bandwidth # (in theory) echo 512 >cpu/confine/cpu.shares echo 500 >blkio/confine/blkio.weight
(Since we gave ourselves permissions with -a, we can set all
of these limits directly without being root.)
What parameters controllers take is established partly by poking
around in /sys/fs/cgroup, partly by experimentation, partly by
Internet searches, and sometimes from the official kernel
documentation.
Where limits exist (and work) they may have side effects; for
example, limiting total RAM here is going to force a memory-hungry
program to swap, using up a bunch of disk IO bandwidth.
(If you want this cgroup and its settings to persist over reboot you can
make a suitable entry in /etc/cgconfig.conf. On Fedora you may also
need to make sure that the cgconfig service is enabled.)
Finally we need to actually run our make or whatever so that it is
put into our new 'confine' cgroup and it and its children have their
total RAM usage limited the way we want. This is done on the fly with
cgexec (run as ourselves):
cgexec -g memory,cpu,blkio:confine --sticky make
You don't need --sticky in various common situations, for example if
you're not running the cgroups automatic classification daemon. But I
don't think it does any harm to supply it and anyways you may well
want to wrap this magic command up in a script so you don't have to
remember it.
You can check to see that cgexec is properly putting things into
cgroups by looking at /proc/pid/cgroup to see what cgroups a
suitable process is part of. In this case you would expect to see
memory:/confine among the list. Testing whether your actual cgroup
controller settings are working and doing what you want is beyond the
scope of this entry.
The good news is that this seems to work for me. My Firefox build process has been significantly tamed.
(I've looked at fair share scheduling with cgroups before, which certainly helped here. People have written all of this information down in bits and pieces and partial explanation and Stack Overflow answers and so on, but since I put it together I want to write it down all in one place for later use. (I'm sure there'll be later use.))
2013-12-07
Things get weird with read-only NFS mounts and atime on Linux
It all started with this tweet by Matt Simmons which led to a mailing list question and a second message with context:
Nevertheless, I tested it and unless I messed up my test, an NFS mount with -o ro, you read a file on the mounted FS, and the access time is updated.
For the test the server was a NetApp, the client was Linux.
There is a mount flag -o noatime that does what I want. But I would argue that this is not right. The simplest behavior - nothing is ever written period - should be what you get by default, and then there could be a flag that enables exceptional behavior, that is updating the access time.
Actually things here are much more interesting and odd than you
might think. In light of the fact that mounting the filesystem
ro on the client doesn't quite mean what you might think on NFS (and how atime works on NFS), the ordinary
behavior makes complete sense. It's not that the client is sending atime
updates to the server despite the NFS mount being read-only, it's that
the server is updating atimes when the client does reads. The weird
thing is what seems to happen when the client mounts the filesystem
with noatime.
(The server mounting the filesystem with noatime should turn off atime
updates regardless of what the setting is on your client.)
It turns out that this is an illusion. As a stateless protocol, NFS
servers do not send any sort of notification to clients when a file's
attributes change; instead NFS clients have to check for this by issuing
a NFS GETATTR operation every time they need to know. Because a
Unix system looks at file attributes quite frequently, NFS clients
normally keep caches of recent enough GETATTR results and expire
them periodically. On Linux, when you mount a NFS filesystem without
noatime the NFS client code decides that you really want to know about
atime updates and so it often deliberately bypasses this GETATTR cache
so it can pull in the latest atime update from the server. When you
mount the same NFS filesystem with noatime this special bypass doesn't
happen and you get the full attribute cache effects, which means that
you don't see server updates to atime until the cache entry for your
file expires (or is invalidated for some reason, or is evicted).
(Attribute caching is covered in the nfs(5) manpage; see the
description of ac, actimeo, and so on.)
So what's really happening here is that with a noatime NFS mount the
atime is still being updated on the server but you won't see that update
on your client for some amount of time (how long depends on various
things). If you check the atime immediately after reading from a file,
this will look like the atime isn't being updated at all. You could see
the true state of affairs by looking at the atime on either the server
or on another client that was mounting the filesystem without noatime.
The corollary of this is that mounting your NFS filesystems with
noatime will reduce the number of NFS requests you make to the
server (although I don't know by how much). In some situations this
may be a quite useful feature, especially if you've already turned off
atime updates on the server itself.