2010-04-23
What per-partition disk IO stats you get from the Linux kernel
A while back I wrote about what disk IO stats you get from the Linux kernel. At the time I talked only about full devices, not partitions, but recently I've become interested in the subject of what IO stats are maintained for partitions (because sometimes you're interested in per-partition information).
The answer depends on what 2.6 kernel version you have. In kernels before 2.6.25, you only have counts of read and write IOs issued and sectors read and written (what I called rio, rsect, wio, and wsect in my original entry). From 2.6.25 onwards, all stats are accurately maintained for partitions as well as whole disks, including time-based statistics and in-flight counts; this goes by the term 'extended partition statistics'.
(At this point I throw brickbats at the kernel's Documentation/iostats.txt, which claims to have been last updated in 2003 and spends a bunch of time talking about the pre-2.6.25 situation, but then has a little note at the end to the effect that all of this is now inapplicable since 2.6.25.)
These days disk stats also show up in sysfs, which is more convenient
if you only want stats for a single entity; the format is the same as
/proc/diskstats with the first three fields dropped. Since about
2.6.32, sysfs (but not /proc/diskstats) will also give you separate
stats for how many read and write IOs are in flight, as well as the
merged numbers. This shows up in the inflight sysfs entry, which has
two numbers; the first number is reads and the second is writes.
Where all of this is done in the source code has changed since my previous entry. Data is displayed in block/genhd.c (and also fs/partitions/check.c for per-partition sysfs stuff) and is now maintained in block/blk-core.c and block/blk-merge.c, using various bits that are found in include/linux/genhd.h.
I have not checked the kernel code to see how much you can trust these stats for LVM or software RAID devices (or how this has changed over time).
2010-04-18
I think it's time to turn off automatic periodic ext3 fscks
I've noticed a pattern lately: I have a number of infrequently rebooted machines, and every time I reboot one of those machines I wind up sitting and drumming my fingers as the machine cheerfully announces 'filesystem X hasn't been checked in N days, checking for you'. It takes a while, because these filesystems are often kind of big and kind of full of things.
This is not ext3's fault; it is faithfully doing what it is configured
to do, and even with all of the improvements you can stuff into it,
fsck can only go so fast because your disks only go so fast. But,
however much I don't like saying this, I think it means that it's time
to stop having systems automatically do periodic ext3 fscks. When I
reboot a machine under controlled circumstances, I almost always want it
to come back up as soon as possible; I do not want to sit there for ten
or twenty minutes as it grinds through a filesystem check. While I like
the reassurance of periodic fscks of my filesystems, I don't like them
quite that much.
There are two periodic checks that ext3 does, one based on how recently
the filesystem was checked and one based on how many times it's been
mounted. You can check the state of both of these with 'dumpe2fs -h';
the time based check is in the 'Last checked', 'Check interval', and
'Next check after' fields, and the mount count based check is in the
'Mount count' and 'Maximum mount count' fields.
(Checking is worthwhile, because some Linux distributions seem to turn off these checks by default; our Red Hat Enterprise Linux machines have both turned off, for example.)
Disabling either or both checks is done by tune2fs; 'tune2fs -c -1'
(or 0) will turn off mount count based checks, and 'tune2fs -i 0' will
disable the time based checks. The last time I tried to do anything with
tune2fs, it had to be used on an unmounted filesystem, but that may
have changed by now. So in summary (and for my future reference), you
want to do:
tune2fs -c -1 -i 0 /dev/whatever
While it may sound alarming to turn off these automatic periodic checks, I should point out explicitly what my experience shows: these automatic checks are happening only very infrequently. If you only reboot machines once or twice a year (or even less frequently), you are only getting very infrequent checks from these 'periodic' checks. If filesystem corruption is a significant concern for you, you are better off explicitly scheduling and performing more frequent checks (or at least more predictable ones). That way you know that your filesystems have all been checked within, say, the last three months.
(I suppose the straightforward way to do this is to actually set a time based check interval and then reboot your machines at slightly more than that time interval, so you might set 85 days as the check interval and then reboot your machines every 90 days. My understanding is that the state of the art of doing this without reboots involves LVM, snapshots, and fsck'ing the snapshot to see if anything comes up, but I have not looked into this very much.)
2010-04-06
How not to set up IP aliases on Ubuntu (and probably Debian)
Suppose that you need some IP aliases on an Ubuntu machine. So you go to
/etc/network/interfaces and slavishly make yourself some, copying the
main stanza a number of times to make entries that looks like this:
auto eth0:0 iface eth0:0 inet static address 128.100.1.A network 128.100.1.0 netmask 255.255.255.0 broadcast 128.100.1.255 gateway 128.100.1.254
(repeat for every additional IP alias, increasing the number and
replacing A with B and so on for all of the different IP aliases.)
What's wrong here is the additional gateway statements for each IP
alias; you do not want to specify gateways for IP aliases. The
problem with all of these gateway statements is that they create
multiple default routes:
$ ip route list | fgrep default
default via 128.100.1.254 dev eth0 src 128.100.1.A metric 100
default via 128.100.1.254 dev eth0 src 128.100.1.B metric 100
default via 128.100.1.254 dev eth0 src 128.100.1.C metric 100
default via 128.100.1.254 dev eth0 metric 100
(You have to use 'ip route list' to see this; 'nestat -nr' will tell
you that you have multiple default routes but not how they differ.)
These routes differ only in that three of the four specify that the
local IP address is something besides the machine's primary IP address
(the 'src <IP>' bit).
When you have multiple default routes with the same metric, Linux picks which one to use semi-randomly (and it will change which one it uses from time to time). Since different default routes come with different local IP addresses, your outgoing connections (and UDP requests) will periodically come from a different IP address. This is comedy gold, especially when combined with a cautiously configured firewall that hasn't been configured to pass outbound traffic from some (but not all) of those IP addresses.
Troubleshooting this is part of where the comedy gold comes in; things
will work sometimes and not at other times, with the problem coming and
going randomly (in reality it comes and goes as the machine chooses
different default routes to use, with different local IP addresses).
You can have a 'telnet outside-host port' command fail and then your
TCP-based traceroute succeed and look fine, for example.
(This happened to us on an Ubuntu 8.04 system. Since Ubuntu and Debian use basically the same system for handling network configuration, I suspect that it would also happen on a Debian machine. It may also happen in other distributions, depending on what they do when you give an IP alias a gateway.)