2014-10-29
Quick notes on the Linux iptables 'ipset' extension
For a long time Linux's iptables firewall had an annoying lack in that it had no way to do efficient matching against a set of IP addresses. If you had a lot of IP addresses to match things against (for example if you were firewalling hundreds or thousands of IP addresses and IP address ranges off from your SMTP port), you needed one iptables rule for each entry and then they were all checked sequentially. This didn't make your life happy, to put it one way. In modern Linuxes, ipsets are finally the answer to this; they give you support for efficient sets of various things, including random CIDR netblocks.
(This entry suggests that ipsets only appeared in mainline Linux kernels as of 2.6.39. Ubuntu 12.04, 14.04, Fedora 20, and RHEL/CentOS 7 all have them while RHEL 5 appears to be too old.)
To work with ipsets, the first thing you need is the user level tool for
creating and manipulating them. For no particularly sensible reason your
Linux distribution probably doesn't install this when you install the
standard iptables stuff; instead you'll need to install an additional
package, usually called ipset. Iptables itself contains the code to
use ipsets, but without ipset to create the sets you can't actually
install any rules that use them.
(I wish I was kidding about this but I'm not.)
The basic use of ipsets is to make a set, populate it, and match against it. Let's take an example:
ipset create smtpblocks hash:net counters ipset add smtpblocks 27.112.32.0/19 ipset add smtpblocks 204.8.87.0/24 iptables -A INPUT -p tcp --dport 25 -m set --match-set smtpblocks src -j DROP
(Both entries are currently on the Spamhaus EDROP list.)
Note that the set must exist before you can add iptables rules that
refer to it. The ipset manpage has a long discussion of the various
types of sets that you can use and the iptables-extensions manpage has
a discussion of --match-set and the SET target for adding entries
to sets from iptables rules. The hash:net I'm using here holds random
CIDR netblocks (including /32s, ie single hosts) and is set to have
counters.
It would be nice if there was a simple command to get just a listing of
the members of an ipset. Unfortunately there isn't, as plain 'ipset
list' insists on outputting a few lines of summary information before
it lists the members. Since I don't know if these are constant I'm using
'ipset list -t save | grep "^add "', which seems ugly but seems likely
to keep working forever.
Unfortunately I don't think there's an officially supported and
documented ipset command for adding multiple entries into a set at
once in a single command invocation; instead you're apparently expected
to run 'ipset add ...' repeatedly. You can abuse the 'ipset restore'
command for this if you want to by creating appropriately formatted
input; check the output of 'ipset save' to see what it needs to look
like. This may even be considered a stable interface by the ipset
authors.
Ipset syntax and usage appears to have changed over time, so old discussions of it that you find online may not work quite as written (and someday these notes may be out of date that way as well).
PS: I can sort of see a lot of clever uses for ipsets, but I've only
started exploring them right now and my iptables usage is fairly basic
in general. I encourage you to read the ipset manpage and go wild.
Sidebar: how I think you're supposed to use list sets
As an illustrated example:
ipset create spamhaus-drop hash:net counters ipset create spamhaus-edrop hash:net counters [... populate both from spamhaus ...] ipset create spamhaus list:set ipset add spamhaus spamhaus-drop ipset add spamhaus spamhaus-edrop iptables -A INPUT -p tcp --dport 25 -m set --match-set spamhaus src -j DROP
This way your iptables rules can be indifferent about exactly what goes into the 'spamhaus' ipset, although of course this will be slightly less efficient than checking a single merged set.
2014-10-16
Don't use dd as a quick version of disk mirroring
Suppose, not entirely hypothetically, that you initially set up a
server with one system disk but have come to wish that it had a
mirrored pair of them. The server is in production and in-place
migration to software RAID requires a downtime or two, so as a cheap 'in case of emergency' measure
you stick in a second disk and then clone your current system disk
to it with dd (remember to fsck the root filesystem afterwards).
(This has a number of problems if you ever actually need to boot from the second disk, but let's set them aside for now.)
Unfortunately, on a modern Linux machine you have just armed a time
bomb that is aimed at your foot. It may never go off, or it may go
off more than a year and a half later (when you've forgotten all
about this), or it may go off the next time you reboot the machine.
The problem is that modern Linux systems identify their root
filesystem by its UUID, not its disk location, and because you
cloned the disk with dd you now have two different filesystems
with the same UUID.
(Unless you do something to manually change the UUID on the cloned
copy, which you can. But you have to remember that step. On extN
filesystems, it's done with tune2fs's -U argument; you probably
want '-U random'.)
Most of the time, the kernel and initramfs will probably see your
first disk first and inventory the UUID on its root partition first
and so on, and thus boot from the right filesystem on the first
disk. But this is not guaranteed. Someday the kernel may get around
to looking at sdb1 before it looks at sda1, find the UUID it's
looking for, and mount your cloned copy as the root filesystem
instead of the real thing. If you're lucky, the cloned copy is so
out of date that things fail explosively and you notice immediately
(although figuring out what's going on may take a bit of time and
in the mean time life can be quite exciting). If you're unlucky,
the cloned copy is close enough to the real root filesystem that
things mostly work and you might only have a few little anomalies,
like missing log files or mysteriously reverted package versions
or the like. You might not even really notice.
(This is the background behind my recent tweet.)
2014-10-10
Where your memory can be going with ZFS on Linux
If you're running ZFS on Linux, its memory use is probably at least a concern. At a high level, there are at least three different places that your RAM may be being used or held down with ZoL.
First, it may be in ZFS's ARC, which is the ZFS equivalent of the buffer
cache. A full discussion of what is included in the ARC and how you
measure it and so on is well beyond the scope of this entry, but the
short summary is that the ARC includes data from disk, metadata from
disk, and several sorts of bookkeeping data. ZoL reports information
about it in /proc/spl/kstat/zfs/arcstats, which is exactly the
standard ZFS ARC kstats. What ZFS considers to be the total current
(RAM) size of the ARC is size. ZFS on Linux normally limits the
maximum ARC size to roughly half of memory (this is c_max).
(Some sources will tell you that the ARC size in kstats is c.
This is wrong. c is the target size; it's often but not always
the same as the actual size.)
Next, RAM can be in slab allocated ZFS objects and data
structures that are not counted as part of the ARC for one reason
or another. It used to be that ZoL handled all slab allocation
itself and so all ZFS slab things were listed in /proc/spl/kmem/slab,
but the current ZoL development version now lets the native kernel slab
allocator handle most slabs for objects that aren't bigger than
spl_kmem_cache_slab_limit bytes, which is normally 16K by
default. Such native kernel slabs are theoretically listed in
/proc/slabinfo but are unfortunately normally subject to SLUB
slab merging, which often means that they get
merged with other slabs and you can't actually see how much memory
they're using.
As far as slab objects that aren't in the ARC, I believe that
zfs_znode_cache slab objects (which are znode_ts) are not
reflected in the ARC size. On some machines active znode_t
objects may be a not insignificant amount of memory. I don't know
this for sure, though, and I'm somewhat reasoning from behavior
we saw on Solaris.
Third, RAM can be trapped in unused objects and space in slabs. One way that unused objects use up space (sometimes a lot of it) is that slabs are allocated and freed in relatively large chunks (at least one 4KB page of memory and often bigger in ZoL), so if only a few objects in a chunk are in use the entire chunk stays alive and can't be freed. We've seen serious issues with slab fragmentation on Solaris and I'm sure ZoL can have this too. It's possible to see the level of wastage and fragmentation for any slab that you can get accurate numbers for (ie, not any that have vanished into SLUB slab merging).
(ZFS on Linux may also allocate some memory outside of its slab allocations, although I can't spot anything large and obvious in the kernel code.)
All of this sounds really abstract, so let me give you an example. On one of my machines with 16 GB and actively used ZFS pools, things are currently reporting the following numbers:
- the ARC is 5.1 GB, which is decent. Most of that is not actual file
data, though; file data is reported as 0.27 GB, then there's 1.87
GB of ZFS metadata from disk and a bunch of other stuff.
- 7.55 GB of RAM is used in active slab objects. 2.37 GB of that is
reported in
/proc/spl/kmem/slab; the remainder is in native Linux slabs in/proc/slabinfo. Theznode_tslab is most of the SPLslabreport, at 2 GB used.(This machine is using a hack to avoid the SLUB slab merging for native kernel ZoL slabs, because I wanted to look at memory usage in detail.)
- 7.81 GB of RAM has been allocated to ZoL slabs in total. This means that there is a few hundred MB of space wasted at the moment.
If znode_t objects are not in the ARC, the ARC and active
znode_t objects account for almost all of the slab space
between the two of them; 7.1 GB out of 7.55 GB.
I have seen total ZoL slab allocated space be as high as 10 GB (on this 16 GB machine) despite the ARC only reporting a 5 GB size. As you can see, this stuff can fluctuate back and forth during normal usage.
Sidebar: Accurately tracking ZoL slab memory usage
To accurately track ZoL memory usage you must defeat SLUB slab
merging somehow. You can turn it off entirely with the slub_nomerge
kernel paramter or hack the spl ZoL kernel module to defeat it
(see the sidebar here).
Because you can set spl_kmem_cache_slab_limit as a module
parameter for the spl ZoL kernel module, I believe that you can
set it to zero to avoid having any ZoL slabs be native kernel slabs.
This avoids SLUB slab merging entirely and also makes it so that
all ZoL slabs appear in /proc/spl/kmem/slab. It may be somewhat
less efficient.
2014-10-09
How /proc/slabinfo is not quite telling you what it looks like
The Linux kernel does a lot (although not all) of its interesting
internal memory allocations through a slab allocator. For quite a while
it's exposed per-type details of this process in /proc/slabinfo;
this is very handy to get an idea of just what in your kernel is
using up a bunch of memory. Today I was exploring this because I
wanted to look into ZFS on Linux's memory usage
and wound up finding out that on modern Linuxes it's a little bit
misleading.
(By 'found out' I mean that DeHackEd on the #zfsonlinux IRC channel explained it to me.)
Specifically, on modern Linux the names shown in slabinfo are
basically a hint because the current slab allocator in the kernel merges multiple slab types together
if they are sufficiently similar. If five different subsystems all
want to allocate (different) 128-byte objects with no special
properties, they don't each get separate slab types with separate
slabinfo entries; instead they are all merged into one slab type
and thus one slabinfo entry. That slabinfo entry normally shows
the name of one of them, probably the first to be set up, with no
direct hint that it also includes the usage of all the others.
(The others don't appear in slabinfo at all.)
Most of the time this is a perfectly good optimization that cuts
down on the number of slab types and enables better memory sharing
and reduced fragmentation. But it does mean that you can't tell the
memory used by, say, btree_node apart from ip_mrt_cache
(on my machine, both are one of a lot of slab types that are actually
all mapped to the generic 128-byte object). It can also leave you
wondering where your slab types actually went, if you're inspecting
code that creates a certain slab type but you can't find it in
slabinfo (which is what happened to me).
The easiest way to see this mapping is to look at /sys/kernel/slab;
all those symlinks are slab types that may be the same thing. You
can decode what is what by hand, but if you're going to do this
regularly you should get a copy of tools/vm/slabinfo.c from the
kernel source and compile it; see the kernel SLUB documentation for details.
You want 'slabinfo -a' to report the mappings.
(Sadly slabinfo is underdocumented. I wish it had a manpage or
at least a README.)
If you need to track the memory usage of specific slab types, perhaps
because you really want to know the memory usage of one subsystem,
the easiest way is apparently to boot with the slub_nomerge
kernel command line argument. Per the
the kernel parameter documentation
this turns off all slab merging, which may result in you having a
lot more slabs than usual.
(On my workstation, slab merging condenses 110 different slabs into 14 actual slabs. On a random server, 170 slabs turn into 35 and a bunch of the pre-merger slabs are probably completely unused.)
Sidebar: disabling this merging in kernel code
The SLUB allocator does not directly expose a way of disabling this
merging when you call kmem_cache_create() in that there's no
'do not merge, really' flag to the call. However, it turns out that
supplying at least one of a number of SLUB debugging flags will
disable this merging and on a kernel built without
CONFIG_DEBUG_KMEMLEAK using SLAB_NOLEAKTRACE appears
to have absolutely no other effects from what I can tell.
Both Fedora 20 and Ubuntu 14.04 build their kernels without this
option.
(I believe that most Linux distributions put a copy of the kernel
build config in /boot when they install kernels.)
This may be handy if you have some additional kernel modules that you want to be able to track memory use for specifically even though a number of their slabs would normally get merged away, and you're compiling from source and willing to make some little modifications to it.
You can see the full set of flags that force never merging in the
#define for SLUB_NEVER_MERGE in mm/slub.c. On a quick look,
none of the others are either harmless or always defined as a
non-zero value. It's possible that SLAB_DEBUG_FREE also does
nothing these days; if used it will make your slabs only mergeable
with other slabs that also specify it (which no slabs in the main
kernel source do). That would cause slabs from your code to potentially
be merged together but they wouldn't merge with anyone else's slabs,
so at least you could track your subsystem's memory usage.
Disclaimer: these ideas have been at most compile-tested, not run live.
2014-10-03
Why people are almost never going to be reporting bugs upstream
In comments on my entry about CentOS bug reporting, opk wrote:
If it is essentially an upstream bug and not packaging I tend to think it's far better to wade into the upstream swamps as you call it. I once packaged something for Debian and mainly gave it up because of the volume of bug reports that were basically for upstream but I had to verify, reproduce, and forward them.
Then Pete left a comment that nicely summarizes the problems with opk's suggestion:
[...] But of course you have to commit to running versions with debugging and then of course there's "the latest" even for the 7. Due to the critical nature of my NM use, I had difficulties experimenting with it.
The reality is that upstream bug reports aren't going to work for almost everyone unless the project has a very generous upstream. The problem is simple: almost all Linux distributions both use old versions of packages and patch them. If your version is patched or not current or both, almost every open source project is going to immediately ask 'can you reproduce this with an unmodified current version?'
I won't go so far as to say that this request is a non-starter, because in theory it can be done. For some projects it is good enough to download the current version (perhaps the current development version) and compile it yourself to be installed in an alternate location (or just run from where it was compiled). Other projects can be rebuilt into real distribution packages and then installed on your system without blowing up the world. And of course if this bug is absolutely critical to you, maybe you're willing to blow up the world just to be able to submit a bug report.
What all of this is is too much work, especially for the payoff most people are likely to get. The reality is that you're unlikely to benefit much from reporting any bug, and you're especially unlikely to benefit from upstream bug fixes unless you're willing to permanently run the upstream version (because if you're not, your distribution has to pick up and possibly backport the upstream bug fix if one is made).
(Let's skip the question of how many bug reporters even have the expertise to go through the steps necessary to try out the upstream version.)
Because reporting bugs upstream is so much work, in practice almost no one is going to do it no matter what you ask (or at least they aren't going to file useful ones). The direct corollary is that a policy of 'report bugs upstream' is in practice a policy of 'don't file bug reports'.
The one semi-exception to all of this is when your distribution package is an unmodified upstream version that the upstream (still) supports. At that point it makes sense to put a note in your bug tracker to explain this and say that upstream will take reports without problems. You're still asking bug reporters to do more work (now they have to go deal with the upstream bug reporting system too), but at least it's a pretty small amount of work.
2014-10-02
The problem with making bug reports about CentOS bugs
I mentioned yesterday that I had not made any sort of bug report about our NetworkManager race bug that we found on CentOS 7. The reason why is pretty simple: where can I report it that will do any good?
I can't report it to Red Hat as a bug against Red Hat Enterprise 7. Red Hat does take public bug reports against RHEL the last time I looked, but I'm not running real RHEL, I'm running CentOS. Even if I could, reinstalling a machine with real RHEL 7 simply to be able to make an 'official' bug report for this issue is not an effective or efficient use of my time. By now we're never going to use NetworkManager even if it's fixed; we don't actually need it and the the risks of another bug existing are too high.
(I actually did go through this exercise for one bug, but that was that the RHEL/CentOS 7 version of systemd winds up putting kernel messages in syslog under the wrong facility. This is both potentially likely to get fixed and something that rather matters to us. And that was a straightforward and easily demonstrated bug, which makes bug reports trivial to file (and it still wound up taking up a bunch of time and talking with Red Hat because it initially looked like rsyslog was at fault).)
There seems to be no point in reporting this to CentOS because it's not a CentOS bug as such. CentOS simply rebuilds the upstream RHEL RPMs, so there is exactly nothing they can or will do to fix this bug (assuming it's not specific to the CentOS rebuild of NM, and I have no reason to believe that it is). The corollary to this is that the only bugs I suspect are worth reporting to or against CentOS are basically packaging bugs.
(With that said, people do seem to report a lot of bugs in the CentOS bug tracker.)
Even if I felt like wading into the upstream swamps of NetworkManager,
we're not using anything like the current upstream version of NM
(RHEL 7 and thus CentOS 7 is absolutely guaranteed to
be behind the NM times RHEL and thus CentOS almost certainly
patches NM a bunch). As a result upstream would quite rightly
basically laugh at me (perhaps politely, via a 'try to reproduce
this with current NM').
(Having written that, I've just discovered that RHEL/CentOS 7 has a more recent version of NM than Fedora 20 does. CentOS 7 ships 0.9.9.1, apparently pulled from git on 2014-03-26, while Fedora 20 has 0.9.9.0 apparently from 2013-10-03. No, don't ask me. Possibly Red Hat felt that it was really important to use as fixed up a version of NM as possible before a RHEL release.)
Once upon a time this was just how CentOS had to be. But now that Red Hat has kind of taken over CentOS, it strikes me as a rather inefficient way to operate; RHEL is basically passing up the bug finding work that CentOS users are doing. With that said, it seems that Red Hat may unofficially accept 'RHEL' bug reports that actually happen with CentOS, but if so this is not documented anywhere that I can casually dig up (outside of some CentOS bugs with replies that ask people to re-file the bug in the RHEL bugzilla).
(And this lack of documentation is likely causing people other than me to not even bother filing bugs.)
PS: also, I hope it's obvious to people that a setup that routinely causes bug reporters to have to refile their reports in another bug reporting system is a hostile one that implicitly discourages bug reports. It's hoop-jumping in splendid form. If this is the real CentOS procedure, it should be changed (especially now that Red Hat is so involved in CentOS).