#!/bin/sh scripts are not necessarily bugs
In the wake of Shellshock, any number
of people have cropped up in any number of places to say that you should
always be able to change a system's
/bin/sh to something other than
Bash because Bashisms in scripts that are specified to use
are a bug. It is my heretical view that these people are wrong in general
(although potentially right in specific situations).
First, let us get a trivial root out of the way: a Unix distribution
is fully entitled to assume that you have not changed non-adjustable
things. If a distribution ships with
/bin/sh as Bash and does not have
a supported way to change it to some other shell, then the distribution
is fully entitled to write its own
#!/bin/sh shell scripts so that
they use Bashisms. This may be an unwise choice on the distribution's
part, but it's not a bug unless they have an official policy that all
of their shell scripts should be POSIX-only.
(Of course the distribution may act on RFEs that their
#!/bin/sh scripts not use Bashisms. But
that's different from it being a bug.)
Next, let's talk about user scripts. On a system where
always officially Bash, ordinary people are equally entitled to assume
that your systems have not been manually mangled into unofficial states.
As a result they are also entitled to write their
with Bashisms in them, because these scripts work properly on all
officially supported system configurations. As with distributions,
this may not be a wise choice (since it may cause pain if and when
they ever move those scripts to another Unix system) but it is not a
bug. The only case when it even approaches being a bug is when the
distribution has officially included large warnings saying '
currently Bash but it may be something else someday, you should write
/bin/sh shell scripts to POSIX only, and here is a tool to help
There are some systems where this is the case and has historically been
the case, and on those systems you can say that people using Bashisms
#!/bin/sh scripts clearly have a bug by the system's official
policy. There are also quite a number of systems where this is or has
not been the case, where the official
/bin/sh is Bash and always
has been. On those systems, Bashisms in
#!/bin/sh scripts are not a
(By the way, only relatively recently have you been able to count
/bin/sh being POSIX compatible; see here. Often it's had very few guarantees.)
By the way, as a pragmatic matter
a system with only Bash as
/bin/sh is likely to have plenty of
/bin/sh shell scripts with Bashisms in them even if the official
policy is that you should only use POSIX features in such scripts.
This is a straightforward application of one of my aphorisms of
system administration (and perhaps also this
one). These scripts have a nominal bug, but
of course people are not going to be happy if you break them.
System metrics need to be documented, not just to exist
As a system administrator, I love systems that expose metrics (performance, health, status, whatever they are). But there's a big caveat to that, which is that metrics don't really exist until they're meaningfully documented. Sadly, documenting your metrics is much less common than simply exposing them, perhaps because it takes much more work.
At the best of times this forces system administrators and other bystanders to reverse engineer your metrics from your system's source code or from programs that you or other people write to report on them. At the worst this makes your metrics effectively useless; sysadmins can see the numbers and see them change, but they have very little idea of what they mean.
(Maybe sysadmins can dump them into a stats tracking system and look for correlations.)
Forcing people to reverse engineer the meaning of your stats has two bad effects. The obvious one is that people almost always wind up duplicating this work, which is just wasted effort. The subtle one is that it is terribly easy for a mistake about what the metrics means to become, essentially, superstition that everyone knows and spreads. Because people are reverse engineering things in the first place, it's very easy for mistakes and misunderstandings to happen; then people write the mistake down or embody it in a useful program and pretty soon it is being passed around the Internet since it's one of the few resources on the stats that exist. One mistake will be propagated into dozens of useful programs, various blog posts, and so on, and through the magic of the Internet many of these secondary sources will come off as unhesitatingly authoritative. At that point, good luck getting any sort of correction out into the Internet (if you even notice that people are misinterpreting your stats).
At this point some people will suggest that sysadmins should avoid doing anything with stats that they reverse engineer unless they are absolutely, utterly sure that they're correct. I'm sorry, life doesn't work this way. Very few sysadmins reverse engineer stats for fun; instead, we're doing it to solve problems. If our reverse engineering solves our problems and appears sane, many sysadmins are going to share their tools and what they've learned. It's what people do these days; we write blog posts, we answer questions on Stackoverflow, we put up Github repos with 'here, these are the tools that worked for me'. And all of those things flow around the Internet.
(Also, the suggestion that people should not write tools or write up documentation unless they are absolutely sure that they are correct is essentially equivalent to asking people not to do this at all. To be absolutely sure that you're right about a statistic, you generally need to fully understand the code. That's what they call rather uncommon.)
Phish spammers are apparently exploiting mailing list software
One of the interesting things I've observed recently through my sinkhole SMTP server is a small number of phish spams that have been sent to me by what is clearly mailing list software; the latest instance was sent by a Mailman installation, for example. Although I initially thought all three of the emails I've spotted were all from one root cause, it turns out that there are several different things apparently going on.
In one case, the phish spammer clearly seems to have compromised a
legitimate machine with mailing list software and then used that
software to make themselves a phish spamming mailing list. It's
easy to see the attraction of this; it makes the phish spammer much
more efficient in that it takes them less time to send stuff to
more people. In an interesting twist, the
Received headers of the
email I got say that the spammer initially sent it with the envelope
email@example.com (which matched their
and then the mailing list software rewrote the envelope sender.
In the most clear-cut case, the phish spammer seems to have sent out their spam through a commercial site that advertises itself as (hosted) 'Bulk Email Marketing Software'. This suggests that the phish spammer was willing to spend some money on their spamming, or at least burned a stolen credit card (the website advertises fast signups, which mean that credit cards mean basically nothing). I'm actually surprised that this doesn't happen more often, given that my impression is that the spam world is increasingly commercialized and phish spammers now often buy access to compromised machines instead of compromising the machines themselves. If you're going to spend money one way or another and you can safely just buy use of a commercial spam operation, well, why not?
(I say 'seems to' because the domain I got it from is not quite the same as the commercial site's main domain, although there are various indications tying it to them. If the phish spammer is trying to frame this commercial site, they went to an unusually large amount of work to do so.)
The third case is the most interesting to me. It uses a domain that was registered two days before it sent the phish spam and that domain was registered by an organization called 'InstantBulkSMTP'. The sending IP, 22.214.171.124, was also apparently also assigned on the same day. The domain has now disappeared but the sending IP now has DNS that claims it is 'mta1.strakbody.com' and the website for that domain is the control panel for something called 'Interspire Email Marketer'. So my operating theory is that it's somewhat like the second case; a phish spammer found a company that sets up this sort of stuff and paid them some money (or gave them a bad credit card) for a customized service. The domain name they used was probably picked to be useful for the phish spam target.
(The domain was 'titolaricartasi.info' and the phish target was cartasi.it. Google Translate claims that 'titolari' translates to 'holders'.)
PS: All of this shows the hazards of looking closely at spam. Until I started writing this entry, I had thought that all three cases were the same and were like the first one, ie phish spammers exploiting compromised machines with mailing list managers. Then things turned out to be more complicated and my nice simple short blog entry disappeared in a puff of smoke.
Thinking about how to create flexible aggregations from causes
Every so often I have programming puzzles that I find frustrating, not so much because I can't solve them as such but because I feel that there must already be a solution for them if I could both formulate the problem right and then use that formulation to search existing collections of algorithms and such. Today's issue is a concrete problem I am running into with NFS activity monitoring.
Suppose that you have a collection of specific event counters, where you know that userid W on machine X working against filesystem Y (in ZFS pool Z) did N NFS operations per second. My goal is to show aggregate information about the top sources of operations on the server, where a source might be one machine, one user, one filesystem, one pool, or some combination of these. This gives me two problems.
The first problem is efficiently going 'upwards' to sum together various specific event counters into more general categories (with the most general one being 'all NFS operations'). This feels like I want some sort of clever tree or inverted tree data structure, but I could just do it by brute force since I will probably not be dealing with too many specific event counters at any one time (from combinations we can see that each 4-element specific initial event maps to 16 categories; this is amenable to brute force on modern machines).
The second problem is going back 'down' from a category sum to the most specific cause possible for it so that we can report only that. The easiest way to explain this is with an example; if we have (user W, machine X, fs Y, pool Z) with 1000 operations and W was the only user to do things from that machine or on that filesystem, we don't want a report that lists every permutation of the machine and filesystem (eg '1000 from X', '1000 against Y', '1000 from X against Y', etc). Instead we want to report only that 1000 events came from user W on machine X doing things to filesystem Y.
If I wind up with a real tree, this smells like a case of replacing nodes that have only one child with their child (with some special cases around the edges). If I wind up with some other data structure, well, I'll have to figure it out then. And a good approach for this might well influence what data structure I want to use for the first problem.
If all of this sounds like I haven't even started trying to write some code to explore this problem, that would be the correct impression. One of my coding hangups is that I like to have at least some idea of how to solve a problem before I start trying to tackle it; this is especially the case if my choice of language isn't settled and I might want to use a different solution depending on the language I wind up in.
(There are at least three candidate languages for what I want to do here, including Go if I need raw speed to make a brute force approach feasible.)
Where your memory can be going with ZFS on Linux
If you're running ZFS on Linux, its memory use is probably at least a concern. At a high level, there are at least three different places that your RAM may be being used or held down with ZoL.
First, it may be in ZFS's ARC, which is the ZFS equivalent of the buffer
cache. A full discussion of what is included in the ARC and how you
measure it and so on is well beyond the scope of this entry, but the
short summary is that the ARC includes data from disk, metadata from
disk, and several sorts of bookkeeping data. ZoL reports information
about it in
/proc/spl/kstat/zfs/arcstats, which is exactly the
standard ZFS ARC kstats. What ZFS considers to be the total current
(RAM) size of the ARC is
size. ZFS on Linux normally limits the
maximum ARC size to roughly half of memory (this is
(Some sources will tell you that the ARC size in kstats is
This is wrong.
c is the target size; it's often but not always
the same as the actual size.)
Next, RAM can be in slab allocated ZFS objects and data
structures that are not counted as part of the ARC for one reason
or another. It used to be that ZoL handled all slab allocation
itself and so all ZFS slab things were listed in
but the current ZoL development version now lets the native kernel slab
allocator handle most slabs for objects that aren't bigger than
spl_kmem_cache_slab_limit bytes, which is normally 16K by
default. Such native kernel slabs are theoretically listed in
/proc/slabinfo but are unfortunately normally subject to SLUB
slab merging, which often means that they get
merged with other slabs and you can't actually see how much memory
As far as slab objects that aren't in the ARC, I believe that
zfs_znode_cache slab objects (which are
znode_ts) are not
reflected in the ARC size. On some machines active
objects may be a not insignificant amount of memory. I don't know
this for sure, though, and I'm somewhat reasoning from behavior
we saw on Solaris.
Third, RAM can be trapped in unused objects and space in slabs. One way that unused objects use up space (sometimes a lot of it) is that slabs are allocated and freed in relatively large chunks (at least one 4KB page of memory and often bigger in ZoL), so if only a few objects in a chunk are in use the entire chunk stays alive and can't be freed. We've seen serious issues with slab fragmentation on Solaris and I'm sure ZoL can have this too. It's possible to see the level of wastage and fragmentation for any slab that you can get accurate numbers for (ie, not any that have vanished into SLUB slab merging).
(ZFS on Linux may also allocate some memory outside of its slab allocations, although I can't spot anything large and obvious in the kernel code.)
All of this sounds really abstract, so let me give you an example. On one of my machines with 16 GB and actively used ZFS pools, things are currently reporting the following numbers:
- the ARC is 5.1 GB, which is decent. Most of that is not actual file
data, though; file data is reported as 0.27 GB, then there's 1.87
GB of ZFS metadata from disk and a bunch of other stuff.
- 7.55 GB of RAM is used in active slab objects. 2.37 GB of that is
/proc/spl/kmem/slab; the remainder is in native Linux slabs in
znode_tslab is most of the SPL
slabreport, at 2 GB used.
(This machine is using a hack to avoid the SLUB slab merging for native kernel ZoL slabs, because I wanted to look at memory usage in detail.)
- 7.81 GB of RAM has been allocated to ZoL slabs in total. This means that there is a few hundred MB of space wasted at the moment.
znode_t objects are not in the ARC, the ARC and active
znode_t objects account for almost all of the slab space
between the two of them; 7.1 GB out of 7.55 GB.
I have seen total ZoL slab allocated space be as high as 10 GB (on this 16 GB machine) despite the ARC only reporting a 5 GB size. As you can see, this stuff can fluctuate back and forth during normal usage.
Sidebar: Accurately tracking ZoL slab memory usage
To accurately track ZoL memory usage you must defeat SLUB slab
merging somehow. You can turn it off entirely with the
kernel paramter or hack the
spl ZoL kernel module to defeat it
(see the sidebar here).
Because you can set
spl_kmem_cache_slab_limit as a module
parameter for the
spl ZoL kernel module, I believe that you can
set it to zero to avoid having any ZoL slabs be native kernel slabs.
This avoids SLUB slab merging entirely and also makes it so that
all ZoL slabs appear in
/proc/spl/kmem/slab. It may be somewhat
/proc/slabinfo is not quite telling you what it looks like
The Linux kernel does a lot (although not all) of its interesting
internal memory allocations through a slab allocator. For quite a while
it's exposed per-type details of this process in
this is very handy to get an idea of just what in your kernel is
using up a bunch of memory. Today I was exploring this because I
wanted to look into ZFS on Linux's memory usage
and wound up finding out that on modern Linuxes it's a little bit
(By 'found out' I mean that DeHackEd on the #zfsonlinux IRC channel explained it to me.)
Specifically, on modern Linux the names shown in
basically a hint because the current slab allocator in the kernel merges multiple slab types together
if they are sufficiently similar. If five different subsystems all
want to allocate (different) 128-byte objects with no special
properties, they don't each get separate slab types with separate
slabinfo entries; instead they are all merged into one slab type
and thus one
slabinfo entry. That
slabinfo entry normally shows
the name of one of them, probably the first to be set up, with no
direct hint that it also includes the usage of all the others.
(The others don't appear in
slabinfo at all.)
Most of the time this is a perfectly good optimization that cuts
down on the number of slab types and enables better memory sharing
and reduced fragmentation. But it does mean that you can't tell the
memory used by, say,
btree_node apart from
(on my machine, both are one of a lot of slab types that are actually
all mapped to the generic 128-byte object). It can also leave you
wondering where your slab types actually went, if you're inspecting
code that creates a certain slab type but you can't find it in
slabinfo (which is what happened to me).
The easiest way to see this mapping is to look at
all those symlinks are slab types that may be the same thing. You
can decode what is what by hand, but if you're going to do this
regularly you should get a copy of
tools/vm/slabinfo.c from the
kernel source and compile it; see the kernel SLUB documentation for details.
You want '
slabinfo -a' to report the mappings.
slabinfo is underdocumented. I wish it had a manpage or
at least a README.)
If you need to track the memory usage of specific slab types, perhaps
because you really want to know the memory usage of one subsystem,
the easiest way is apparently to boot with the
kernel command line argument. Per the
the kernel parameter documentation
this turns off all slab merging, which may result in you having a
lot more slabs than usual.
(On my workstation, slab merging condenses 110 different slabs into 14 actual slabs. On a random server, 170 slabs turn into 35 and a bunch of the pre-merger slabs are probably completely unused.)
Sidebar: disabling this merging in kernel code
The SLUB allocator does not directly expose a way of disabling this
merging when you call
kmem_cache_create() in that there's no
'do not merge, really' flag to the call. However, it turns out that
supplying at least one of a number of SLUB debugging flags will
disable this merging and on a kernel built without
to have absolutely no other effects from what I can tell.
Both Fedora 20 and Ubuntu 14.04 build their kernels without this
(I believe that most Linux distributions put a copy of the kernel
build config in
/boot when they install kernels.)
This may be handy if you have some additional kernel modules that you want to be able to track memory use for specifically even though a number of their slabs would normally get merged away, and you're compiling from source and willing to make some little modifications to it.
You can see the full set of flags that force never merging in the
mm/slub.c. On a quick look,
none of the others are either harmless or always defined as a
non-zero value. It's possible that
SLAB_DEBUG_FREE also does
nothing these days; if used it will make your slabs only mergeable
with other slabs that also specify it (which no slabs in the main
kernel source do). That would cause slabs from your code to potentially
be merged together but they wouldn't merge with anyone else's slabs,
so at least you could track your subsystem's memory usage.
Disclaimer: these ideas have been at most compile-tested, not run live.
Simple web application environments and per-request state
One of the big divides in web programming environments (which are somewhat broader than web frameworks) is between environments that only really have per-request state and every new request starts over with a blank slate and environments with state that persists from request to request. CGI is the archetype of per-request state, but PHP is also famous for it. Many more advanced web environments have potential or actual shared state; sometimes this an explicit feature of the environment over simpler ones.
(One example of a persistent state environment is Node and I'd expect the JVM to generally be another one.)
I have nothing in particular against environments with persistent state and sometimes they're clearly needed (or at least very useful) for doing powerful web applications. But I think it's clear that web environments without it are simpler to program and thus are easier to write simple web things in.
Put simply, in an environment with non-persistent state you can be sloppy. You can change things. You can leave things sitting around the global environment. You can be casual about cleaning up bits and pieces. And you know that anything you do will be wiped away at the end of the request and the next one will start from scratch. An environment with persistent state allows you to do some powerful things but you have to be more careful. It's very easy to 'leak' things into the persistent environment and to modify things in a way that unexpectedly changes later requests, and it can also be easy to literally leak memory or other resources that would have been automatically cleaned up in a per-request environment.
(At this point the pure functional programmers are smugly mentioning the evils of mutable state.)
Speaking from personal experience, keeping track of the state you're changing is hard and it's easy to do something you don't realize. DWiki started out running in a purely non-persistent environment; when I also started running it in a semi-persistent one I found any number of little surprises and things I was doing to myself. I suspect I'd find more if I ran it for a long time in a fully persistent environment.
As a side note, there are some relatively obvious overall advantages to building a web application that doesn't require persistent state even if the underlying web environment you're running in supports it. This may make it useful to at least test your application in an environment that explicitly lacks it, just to make sure that everything still works right.
Why blocking writes are a good Unix API (on pipes and elsewhere)
One of the principles of good practical programming is that when your program can't make forward progress, it should do nothing rather than, say, continue to burn CPU while it waits for something to do. You want your program to do what work it can and then generally go to sleep, and thus you want APIs that encourage this to happen by default.
Now consider a chain of programs (or processes or services), each one feeding the next. In a multi-process environment like this you usually want something that gets called 'backpressure', where if any one component gets overloaded or can't make further progress it pushes back on the things feeding it so that they stop in turn (and so on back up the chain until everything quietly comes to a stop, not burning CPU and so on).
(You also want an equivalent for downstream services, where they process any input they get (if they can) but then stop doing anything if they stop getting any input at all.)
I don't think it's a coincidence that this describes classic Unix
blocking IO to both pipes and files. Unix's blocking writes do
backpressure pretty much exactly the way you want to happen; if any
stage in a pipeline stalls for some reason, pretty soon all processes
involved in it will block and sleep in
write()s to their output
pipe. Things like disk IO speed limits or slow processing or whatever
will naturally do just what you want. And the Unix 'return what's
available' behavior on reads does the same thing for the downstream
of a stalled process; if the process wrote some output you can
process it, but then you'll quietly go to sleep as you block for
And this is why I think that Unix having blocking pipe writes by default is not just a sensible API decision but a good one. This decision makes pipes just work right.
(Having short reads also makes the implementation of pipes simpler,
because you don't have complex handling in the situation where eg
process B is doing a
read() of 128 megabytes while process A is
write() 64 megabytes to it. The kernel can make this
work right, but it needs to go out of its way to do so.)
Why it's sensible for large writes to pipes to block
Back in this entry I said that large writes to pipes blocking instead of immediately returning with a short write was a sensible API decision. Today let's talk about that, by way of talking about how deciding the other way would be a bad API.
Let's start with a question: in a typical Unix pipeline program like
grep, what would be the sensible reactions to trying to write a large
amount of data returning a short write indicator? This is clearly not
an error that should cause the program to abort (or even to print a
warning); instead it's a perfectly normal thing if you're producing
output faster than the other side of the pipe can consume it. For most
programs, that means the only thing you can really do is pause until you
can write more to the pipe. The conclusion is pretty straightforward;
in a hypothetical world where such too-large pipe writes returned short
write indicators instead of blocking, almost all programs would either
wrap their writes in code that paused and retried them or arrange to set
a special flag on the file descriptor to say 'block me until everything
is written'. Either or both would probably wind up being part of stdio.
If everything is going to have code to work around or deal with something, this suggests that you are picking the wrong default. Thus large writes to pipes blocking by default is the right API decision because it means everyone can write simpler and less error-prone code at the user level.
(There are a number of reasons this is less error-prone, including both
programs that don't usually expect to write to pipes (but you tell them
to write to
/dev/stdout) and programs that usually do short writes
that don't block and so don't handle short writes, resulting in silently
not writing some amount of their output some of the time.)
There's actually a reason why this is not merely a sensible API but a good one, but that's going to require an additional entry rather than wedging it in here.
Sidebar: This story does not represent actual history
The description I've written above more or less requires that there is
some way to wait for a file descriptor to become ready for IO, so that
when your write is short you can find out when you can usefully write
more. However there was no such mechanism in early Unixes;
only appeared in UCB BSD (and
poll() and friends are even later).
This means that having nonblocking pipe writes in V7 Unix would have
required an entire set of mechanisms that only appeared later, instead
of just a 'little' behavior change.
(However I do suspect that the Bell Labs Unix people actively felt
that pipe writes should block just like file writes blocked until
complete, barring some error. Had they felt otherwise, the Unix API
would likely have been set up somewhat differently and V7 might
have had some equivalent of
If you're wondering how V7 could possibly not have something like
select(), note that V7 didn't have any networking (partly because
networks were extremely new and experimental at the time). Without
networking and the problems it brings, there's much less need (or use)
Making bug reports is exhausting, frustrating, and stressful
I've danced around this subject before when I've written about bug reports (and making bug reports), but I want to come out and say it explicitly: far too often, making bug reports is an exhausting experience that is frequently frustrating and stressful.
This is not because the tools for doing it are terrible, although that doesn't help. It is because the very frequent result of trying to make a bug report is having to deal with people who don't believe you, who don't take you seriously, and who often don't read, consider, and investigate what you wrote. Some of the time it involves arguing with people who disagree with you, people who feel that what you are reporting is in fact not a bug or at best a trivial issue. The crowning frustration on top of all of these experiences is that after all of your effort and the stress of arguing with people, the bug will often not be fixed in any useful fashion. By the way, that 'deal with' is often actually 'argue with' (which is about as much fun as you'd expect).
(A contributing factor to the stress is often that you really need a fix or a workaround for the bug.)
Whether or not they can articulate it, everyone who's made enough bug reports knows this in their gut. In my opinion it's a fairly big reason why a lot of people burn out on making bug reports and stop doing it; it's not that they're making carefully considered cost/benefit calculations (no matter what I've written before about this), it's that they have absolutely no desire to put themselves through the whole exercise again. The frequently low cost/benefit ratio is a post-facto rationalization that people would reach for much less if the whole experience was actually a pleasant one.
There is a really important corollary for this: if you're tempted to urge someone to make a bug report, especially a bug report that you reasonably expect may be rejected, you should understand that you're trying to get them to put themselves through an unpleasant experience.
(I think this is a big part of why I have a very strong urge to bite the heads off of people who respond to me to suggest that I should file bug reports.)