Wandering Thoughts archives


System metrics need to be documented, not just to exist

As a system administrator, I love systems that expose metrics (performance, health, status, whatever they are). But there's a big caveat to that, which is that metrics don't really exist until they're meaningfully documented. Sadly, documenting your metrics is much less common than simply exposing them, perhaps because it takes much more work.

At the best of times this forces system administrators and other bystanders to reverse engineer your metrics from your system's source code or from programs that you or other people write to report on them. At the worst this makes your metrics effectively useless; sysadmins can see the numbers and see them change, but they have very little idea of what they mean.

(Maybe sysadmins can dump them into a stats tracking system and look for correlations.)

Forcing people to reverse engineer the meaning of your stats has two bad effects. The obvious one is that people almost always wind up duplicating this work, which is just wasted effort. The subtle one is that it is terribly easy for a mistake about what the metrics means to become, essentially, superstition that everyone knows and spreads. Because people are reverse engineering things in the first place, it's very easy for mistakes and misunderstandings to happen; then people write the mistake down or embody it in a useful program and pretty soon it is being passed around the Internet since it's one of the few resources on the stats that exist. One mistake will be propagated into dozens of useful programs, various blog posts, and so on, and through the magic of the Internet many of these secondary sources will come off as unhesitatingly authoritative. At that point, good luck getting any sort of correction out into the Internet (if you even notice that people are misinterpreting your stats).

At this point some people will suggest that sysadmins should avoid doing anything with stats that they reverse engineer unless they are absolutely, utterly sure that they're correct. I'm sorry, life doesn't work this way. Very few sysadmins reverse engineer stats for fun; instead, we're doing it to solve problems. If our reverse engineering solves our problems and appears sane, many sysadmins are going to share their tools and what they've learned. It's what people do these days; we write blog posts, we answer questions on Stackoverflow, we put up Github repos with 'here, these are the tools that worked for me'. And all of those things flow around the Internet.

(Also, the suggestion that people should not write tools or write up documentation unless they are absolutely sure that they are correct is essentially equivalent to asking people not to do this at all. To be absolutely sure that you're right about a statistic, you generally need to fully understand the code. That's what they call rather uncommon.)

sysadmin/StatsNeedDocumentation written at 01:24:54; Add Comment


Phish spammers are apparently exploiting mailing list software

One of the interesting things I've observed recently through my sinkhole SMTP server is a small number of phish spams that have been sent to me by what is clearly mailing list software; the latest instance was sent by a Mailman installation, for example. Although I initially thought all three of the emails I've spotted were all from one root cause, it turns out that there are several different things apparently going on.

In one case, the phish spammer clearly seems to have compromised a legitimate machine with mailing list software and then used that software to make themselves a phish spamming mailing list. It's easy to see the attraction of this; it makes the phish spammer much more efficient in that it takes them less time to send stuff to more people. In an interesting twist, the Received headers of the email I got say that the spammer initially sent it with the envelope address of service@paypal.com.au (which matched their From:) and then the mailing list software rewrote the envelope sender.

In the most clear-cut case, the phish spammer seems to have sent out their spam through a commercial site that advertises itself as (hosted) 'Bulk Email Marketing Software'. This suggests that the phish spammer was willing to spend some money on their spamming, or at least burned a stolen credit card (the website advertises fast signups, which mean that credit cards mean basically nothing). I'm actually surprised that this doesn't happen more often, given that my impression is that the spam world is increasingly commercialized and phish spammers now often buy access to compromised machines instead of compromising the machines themselves. If you're going to spend money one way or another and you can safely just buy use of a commercial spam operation, well, why not?

(I say 'seems to' because the domain I got it from is not quite the same as the commercial site's main domain, although there are various indications tying it to them. If the phish spammer is trying to frame this commercial site, they went to an unusually large amount of work to do so.)

The third case is the most interesting to me. It uses a domain that was registered two days before it sent the phish spam and that domain was registered by an organization called 'InstantBulkSMTP'. The sending IP,, was also apparently also assigned on the same day. The domain has now disappeared but the sending IP now has DNS that claims it is 'mta1.strakbody.com' and the website for that domain is the control panel for something called 'Interspire Email Marketer'. So my operating theory is that it's somewhat like the second case; a phish spammer found a company that sets up this sort of stuff and paid them some money (or gave them a bad credit card) for a customized service. The domain name they used was probably picked to be useful for the phish spam target.

(The domain was 'titolaricartasi.info' and the phish target was cartasi.it. Google Translate claims that 'titolari' translates to 'holders'.)

PS: All of this shows the hazards of looking closely at spam. Until I started writing this entry, I had thought that all three cases were the same and were like the first one, ie phish spammers exploiting compromised machines with mailing list managers. Then things turned out to be more complicated and my nice simple short blog entry disappeared in a puff of smoke.

spam/PhishViaMailingLists written at 01:36:45; Add Comment


Thinking about how to create flexible aggregations from causes

Every so often I have programming puzzles that I find frustrating, not so much because I can't solve them as such but because I feel that there must already be a solution for them if I could both formulate the problem right and then use that formulation to search existing collections of algorithms and such. Today's issue is a concrete problem I am running into with NFS activity monitoring.

Suppose that you have a collection of specific event counters, where you know that userid W on machine X working against filesystem Y (in ZFS pool Z) did N NFS operations per second. My goal is to show aggregate information about the top sources of operations on the server, where a source might be one machine, one user, one filesystem, one pool, or some combination of these. This gives me two problems.

The first problem is efficiently going 'upwards' to sum together various specific event counters into more general categories (with the most general one being 'all NFS operations'). This feels like I want some sort of clever tree or inverted tree data structure, but I could just do it by brute force since I will probably not be dealing with too many specific event counters at any one time (from combinations we can see that each 4-element specific initial event maps to 16 categories; this is amenable to brute force on modern machines).

The second problem is going back 'down' from a category sum to the most specific cause possible for it so that we can report only that. The easiest way to explain this is with an example; if we have (user W, machine X, fs Y, pool Z) with 1000 operations and W was the only user to do things from that machine or on that filesystem, we don't want a report that lists every permutation of the machine and filesystem (eg '1000 from X', '1000 against Y', '1000 from X against Y', etc). Instead we want to report only that 1000 events came from user W on machine X doing things to filesystem Y.

If I wind up with a real tree, this smells like a case of replacing nodes that have only one child with their child (with some special cases around the edges). If I wind up with some other data structure, well, I'll have to figure it out then. And a good approach for this might well influence what data structure I want to use for the first problem.

If all of this sounds like I haven't even started trying to write some code to explore this problem, that would be the correct impression. One of my coding hangups is that I like to have at least some idea of how to solve a problem before I start trying to tackle it; this is especially the case if my choice of language isn't settled and I might want to use a different solution depending on the language I wind up in.

(There are at least three candidate languages for what I want to do here, including Go if I need raw speed to make a brute force approach feasible.)

programming/MostSpecificCauseProblem written at 02:01:18; Add Comment


Where your memory can be going with ZFS on Linux

If you're running ZFS on Linux, its memory use is probably at least a concern. At a high level, there are at least three different places that your RAM may be being used or held down with ZoL.

First, it may be in ZFS's ARC, which is the ZFS equivalent of the buffer cache. A full discussion of what is included in the ARC and how you measure it and so on is well beyond the scope of this entry, but the short summary is that the ARC includes data from disk, metadata from disk, and several sorts of bookkeeping data. ZoL reports information about it in /proc/spl/kstat/zfs/arcstats, which is exactly the standard ZFS ARC kstats. What ZFS considers to be the total current (RAM) size of the ARC is size. ZFS on Linux normally limits the maximum ARC size to roughly half of memory (this is c_max).

(Some sources will tell you that the ARC size in kstats is c. This is wrong. c is the target size; it's often but not always the same as the actual size.)

Next, RAM can be in slab allocated ZFS objects and data structures that are not counted as part of the ARC for one reason or another. It used to be that ZoL handled all slab allocation itself and so all ZFS slab things were listed in /proc/spl/kmem/slab, but the current ZoL development version now lets the native kernel slab allocator handle most slabs for objects that aren't bigger than spl_kmem_cache_slab_limit bytes, which is normally 16K by default. Such native kernel slabs are theoretically listed in /proc/slabinfo but are unfortunately normally subject to SLUB slab merging, which often means that they get merged with other slabs and you can't actually see how much memory they're using.

As far as slab objects that aren't in the ARC, I believe that zfs_znode_cache slab objects (which are znode_ts) are not reflected in the ARC size. On some machines active znode_t objects may be a not insignificant amount of memory. I don't know this for sure, though, and I'm somewhat reasoning from behavior we saw on Solaris.

Third, RAM can be trapped in unused objects and space in slabs. One way that unused objects use up space (sometimes a lot of it) is that slabs are allocated and freed in relatively large chunks (at least one 4KB page of memory and often bigger in ZoL), so if only a few objects in a chunk are in use the entire chunk stays alive and can't be freed. We've seen serious issues with slab fragmentation on Solaris and I'm sure ZoL can have this too. It's possible to see the level of wastage and fragmentation for any slab that you can get accurate numbers for (ie, not any that have vanished into SLUB slab merging).

(ZFS on Linux may also allocate some memory outside of its slab allocations, although I can't spot anything large and obvious in the kernel code.)

All of this sounds really abstract, so let me give you an example. On one of my machines with 16 GB and actively used ZFS pools, things are currently reporting the following numbers:

  • the ARC is 5.1 GB, which is decent. Most of that is not actual file data, though; file data is reported as 0.27 GB, then there's 1.87 GB of ZFS metadata from disk and a bunch of other stuff.

  • 7.55 GB of RAM is used in active slab objects. 2.37 GB of that is reported in /proc/spl/kmem/slab; the remainder is in native Linux slabs in /proc/slabinfo. The znode_t slab is most of the SPL slab report, at 2 GB used.

    (This machine is using a hack to avoid the SLUB slab merging for native kernel ZoL slabs, because I wanted to look at memory usage in detail.)

  • 7.81 GB of RAM has been allocated to ZoL slabs in total. This means that there is a few hundred MB of space wasted at the moment.

If znode_t objects are not in the ARC, the ARC and active znode_t objects account for almost all of the slab space between the two of them; 7.1 GB out of 7.55 GB.

I have seen total ZoL slab allocated space be as high as 10 GB (on this 16 GB machine) despite the ARC only reporting a 5 GB size. As you can see, this stuff can fluctuate back and forth during normal usage.

Sidebar: Accurately tracking ZoL slab memory usage

To accurately track ZoL memory usage you must defeat SLUB slab merging somehow. You can turn it off entirely with the slub_nomerge kernel paramter or hack the spl ZoL kernel module to defeat it (see the sidebar here).

Because you can set spl_kmem_cache_slab_limit as a module parameter for the spl ZoL kernel module, I believe that you can set it to zero to avoid having any ZoL slabs be native kernel slabs. This avoids SLUB slab merging entirely and also makes it so that all ZoL slabs appear in /proc/spl/kmem/slab. It may be somewhat less efficient.

linux/ZFSonLinuxMemoryWhere written at 01:24:46; Add Comment


How /proc/slabinfo is not quite telling you what it looks like

The Linux kernel does a lot (although not all) of its interesting internal memory allocations through a slab allocator. For quite a while it's exposed per-type details of this process in /proc/slabinfo; this is very handy to get an idea of just what in your kernel is using up a bunch of memory. Today I was exploring this because I wanted to look into ZFS on Linux's memory usage and wound up finding out that on modern Linuxes it's a little bit misleading.

(By 'found out' I mean that DeHackEd on the #zfsonlinux IRC channel explained it to me.)

Specifically, on modern Linux the names shown in slabinfo are basically a hint because the current slab allocator in the kernel merges multiple slab types together if they are sufficiently similar. If five different subsystems all want to allocate (different) 128-byte objects with no special properties, they don't each get separate slab types with separate slabinfo entries; instead they are all merged into one slab type and thus one slabinfo entry. That slabinfo entry normally shows the name of one of them, probably the first to be set up, with no direct hint that it also includes the usage of all the others.

(The others don't appear in slabinfo at all.)

Most of the time this is a perfectly good optimization that cuts down on the number of slab types and enables better memory sharing and reduced fragmentation. But it does mean that you can't tell the memory used by, say, btree_node apart from ip_mrt_cache (on my machine, both are one of a lot of slab types that are actually all mapped to the generic 128-byte object). It can also leave you wondering where your slab types actually went, if you're inspecting code that creates a certain slab type but you can't find it in slabinfo (which is what happened to me).

The easiest way to see this mapping is to look at /sys/kernel/slab; all those symlinks are slab types that may be the same thing. You can decode what is what by hand, but if you're going to do this regularly you should get a copy of tools/vm/slabinfo.c from the kernel source and compile it; see the kernel SLUB documentation for details. You want 'slabinfo -a' to report the mappings.

(Sadly slabinfo is underdocumented. I wish it had a manpage or at least a README.)

If you need to track the memory usage of specific slab types, perhaps because you really want to know the memory usage of one subsystem, the easiest way is apparently to boot with the slub_nomerge kernel command line argument. Per the the kernel parameter documentation this turns off all slab merging, which may result in you having a lot more slabs than usual.

(On my workstation, slab merging condenses 110 different slabs into 14 actual slabs. On a random server, 170 slabs turn into 35 and a bunch of the pre-merger slabs are probably completely unused.)

Sidebar: disabling this merging in kernel code

The SLUB allocator does not directly expose a way of disabling this merging when you call kmem_cache_create() in that there's no 'do not merge, really' flag to the call. However, it turns out that supplying at least one of a number of SLUB debugging flags will disable this merging and on a kernel built without CONFIG_DEBUG_KMEMLEAK using SLAB_NOLEAKTRACE appears to have absolutely no other effects from what I can tell. Both Fedora 20 and Ubuntu 14.04 build their kernels without this option.

(I believe that most Linux distributions put a copy of the kernel build config in /boot when they install kernels.)

This may be handy if you have some additional kernel modules that you want to be able to track memory use for specifically even though a number of their slabs would normally get merged away, and you're compiling from source and willing to make some little modifications to it.

You can see the full set of flags that force never merging in the #define for SLUB_NEVER_MERGE in mm/slub.c. On a quick look, none of the others are either harmless or always defined as a non-zero value. It's possible that SLAB_DEBUG_FREE also does nothing these days; if used it will make your slabs only mergeable with other slabs that also specify it (which no slabs in the main kernel source do). That would cause slabs from your code to potentially be merged together but they wouldn't merge with anyone else's slabs, so at least you could track your subsystem's memory usage.

Disclaimer: these ideas have been at most compile-tested, not run live.

linux/SlabinfoSlabMerging written at 00:19:11; Add Comment


Simple web application environments and per-request state

One of the big divides in web programming environments (which are somewhat broader than web frameworks) is between environments that only really have per-request state and every new request starts over with a blank slate and environments with state that persists from request to request. CGI is the archetype of per-request state, but PHP is also famous for it. Many more advanced web environments have potential or actual shared state; sometimes this an explicit feature of the environment over simpler ones.

(One example of a persistent state environment is Node and I'd expect the JVM to generally be another one.)

I have nothing in particular against environments with persistent state and sometimes they're clearly needed (or at least very useful) for doing powerful web applications. But I think it's clear that web environments without it are simpler to program and thus are easier to write simple web things in.

Put simply, in an environment with non-persistent state you can be sloppy. You can change things. You can leave things sitting around the global environment. You can be casual about cleaning up bits and pieces. And you know that anything you do will be wiped away at the end of the request and the next one will start from scratch. An environment with persistent state allows you to do some powerful things but you have to be more careful. It's very easy to 'leak' things into the persistent environment and to modify things in a way that unexpectedly changes later requests, and it can also be easy to literally leak memory or other resources that would have been automatically cleaned up in a per-request environment.

(At this point the pure functional programmers are smugly mentioning the evils of mutable state.)

Speaking from personal experience, keeping track of the state you're changing is hard and it's easy to do something you don't realize. DWiki started out running in a purely non-persistent environment; when I also started running it in a semi-persistent one I found any number of little surprises and things I was doing to myself. I suspect I'd find more if I ran it for a long time in a fully persistent environment.

As a side note, there are some relatively obvious overall advantages to building a web application that doesn't require persistent state even if the underlying web environment you're running in supports it. This may make it useful to at least test your application in an environment that explicitly lacks it, just to make sure that everything still works right.

web/NonpersistentStateSimple written at 00:41:09; Add Comment


Why blocking writes are a good Unix API (on pipes and elsewhere)

One of the principles of good practical programming is that when your program can't make forward progress, it should do nothing rather than, say, continue to burn CPU while it waits for something to do. You want your program to do what work it can and then generally go to sleep, and thus you want APIs that encourage this to happen by default.

Now consider a chain of programs (or processes or services), each one feeding the next. In a multi-process environment like this you usually want something that gets called 'backpressure', where if any one component gets overloaded or can't make further progress it pushes back on the things feeding it so that they stop in turn (and so on back up the chain until everything quietly comes to a stop, not burning CPU and so on).

(You also want an equivalent for downstream services, where they process any input they get (if they can) but then stop doing anything if they stop getting any input at all.)

I don't think it's a coincidence that this describes classic Unix blocking IO to both pipes and files. Unix's blocking writes do backpressure pretty much exactly the way you want to happen; if any stage in a pipeline stalls for some reason, pretty soon all processes involved in it will block and sleep in write()s to their output pipe. Things like disk IO speed limits or slow processing or whatever will naturally do just what you want. And the Unix 'return what's available' behavior on reads does the same thing for the downstream of a stalled process; if the process wrote some output you can process it, but then you'll quietly go to sleep as you block for input.

And this is why I think that Unix having blocking pipe writes by default is not just a sensible API decision but a good one. This decision makes pipes just work right.

(Having short reads also makes the implementation of pipes simpler, because you don't have complex handling in the situation where eg process B is doing a read() of 128 megabytes while process A is trying to write() 64 megabytes to it. The kernel can make this work right, but it needs to go out of its way to do so.)

unix/BlockingWritesAndBackpressure written at 00:22:09; Add Comment


Why it's sensible for large writes to pipes to block

Back in this entry I said that large writes to pipes blocking instead of immediately returning with a short write was a sensible API decision. Today let's talk about that, by way of talking about how deciding the other way would be a bad API.

Let's start with a question: in a typical Unix pipeline program like grep, what would be the sensible reactions to trying to write a large amount of data returning a short write indicator? This is clearly not an error that should cause the program to abort (or even to print a warning); instead it's a perfectly normal thing if you're producing output faster than the other side of the pipe can consume it. For most programs, that means the only thing you can really do is pause until you can write more to the pipe. The conclusion is pretty straightforward; in a hypothetical world where such too-large pipe writes returned short write indicators instead of blocking, almost all programs would either wrap their writes in code that paused and retried them or arrange to set a special flag on the file descriptor to say 'block me until everything is written'. Either or both would probably wind up being part of stdio.

If everything is going to have code to work around or deal with something, this suggests that you are picking the wrong default. Thus large writes to pipes blocking by default is the right API decision because it means everyone can write simpler and less error-prone code at the user level.

(There are a number of reasons this is less error-prone, including both programs that don't usually expect to write to pipes (but you tell them to write to /dev/stdout) and programs that usually do short writes that don't block and so don't handle short writes, resulting in silently not writing some amount of their output some of the time.)

There's actually a reason why this is not merely a sensible API but a good one, but that's going to require an additional entry rather than wedging it in here.

Sidebar: This story does not represent actual history

The description I've written above more or less requires that there is some way to wait for a file descriptor to become ready for IO, so that when your write is short you can find out when you can usefully write more. However there was no such mechanism in early Unixes; select() only appeared in UCB BSD (and poll() and friends are even later). This means that having nonblocking pipe writes in V7 Unix would have required an entire set of mechanisms that only appeared later, instead of just a 'little' behavior change.

(However I do suspect that the Bell Labs Unix people actively felt that pipe writes should block just like file writes blocked until complete, barring some error. Had they felt otherwise, the Unix API would likely have been set up somewhat differently and V7 might have had some equivalent of select().)

If you're wondering how V7 could possibly not have something like select(), note that V7 didn't have any networking (partly because networks were extremely new and experimental at the time). Without networking and the problems it brings, there's much less need (or use) for a select().

unix/BlockingLargePipeWrites written at 01:03:58; Add Comment


Making bug reports is exhausting, frustrating, and stressful

I've danced around this subject before when I've written about bug reports (and making bug reports), but I want to come out and say it explicitly: far too often, making bug reports is an exhausting experience that is frequently frustrating and stressful.

This is not because the tools for doing it are terrible, although that doesn't help. It is because the very frequent result of trying to make a bug report is having to deal with people who don't believe you, who don't take you seriously, and who often don't read, consider, and investigate what you wrote. Some of the time it involves arguing with people who disagree with you, people who feel that what you are reporting is in fact not a bug or at best a trivial issue. The crowning frustration on top of all of these experiences is that after all of your effort and the stress of arguing with people, the bug will often not be fixed in any useful fashion. By the way, that 'deal with' is often actually 'argue with' (which is about as much fun as you'd expect).

(A contributing factor to the stress is often that you really need a fix or a workaround for the bug.)

Whether or not they can articulate it, everyone who's made enough bug reports knows this in their gut. In my opinion it's a fairly big reason why a lot of people burn out on making bug reports and stop doing it; it's not that they're making carefully considered cost/benefit calculations (no matter what I've written before about this), it's that they have absolutely no desire to put themselves through the whole exercise again. The frequently low cost/benefit ratio is a post-facto rationalization that people would reach for much less if the whole experience was actually a pleasant one.

There is a really important corollary for this: if you're tempted to urge someone to make a bug report, especially a bug report that you reasonably expect may be rejected, you should understand that you're trying to get them to put themselves through an unpleasant experience.

(I think this is a big part of why I have a very strong urge to bite the heads off of people who respond to me to suggest that I should file bug reports.)

tech/BugReportsExhausting written at 02:51:43; Add Comment


Why people are almost never going to be reporting bugs upstream

In comments on my entry about CentOS bug reporting, opk wrote:

If it is essentially an upstream bug and not packaging I tend to think it's far better to wade into the upstream swamps as you call it. I once packaged something for Debian and mainly gave it up because of the volume of bug reports that were basically for upstream but I had to verify, reproduce, and forward them.

Then Pete left a comment that nicely summarizes the problems with opk's suggestion:

[...] But of course you have to commit to running versions with debugging and then of course there's "the latest" even for the 7. Due to the critical nature of my NM use, I had difficulties experimenting with it.

The reality is that upstream bug reports aren't going to work for almost everyone unless the project has a very generous upstream. The problem is simple: almost all Linux distributions both use old versions of packages and patch them. If your version is patched or not current or both, almost every open source project is going to immediately ask 'can you reproduce this with an unmodified current version?'

I won't go so far as to say that this request is a non-starter, because in theory it can be done. For some projects it is good enough to download the current version (perhaps the current development version) and compile it yourself to be installed in an alternate location (or just run from where it was compiled). Other projects can be rebuilt into real distribution packages and then installed on your system without blowing up the world. And of course if this bug is absolutely critical to you, maybe you're willing to blow up the world just to be able to submit a bug report.

What all of this is is too much work, especially for the payoff most people are likely to get. The reality is that you're unlikely to benefit much from reporting any bug, and you're especially unlikely to benefit from upstream bug fixes unless you're willing to permanently run the upstream version (because if you're not, your distribution has to pick up and possibly backport the upstream bug fix if one is made).

(Let's skip the question of how many bug reporters even have the expertise to go through the steps necessary to try out the upstream version.)

Because reporting bugs upstream is so much work, in practice almost no one is going to do it no matter what you ask (or at least they aren't going to file useful ones). The direct corollary is that a policy of 'report bugs upstream' is in practice a policy of 'don't file bug reports'.

The one semi-exception to all of this is when your distribution package is an unmodified upstream version that the upstream (still) supports. At that point it makes sense to put a note in your bug tracker to explain this and say that upstream will take reports without problems. You're still asking bug reporters to do more work (now they have to go deal with the upstream bug reporting system too), but at least it's a pretty small amount of work.

linux/NoUpstreamBugReports written at 23:56:29; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.