The difference in available pool space between
zfs list and
For a while I've noticed that '
zpool list' would report that our pools
had more available space than '
zfs list' did and I've vaguely wondered
about why. We recently had a very serious issue due to a pool filling
up, so suddenly I became very interested in the whole issue and did
some digging. It turns out that there are two sources of the difference
depending on how your vdevs are set up.
For raidz vdevs, the simple version is that '
zpool list' reports more
or less the raw disk space before the raidz overhead while '
applies the standard estimate that you expect (ie that N disks worth of
space will vanish for a raidz level of N). Given that raidz overhead is
variable in ZFS, it's easy to see why the two commands are behaving this
In addition, in general ZFS reserves a certain amount of pool space for various reasons, for example so that you can remove files even when the pool is 'full' (since ZFS is a copy on write system, removing files requires some new space to record the changes). This space is sometimes called 'slop space'. According to the code this reservation is 1/32nd of the pool's size. In my actual experimentation on our OmniOS fileservers this appears to be roughly 1/64th of the pool and definitely not 1/32nd of it, and I don't know why we're seeing this difference.
(I found out all of this from a Ben Rockwood blog entry and then found the code in the current Illumos codebase to see what the current state was (or is).)
The actual situation with what operations can (or should) use what space
is complicated. Roughly speaking, user level writes and ZFS operations
zfs create' and '
zfs snapshot' that make things should use the
1/32nd reserved space figure, file removes and 'neutral' ZFS operations
should be allowed to use half of the slop space (running the pool down
to 1/64th of its size), and some operations (like '
zfs destroy') have
no limit whatever and can theoretically run your pool permanently and
unrecoverably out of space.
The final authority is the Illumos kernel code and its comments. These
days it's on Github so I can just link to the two most relevant bits:
spa_misc.c's discussion of
and dsl_synctask.h's discussion of
(What I'm seeing with our pools would make sense if everything was actually being classified as a 'allowed to use half of the slop space' operation. I haven't traced the Illumos kernel code at this level so I have no idea how this could be happening; the comments certainly suggest that it isn't supposed to be.)
(This is the kind of thing that I write down so I can find it later, even though it's theoretically out there on the Internet already. Re-finding things on the Internet can be a hard problem.)
In Go I've given up and I'm now using standard packages
In my Go programming, I've come around to an attitude that I'll summarize as 'there's no point in fighting city hall'. What this means is that I'm now consciously using standard packages that I don't particularly like just because they are the standard packages.
I'm on record as disliking the standard
package, for example, and while I still believe in my reasons
for this I've decided that it's
simply not worth going out of my way over it. The flag package
works and it's there. Similarly, I don't think that the
package is necessarily a great solution for emitting messages from
Unix style command line utilities but in my latest Go program I used it anyways. It
was there and it wasn't worth the effort to code
functions and so on.
log is standard Go practice so it's going
to be both familiar to and expected by anyone who might look at my code
someday. There's a definite social benefit to doing things the standard
way for anything that I put out in public, much like most everyone uses
gofmt on their code.
In theory I could find and use some alternate getopt package (these days the go to place to find one would be godoc.org). In practice I find using external packages too much of a hassle unless I really need them. This is an odd thing to say about Go, considering that it makes them so easy and accessible, but depending on external packages comes with a whole set of hassles and concerns right now. I've seen a bit too much breakage to want that headache without a good reason.
(This may not be a rational view for Go programming, given that Go deliberately makes using people's packages so easy. Perhaps I should throw myself into using lots of packages just to get acclimatized to it. And in practice I suspect most packages don't break or vanish.)
PS: note that this is different from the people who say you should eg
testing package for your testing because you don't really
need anything more than what it provides and stick with the standard
library's HTTP stuff rather than getting a framework. As mentioned, I
still think that
flag is not the right answer; it's just not wrong
enough to be worth fighting city hall over.
Sidebar: Doing standard Unix error and warning messages with
Here's what I do:
log.SetPrefix("<progname>: ") log.SetFlags(0)
If I was doing this better I would derive the program name from
os.Args instead of hard-coding it, but if I did that I'd have to
worry about various special cases and no, I'm being lazy here.
The clarity drawback of allowing comparison functions for sorting
I've written before about my unhappiness that Python 3 dropped support for using a comparison function. Well, let me take that back a bit, because I've come around to the idea that there are some real drawbacks to supporting a comparison function here. Not drawbacks in performance (which are comparatively unimportant here) but drawbacks in code clarity.
DWiki's code is sufficiently old that it uses only
functions simply because, well, that's what I had (or at least
that's what I was used to). As a result, in two widely scattered
spots in different functions its code base contains the following
def func1(...): .... dl.sort(lambda x,y: cmp(y.timestamp, x.timestamp)) .... def func2(...): .... coms.sort(lambda x,y: cmp(x.time, y.time)) ....
Apart from the field name, did you see the difference there? I didn't
today while I was doing some modernization in DWiki's codebase and
converted both of these to the '
.sort(key=lambda x: x.FIELD)'
form. The difference is that the first is a reverse sort, not a
forward sort, because it flips
y in the
(This code predates
.sort() having a
reverse= argument or at least
my general awareness and use of it.)
And that's the drawback of allowing or using a sort comparison function: it's not as clear as directly saying what you mean. Small things in the comparison function can have big impacts and they're easy to overlook. By contrast, my intentions and what's going on are clearly spelled out when these things are rewritten into the modern form:
dl.sort(key=lambda x: x.timestamp, reverse=True) coms.sort(key=lambda x: x.time)
Anyone, a future me included, is much less likely to miss the difference in sort order when reading (or skimming) this code.
I now feel that in practice you want to avoid using a comparison
function as much as possible even if one exists for exactly this
reason. Try very hard to directly say what you mean instead of
hiding it inside your
cmp function unless there's no way out.
A direct corollary of this is that sorting interfaces should
try to let you directly express as much as possible instead of
forcing you to resort to tricks.
(Note that there are some cases where you must use a comparison function in some form (see especially the second comment).)
PS: I still disagree with Python 3 about removing the cmp argument entirely. It hasn't removed the ability to have custom sort functions; it's just forced you to write a lot more code to enable them and the result is probably even less efficient than before.
Exim's (log) identifiers are basically unique on a given machine
Exim gives each incoming email message an identifier; these look like '1XgWdJ-00020d-7g'. Among other things, this identifier is used for all log messages about the particular email message. Since Exim normally splits information about each message across multiple lines, you routinely need to reassemble or at least match multiple lines for a single message. As a result of this need to aggregate multiple lines, I've quietly wondered for a long time just how unique these log identifiers were. Clearly they weren't going to repeat over the short term, but if I gathered tens or hundreds of days of logs for a particular system, would I find repeats?
The answer turns out to be no. Under normal circumstances Exim's message IDs here will be permanently unique on a single machine, although you can't count on global uniqueness across multiple machines (although the odds are pretty good). The details of how these message IDs are formed are in the Exim documentation's chapter 3.4. On most Unixes and with most Exim configurations they are a per-second timestamp, the process PID, and a final subsecond timestamp, and Exim takes care to guarantee that the timestamps will be different for the next possible message with the same PID.
(Thus a cross-machine collision would require the same message time down to the subsecond component plus the same PID on both machines. This is fairly unlikely but not impossible. Exim has a setting that can force more cross-machine uniqueness.)
This means that aggregation of multi-line logs can be done with
simple brute force approaches that rely on ID uniqueness. Heck, to
group all the log lines for a given message together you can just
sort on the ID field, assuming you do a stable sort so that things
stay in timestamp order when the IDs match.
(As they say, this is relevant to my interests and I finally wound up looking it up today. Writing it down here insures I don't have to try to remember where I found it in the Exim documentation the next time I need it.)
PS: like many other uses of Unix timestamps, all of this uniqueness potentially goes out the window if you allow time on your machine to actually go backwards. On a moderate volume machine you'd still have to be pretty unlucky to have a collision, though.
Some numbers on our inbound and outbound TLS usage in SMTP
As a result of POODLE,
it's suddenly rather interesting to find out the volume of SSLv3
usage that you're seeing. Fortunately for us, Exim directly logs
the SSL/TLS protocol version in a relatively easy to search for
format; it's recorded as the '
X=...' parameter for both inbound
and outbound email. So here's some statistics, first from our external
MX gateway for inbound messages and then from our other servers for
Over the past 90 days, we've received roughly 1.17 million external email messages. 389,000 of them were received with some version of SSL/TLS. Unfortunately our external mail gateway currently only supports up to TLS 1.0, so the only split I can report is that only 130 of these messages were received using SSLv3 instead of TLS 1.0. 130 messages is low enough for me to examine the sources by hand; the only particularly interesting and eyebrow-raising ones were a couple of servers at a US university and a .nl ISP.
(I'm a little bit surprised that our Exim doesn't support higher TLS versions, to be honest. We're using Exim on Ubuntu 12.04, which I would have thought would support something more than just TLS 1.0.)
On our user mail submission machine, we've delivered to 167,000 remote addresses over the past 90 days. Almost all of them, 158,000, were done with SSL/TLS. Only three of them used SSLv3 and they were all to the same destination; everything else was TLS 1.0.
(It turns out that very few of our user submitted messages were received with TLS, only 0.9%. This rather surprises me but maybe many IMAP programs default to not using TLS even if the submission server offers it. All of these small number of submissions used TLS 1.0, as I'd hope.)
Given that our Exim version only supports TLS 1.0, these numbers are more boring than I was hoping they'd be when I started writing this entry. That's how it goes sometimes; the research process can be disappointing as well as educating.
(I did verify that our SMTP servers really only do support up to TLS 1.0 and it's not just that no one asked for a higher version than that.)
One set of numbers I'd like to get for our inbound email is how TLS usage correlates with spam score. Unfortunately our inbound mail setup makes it basically impossible to correlate the bits together, as spam scoring is done well after TLS information is readily available.
Sidebar: these numbers don't quite mean what you might think
I've talked about inbound message deliveries and outbound destination
addresses here because that's what Exim logs information about, but
of course what is really encrypted is connections. One (encrypted)
connection may deliver multiple inbound messages and certainly may
be handed multiple
RCPT TO addresses in the same conversation.
I've also made no attempt to aggregate this by source or destination,
so very popular sources or destinations (like, say, Gmail) will
influence these numbers quite a lot.
All of this means that this sort of numbers can't be taken as an indication of how many sources or destinations do TLS with us. All I can talk about is message flows.
(I can't even talk about how many outgoing messages are completely protected by TLS, because to do that I'd have to work out how many messages had no non-TLS deliveries. This is probably possible with Exim logs, but it's more work than I'm interested in doing right now. Clearly what I need is some sort of easy to use Exim log aggregator that will group all log messages for a given email message together and then let me do relatively sophisticated queries on the result.)
Revisiting Python's string concatenation optimization
Back in Python 2.4, CPython introduced an optimization for string concatenation that was designed to reduce memory churn in this operation and I got curious enough about this to examine it in some detail. Python 2.4 is a long time ago and I recently was prompted to wonder what had changed since then, if anything, in both Python 2 and Python 3.
To quickly summarize my earlier entry,
CPython only optimizes string concatenations by attempting to grow
the left side in place instead of making a new string and copying
everything. It can only do this if the left side string only has
(or clearly will have) a reference count of one, because otherwise
it's breaking the promise that strings are immutable. Generally
this requires code of the form '
avar = avar + ...' or '
As of Python 2.7.8, things have changed only slightly. In particular
concatenation of Unicode strings is still not optimized; this
remains a byte string only optimization. For byte strings there are two
cases. Strings under somewhat less than 512 bytes can sometimes be grown
in place by a few bytes, depending on their exact sizes. Strings over
that can be grown if the system
realloc() can find empty space after
(As a trivial root, CPython also optimizes concatenating an empty string to something by just returning the other string with its reference count increased.)
In Python 3, things are more complicated but the good news is that
this optimization does work on Unicode strings. Python 3.3+ has a
complex implementation of (Unicode) strings, but it does attempt
to do in-place resizing on them under appropriate circumstances.
The first complication is that internally Python 3 has a hierarchy
of Unicode string storage and you can't do an in-place concatenation
of a more complex sort of Unicode string into a less complex one.
Once you have compatible strings in this sense, in terms of byte
sizes the relevant sizes are the same as for Python 2.7.8; Unicode
string objects that are less than 512 bytes can sometimes be grown
by a few bytes while ones larger than that are at the mercy of the
realloc(). However, how many bytes a Unicode string takes
up depends on what sort of string storage it is using, which I think
mostly depends on how big your Unicode characters are (see this
section of the Python 3.3 release notes and PEP 393 for the gory details).
So my overall conclusion remains as before; this optimization is
chancy and should not be counted on. If you are doing repeated
concatenation you're almost certainly better off using
on a list; if you think you have a situation that's otherwise, you
should benchmark it.
(In Python 3, the place to start is
Objects/unicodeobject.c. You'll probably also want to read
Include/unicodeobject.h and PEP 393 to understand this, and
then see Objects/obmalloc.c for the small object allocator.)
Sidebar: What the funny 512 byte breakpoint is about
Current versions of CPython 2 and 3 allocate 'small' objects using an internal allocator that I think is basically a slab allocator. This allocator is used for all overall objects that are 512 bytes or less and it rounds object size up to the next 8-byte boundary. This means that if you ask for, say, a 41-byte object you actually get one that can hold up to 48 bytes and thus can be 'grown' in place up to this size.
Vegeta, a tool for web server stress testing
Standard stress testing tools like siege (or the venerable
you shouldn't use) are all systems that do N
concurrent requests at once and see how your website stands up to
this. This model is a fine one for putting a consistent load on
your website for a stress test, but it's not actually representative
of how the real world acts. In the real world you generally don't
have, say, 50 clients all trying to repeatedly make and re-make one
request to you as fast as they can; instead you'll have 50 new
clients (and requests) show up every second.
(I wrote about this difference at length back in this old entry.)
Vegeta is a HTTP load and stress testing tool that I stumbled over at some point. What really attracted my attention is that it uses a 'N requests a second' model, instead of the concurrent request model. As a bonus it will also report not just average performance but also on outliers in the form of 90th and 99th percentile outliers. It's written in Go, which some of my readers may find annoying but which I rather like.
I gave it a try recently and, well, it works. It does what it says it does, which means that it's now become my default load and stress testing tool; 'N new requests a second' is a more realistic and thus interesting test than 'N concurrent requests' for my software (especially here, for obvious reasons).
(I may still do N concurrent requests tests as well, but it'll probably mostly be to see if there are issues that come up under some degree of consistent load and if I have any obvious concurrency race problems.)
Note that as with any HTTP stress tester, testing with high load levels may require a fast system (or systems) with plenty of CPUs, memory, and good networking if applicable. And as always you should validate that vegeta is actually delivering the degree of load that it should be, although this is actually reasonably easy to verify for a 'N new request per second' tester.
(Barring errors, N new requests a second over an M second test run
should result in N*M requests made and thus appearing in your server
logs. I suppose the next time I run a test with vegeta I should
verify this myself in my test environment. In my usage so far I
just took it on trust that vegeta was working right, which in
light of my
ab experience may be a little bit
During your crisis, remember to look for anomalies
This is a war story.
Today I had one of those valuable learning experiences for a system administrator. What happened is that one of our old fileservers locked up mysteriously, so we power cycled it. Then it locked up again. And again (and an attempt to get a crash dump failed). We thought it might be hardware related, so we transplanted the system disks into an entirely new chassis (with more memory, because there was some indications that it might be running out of memory somehow). It still locked up. Each lockup took maybe ten or fifteen minutes from the reboot, and things were all the more alarming and mysterious because this particular old fileserver only had a handful of production filesystems still on it; almost all of them had been migrated to one of our new fileservers. After one more lockup we gave up and went with our panic plan: we disabled NFS and set up to do an emergency migration of the remaining filesystems to the appropriate new fileserver.
Only as we started the first filesystem migration did we notice that one of the ZFS pools was completely full (so full it could not make a ZFS snapshot). As we were freeing up some space in the pool, a little light came on in the back of my mind; I remembered reading something about how full ZFS pools on our ancient version of Solaris could be very bad news, and I was pretty sure that earlier I'd seen a bunch of NFS write IO at least being attempted against the pool. Rather than migrate the filesystem after the pool had some free space, we selectively re-enabled NFS fileservice. The fileserver stayed up. We enabled more NFS fileservice. And things stayed happy. At this point we're pretty sure that we found the actual cause of all of our fileserver problems today.
What this has taught me is during an inexplicable crisis, I should try to take a bit of time to look for anomalies. Not specific anomalies, but general ones; things about the state of the system that aren't right or don't seem right.
(There is a certain amount of hindsight bias in this advice, but I want to mull that over a bit before I wrote more about it. The more I think about it the more complicated real crisis response becomes.)
My experience doing relatively low level X stuff in Go
Today I wound up needing a program that spoke the current Firefox
remote control protocol instead of
-remote based protocol that Firefox Nightly just removed. I had my
choice between either adding a bunch of buffer mangling to a very old
C program that already did basically all of the X stuff necessary or
trying to do low-level X things from a Go program. The latter seemed
much more interesting and so it's what I did.
(The old protocol was pretty simple but the new one involves a bunch of annoying buffer packing.)
Remote controlling Firefox is done through X properties, which is a relatively low level part of the X protocol (well below the usual level of GUIs and toolkits like GTK and Qt). You aren't making windows or drawing anything; instead you're grubbing around in window trees and getting obscure events from other people's windows. Fortunately Go has low level bindings for X in the form of Andrew Gallant's X Go Binding and his xgbutil packages for them (note that the XGB documentation you really want to read is for xgb/xproto). Use of these can be a little bit obscure so it very much helped me to read several examples (for both xgb and xgbutil).
All told the whole experience was pretty painless. Most of the stumbling blocks I ran into were because I don't really know X programming and because I was effectively translating from an older X API (Xlib) that my original C program was using to XCB, which is what XGB's API is based on. This involved a certain amount of working out what old functions that the old code was calling actually did and then figuring out how to translate them into XGB and xgbutil stuff (mostly the latter, because xgbutil puts a nice veneer over a lot of painstaking protocol bits).
(I was especially pleased that my Go code for the annoying buffer packing worked the first time. It was also pretty easy and obvious to write.)
One of the nice little things about using Go for this is that XGB turns out to be a pure Go binding, which means it can be freely cross compiled. So now I can theoretically do Firefox remote control from essentially any machine I remotely log into around here. Someday I may have a use for this, perhaps for some annoying system management program that insists on spawning something to show me links.
(Cross machine remote control matters to me because I read my email on a remote machine with a graphical program, and of course I want to click on links there and have them open in my workstation's main Firefox.)
Interested parties who want either a functional and reasonably commented example of doing this sort of stuff in Go or a program to do lightweight remote control of Unix Firefox can take a look at the ffox-remote repo. As a bonus I have written down in comments what I now know about the actual Firefox remote control protocol itself.
dd as a quick version of disk mirroring
Suppose, not entirely hypothetically, that you initially set up a
server with one system disk but have come to wish that it had a
mirrored pair of them. The server is in production and in-place
migration to software RAID requires a downtime or two, so as a cheap 'in case of emergency' measure
you stick in a second disk and then clone your current system disk
to it with
dd (remember to
fsck the root filesystem afterwards).
(This has a number of problems if you ever actually need to boot from the second disk, but let's set them aside for now.)
Unfortunately, on a modern Linux machine you have just armed a time
bomb that is aimed at your foot. It may never go off, or it may go
off more than a year and a half later (when you've forgotten all
about this), or it may go off the next time you reboot the machine.
The problem is that modern Linux systems identify their root
filesystem by its UUID, not its disk location, and because you
cloned the disk with
dd you now have two different filesystems
with the same UUID.
(Unless you do something to manually change the UUID on the cloned
copy, which you can. But you have to remember that step. On extN
filesystems, it's done with
-U argument; you probably
Most of the time, the kernel and initramfs will probably see your
first disk first and inventory the UUID on its root partition first
and so on, and thus boot from the right filesystem on the first
disk. But this is not guaranteed. Someday the kernel may get around
to looking at
sdb1 before it looks at
sda1, find the UUID it's
looking for, and mount your cloned copy as the root filesystem
instead of the real thing. If you're lucky, the cloned copy is so
out of date that things fail explosively and you notice immediately
(although figuring out what's going on may take a bit of time and
in the mean time life can be quite exciting). If you're unlucky,
the cloned copy is close enough to the real root filesystem that
things mostly work and you might only have a few little anomalies,
like missing log files or mysteriously reverted package versions
or the like. You might not even really notice.
(This is the background behind my recent tweet.)