Wandering Thoughts

2014-10-25

The difference in available pool space between zfs list and zpool list

For a while I've noticed that 'zpool list' would report that our pools had more available space than 'zfs list' did and I've vaguely wondered about why. We recently had a very serious issue due to a pool filling up, so suddenly I became very interested in the whole issue and did some digging. It turns out that there are two sources of the difference depending on how your vdevs are set up.

For raidz vdevs, the simple version is that 'zpool list' reports more or less the raw disk space before the raidz overhead while 'zfs list' applies the standard estimate that you expect (ie that N disks worth of space will vanish for a raidz level of N). Given that raidz overhead is variable in ZFS, it's easy to see why the two commands are behaving this way.

In addition, in general ZFS reserves a certain amount of pool space for various reasons, for example so that you can remove files even when the pool is 'full' (since ZFS is a copy on write system, removing files requires some new space to record the changes). This space is sometimes called 'slop space'. According to the code this reservation is 1/32nd of the pool's size. In my actual experimentation on our OmniOS fileservers this appears to be roughly 1/64th of the pool and definitely not 1/32nd of it, and I don't know why we're seeing this difference.

(I found out all of this from a Ben Rockwood blog entry and then found the code in the current Illumos codebase to see what the current state was (or is).)

The actual situation with what operations can (or should) use what space is complicated. Roughly speaking, user level writes and ZFS operations like 'zfs create' and 'zfs snapshot' that make things should use the 1/32nd reserved space figure, file removes and 'neutral' ZFS operations should be allowed to use half of the slop space (running the pool down to 1/64th of its size), and some operations (like 'zfs destroy') have no limit whatever and can theoretically run your pool permanently and unrecoverably out of space.

The final authority is the Illumos kernel code and its comments. These days it's on Github so I can just link to the two most relevant bits: spa_misc.c's discussion of spa_slop_shift and dsl_synctask.h's discussion of zfs_space_check_t.

(What I'm seeing with our pools would make sense if everything was actually being classified as a 'allowed to use half of the slop space' operation. I haven't traced the Illumos kernel code at this level so I have no idea how this could be happening; the comments certainly suggest that it isn't supposed to be.)

(This is the kind of thing that I write down so I can find it later, even though it's theoretically out there on the Internet already. Re-finding things on the Internet can be a hard problem.)

solaris/ZFSSpaceReportDifference written at 02:05:09; Add Comment

2014-10-24

In Go I've given up and I'm now using standard packages

In my Go programming, I've come around to an attitude that I'll summarize as 'there's no point in fighting city hall'. What this means is that I'm now consciously using standard packages that I don't particularly like just because they are the standard packages.

I'm on record as disliking the standard flag package, for example, and while I still believe in my reasons for this I've decided that it's simply not worth going out of my way over it. The flag package works and it's there. Similarly, I don't think that the log package is necessarily a great solution for emitting messages from Unix style command line utilities but in my latest Go program I used it anyways. It was there and it wasn't worth the effort to code warn() and die() functions and so on.

Besides, using flag and log is standard Go practice so it's going to be both familiar to and expected by anyone who might look at my code someday. There's a definite social benefit to doing things the standard way for anything that I put out in public, much like most everyone uses gofmt on their code.

In theory I could find and use some alternate getopt package (these days the go to place to find one would be godoc.org). In practice I find using external packages too much of a hassle unless I really need them. This is an odd thing to say about Go, considering that it makes them so easy and accessible, but depending on external packages comes with a whole set of hassles and concerns right now. I've seen a bit too much breakage to want that headache without a good reason.

(This may not be a rational view for Go programming, given that Go deliberately makes using people's packages so easy. Perhaps I should throw myself into using lots of packages just to get acclimatized to it. And in practice I suspect most packages don't break or vanish.)

PS: note that this is different from the people who say you should eg use the testing package for your testing because you don't really need anything more than what it provides and stick with the standard library's HTTP stuff rather than getting a framework. As mentioned, I still think that flag is not the right answer; it's just not wrong enough to be worth fighting city hall over.

Sidebar: Doing standard Unix error and warning messages with log

Here's what I do:

log.SetPrefix("<progname>: ")
log.SetFlags(0)

If I was doing this better I would derive the program name from os.Args[0] instead of hard-coding it, but if I did that I'd have to worry about various special cases and no, I'm being lazy here.

programming/GoUsingStandardPackages written at 01:15:57; Add Comment

2014-10-23

The clarity drawback of allowing comparison functions for sorting

I've written before about my unhappiness that Python 3 dropped support for using a comparison function. Well, let me take that back a bit, because I've come around to the idea that there are some real drawbacks to supporting a comparison function here. Not drawbacks in performance (which are comparatively unimportant here) but drawbacks in code clarity.

DWiki's code is sufficiently old that it uses only .sort() cmp functions simply because, well, that's what I had (or at least that's what I was used to). As a result, in two widely scattered spots in different functions its code base contains the following lines:

def func1(...):
    ....
    dl.sort(lambda x,y: cmp(y.timestamp, x.timestamp))
    ....

def func2(...):
    ....
    coms.sort(lambda x,y: cmp(x.time, y.time))
    ....

Apart from the field name, did you see the difference there? I didn't today while I was doing some modernization in DWiki's codebase and converted both of these to the '.sort(key=lambda x: x.FIELD)' form. The difference is that the first is a reverse sort, not a forward sort, because it flips x and y in the cmp().

(This code predates .sort() having a reverse= argument or at least my general awareness and use of it.)

And that's the drawback of allowing or using a sort comparison function: it's not as clear as directly saying what you mean. Small things in the comparison function can have big impacts and they're easy to overlook. By contrast, my intentions and what's going on are clearly spelled out when these things are rewritten into the modern form:

   dl.sort(key=lambda x: x.timestamp, reverse=True)
   coms.sort(key=lambda x: x.time)

Anyone, a future me included, is much less likely to miss the difference in sort order when reading (or skimming) this code.

I now feel that in practice you want to avoid using a comparison function as much as possible even if one exists for exactly this reason. Try very hard to directly say what you mean instead of hiding it inside your cmp function unless there's no way out. A direct corollary of this is that sorting interfaces should try to let you directly express as much as possible instead of forcing you to resort to tricks.

(Note that there are some cases where you must use a comparison function in some form (see especially the second comment).)

PS: I still disagree with Python 3 about removing the cmp argument entirely. It hasn't removed the ability to have custom sort functions; it's just forced you to write a lot more code to enable them and the result is probably even less efficient than before.

python/SortCmpFunctionClarityIssue written at 00:14:32; Add Comment

2014-10-22

Exim's (log) identifiers are basically unique on a given machine

Exim gives each incoming email message an identifier; these look like '1XgWdJ-00020d-7g'. Among other things, this identifier is used for all log messages about the particular email message. Since Exim normally splits information about each message across multiple lines, you routinely need to reassemble or at least match multiple lines for a single message. As a result of this need to aggregate multiple lines, I've quietly wondered for a long time just how unique these log identifiers were. Clearly they weren't going to repeat over the short term, but if I gathered tens or hundreds of days of logs for a particular system, would I find repeats?

The answer turns out to be no. Under normal circumstances Exim's message IDs here will be permanently unique on a single machine, although you can't count on global uniqueness across multiple machines (although the odds are pretty good). The details of how these message IDs are formed are in the Exim documentation's chapter 3.4. On most Unixes and with most Exim configurations they are a per-second timestamp, the process PID, and a final subsecond timestamp, and Exim takes care to guarantee that the timestamps will be different for the next possible message with the same PID.

(Thus a cross-machine collision would require the same message time down to the subsecond component plus the same PID on both machines. This is fairly unlikely but not impossible. Exim has a setting that can force more cross-machine uniqueness.)

This means that aggregation of multi-line logs can be done with simple brute force approaches that rely on ID uniqueness. Heck, to group all the log lines for a given message together you can just sort on the ID field, assuming you do a stable sort so that things stay in timestamp order when the IDs match.

(As they say, this is relevant to my interests and I finally wound up looking it up today. Writing it down here insures I don't have to try to remember where I found it in the Exim documentation the next time I need it.)

PS: like many other uses of Unix timestamps, all of this uniqueness potentially goes out the window if you allow time on your machine to actually go backwards. On a moderate volume machine you'd still have to be pretty unlucky to have a collision, though.

sysadmin/EximLogIdUniqueness written at 00:20:33; Add Comment

2014-10-20

Some numbers on our inbound and outbound TLS usage in SMTP

As a result of POODLE, it's suddenly rather interesting to find out the volume of SSLv3 usage that you're seeing. Fortunately for us, Exim directly logs the SSL/TLS protocol version in a relatively easy to search for format; it's recorded as the 'X=...' parameter for both inbound and outbound email. So here's some statistics, first from our external MX gateway for inbound messages and then from our other servers for external deliveries.

Over the past 90 days, we've received roughly 1.17 million external email messages. 389,000 of them were received with some version of SSL/TLS. Unfortunately our external mail gateway currently only supports up to TLS 1.0, so the only split I can report is that only 130 of these messages were received using SSLv3 instead of TLS 1.0. 130 messages is low enough for me to examine the sources by hand; the only particularly interesting and eyebrow-raising ones were a couple of servers at a US university and a .nl ISP.

(I'm a little bit surprised that our Exim doesn't support higher TLS versions, to be honest. We're using Exim on Ubuntu 12.04, which I would have thought would support something more than just TLS 1.0.)

On our user mail submission machine, we've delivered to 167,000 remote addresses over the past 90 days. Almost all of them, 158,000, were done with SSL/TLS. Only three of them used SSLv3 and they were all to the same destination; everything else was TLS 1.0.

(It turns out that very few of our user submitted messages were received with TLS, only 0.9%. This rather surprises me but maybe many IMAP programs default to not using TLS even if the submission server offers it. All of these small number of submissions used TLS 1.0, as I'd hope.)

Given that our Exim version only supports TLS 1.0, these numbers are more boring than I was hoping they'd be when I started writing this entry. That's how it goes sometimes; the research process can be disappointing as well as educating.

(I did verify that our SMTP servers really only do support up to TLS 1.0 and it's not just that no one asked for a higher version than that.)

One set of numbers I'd like to get for our inbound email is how TLS usage correlates with spam score. Unfortunately our inbound mail setup makes it basically impossible to correlate the bits together, as spam scoring is done well after TLS information is readily available.

Sidebar: these numbers don't quite mean what you might think

I've talked about inbound message deliveries and outbound destination addresses here because that's what Exim logs information about, but of course what is really encrypted is connections. One (encrypted) connection may deliver multiple inbound messages and certainly may be handed multiple RCPT TO addresses in the same conversation. I've also made no attempt to aggregate this by source or destination, so very popular sources or destinations (like, say, Gmail) will influence these numbers quite a lot.

All of this means that this sort of numbers can't be taken as an indication of how many sources or destinations do TLS with us. All I can talk about is message flows.

(I can't even talk about how many outgoing messages are completely protected by TLS, because to do that I'd have to work out how many messages had no non-TLS deliveries. This is probably possible with Exim logs, but it's more work than I'm interested in doing right now. Clearly what I need is some sort of easy to use Exim log aggregator that will group all log messages for a given email message together and then let me do relatively sophisticated queries on the result.)

spam/CSLabTLSUsage2014-10 written at 23:27:51; Add Comment

Revisiting Python's string concatenation optimization

Back in Python 2.4, CPython introduced an optimization for string concatenation that was designed to reduce memory churn in this operation and I got curious enough about this to examine it in some detail. Python 2.4 is a long time ago and I recently was prompted to wonder what had changed since then, if anything, in both Python 2 and Python 3.

To quickly summarize my earlier entry, CPython only optimizes string concatenations by attempting to grow the left side in place instead of making a new string and copying everything. It can only do this if the left side string only has (or clearly will have) a reference count of one, because otherwise it's breaking the promise that strings are immutable. Generally this requires code of the form 'avar = avar + ...' or 'avar += ...'.

As of Python 2.7.8, things have changed only slightly. In particular concatenation of Unicode strings is still not optimized; this remains a byte string only optimization. For byte strings there are two cases. Strings under somewhat less than 512 bytes can sometimes be grown in place by a few bytes, depending on their exact sizes. Strings over that can be grown if the system realloc() can find empty space after them.

(As a trivial root, CPython also optimizes concatenating an empty string to something by just returning the other string with its reference count increased.)

In Python 3, things are more complicated but the good news is that this optimization does work on Unicode strings. Python 3.3+ has a complex implementation of (Unicode) strings, but it does attempt to do in-place resizing on them under appropriate circumstances. The first complication is that internally Python 3 has a hierarchy of Unicode string storage and you can't do an in-place concatenation of a more complex sort of Unicode string into a less complex one. Once you have compatible strings in this sense, in terms of byte sizes the relevant sizes are the same as for Python 2.7.8; Unicode string objects that are less than 512 bytes can sometimes be grown by a few bytes while ones larger than that are at the mercy of the system realloc(). However, how many bytes a Unicode string takes up depends on what sort of string storage it is using, which I think mostly depends on how big your Unicode characters are (see this section of the Python 3.3 release notes and PEP 393 for the gory details).

So my overall conclusion remains as before; this optimization is chancy and should not be counted on. If you are doing repeated concatenation you're almost certainly better off using .join() on a list; if you think you have a situation that's otherwise, you should benchmark it.

(In Python 3, the place to start is PyUnicode_Append() in Objects/unicodeobject.c. You'll probably also want to read Include/unicodeobject.h and PEP 393 to understand this, and then see Objects/obmalloc.c for the small object allocator.)

Sidebar: What the funny 512 byte breakpoint is about

Current versions of CPython 2 and 3 allocate 'small' objects using an internal allocator that I think is basically a slab allocator. This allocator is used for all overall objects that are 512 bytes or less and it rounds object size up to the next 8-byte boundary. This means that if you ask for, say, a 41-byte object you actually get one that can hold up to 48 bytes and thus can be 'grown' in place up to this size.

python/ExaminingStringConcatOptII written at 00:37:10; Add Comment

2014-10-19

Vegeta, a tool for web server stress testing

Standard stress testing tools like siege (or the venerable ab, which you shouldn't use) are all systems that do N concurrent requests at once and see how your website stands up to this. This model is a fine one for putting a consistent load on your website for a stress test, but it's not actually representative of how the real world acts. In the real world you generally don't have, say, 50 clients all trying to repeatedly make and re-make one request to you as fast as they can; instead you'll have 50 new clients (and requests) show up every second.

(I wrote about this difference at length back in this old entry.)

Vegeta is a HTTP load and stress testing tool that I stumbled over at some point. What really attracted my attention is that it uses a 'N requests a second' model, instead of the concurrent request model. As a bonus it will also report not just average performance but also on outliers in the form of 90th and 99th percentile outliers. It's written in Go, which some of my readers may find annoying but which I rather like.

I gave it a try recently and, well, it works. It does what it says it does, which means that it's now become my default load and stress testing tool; 'N new requests a second' is a more realistic and thus interesting test than 'N concurrent requests' for my software (especially here, for obvious reasons).

(I may still do N concurrent requests tests as well, but it'll probably mostly be to see if there are issues that come up under some degree of consistent load and if I have any obvious concurrency race problems.)

Note that as with any HTTP stress tester, testing with high load levels may require a fast system (or systems) with plenty of CPUs, memory, and good networking if applicable. And as always you should validate that vegeta is actually delivering the degree of load that it should be, although this is actually reasonably easy to verify for a 'N new request per second' tester.

(Barring errors, N new requests a second over an M second test run should result in N*M requests made and thus appearing in your server logs. I suppose the next time I run a test with vegeta I should verify this myself in my test environment. In my usage so far I just took it on trust that vegeta was working right, which in light of my ab experience may be a little bit optimistic.)

web/VegetaLoadTesting written at 02:03:33; Add Comment

2014-10-18

During your crisis, remember to look for anomalies

This is a war story.

Today I had one of those valuable learning experiences for a system administrator. What happened is that one of our old fileservers locked up mysteriously, so we power cycled it. Then it locked up again. And again (and an attempt to get a crash dump failed). We thought it might be hardware related, so we transplanted the system disks into an entirely new chassis (with more memory, because there was some indications that it might be running out of memory somehow). It still locked up. Each lockup took maybe ten or fifteen minutes from the reboot, and things were all the more alarming and mysterious because this particular old fileserver only had a handful of production filesystems still on it; almost all of them had been migrated to one of our new fileservers. After one more lockup we gave up and went with our panic plan: we disabled NFS and set up to do an emergency migration of the remaining filesystems to the appropriate new fileserver.

Only as we started the first filesystem migration did we notice that one of the ZFS pools was completely full (so full it could not make a ZFS snapshot). As we were freeing up some space in the pool, a little light came on in the back of my mind; I remembered reading something about how full ZFS pools on our ancient version of Solaris could be very bad news, and I was pretty sure that earlier I'd seen a bunch of NFS write IO at least being attempted against the pool. Rather than migrate the filesystem after the pool had some free space, we selectively re-enabled NFS fileservice. The fileserver stayed up. We enabled more NFS fileservice. And things stayed happy. At this point we're pretty sure that we found the actual cause of all of our fileserver problems today.

(Afterwards I discovered that we had run into something like this before.)

What this has taught me is during an inexplicable crisis, I should try to take a bit of time to look for anomalies. Not specific anomalies, but general ones; things about the state of the system that aren't right or don't seem right.

(There is a certain amount of hindsight bias in this advice, but I want to mull that over a bit before I wrote more about it. The more I think about it the more complicated real crisis response becomes.)

sysadmin/CrisisLookForAnomalies written at 00:54:50; Add Comment

2014-10-17

My experience doing relatively low level X stuff in Go

Today I wound up needing a program that spoke the current Firefox remote control protocol instead of the old -remote based protocol that Firefox Nightly just removed. I had my choice between either adding a bunch of buffer mangling to a very old C program that already did basically all of the X stuff necessary or trying to do low-level X things from a Go program. The latter seemed much more interesting and so it's what I did.

(The old protocol was pretty simple but the new one involves a bunch of annoying buffer packing.)

Remote controlling Firefox is done through X properties, which is a relatively low level part of the X protocol (well below the usual level of GUIs and toolkits like GTK and Qt). You aren't making windows or drawing anything; instead you're grubbing around in window trees and getting obscure events from other people's windows. Fortunately Go has low level bindings for X in the form of Andrew Gallant's X Go Binding and his xgbutil packages for them (note that the XGB documentation you really want to read is for xgb/xproto). Use of these can be a little bit obscure so it very much helped me to read several examples (for both xgb and xgbutil).

All told the whole experience was pretty painless. Most of the stumbling blocks I ran into were because I don't really know X programming and because I was effectively translating from an older X API (Xlib) that my original C program was using to XCB, which is what XGB's API is based on. This involved a certain amount of working out what old functions that the old code was calling actually did and then figuring out how to translate them into XGB and xgbutil stuff (mostly the latter, because xgbutil puts a nice veneer over a lot of painstaking protocol bits).

(I was especially pleased that my Go code for the annoying buffer packing worked the first time. It was also pretty easy and obvious to write.)

One of the nice little things about using Go for this is that XGB turns out to be a pure Go binding, which means it can be freely cross compiled. So now I can theoretically do Firefox remote control from essentially any machine I remotely log into around here. Someday I may have a use for this, perhaps for some annoying system management program that insists on spawning something to show me links.

(Cross machine remote control matters to me because I read my email on a remote machine with a graphical program, and of course I want to click on links there and have them open in my workstation's main Firefox.)

Interested parties who want either a functional and reasonably commented example of doing this sort of stuff in Go or a program to do lightweight remote control of Unix Firefox can take a look at the ffox-remote repo. As a bonus I have written down in comments what I now know about the actual Firefox remote control protocol itself.

programming/GoLowLevelX written at 00:54:09; Add Comment

2014-10-16

Don't use dd as a quick version of disk mirroring

Suppose, not entirely hypothetically, that you initially set up a server with one system disk but have come to wish that it had a mirrored pair of them. The server is in production and in-place migration to software RAID requires a downtime or two, so as a cheap 'in case of emergency' measure you stick in a second disk and then clone your current system disk to it with dd (remember to fsck the root filesystem afterwards).

(This has a number of problems if you ever actually need to boot from the second disk, but let's set them aside for now.)

Unfortunately, on a modern Linux machine you have just armed a time bomb that is aimed at your foot. It may never go off, or it may go off more than a year and a half later (when you've forgotten all about this), or it may go off the next time you reboot the machine. The problem is that modern Linux systems identify their root filesystem by its UUID, not its disk location, and because you cloned the disk with dd you now have two different filesystems with the same UUID.

(Unless you do something to manually change the UUID on the cloned copy, which you can. But you have to remember that step. On extN filesystems, it's done with tune2fs's -U argument; you probably want '-U random'.)

Most of the time, the kernel and initramfs will probably see your first disk first and inventory the UUID on its root partition first and so on, and thus boot from the right filesystem on the first disk. But this is not guaranteed. Someday the kernel may get around to looking at sdb1 before it looks at sda1, find the UUID it's looking for, and mount your cloned copy as the root filesystem instead of the real thing. If you're lucky, the cloned copy is so out of date that things fail explosively and you notice immediately (although figuring out what's going on may take a bit of time and in the mean time life can be quite exciting). If you're unlucky, the cloned copy is close enough to the real root filesystem that things mostly work and you might only have a few little anomalies, like missing log files or mysteriously reverted package versions or the like. You might not even really notice.

(This is the background behind my recent tweet.)

linux/DDMirroringDanger written at 02:13:02; Add Comment

(Previous 10 or go back to October 2014 at 2014/10/15)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.