Wandering Thoughts archives

2012-02-26

What information I want out of ZFS tools and libraries

Back in comments on my observation that Solaris 11 is closed source, Joshua M. Clulow noted that the Illumos people are working on making a better (and presumably public) version of libzfs, the nominal interface for dealing with ZFS. Although I've moved slowly on this, I think it's time to write down my thoughts about what I want for dealing with ZFS.

First off, my needs are probably somewhat unusual. I don't actually want to do anything to ZFS through libzfs; I just want to extract information. I also mostly don't care if I get an actual C-level API or simply some tools that give me information; either is about as convenient to me, since I'm actually going to consume the information in a non-C environment (either shell scripts or Python, depending on just what we're doing).

What I do need is three things: a stable and documented interface, information in a form that I can easily parse and interpret reliably, and complete information (not just things that have been cooked into some user-friendly form that elides details). The output of current zpool and zfs commands are none of these three; exact output is neither stable nor documented, it's very hard to parse, and it's not complete. What we current get through (ab)using Solaris's current libzfs is complete and easy to 'parse' (C structures are easy to deal with in one sense), but it's not stable or documented.

(I have a moderate bias towards a stable C API for libzfs because at this point I'd rather roll my own information extraction stuff than trust ZFS's own commands, and it's harder to cheat or omit things in a C API. And I don't have to worry that people will feel that, eg, XML is the perfect output format.)

Currently, we need two sorts of information; we need configuration information and pool state information. Configuration information covers things like what disks the pool uses and how it's organized, what filesystems there are, what snapshots there are, and so on. We use this both passively (we periodically record basic information about all pools for tracking purposes) and actively (knowing what disks are in use and how is a vital part of our spares system). Pool state information covers the health of disks in the pool and the state of things like resilvers and scrubs; we use this both for ongoing health monitoring and as part of our spares system.

(We don't currently need to extract performance data but we might at some point in the future.)

As for what specific pieces of configuration and state information we want, the likely answer is 'all of it'. If ZFS tracks it at all, I'm at least potentially interested in it.

Sidebar: how to test a proposed ZFS API

My rather obvious advice to anyone designing a public API for getting ZFS information is to test it by rewriting the information display portions of zpool and zfs using only the public API. If you can't do it at all, the API has obviously failed. However, if the API doesn't give you any extra information over what those two commands need today, it also fails, because both commands don't display most of the available information about configuration and state.

Generally you should be able to use the API to write an absurdly more verbose version of zpool status, one that will deluge you in a pile of detailed information.

solaris/ZFSInformationDesire written at 22:15:29; Add Comment

How much spam is forged as being from who it's sent to?

After doing the stats for the most popular sender domains for spam and discovering that the most popular thing was to use our domains, I was left with a very related question: how much spam is forged to come from the victim themselves?

As near as I can tell, the answer is almost all of the spam that's forged as from our domains is in fact forged as coming from the victim themselves (or, for multi-recipient messages, as coming from the first recipient). Based on our current set of 45 days of logfiles, that's about 8.3% of all messages that got spam-tagged. I suppose that this makes sense; after all, there's no need to take the risk of making up addresses on the remote system when you already have some, ie the ones you're sending spam to.

(As before, I checked only high-rated spam.)

The obvious corollary question to ask is how many non-spam messages match this criteria. The answer appears to be that almost none do, which is not really surprising. Given ad-hoc mailing lists and the like, it's possible for legitimate email to loop around in this way or for people to copy themselves when they're sending email through an outside SMTP server, but it's probably not going to be very common in most user populations.

For a while, I've believed that spammers like forging system addresses, especially postmaster. This turns out to be wrong; vanishingly little (high-scoring) spam is sent as from anyone's postmaster, and none is forged as from our postmaster address. Virus spammers may do that, but viruses are still very rare in our mail stream. I admit that this surprises me.

(Working with the logfiles for our spam filtering and tagging system has shown me that I need a specialized matching and extracting program that works with log lines of the form 'key=value key=value key=value ...', especially with some keys repeated several times. Awk is not a really good fit for these files. Creative use of tr can help when I only want a single field, but things fall down when I want several.)

spam/ForgedFromSelf-2012-02-26 written at 01:45:49; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.