Wandering Thoughts

2015-05-25

Email providers cannot stop spam by scanning outgoing email

One of the things that Amazon SES advertises that it (usually) does is that it scans the outgoing email that people send through it to block spam. This sounds great and certainly should mean that Amazon SES emits very low levels of spam, right? Well, no, not so fast. Unfortunately, no outgoing mail scanning on a service like this can eliminate spam. All it can do is stop certain sorts of obvious spam. This is intrinsic in the definition of 'spam' and the limitations of what a mail sending system like Amazon SES does.

Essentially perfect content scanning can tell you two things: whether the email has markers of known types of spam, such as phish, advance fee fraud, malware distribution, and so on, and whether the email will be be scored as spam by however many spam scoring systems you can get your hands on the rules for. These are undeniably useful things to know (provided that you act on them), but messages that fail these tests are far from the only sorts of spam. In particular, basically all sorts of advertising and marketing emails cannot be blocked by such a system because what makes these messages spam is not their content, it's that they are unsolicited (cf, cf).

The only way to even theoretically tell whether a message is solicited or unsolicited is to control not just the sending of outgoing email but the process of choosing destination email addresses. If you only scan messages but don't control addresses, you have very little choice but to believe the sender when they tell you 'honest, all of these addresses want this email'. And then the marketing department of everyone and sundry descends on Amazon SES with their list of leads and prospects and people to notify about their very special whatever it is that of course everyone will be interested in, and then Amazon SES is sending spam.

(Or the marketing people buy 'qualified email addresses' from spam providers because why not, you could get lucky.)

There is absolutely nothing content filtering can do about this. Nothing. You could have a strong AI reading the messages and it wouldn't be able to stop all of the UBE.

(I wrote a version of this as a comment reply on my Amazon SES entry but I've decided it's an important enough point to state and elaborate in an entry.)

spam/OutgoingScanningLimitation written at 00:15:37; Add Comment

2015-05-24

A mod_wsgi problem with serving both HTTP and HTTPS from the same WSGI app

This is kind of a warning story. It may not be true any more (I believe that I ran into this back in 2013, probably with a 3.x version of mod_wsgi), but it's probably representative of the kind of things that you can run into with Python web apps in an environment that mixes HTTP and HTTPS.

Once upon a time I tried converting my personal site from lighttpd plus a CGI based lashup for DWiki to Apache plus mod_wsgi serving DWiki as a WSGI application. At the time I had not yet made the decision to push all (or almost all) of my traffic from HTTP to HTTPS; instead I decided to serve both HTTP and HTTPS along side each other. The WSGI configuration I set up for this was what I felt was pretty straightforward. Outside of any particular virtual host stanza, I defined a single WSGI daamon process for my application and said to put everything in it:

WSGIDaemonProcess cspace2 user=... processes=15 threads=1 maximum-requests=500 ...
WSGIProcessGroup cspace2

Then in each of the HTTP and HTTPS versions of the site I defined appropriate Apache stuff to invoke my application in the already defined WSGI daemon process. This was exactly the same in both sites, because the URLs and everything were the same:

WSGIScriptAlias /space ..../cspace2.wsgi
<Directory ...>
   WSGIApplicationGroup cspace2
   ...

(Yes, this is what is by now old syntax and may have been old even back at the time; today you'd specify the process group and/or the application group in the WSGIScriptAlias directive.)

This all worked and I was happy. Well, I was happy for a while. Then I noticed that sometimes my HTTPS site was serving URLs that had HTTP URLs in links and vice versa. In fact, what was happening is that some of the time the application was being accessed over HTTPS but thought it was using HTTP, and sometimes it was the other way around. I didn't go deep into diagnosis because other factors intervened, but my operating hypothesis was that when a new process was forked off and handled its first request it then latched whichever of HTTP or HTTPS the request had been made through and used that for all of the remaining requests it handled.

(This may have been related to my mistake about how a WSGI app is supposed to find out about HTTP versus HTTPS.)

This taught me a valuable lesson about mixing WSGI daemon processes and so on across different contexts, which is that I probably don't want to do that. It's tempting, because it reduces the number of total WSGI related processes that are rattling around my systems, but even apart from Unix UID issues it's clear that mod_wsgi has a certain amount of mixture across theoretically separate contexts. Even if this is a now-fixed mod_wsgi issue, well, where there's one issue there can be more. As I've found out myself, keeping things carefully separate is hard work and is prone to accidental slipups.

(It's also possible that this is a mod_wsgi configuration mistake on my part, which I can believe; I'm not entirely sure I understand the distinction between 'process group' and 'application group', for example. The possibility of such configuration mistakes is another reason to keep things as separate as possible in the future.)

python/ModWsgiDualSchemaProblem written at 01:04:59; Add Comment

2015-05-23

The right way for your WSGI app to know if it's using HTTPS

Suppose that you have a WSGI application that's running under Apache, either directly as a CGI-BIN through some lashup or perhaps through an (old) version of mod_wsgi (such as Django on an Ubuntu 12.04 host, which has mod_wsgi version 3.3). Suppose that you want to know if you're being invoked via a HTTPS URL, either for security purposes or for your own internal reasons (for example, you might need separate page caches for HTTP versus HTTPS requests). What is the correct way to do this?

If you're me, for a long time you do the obvious thing; you look at the HTTPS environment variable that your WSGI application inherits from Apache (or the web server of your choice, if you're also running things under an alterate). If it has the value on or sometimes 1, you've got a HTTPS connection; if it doesn't exist or has some other value, you don't.

As I learned recently by reading some mod_wsgi release notes, this is in practice wrong (and probably wrong even in theory). What I should be doing is checking wsgi.url_scheme from the (WSGI) environment to see if it was "https" or "http". Newer versions of mod_wsgi explicitly strip the HTTPS environment variable and anyways, as the WSGI PEP makes clear, including a HTTPS environment variable was always a 'maybe' thing.

(You can argue that mod_wsgi is violating the spirit of the 'should' in the PEP here, but I'm sure it has its reasons for this particular change.)

Not using wsgi.url_scheme was always kind of conveniently lazy; I was pretending that WSGI was still basically a CGI-BIN environment when it's not really. I always should have been preferring wsgi. environment variables where they were available, and wsgi.url_scheme has always been there. But I change habits slowly when nothing smacks me over the nose about them.

(This may have been part of an mod_wsgi issue I ran into at one point, but that's another entry.)

python/WSGIandCheckingHTTPS written at 00:35:54; Add Comment

2015-05-22

Unsurprisingly, Amazon is now running a mail spamming service

I recently got email from an amazonses.com machine, cheerfully sending me a mailing list message from some random place that desperately wanted me to know about their thing. It was, of course, spam, which means that Amazon is now in the business of running a mail spamming service. Oh, Amazon doesn't call what they're running a mail spamming service, but in practice that's what it is.

For those that have not run into it, amazonses.com is 'Amazon Simple Email Service', where Amazon carefully sends out email for you in a way that is designed to get as much of it delivered as possible and to let you wash annoying people who complain out of your lists as effectively as possible (which probably includes forwarding complaints from those people to you, which is something that has historically caused serious problems for people who file complaints due to spammer retaliation). I translate from the marketing language on their website, of course.

In the process of doing this amazonses.com sends from their own IP address space, using their own HELO names, their own domain name, and completely opaque sender envelope address information. Want to get some email sent through amazonses.com but not the email from spammers you've identified? You're plain out of luck at the basic SMTP level; your only option is to parse the actual message during the DATA phase and look for markers. Of course this helps spammers, since they get a free ride on the fact that you may not be able to block amazonses.com email in general.

I'm fairly sure that Amazon does not deliberately want to run a mail spamming service. It's just that, as usual, not running a mail spamming service would cost them too much money and too much effort and they are in a position to not actually care. So everyone else gets to lose. Welcome to the modern Internet email environment, where receiving email from random strangers to anything except disposable email addresses gets to be more and more of a problem every year.

(As far as I can tell, Amazon does not even require you to use their own mailing list software, managed by Amazon so that Amazon can confirm subscriptions and monitor things like that. You're free to roll your own mail blast software and as far as I can tell my specific spammer did.)

spam/AmazonSpammingService written at 01:04:18; Add Comment

2015-05-21

It's time for me to stop using lighttpd

There's another SSL configuration vulnerability going around; this one is called Logjam (also). Part of the suggested fixes for it is to generate your own strong Diffie-Hellman group instead of using one of the default groups, and of course another fix is yet more SSL parameter fiddling. There have been quite a lot of SSL/TLS related issues lately, and many of them have required SSL parameter fiddling at least in the short term.

I've had a long-standing flirtation with lighttpd and my personal site has used it since the start. But this latest SSL issue has crystallized something I've been feeling for a while, which is that lighttpd has not really been keeping up with the SSL times. Lighttpd cannot configure or de-configure a number of things that people want to; for example, it has no option to disable TLS v1.0 or SSL compression (although the latter is probably off in OpenSSL by now). OCSP stapling? You can forget it (from all appearances). In general, the last release of lighttpd 1.4.x was a year ago, which is an eternity in SSL best practices land.

For a while now I've been telling people when they asked me that I couldn't recommend lighttpd for new deployments if they cared about SSL security at all. Since I care increasingly much about SSL myself, it's really time for me to follow my own advice and move away from lighttpd to something else (Apache is the most likely candidate, despite practical annoyances in my environment). It'll be annoying, but in the long run it will be good for me. I'll have a SSL configuration that I have much more trust in and that is much better supported by common resources like Mozilla's SSL configuration generator and configuration guidelines.

There's certainly a part of me that regrets this, since lighttpd is a neat little idea and Apache is kind of a hulking monstrosity. But in practice, what matters on the Internet is that unmaintained software decays. Lighttpd is in practice more or less unmaintained, while Apache is very well maintained (partly because so many people use it).

(Initially I was going to write that dealing with Logjam would push me over the edge right away, but it turns out that the Logjam resources page actually has settings for lighttpd for once.)

web/AbandoningLighttpd written at 01:08:44; Add Comment

2015-05-20

On the modern web, ISPs are one of your threats

Once upon a time, it was possible to view the Internet as a generally benevolent place as far as your traffic was concerned. Both passive eavesdroppers and man in the middle attacks were uncommon and took generally aggressive attackers to achieve (although it could be done). Eavesdropping attacks were things you mostly worried about on (public) wifi or unusual environments like conference networks.

I am afraid that those days are long over now. On the modern Internet, ISPs themselves are one of your threats (both your ISP and other people's ISPs). ISPs routinely monitor traffic, intercept traffic, modify traffic on the fly both for outgoing requests (eg) and for incoming replies from web servers ('helpfully' injecting hostile JavaScript and HTML into pages is now commonplace), and do other malfeasance. To a certain extent this is more common on mobile Internet than on good old fashioned fixed Internet, but this is not particularly reassuring; an increasing amount of traffic is from mobile devices, and ISPs are or will be adding this sort of stuff to fixed Internet as well because it makes them more money and they like cash.

(See for example the catalog of evil things various ISPs are doing laid out in We're Deprecating HTTP And It's Going To Be Okay (via). Your ISP is no longer your friend.)

The only remedy that the Internet has for this today is strong encryption, with enough source authentication that ISPs cannot shove themselves in the middle without drastic actions. This is fundamentally why it's time for HTTP-only software to die; the modern Internet strongly calls for HTTPS.

This is a fundamental change in the Internet and not a welcome one. But reality is what it is and we get to deal with the Internet we have, not the Internet we used to have and we'd like to still have. And when we're building things that will be used on today's Internet it behooves us to understand what sort of a place we're really dealing with and work accordingly, not cling to a romantic image from the past of a friendlier place.

(If we do nothing and keep naively building for a nicer Internet that no longer exists, it's only going to get worse.)

web/ISPsAreThreats written at 02:07:09; Add Comment

2015-05-19

Converting filesystems from ext3 to ext4, and concerns attached to it (plus bad news for me)

Yesterday I covered why I basically need to move from ext3 to ext4 and I said that the mechanisms of this transition raise a few issues. So let's talk about them, and in the process I'll wind up with disappointing news for my own case. Actually, let's lead with the disappointing news:

An ext3 filesystem converted to ext4 almost certainly won't support ext4's nanosecond file timestamps; it will only have ext3 one-second ones.

On the surface the mechanics of converting from ext3 to ext4 are simple and relatively straightforward, with two levels. The first level is simply to mount filesystems as ext4 instead of ext3. This enables backwards-compatible filesystem features and is fully reversible (at least according to the sources I've read). As a result it ought to be as safe as ext4 itself, and these days that's pretty safe. However this only gives you features like delayed allocation (which some people consider a non-feature for complex reasons).

(Checksumming the journal may be one of the things you get here; it's not clear to me.)

To fully convert a filesystem to ext4, one must also use tune2fs to enable the ext4-specific filesystem features that affect the on disk format of the filesystem. Various sources (eg the ext4 wiki) say that this is:

tune2fs -O extents,uninit_bg,dir_index /dev/<WHAT>

(In theory dir_index is also an ext3 feature, but on the other hand at least some of my ext3 filesystems don't have it. They may have started out as ext2 filesystems.)

However, this is not the full set of feature differences between ext3 and ext4 today. The Fedora 21 /etc/mke2fs.conf adds huge_file, flex_bg, dir_nlink, and extra_isize, and the tune2fs manpage says that all of them can be set on an existing filesystem (although flex_bg and extra_isize probably have no meaningful effect). Per the tune2fs manpage and various instructions on ext3 to ext4 conversions, setting uninit_bg requires an e2fsck and setting dir_index is helped by doing an e2fsck -D.

Now we get to my concerns. The first concern is straightforward; at least some 'how to convert from ext3 to ext4' sources contain prominent 'back up your data just in case' warnings. These sources are also relatively old ones and I suspect that they date from the days when ext4 and this process was new and thus uncertain. Notably, the ext4 wiki itself doesn't have any such cautions these days. Still, it's a bit alarming.

The second concern is whether this is actually going to do me any good. My entire purpose for doing this is to get support for sub-second file timestamps, because that's basically the only user visible ext3 versus ext4 difference (and software is noticing). The existing documentation is not clear on whether ext4 automatically starts doing this on an existing ext3 filesystem, if you have to add the extra_isize feature, or if a converted ext3 filesystem simply can't do this because its existing (and future) inodes are too small to hold the extra data.

Unfortunately the news here is not good. Additional sources (this description of timestamps in ext4, this SA answer) say that the high resolution timestamps definitely go in the extra space in normal ext4 inodes, space that my ext3 filesystem inodes just don't have (ext3 defaults to 128 byte inodes, ext4 to 256 byte ones). Converting a filesystem from ext3 to ext4 does not enlarge its inodes, which means that a normal ext3 filesystem converted to ext4 will never support high resolution timestamps.

Given this, there's not much point in putting myself through this in-place conversion effort (in fact converting my filesystems would be silently misleading). The only way I'd ever get better timestamps would be to make new ext4 filesystems from scratch, copy all the current data into them, and then replace my existing filesystem mounts with mounts of the new ext4 filesystems. Among other things, this is a pain in the rear.

(It's achievable, at least in theory, as my LVM pool has plenty of unused space. It would mean that I'd only 'convert' a couple of filesystems, the ones most likely to run into high resolution file timestamp issues.)

PS: You can see how big your ext3 inodes are with 'tune2fs -l /dev/<WHAT>'; it's reported near the end. This will also let you see what features are set on your filesystem (and thus let you know that your ext3 filesystems are without dir_index).

linux/Ext3ToExt4Limitation written at 01:14:56; Add Comment

2015-05-17

Why I'm interested in converting my ext3 filesystems to ext4

My home machine has a sufficiently old set of filesystems that many of my actively used filesystems are still ext3, not ext4, including both my home directory and where I keep code. Normally this isn't something that I particularly think or worry about; it's not like ext4 is a particularly radical advance from ext3 (certainly not the same sort of jump that was ext2 to ext3, where you got fast crash recovery). As a sysadmin I'm generally cautious with filesystem choices anyways (at least when I'm not being radical); I used ext2 over ext3 for years after the latter came out, for example, on the principle that I'd let other people find the problems.

It turns out that there is one important thing that ext4 has and ext3 does not: ext4 has sub-second file timestamps, while ext3 only does timestamps to the nearest second. Modern machines are fast enough that nearest second timestamps are increasingly not really good enough when building software or otherwise doing things that care about relative file timestamps and 'is X more recent than Y'. Oh, sure, it works most of the time, but every so often things go wrong or you find assumptions buried in other people's software.

Most people don't notice these things because most people are now using filesystems that support sub-second file timestamps (which is almost all modern Linux filesystems). What this tells me is that I'm increasingly operating in an unusual and effectively unsupported environment by continuing to use ext3. As time goes by, more and more software is likely to assume sub-second file timestamps basically by default (because the authors have never run it on a system without them) and not work quite right in various ways. I can fight a slow battle against what is effectively a new standard of sub-second file timestamps, or I can give in and convert my ext3 filesystems to ext4. It's not like ext4 is exactly a new filesystem these days, after all (Wikipedia dates it to 2008).

The mechanics of this conversion raise a few issues, but that's something for another entry.

linux/Ext3ToExt4WhyConvert written at 23:51:58; Add Comment

A bit more on the ZFS delete queue and snapshots

In my entry on ZFS delete queues, I mentioned that a filesystem's delete queue is captured in snapshots and so the space used by pending deletes is held by snapshots. A commentator then asked:

So in case someone uses zfs send/receive for backup he accidentially stores items in the delete queue?

This is important enough to say explicitly: YES. Absolutely. Since it's part of a snapshot, the delete queue and all of the space it holds will be transferred if you use zfs send to move a filesystem snapshot elsewhere for whatever reason. Full backups, incremental backups, migrating a filesystem, they all copy all of the space held by the delete queue (and then keep it allocated on the received side).

This has two important consequences. The first is that if you transfer a filesystem with a heavy space loss due to things being held in the delete queue for whatever reason, you can get a very head-scratching result. If you don't actually mount the received dataset you'll wind up with a dataset that claims to have all of its space consumed by the dataset, not snapshots, but if you 'zfs destroy' the transfer snapshot the dataset promptly shrinks. Having gone through this experience myself, this is a very WAT moment.

The second important consequence is that apparently the moment you mount the received dataset, the current live version will immediately diverge from the snapshot (because ZFS wakes up, says 'ah, a delete queue with no live references', and applies all of those pending deletes). This is a problem if you're doing repeated incremental receives, because the next incremental receive will tell you 'filesystem has diverged from snapshot, you'll have to tell me to force a rollback'. On the other hand, if ZFS space accounting is working right this divergence should transfer a bunch of the space the filesystem is consuming into the usedbysnapshots category. Still, this must be another head-scratching moment, as just mounting a filesystem suddenly caused a (potentially big) swing in space usage and a divergence from the snapshot.

(I have not verified this mounting behavior myself, but in retrospect it may be the cause of some unexpected divergences we've experienced while migrating filesystems. Our approach was always just to use 'zfs recv -F ...', which is prefectly viable if you're really sure that you're not blowing your own foot off.)

solaris/ZFSDeleteQueueSnapshots written at 00:45:15; Add Comment

2015-05-15

Your Illumos-based NFS fileserver may be 'leaking' deleted files

By now you may have guessed the punchline of my sudden interest in ZFS delete queues: we had a problem with ZFS leaking space for deleted files that was ultimately traced down to an issue with pending deletes that our fileserver wasn't cleaning up when it should have been.

As a well-debugged filesystem, ZFS should not outright leak pending deletions, where there are no remaining references anywhere yet the files haven't been cleaned up (well, more or less; snapshots come into the picture, as mentioned). However it's possible for both user-level and kernel-level things to hold references to now-deleted files in the traditional way and thus keep them from being actually removed. User-level things holding open files should be visible in, eg, fuser, and anyways this is a well-known issue that savvy people will immediately ask you about. Kernel level things may be less visible, and there is at least one in mainline Illumos and thus OmniOS r151014 (the current release as I write this entry).

Per George Wilson on the illumos-zfs mailing list here, Delphix found that the network lock manager (the nlockmgr SMF service) could hold references to (deleted) files under some circumstances (see the comment in their fix). Under the right circumstances this can cause significant space lossage over time; we saw loss rates of 5 GB a week. This is worked around by restarting nlockmgr; this restart drops the old references and thus allows ZFS to actually remove the files and free up potentially significant amounts of your disk space. Rebooting the whole server will do it too, for obvious reasons, but is somewhat less graceful.

(Restarting nlockmgr is said to be fully transparent to clients, but we have not attempted to test that. When we did our nlockmgr restart we did as much as possible to make any locking failures a non-issue.)

As far as I know there is no kernel-level equivalent of fuser, so that you could list eg even all currently active kernel level references to files in a particular filesystem (never mind what kernel subsystem is holding such references). I'd love to be wrong here; it's an annoying gap in Illumos's observability.

solaris/ZFSDeleteQueueNLMLeak written at 23:29:05; Add Comment

The ZFS delete queue: ZFS's solution to the pending delete problem

Like every other always-consistent filesystem, ZFS needs a solution to the Unix pending delete problem (files that have been deleted on the filesystem but that are still in use). ZFS's solution is implemented with a type of internal ZFS object called the 'ZFS delete queue', which holds a reference to any and all ZFS objects that are pending deletion. You can think of it as a kind of directory (and technically it's implemented with the same underlying storage as directories are, namely a ZAP store).

Each filesystem in a ZFS pool has its own ZFS delete queue object, holding pending deletes for objects that are in (or were originally in) that filesystem. Also, each snapshot has a ZFS delete queue as well, because the current state of a filesytem's ZFS delete queue is captured as part of making a snapshot. This capture of delete queues in snapshots has some interesting consequences; the short version is that once a delete queue with entries is captured in a snapshot, the space used by those pending deleted objects cannot be released until the snapshot itself is deleted.

(I'm not sure that this space usage is properly accounted for in the 'usedby*' space usage properties, but I haven't tested this specifically.)

There is no simple way to find out how big the ZFS delete queue is for a given filesystem. Instead you have to use the magic zdb command to read it out, using 'zdb -dddd DATASET OBJNUM' to dump details of individual ZFS objects so that you can find out how many ZAP entries a filesystem's 'ZFS delete queue' object has; the number of current ZAP entries is the number of pending deletions. See the sidebar for full details, because it gets long and tedious.

(In some cases it will be blatantly obvious that you have some sort of problem because df and 'zfs list' and so on report very different space numbers than eg du does, and you don't have any of the usual suspects like snapshots.)

Things in the ZFS delete queue still count in and against per-user and per-group space usage and quotas, which makes sense because they're still not quite deleted. If you use 'zfs userspace' or 'zfs groupspace' for space tracking and reporting purposes this can result in potentially misleading numbers, especially if pending deletions are 'leaking' (which can happen). If you actually have and enforce per-user or per-group quotas, well, you can wind up with users or groups that are hitting quota limits for no readily apparent reason.

(Needing to add things to the ZFS delete queue has apparently caused problems on full filesystems at least in the past, per this interesting opensolaris discussion from 2006.)

Sidebar: A full example of finding how large a ZFS delete queue is

To dump the ZFS delete queue for a filesystem, first you need to know what its object number is; this is usually either 2 (for sufficiently old filesystems) or 3 (for newer ones), but the sure way to find out is to look at the ZFS master node for the filesystem (which is always object 1). So to start with, we'll dump the ZFS master node to find out the object number of the delete queue.

# zdb -dddd fs3-corestaff-01/h/281 1
Dataset [....]

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         1    1    16K     1K     8K     1K  100.00  ZFS master node
[...]
        microzap: 512 bytes, 3 entries

                DELETE_QUEUE = 2 
[...]

The object number of this filesystem's delete queue is 2 (it's an old filesystem, having been originally created on Solaris 10). So we can dump the ZFS delete queue:

# zdb -dddd fs3-corestaff-01/h/281 2
Dataset [...]

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         2    2    16K    16K   144K   272K  100.00  ZFS delete queue
        dnode flags: USED_BYTES USERUSED_ACCOUNTED 
        dnode maxblkid: 16
        Fat ZAP stats:
[...]
                ZAP entries: 5
[...]
                3977ca = 3766218 
                3977da = 3766234 
                397a8b = 3766923 
                397a87 = 3766919 
                397840 = 3766336 

(The final list here is the ZAP entries themselves, going from some magic key (on the left) to the ZFS object numbers on the right. If we wanted to, we could use these object numbers to inspect (or even read out) the actual things that are pending deletion. This is probably most useful to find out how large they are and thus how much space they should be consuming.)

There are two different forms of ZAPs and zdb reports how many entries they have somewhat differently. In the master node we saw a 'microzap', used when the ZAP is and always has been small. Here we see a 'Fat ZAP', which is what a small ZAP turns into if at some point it grows big enough. Once the ZFS delete queue becomes a fat ZAP it stays that way even if it later only has a few entries, as we see here.

In this case the ZFS delete queue for this filesystem holds only five entries, which is not particularly excessive or alarming. Our problem filesystem had over ten thousand entries by the time we resolved the issue.

PS: You can pretty much ignore the summary line with its pretty sizes; as we see here, they have very little to do with how many delete queue entries you have right now. A growing ZFS delete queue size may be a problem indicator, but here the only important thing in the summary is the type field, which confirms that we have the right sort of objects both for the ZFS master node and the ZFS delete queue.

PPS: You can also do this exercise for snapshots of filesystems; just use the full snapshot name instead of the filesystem.

(I'm not going to try to cover zdb usage details at all, partly because I'm just flailing around with it. See Ben Rockwood's zdb: Examining ZFS At Point-Blank Range for one source of more information.)

solaris/ZFSDeleteQueue written at 23:28:11; Add Comment

The pending delete problem for Unix filesystems

Unix has a number of somewhat annoying filesystem semantics that tend to irritate designers and implementors of filesystems. One of the famous ones is that you can delete a file without losing access to it. On at least some OSes, if your program open()s a file and then tries to delete it, either the deletion fails with 'file is in use' or you immediately lose access to the file; further attempts to read or write it will fail with some error. On Unix your program retains access to the deleted file and can even pass this access to other processes in various ways. Only when the last process using the file closes it will the file actually get deleted.

This 'use after deletion' presents Unix and filesystem designers with the problem of how you keep track of this in the kernel. The historical and generic kernel approach is to keep both a link count and a reference count for each active inode; an inode is only marked as unused and the filesystem told to free its space when both counts go to zero. Deleting a file via unlink() just lowers the link count (and removes a directory entry); closing open file descriptors is what lowers the reference count. This historical approach ignored the possibility of the system crashing while an inode had become unreachable through the filesystem and was only being kept alive by its reference count; if this happened the inode became a zombie, marked as active on disk but not referred to by anything. To fix it you had to run a filesystem checker, which would find such no-link inodes and actually deallocate them.

(When Sun introduced NFS they were forced to deviate slightly from this model, but that's an explanation for another time.)

Obviously this is not suitable for any sort of journaling or 'always consistent' filesystem that wants to avoid the need for a fsck after unclean shutdowns. All such filesystems must keep track of such 'deleted but not deallocated' files on disk using some mechanism (and the kernel has to support telling filesystems about such inodes). When the filesystem is unmounted in an orderly way, these deleted files will probably get deallocated. If the system crashes, part of bringing the filesystem up on boot will be to apply all of the pending deallocations.

Some filesystems will do this as part of their regular journal; you journal, say, 'file has gone to 0 reference count', and then you know to do the deallocation on journal replay. Some filesystems may record this information separately, especially if they have some sort of 'delayed asynchronous deallocation' support for file deletions in general.

(Asynchronous deallocation is popular because it means your process can unlink() a big file without having to stall while the kernel frantically runs around finding all of the file's data blocks and then marking them all as free. Given that finding out what a file's data blocks are often requires reading things from disk, such deallocations can be relatively slow under disk IO load (even if you don't have other issues there).)

PS: It follows that a failure to correctly record pending deallocations or properly replay them is one way to quietly lose disk space on such a journaling filesystem. Spotting and fixing this is one of the things that you need a filesystem consistency checker for (whether it's a separate program or embedded into the filesystem itself).

unix/UnixPendingDeleteProblem written at 01:02:45; Add Comment

(Previous 12 or go back to May 2015 at 2015/05/14)

Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.