Email providers cannot stop spam by scanning outgoing email
One of the things that Amazon SES advertises that it (usually) does is that it scans the outgoing email that people send through it to block spam. This sounds great and certainly should mean that Amazon SES emits very low levels of spam, right? Well, no, not so fast. Unfortunately, no outgoing mail scanning on a service like this can eliminate spam. All it can do is stop certain sorts of obvious spam. This is intrinsic in the definition of 'spam' and the limitations of what a mail sending system like Amazon SES does.
Essentially perfect content scanning can tell you two things: whether the email has markers of known types of spam, such as phish, advance fee fraud, malware distribution, and so on, and whether the email will be be scored as spam by however many spam scoring systems you can get your hands on the rules for. These are undeniably useful things to know (provided that you act on them), but messages that fail these tests are far from the only sorts of spam. In particular, basically all sorts of advertising and marketing emails cannot be blocked by such a system because what makes these messages spam is not their content, it's that they are unsolicited (cf, cf).
The only way to even theoretically tell whether a message is solicited or unsolicited is to control not just the sending of outgoing email but the process of choosing destination email addresses. If you only scan messages but don't control addresses, you have very little choice but to believe the sender when they tell you 'honest, all of these addresses want this email'. And then the marketing department of everyone and sundry descends on Amazon SES with their list of leads and prospects and people to notify about their very special whatever it is that of course everyone will be interested in, and then Amazon SES is sending spam.
(Or the marketing people buy 'qualified email addresses' from spam providers because why not, you could get lucky.)
There is absolutely nothing content filtering can do about this. Nothing. You could have a strong AI reading the messages and it wouldn't be able to stop all of the UBE.
(I wrote a version of this as a comment reply on my Amazon SES entry but I've decided it's an important enough point to state and elaborate in an entry.)
A mod_wsgi problem with serving both HTTP and HTTPS from the same WSGI app
This is kind of a warning story. It may not be true any more (I believe that I ran into this back in 2013, probably with a 3.x version of mod_wsgi), but it's probably representative of the kind of things that you can run into with Python web apps in an environment that mixes HTTP and HTTPS.
Once upon a time I tried converting my personal site from lighttpd plus a CGI based lashup for DWiki to Apache plus mod_wsgi serving DWiki as a WSGI application. At the time I had not yet made the decision to push all (or almost all) of my traffic from HTTP to HTTPS; instead I decided to serve both HTTP and HTTPS along side each other. The WSGI configuration I set up for this was what I felt was pretty straightforward. Outside of any particular virtual host stanza, I defined a single WSGI daamon process for my application and said to put everything in it:
WSGIDaemonProcess cspace2 user=... processes=15 threads=1 maximum-requests=500 ... WSGIProcessGroup cspace2
Then in each of the HTTP and HTTPS versions of the site I defined appropriate Apache stuff to invoke my application in the already defined WSGI daemon process. This was exactly the same in both sites, because the URLs and everything were the same:
WSGIScriptAlias /space ..../cspace2.wsgi <Directory ...> WSGIApplicationGroup cspace2 ...
(Yes, this is what is by now old syntax and may have been old even back at the time; today you'd specify the process group and/or the application group in the WSGIScriptAlias directive.)
This all worked and I was happy. Well, I was happy for a while. Then I noticed that sometimes my HTTPS site was serving URLs that had HTTP URLs in links and vice versa. In fact, what was happening is that some of the time the application was being accessed over HTTPS but thought it was using HTTP, and sometimes it was the other way around. I didn't go deep into diagnosis because other factors intervened, but my operating hypothesis was that when a new process was forked off and handled its first request it then latched whichever of HTTP or HTTPS the request had been made through and used that for all of the remaining requests it handled.
(This may have been related to my mistake about how a WSGI app is supposed to find out about HTTP versus HTTPS.)
This taught me a valuable lesson about mixing WSGI daemon processes and so on across different contexts, which is that I probably don't want to do that. It's tempting, because it reduces the number of total WSGI related processes that are rattling around my systems, but even apart from Unix UID issues it's clear that mod_wsgi has a certain amount of mixture across theoretically separate contexts. Even if this is a now-fixed mod_wsgi issue, well, where there's one issue there can be more. As I've found out myself, keeping things carefully separate is hard work and is prone to accidental slipups.
(It's also possible that this is a mod_wsgi configuration mistake on my part, which I can believe; I'm not entirely sure I understand the distinction between 'process group' and 'application group', for example. The possibility of such configuration mistakes is another reason to keep things as separate as possible in the future.)
The right way for your WSGI app to know if it's using HTTPS
Suppose that you have a WSGI application that's running under Apache, either directly as a CGI-BIN through some lashup or perhaps through an (old) version of mod_wsgi (such as Django on an Ubuntu 12.04 host, which has mod_wsgi version 3.3). Suppose that you want to know if you're being invoked via a HTTPS URL, either for security purposes or for your own internal reasons (for example, you might need separate page caches for HTTP versus HTTPS requests). What is the correct way to do this?
If you're me, for a long time you do the obvious thing; you look
HTTPS environment variable that your WSGI application
inherits from Apache (or the web server of your choice, if you're
also running things under an alterate).
If it has the value
on or sometimes
1, you've got a HTTPS
connection; if it doesn't exist or has some other value, you don't.
As I learned recently by reading some mod_wsgi release notes,
this is in practice wrong (and probably wrong even in theory). What
I should be doing is checking
wsgi.url_scheme from the (WSGI)
environment to see if it was
"http". Newer versions
of mod_wsgi explicitly strip the
HTTPS environment variable
and anyways, as the WSGI PEP
makes clear, including a
HTTPS environment variable was always a
(You can argue that mod_wsgi is violating the spirit of the 'should' in the PEP here, but I'm sure it has its reasons for this particular change.)
wsgi.url_scheme was always kind of conveniently lazy;
I was pretending that WSGI was still basically a CGI-BIN environment
when it's not really. I always should have been preferring
environment variables where they were available, and
has always been there. But I change habits slowly when nothing
smacks me over the nose about them.
(This may have been part of an mod_wsgi issue I ran into at one point, but that's another entry.)
Unsurprisingly, Amazon is now running a mail spamming service
I recently got email from an
amazonses.com machine, cheerfully
sending me a mailing list message from some random place that
desperately wanted me to know about their thing. It was, of course,
spam, which means that Amazon is now in the business of running a
mail spamming service. Oh, Amazon doesn't call what they're running
a mail spamming service, but in practice that's what it is.
For those that have not run into it, amazonses.com is 'Amazon Simple Email Service', where Amazon carefully sends out email for you in a way that is designed to get as much of it delivered as possible and to let you wash annoying people who complain out of your lists as effectively as possible (which probably includes forwarding complaints from those people to you, which is something that has historically caused serious problems for people who file complaints due to spammer retaliation). I translate from the marketing language on their website, of course.
In the process of doing this amazonses.com sends from their own
IP address space, using their own
HELO names, their own domain
name, and completely opaque sender envelope address information.
Want to get some email sent through amazonses.com but not the email
from spammers you've identified? You're plain out of luck at the
basic SMTP level; your only option is to parse the actual message
DATA phase and look for markers. Of course this helps
spammers, since they get a free ride on the fact that you may not
be able to block amazonses.com email in general.
I'm fairly sure that Amazon does not deliberately want to run a mail spamming service. It's just that, as usual, not running a mail spamming service would cost them too much money and too much effort and they are in a position to not actually care. So everyone else gets to lose. Welcome to the modern Internet email environment, where receiving email from random strangers to anything except disposable email addresses gets to be more and more of a problem every year.
(As far as I can tell, Amazon does not even require you to use their own mailing list software, managed by Amazon so that Amazon can confirm subscriptions and monitor things like that. You're free to roll your own mail blast software and as far as I can tell my specific spammer did.)
It's time for me to stop using lighttpd
There's another SSL configuration vulnerability going around; this one is called Logjam (also). Part of the suggested fixes for it is to generate your own strong Diffie-Hellman group instead of using one of the default groups, and of course another fix is yet more SSL parameter fiddling. There have been quite a lot of SSL/TLS related issues lately, and many of them have required SSL parameter fiddling at least in the short term.
I've had a long-standing flirtation with lighttpd and my personal site has used it since the start. But this latest SSL issue has crystallized something I've been feeling for a while, which is that lighttpd has not really been keeping up with the SSL times. Lighttpd cannot configure or de-configure a number of things that people want to; for example, it has no option to disable TLS v1.0 or SSL compression (although the latter is probably off in OpenSSL by now). OCSP stapling? You can forget it (from all appearances). In general, the last release of lighttpd 1.4.x was a year ago, which is an eternity in SSL best practices land.
For a while now I've been telling people when they asked me that I couldn't recommend lighttpd for new deployments if they cared about SSL security at all. Since I care increasingly much about SSL myself, it's really time for me to follow my own advice and move away from lighttpd to something else (Apache is the most likely candidate, despite practical annoyances in my environment). It'll be annoying, but in the long run it will be good for me. I'll have a SSL configuration that I have much more trust in and that is much better supported by common resources like Mozilla's SSL configuration generator and configuration guidelines.
There's certainly a part of me that regrets this, since lighttpd is a neat little idea and Apache is kind of a hulking monstrosity. But in practice, what matters on the Internet is that unmaintained software decays. Lighttpd is in practice more or less unmaintained, while Apache is very well maintained (partly because so many people use it).
(Initially I was going to write that dealing with Logjam would push me over the edge right away, but it turns out that the Logjam resources page actually has settings for lighttpd for once.)
On the modern web, ISPs are one of your threats
Once upon a time, it was possible to view the Internet as a generally benevolent place as far as your traffic was concerned. Both passive eavesdroppers and man in the middle attacks were uncommon and took generally aggressive attackers to achieve (although it could be done). Eavesdropping attacks were things you mostly worried about on (public) wifi or unusual environments like conference networks.
The only remedy that the Internet has for this today is strong encryption, with enough source authentication that ISPs cannot shove themselves in the middle without drastic actions. This is fundamentally why it's time for HTTP-only software to die; the modern Internet strongly calls for HTTPS.
This is a fundamental change in the Internet and not a welcome one. But reality is what it is and we get to deal with the Internet we have, not the Internet we used to have and we'd like to still have. And when we're building things that will be used on today's Internet it behooves us to understand what sort of a place we're really dealing with and work accordingly, not cling to a romantic image from the past of a friendlier place.
(If we do nothing and keep naively building for a nicer Internet that no longer exists, it's only going to get worse.)
Converting filesystems from ext3 to ext4, and concerns attached to it (plus bad news for me)
Yesterday I covered why I basically need to move from ext3 to ext4 and I said that the mechanisms of this transition raise a few issues. So let's talk about them, and in the process I'll wind up with disappointing news for my own case. Actually, let's lead with the disappointing news:
An ext3 filesystem converted to ext4 almost certainly won't support ext4's nanosecond file timestamps; it will only have ext3 one-second ones.
On the surface the mechanics of converting from ext3 to ext4 are simple and relatively straightforward, with two levels. The first level is simply to mount filesystems as ext4 instead of ext3. This enables backwards-compatible filesystem features and is fully reversible (at least according to the sources I've read). As a result it ought to be as safe as ext4 itself, and these days that's pretty safe. However this only gives you features like delayed allocation (which some people consider a non-feature for complex reasons).
(Checksumming the journal may be one of the things you get here; it's not clear to me.)
To fully convert a filesystem to ext4, one must also use
to enable the ext4-specific filesystem features that affect the on
disk format of the filesystem. Various sources (eg the ext4 wiki)
say that this is:
tune2fs -O extents,uninit_bg,dir_index /dev/<WHAT>
(In theory dir_index is also an ext3 feature, but on the other hand at least some of my ext3 filesystems don't have it. They may have started out as ext2 filesystems.)
However, this is not the full set of feature differences between
ext3 and ext4 today. The Fedora 21
huge_file, flex_bg, dir_nlink, and extra_isize, and
the tune2fs manpage says that all of them can be set on an existing
filesystem (although flex_bg and extra_isize probably have
no meaningful effect). Per the tune2fs manpage and various
instructions on ext3 to ext4 conversions, setting uninit_bg
e2fsck and setting dir_index is helped by doing
Now we get to my concerns. The first concern is straightforward; at least some 'how to convert from ext3 to ext4' sources contain prominent 'back up your data just in case' warnings. These sources are also relatively old ones and I suspect that they date from the days when ext4 and this process was new and thus uncertain. Notably, the ext4 wiki itself doesn't have any such cautions these days. Still, it's a bit alarming.
The second concern is whether this is actually going to do me any good. My entire purpose for doing this is to get support for sub-second file timestamps, because that's basically the only user visible ext3 versus ext4 difference (and software is noticing). The existing documentation is not clear on whether ext4 automatically starts doing this on an existing ext3 filesystem, if you have to add the extra_isize feature, or if a converted ext3 filesystem simply can't do this because its existing (and future) inodes are too small to hold the extra data.
Unfortunately the news here is not good. Additional sources (this description of timestamps in ext4, this SA answer) say that the high resolution timestamps definitely go in the extra space in normal ext4 inodes, space that my ext3 filesystem inodes just don't have (ext3 defaults to 128 byte inodes, ext4 to 256 byte ones). Converting a filesystem from ext3 to ext4 does not enlarge its inodes, which means that a normal ext3 filesystem converted to ext4 will never support high resolution timestamps.
Given this, there's not much point in putting myself through this in-place conversion effort (in fact converting my filesystems would be silently misleading). The only way I'd ever get better timestamps would be to make new ext4 filesystems from scratch, copy all the current data into them, and then replace my existing filesystem mounts with mounts of the new ext4 filesystems. Among other things, this is a pain in the rear.
(It's achievable, at least in theory, as my LVM pool has plenty of unused space. It would mean that I'd only 'convert' a couple of filesystems, the ones most likely to run into high resolution file timestamp issues.)
PS: You can see how big your ext3 inodes are with '
/dev/<WHAT>'; it's reported near the end. This will also let
you see what features are set on your filesystem (and thus let
you know that your ext3 filesystems are without dir_index).
Why I'm interested in converting my ext3 filesystems to ext4
My home machine has a sufficiently old set of filesystems that many of my actively used filesystems are still ext3, not ext4, including both my home directory and where I keep code. Normally this isn't something that I particularly think or worry about; it's not like ext4 is a particularly radical advance from ext3 (certainly not the same sort of jump that was ext2 to ext3, where you got fast crash recovery). As a sysadmin I'm generally cautious with filesystem choices anyways (at least when I'm not being radical); I used ext2 over ext3 for years after the latter came out, for example, on the principle that I'd let other people find the problems.
It turns out that there is one important thing that ext4 has and ext3 does not: ext4 has sub-second file timestamps, while ext3 only does timestamps to the nearest second. Modern machines are fast enough that nearest second timestamps are increasingly not really good enough when building software or otherwise doing things that care about relative file timestamps and 'is X more recent than Y'. Oh, sure, it works most of the time, but every so often things go wrong or you find assumptions buried in other people's software.
Most people don't notice these things because most people are now using filesystems that support sub-second file timestamps (which is almost all modern Linux filesystems). What this tells me is that I'm increasingly operating in an unusual and effectively unsupported environment by continuing to use ext3. As time goes by, more and more software is likely to assume sub-second file timestamps basically by default (because the authors have never run it on a system without them) and not work quite right in various ways. I can fight a slow battle against what is effectively a new standard of sub-second file timestamps, or I can give in and convert my ext3 filesystems to ext4. It's not like ext4 is exactly a new filesystem these days, after all (Wikipedia dates it to 2008).
The mechanics of this conversion raise a few issues, but that's something for another entry.
A bit more on the ZFS delete queue and snapshots
In my entry on ZFS delete queues, I mentioned that a filesystem's delete queue is captured in snapshots and so the space used by pending deletes is held by snapshots. A commentator then asked:
So in case someone uses zfs send/receive for backup he accidentially stores items in the delete queue?
This is important enough to say explicitly: YES. Absolutely.
Since it's part of a snapshot, the delete queue and all of the space
it holds will be transferred if you use
zfs send to move a
filesystem snapshot elsewhere for whatever reason. Full backups,
incremental backups, migrating a filesystem, they all copy all of
the space held by the delete queue (and then keep it allocated on
the received side).
This has two important consequences. The first is that if you
transfer a filesystem with a heavy space loss due to things being
held in the delete queue for whatever reason,
you can get a very head-scratching result. If you don't actually
mount the received dataset you'll wind up with a dataset that claims
to have all of its space consumed by the dataset, not snapshots,
but if you '
zfs destroy' the transfer snapshot the dataset promptly
shrinks. Having gone through this experience myself, this is a very
The second important consequence is that apparently the moment you
mount the received dataset, the current live version will immediately
diverge from the snapshot (because ZFS wakes up, says 'ah, a delete
queue with no live references', and applies all of those pending
deletes). This is a problem if you're doing repeated incremental
receives, because the next incremental receive will tell you
'filesystem has diverged from snapshot, you'll have to tell me to
force a rollback'. On the other hand, if ZFS space accounting is
working right this divergence should transfer a bunch of the space
the filesystem is consuming into the
Still, this must be another head-scratching moment, as just mounting
a filesystem suddenly caused a (potentially big) swing in space
usage and a divergence from the snapshot.
(I have not verified this mounting behavior myself, but in retrospect
it may be the cause of some unexpected divergences we've experienced
while migrating filesystems. Our approach was always just to use
zfs recv -F ...', which is prefectly viable if you're really
sure that you're not blowing your own foot off.)
Your Illumos-based NFS fileserver may be 'leaking' deleted files
By now you may have guessed the punchline of my sudden interest in ZFS delete queues: we had a problem with ZFS leaking space for deleted files that was ultimately traced down to an issue with pending deletes that our fileserver wasn't cleaning up when it should have been.
As a well-debugged filesystem, ZFS should not outright leak pending
deletions, where there are no remaining references anywhere yet the
files haven't been cleaned up (well, more or less; snapshots come
into the picture, as mentioned). However it's
possible for both user-level and kernel-level things to hold
references to now-deleted files in the traditional way and thus
keep them from being actually removed. User-level things holding
open files should be visible in, eg,
fuser, and anyways this is
a well-known issue that savvy people will immediately ask you
about. Kernel level things may be less visible, and there is at
least one in mainline Illumos and thus OmniOS r151014 (the current
release as I write this entry).
Per George Wilson on the illumos-zfs mailing list here, Delphix
found that the network lock manager (the
nlockmgr SMF service)
could hold references to (deleted) files under some circumstances
(see the comment in their fix).
Under the right circumstances this can cause significant space
lossage over time; we saw loss rates of 5 GB a week. This is worked
around by restarting
nlockmgr; this restart drops the old references
and thus allows ZFS to actually remove the files and free up
potentially significant amounts of your disk space. Rebooting
the whole server will do it too, for obvious reasons, but is somewhat
nlockmgr is said to be fully transparent to clients,
but we have not attempted to test that. When we did our
restart we did as much as possible to make any locking failures a
As far as I know there is no kernel-level equivalent of
so that you could list eg even all currently active kernel level
references to files in a particular filesystem (never mind what
kernel subsystem is holding such references). I'd love to be wrong
here; it's an annoying gap in Illumos's observability.
The ZFS delete queue: ZFS's solution to the pending delete problem
Like every other always-consistent filesystem, ZFS needs a solution to the Unix pending delete problem (files that have been deleted on the filesystem but that are still in use). ZFS's solution is implemented with a type of internal ZFS object called the 'ZFS delete queue', which holds a reference to any and all ZFS objects that are pending deletion. You can think of it as a kind of directory (and technically it's implemented with the same underlying storage as directories are, namely a ZAP store).
Each filesystem in a ZFS pool has its own ZFS delete queue object, holding pending deletes for objects that are in (or were originally in) that filesystem. Also, each snapshot has a ZFS delete queue as well, because the current state of a filesytem's ZFS delete queue is captured as part of making a snapshot. This capture of delete queues in snapshots has some interesting consequences; the short version is that once a delete queue with entries is captured in a snapshot, the space used by those pending deleted objects cannot be released until the snapshot itself is deleted.
(I'm not sure that this space usage is properly accounted for in
usedby*' space usage properties, but I haven't tested this
There is no simple way to find out how big the ZFS delete queue is
for a given filesystem. Instead you have to use the magic
command to read it out, using '
zdb -dddd DATASET OBJNUM' to dump
details of individual ZFS objects so that you can find out how many
ZAP entries a filesystem's 'ZFS delete queue' object has; the number
of current ZAP entries is the number of pending deletions. See the
sidebar for full details, because it gets long and tedious.
(In some cases it will be blatantly obvious that you have some
sort of problem because
df and '
zfs list' and so on report
very different space numbers than eg
du does, and you don't
have any of the usual suspects like snapshots.)
Things in the ZFS delete queue still count in and against per-user
and per-group space usage and quotas, which makes sense because
they're still not quite deleted. If you use '
zfs userspace' or
zfs groupspace' for space tracking and reporting purposes this
can result in potentially misleading numbers, especially if pending
deletions are 'leaking' (which can happen).
If you actually have and enforce per-user or per-group quotas, well,
you can wind up with users or groups that are hitting quota limits
for no readily apparent reason.
(Needing to add things to the ZFS delete queue has apparently caused problems on full filesystems at least in the past, per this interesting opensolaris discussion from 2006.)
Sidebar: A full example of finding how large a ZFS delete queue is
To dump the ZFS delete queue for a filesystem, first you need to know what its object number is; this is usually either 2 (for sufficiently old filesystems) or 3 (for newer ones), but the sure way to find out is to look at the ZFS master node for the filesystem (which is always object 1). So to start with, we'll dump the ZFS master node to find out the object number of the delete queue.
# zdb -dddd fs3-corestaff-01/h/281 1 Dataset [....] Object lvl iblk dblk dsize lsize %full type 1 1 16K 1K 8K 1K 100.00 ZFS master node [...] microzap: 512 bytes, 3 entries DELETE_QUEUE = 2 [...]
The object number of this filesystem's delete queue is 2 (it's an old filesystem, having been originally created on Solaris 10). So we can dump the ZFS delete queue:
# zdb -dddd fs3-corestaff-01/h/281 2 Dataset [...] Object lvl iblk dblk dsize lsize %full type 2 2 16K 16K 144K 272K 100.00 ZFS delete queue dnode flags: USED_BYTES USERUSED_ACCOUNTED dnode maxblkid: 16 Fat ZAP stats: [...] ZAP entries: 5 [...] 3977ca = 3766218 3977da = 3766234 397a8b = 3766923 397a87 = 3766919 397840 = 3766336
(The final list here is the ZAP entries themselves, going from some magic key (on the left) to the ZFS object numbers on the right. If we wanted to, we could use these object numbers to inspect (or even read out) the actual things that are pending deletion. This is probably most useful to find out how large they are and thus how much space they should be consuming.)
There are two different forms of ZAPs and
zdb reports how many
entries they have somewhat differently. In the master node we saw
a 'microzap', used when the ZAP is and always has been small. Here
we see a 'Fat ZAP', which is what a small ZAP turns into if at some
point it grows big enough. Once the ZFS delete queue becomes a fat
ZAP it stays that way even if it later only has a few entries, as
we see here.
In this case the ZFS delete queue for this filesystem holds only five entries, which is not particularly excessive or alarming. Our problem filesystem had over ten thousand entries by the time we resolved the issue.
PS: You can pretty much ignore the summary line with its pretty sizes;
as we see here, they have very little to do with how many delete queue
entries you have right now. A growing ZFS delete queue size may be
a problem indicator,
but here the only important thing in the summary is the
which confirms that we have the right sort of objects both for the ZFS
master node and the ZFS delete queue.
PPS: You can also do this exercise for snapshots of filesystems; just use the full snapshot name instead of the filesystem.
(I'm not going to try to cover
zdb usage details at all, partly
because I'm just flailing around with it. See Ben Rockwood's zdb:
Examining ZFS At Point-Blank Range for one
source of more information.)
The pending delete problem for Unix filesystems
Unix has a number of somewhat annoying filesystem semantics that
tend to irritate designers and implementors of filesystems. One of
the famous ones is that you can delete a file without losing access
to it. On at least some OSes, if your program
open()s a file and
then tries to delete it, either the deletion fails with 'file is
in use' or you immediately lose access to the file; further attempts
to read or write it will fail with some error. On Unix your program
retains access to the deleted file and can even pass this access
to other processes in various ways. Only when the last process using
the file closes it will the file actually get deleted.
This 'use after deletion' presents Unix and filesystem designers
with the problem of how you keep track of this in the kernel. The
historical and generic kernel approach is to keep both a link count
and a reference count for each active inode; an inode is only marked
as unused and the filesystem told to free its space when both counts
go to zero. Deleting a file via
unlink() just lowers the link
count (and removes a directory entry); closing open file descriptors
is what lowers the reference count. This historical approach ignored
the possibility of the system crashing while an inode had become
unreachable through the filesystem and was only being kept alive
by its reference count; if this happened the inode became a zombie,
marked as active on disk but not referred to by anything. To fix
it you had to run a filesystem checker, which would
find such no-link inodes and actually deallocate them.
(When Sun introduced NFS they were forced to deviate slightly from this model, but that's an explanation for another time.)
Obviously this is not suitable for any sort of journaling or 'always
consistent' filesystem that wants to avoid the need for a
after unclean shutdowns. All such filesystems must keep track of
such 'deleted but not deallocated' files on disk using some mechanism
(and the kernel has to support telling filesystems about such
inodes). When the filesystem is unmounted in an orderly way, these
deleted files will probably get deallocated. If the system crashes,
part of bringing the filesystem up on boot will be to apply all of
the pending deallocations.
Some filesystems will do this as part of their regular journal; you journal, say, 'file has gone to 0 reference count', and then you know to do the deallocation on journal replay. Some filesystems may record this information separately, especially if they have some sort of 'delayed asynchronous deallocation' support for file deletions in general.
(Asynchronous deallocation is popular because it means your process
unlink() a big file without having to stall while the kernel
frantically runs around finding all of the file's data blocks and
then marking them all as free. Given that finding out what a file's
data blocks are often requires reading things from disk, such deallocations can be relatively
slow under disk IO load (even if you don't have other issues there).)
PS: It follows that a failure to correctly record pending deallocations or properly replay them is one way to quietly lose disk space on such a journaling filesystem. Spotting and fixing this is one of the things that you need a filesystem consistency checker for (whether it's a separate program or embedded into the filesystem itself).