2018-05-27
ZFS pushes file renamings and other metadata changes to disk quite promptly
One of the general open questions on Unix is when changes like
renaming or creating files are actually durably on disk. Famously,
some filesystems on some Unixes have been willing to delay this for
an unpredictable amount of time unless you did things like fsync()
the containing directory of your renamed file, not just fsync()
the file itself. As it happens, ZFS's design means that it offers
some surprisingly strong guarantees about this; specifically, ZFS
persists all metadata changes to disk no later than the next
transaction group commit. In ZFS today, a transaction group commit
generally happens every five seconds, so if you do something like
rename a file, your rename will be fully durable quite soon even if
you do nothing special.
However, this doesn't mean that if you create a file, write data
to the file, and then rename it (with no other special operations)
that in five or ten seconds your new file is guaranteed to be present
under its new name with all the data you wrote. Although metadata
operations like creating and renaming files go to ZFS right away
and then become part of the next txg commit, the kernel generally
holds on to written file data for a while before pushing it out.
You need some sort of fsync()
in there to force the kernel to
commit your data, not just your file creation and renaming. Because
of how the ZFS intent log works, you don't need
to do anything more than fsync()
your file here; when you fsync()
a file, all pending metadata changes are flushed out to disk along
with the file data.
(In a 'create new version, write, rename to overwrite current
version' setup, I think you want to fsync()
the file twice, once
after the write and then once after the rename. Otherwise you haven't
necessarily forced the rename itself to be written out. You don't
want to do the rename before a fsync()
, because then I think that
a crash at just the wrong time could give you an empty new file.
But the ice is thin here in portable code, including code that wants
to be portable to different filesystem types.)
My impression is that ZFS is one of the few filesystems with such a regular schedule for committing metadata changes to disk. Others may be much more unpredictable, and possibly may reorder the commits of some metadata operations in the process (although by now, it would be nice if everyone avoided that particular trick). In ZFS, not only do metadata changes commit regularly, but there is a strict time order to them such that they can never cross over each other that way.
Most modern web spiders are parasites
Once upon a time, it was possible to believe that most web spiders hitting your site were broadly beneficial to the (open) web and to people in general. Oh, sure, there were always bad ones (including spammers scraping the web for addresses to spam), but you could at least believe that bad or selfish spiders were the exception. It's my view that these days are over and that on the modern web, most spiders crawling your site are parasites.
My criteria for whether something is or isn't a parasite is a bit of a hand wave; to steal some famous words, ultimately I know it when I see it. Broadly and generally, web spiders are parasites when they don't gather information to serve the general public, they don't make the web better by their presence, and they don't even do something that we'd consider generally useful even for a somewhat restricted group of people (such as the people on a chat channel). There are all sorts of parasites, of course; some are actively evil and are trying to do things that will do you harm (such as harvest email addresses to spam), while others are simply selfish.
What's a selfish, parasitic web spider? As an example, there are multiple companies that crawl the web looking for mentions of brands and then sell information about this to the brands and various other interested people. There are 'sentiment analysis' and 'media monitoring' firms that try to crawl your pages and analyze what you say about products; several of them came up recently. There are companies that perhaps maybe might tell you something about the network of links and connections between sites, but you have to register first and perhaps that means you have to pay them money to get anything useful. At one point there were companies trying to gather up web pages so they could sell a plagiarism analysis service to universities and other people. And so on and so forth, at nearly endless length if you actually look at your web server logs and then start investigating.
(The individual parasitic web spiders don't necessarily crawl at high volume, although some of them certainly will try if you let them, but there are a lot of different ones overall. It's somewhat depressing how many of them seem to be involved in the general Internet ad business, if you construe it somewhat broadly.)
(I wrote about some of my own attitudes on this long ago, in The limits of web spider tolerance. Things have not gotten better since then.)