Wandering Thoughts archives


ZFS pushes file renamings and other metadata changes to disk quite promptly

One of the general open questions on Unix is when changes like renaming or creating files are actually durably on disk. Famously, some filesystems on some Unixes have been willing to delay this for an unpredictable amount of time unless you did things like fsync() the containing directory of your renamed file, not just fsync() the file itself. As it happens, ZFS's design means that it offers some surprisingly strong guarantees about this; specifically, ZFS persists all metadata changes to disk no later than the next transaction group commit. In ZFS today, a transaction group commit generally happens every five seconds, so if you do something like rename a file, your rename will be fully durable quite soon even if you do nothing special.

However, this doesn't mean that if you create a file, write data to the file, and then rename it (with no other special operations) that in five or ten seconds your new file is guaranteed to be present under its new name with all the data you wrote. Although metadata operations like creating and renaming files go to ZFS right away and then become part of the next txg commit, the kernel generally holds on to written file data for a while before pushing it out. You need some sort of fsync() in there to force the kernel to commit your data, not just your file creation and renaming. Because of how the ZFS intent log works, you don't need to do anything more than fsync() your file here; when you fsync() a file, all pending metadata changes are flushed out to disk along with the file data.

(In a 'create new version, write, rename to overwrite current version' setup, I think you want to fsync() the file twice, once after the write and then once after the rename. Otherwise you haven't necessarily forced the rename itself to be written out. You don't want to do the rename before a fsync(), because then I think that a crash at just the wrong time could give you an empty new file. But the ice is thin here in portable code, including code that wants to be portable to different filesystem types.)

My impression is that ZFS is one of the few filesystems with such a regular schedule for committing metadata changes to disk. Others may be much more unpredictable, and possibly may reorder the commits of some metadata operations in the process (although by now, it would be nice if everyone avoided that particular trick). In ZFS, not only do metadata changes commit regularly, but there is a strict time order to them such that they can never cross over each other that way.

solaris/ZFSWhenMetadataSynced written at 22:47:51; Add Comment

Most modern web spiders are parasites

Once upon a time, it was possible to believe that most web spiders hitting your site were broadly beneficial to the (open) web and to people in general. Oh, sure, there were always bad ones (including spammers scraping the web for addresses to spam), but you could at least believe that bad or selfish spiders were the exception. It's my view that these days are over and that on the modern web, most spiders crawling your site are parasites.

My criteria for whether something is or isn't a parasite is a bit of a hand wave; to steal some famous words, ultimately I know it when I see it. Broadly and generally, web spiders are parasites when they don't gather information to serve the general public, they don't make the web better by their presence, and they don't even do something that we'd consider generally useful even for a somewhat restricted group of people (such as the people on a chat channel). There are all sorts of parasites, of course; some are actively evil and are trying to do things that will do you harm (such as harvest email addresses to spam), while others are simply selfish.

What's a selfish, parasitic web spider? As an example, there are multiple companies that crawl the web looking for mentions of brands and then sell information about this to the brands and various other interested people. There are 'sentiment analysis' and 'media monitoring' firms that try to crawl your pages and analyze what you say about products; several of them came up recently. There are companies that perhaps maybe might tell you something about the network of links and connections between sites, but you have to register first and perhaps that means you have to pay them money to get anything useful. At one point there were companies trying to gather up web pages so they could sell a plagiarism analysis service to universities and other people. And so on and so forth, at nearly endless length if you actually look at your web server logs and then start investigating.

(The individual parasitic web spiders don't necessarily crawl at high volume, although some of them certainly will try if you let them, but there are a lot of different ones overall. It's somewhat depressing how many of them seem to be involved in the general Internet ad business, if you construe it somewhat broadly.)

(I wrote about some of my own attitudes on this long ago, in The limits of web spider tolerance. Things have not gotten better since then.)

web/WebSpidersMostlyParasites written at 01:01:22; Add Comment

Page tools: See As Normal.
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.