2015-01-31
The problem with punishing people over policy violations
Back in my entry on why user-hostile polices are a bad thing I said that I believed threatening to punish people was generally not an effective thing from a business perspective. I owe my readers an explanation for that, because on the surface it seems like an attractive idea.
The problem with punishing people is that practically by definition a meaningful punishment must hurt, and generally it can't hurt retroactively. However, when you hurt people and especially when you hurt people's future with you (through bad performance reviews because of policy violations, docking their future pay, and so on), the people involved may decide to react to the hurt by just quitting and finding another job.
This means that any time you are contemplating punishing someone in a meaningful way, you must ask yourself whether whatever they did is bad enough to risk losing them over it (or bad enough that you should lose them over it). Sometimes the answer will be yes that it was really really bad; sometimes the answer will be yes because they're easy to replace. But if it was not a really bad thing and if they would be disruptive to lose and a pain to replace, well, do you want to run that risk?
Obviously, the worse your punishment is the higher the chance of this happening is. In particular, if your punishment means that they'll wind up noticeably underpaid relative to their counterparts elsewhere (whether through denial of promotion, denial of performance raises, or so on) you'd better hope that they really love working for you.
(You can always hope that they'd have a hard time finding another job (or at least another job that's as attractive as yours even after you punish them) so that they don't really have a choice but sucking it up and taking it. But for high-demand professionals this is probably not very likely. And even if it's the case now you've armed a ticking time bomb; I suspect that you're going to lose them as soon as they can go.)
(This is separate from the additional problems of punishing people at universities, where I was more focused on removal of computer or network access than a larger view of punishments in general.)
Upgrades and support periods
Suppose, hypothetically, that you are a vendor and you want to push people to upgrade more frequently. No problem, you say, you will just reduce the support period for your old releases. This is a magic trick that will surely cause everyone to upgrade at least as fast as you want them to, basically at a pace that you chose, right?
Well, no, obviously not. There are clearly at least two forces operating here. On the one hand you have people's terror of lack of support; this pushes them to upgrade. On the other hand, you have people's 'terror' of the work and risk involved in upgrades; this pushes them to not upgrade. Pushing on ever shortening support from the vendor side can only get you so far because the other force is pushing back against you, and after a certain point people simply don't move any more. Once you've hit that point you can reduce your support period all you want but it won't have any effect.
Generally I think there will be diminishing returns from shorter and shorter support periods as you push more and more people to their limit of terror and they say 'well, to hell with it then'. I also suspect that this neither a linear decay nor a smooth one; there are probably inflection points where a whole lot of people will drop out at once.
Aggressively lowering your support periods will have one effect, though: you can persuade people to totally abandon your system and go find another one that isn't trying to drag them around through terror. This is a win only if you don't want users.
(By the way, the rapidly upgrading products in the real world that do this at scale don't do it by having short support periods.)
2015-01-28
A thought about social obligations to report bugs
One of the things that people sometimes say is that you have a social obligation to report bugs when you find them. This seems most common in the case of open source software, although I've read about it for eg developers on closed source platforms. Let's set aside all of the possible objections with this for the moment, because I want to point out an important issue here that I feel doesn't get half as much attention as it should.
If users have a social obligation to report bugs, projects have a mirror social obligation to make reporting bugs a pleasant or at least not unpleasant experience.
Put flatly, this is only fair. If you are going to say that people need to go out of their way to do something for you (in the abstract and general sense), I very strongly reject the idea that you get to require them to go through unpleasant things or get abused in the process. If you try to require that, you are drastically enlarging the scope of the social obligation you are trying to drop on people, and this is inequitable. You're burdening people all out of proportion for what they are doing.
As a corollary to this, if you want to maintain that users of any particular project (especially your project) have a social obligation to report bugs to 'pay for' the software, you have the obligation of 'paying for' their bug reports by making that project's bug reporting a pleasant process. If you create or tolerate an unpleasant bug reporting process or environment while putting pressure on people to report bugs, you are what I can only describe as an asshole.
(You're also engaged in something that is both ineffective and alienating, but I'm not talking about practicalities here, I'm talking about what's just. If we're all in this together, being just is for everyone to try to make everyone else's life better. Projects make the life of users better by developing software, users make projects better by doing good bug reports, and projects make the life of users better by making bug reports as pleasant as possible.)
(This is probably one of the cases where either I've convinced you by the end of the thesis or you're never going to be convinced, but sometimes I write words anyways.)
2015-01-19
Why user-hostile policies are a bad thing and a mistake
One reasonable reaction to limited email retention policies being user-hostile is to say basically 'so what'. It's not really nice that policies make work for users, but sometimes that's just life; people will cope. I feel that this view is a mistake.
The problem with user-hostile policies is that users will circumvent them. Generously, let's assume that you enacted this policy to achieve some goal (not just to say that you have a policy and perhaps point to a technical implementation as proof of it). What you really want is not for the policy to be adhered to but to achieve your goal; the policy is just a tool in getting to the goal. If you enact a policy and then your users do things that defeat the goals of the policy, you have not actually achieved your overall goal. Instead you've made work, created resentment, and may have deluded yourself into thinking that your goal has actually been achieved because after all the policy has been applied.
(Clearly you won't have inconvenient old emails turn up because you're deleting all email after sixty days, right?)
In extreme cases, a user-hostile policy can actually move you further away from your goal. If your goal is 'minimal email retention', a policy that winds up causing users to automatically archive all emails locally because that's the most convenient way to handle things is actually moving you backwards. You were probably better off letting people keep as much email on the server as they wanted, because at least they were likely to delete some of it.
By the way, I happen to think that threatening punishment to people who take actions that go against the spirit or even the letter of your policy is generally not an effective thing from a business perspective in most environments, but that's another entry.
(As for policies for the sake of having policies, well, I would be really dubious of the idea that saying 'we have an email deletion policy so there's only X days of email on the mail server' will do you much good against either attackers or legal requests. To put it one way, do you think the police would accept that answer if they thought you had incriminating email and might have saved it somewhere?)
2015-01-09
Why filesystems need to be where data is checksummed
Allegedly (and I say this because I have not looked for primary sources) some existing Linux filesystems are adding metadata checksums and then excusing their lack of data checksums by saying that if applications care about data integrity the application will do the checksumming itself. Having metadata checksums is better than having nothing and adding data checksums to existing filesystems is likely difficult, but this does not excuse their views about who should do what with checksums.
There are at least two reasons why filesystems should do data checksums. The first is that data checksums exist not merely to tell applications (and ultimately the user) when data becomes corrupt, but also to do extremely important things like telling which side of a RAID mirror is the correct side. Applications definitely do not have access to low-level details of things like RAID data, but the filesystem is at least in the right general area to be asking the RAID system 'do you happen to have any other copies of this logical block?' or the like.
The second reason is that a great many programs would never be
rewritten to verify checksums. Not only would this require a massive
amount of coding, it would require a central standard so that
applications can interoperate in generating and checking these
checksums, finding them, and so on and so forth. On Unix, for
example, this would need support not just from applications like
Firefox, OpenOffice, and Apache but also common programs like grep,
awk, perl, and gcc. The net result would be that a great deal
of file IO on Unix would not be protected by checksums.
(Let's skip lightly over any desire to verify that executables and shared libraries are intact before you start executing code from them, because you just can't do that without the kernel being very closely involved.)
When you are looking at a core service that should touch absolutely everything that does some common set of operations, the right place to put this service is in a central place so that it's implemented once and then used by everyone. The central place here is the kernel (where all IO passes through one spot), which in practice means in the filesystem.
(Perhaps this is already obvious to everyone; I'd certainly like to think that it is. But if there are filesystem developers out there who are seriously saying that data checksums are the job of applications instead of the filesystem, well, I don't know what to say. Note that I consider 'sorry, we can't feasibly add data checksums to our existing filesystem' to be a perfectly good reason for not doing so.)
2015-01-06
Choices filesystems make about checksums
If you are designing integrity checksums into a new filesystem or trying to adding them to an existing one, there are some broad choices you have to make about them. These choices will determine both how easy it is to add checksums (especially to existing filesystems) and also how much good your checksums do. Unfortunately these two things pull in the opposite direction from each other.
Two big choices are: do you have checksums for just filesystem metadata or both data and metadata, and are your checksums 'internal' (stored with the object that they are a checksum of) or 'external' (stored not with the object but with references to it). I suppose you can also do checksums of just data and not metadata, but I don't think anyone does that yet (partly because in most filesystems the metadata is data too, as it has things like names and access permissions that your raw bits make much less sense without).
The best option is to checksum everything and to use external checksums. The appeal of checksumming everything is hopefully obvious. The advantage of external checksums is that they tell you more than internal checksums do. Internal checksums cover 'this object has been corrupted after being written' while external checksums also cover 'this is the wrong object', ie they let you check and verify the structure of your filesystem. With internal checksums you know that you are looking at, say, an intact directory, but you don't know if it's actually the directory you think you're looking at.
On the other hand, the easiest option to add to an existing filesystem is internal checksums of metadata only. To do this all you need to do is either find or claim some unused space for a single checksum in existing metadata structures like directory disk blocks or just add a checksum on the end of them as a new revision, which you can sometimes arrange so that almost no existing code cares and no existing on-disk data is invalidated. Doing only metadata is simpler because internal checksums present a problem for on-disk data, as there simply isn't any spare room in existing data blocks; they're all full of, well, user file data. In general adding internal checksums to data blocks means that, say, 4K of user data may no longer fit in a single on disk data block, which in practice will perturb a lot of assumptions made by user code.
(Almost all user code assumes that writing data in some power of two size is basically optimal and as a result does it all over the place. There are all sorts of bad things that happen if this is not the case.)
There are two problems with external checksums that give you big heartburn if you try to add them to existing filesystems. The first is that you have to store a lot more checksums. As an example, consider a disk block of directory entries, part of a whole directory. With internal checksums this disk block needs a single checksum for itself, while with external checksums it needs one checksum per directory entry it contains (to let you validate that the inode the directory entry is pointing to is the file you think it is).
(Another way to put this is that any time a chunk of metadata points to multiple sub-objects, external checksums require you to find room for one checksum per sub-object while internal checksums just require you to find room for one, for the chunk of metadata itself. It's extremely common for a single chunk of metadata to point to multiple sub-objects because this is an efficient use of space; directory blocks contain multiple directory entries per block, files have indirect blocks that point to multiple data blocks et al, and so on.)
The second is that you are going to have to update more checksums when things change. With external checksums, any time an object changes all references to it need to have their checksums updated to its new value, and then all references to the references probably need their checksums updated in turn, and so on until you get to the top of the tree. External checksums are a natural fit for copy on write filesystems (which are already changing all references up the tree) and probably a terrible fit for any filesystem that does in-place updates. And unfortunately (for checksums) most common filesystems today do in-place updates for various reasons.
PS: the upshot of this is that on the one hand I sympathize a fair bit with filesystems like ext4 and XFS that are apparently adding metadata checksums (that sound like they're internal ones) because they have a really hard job and it's better than nothing, but on the other hand I still want more.
2015-01-04
What makes a 'next generation' or 'advanced' modern filesystem, for me
Filesystems have been evolving in fits and starts for roughly as long as there have been filesystems, and I doubt that is going to stop any time soon. These days there are a number of directions that filesystems seem to be moving in, but I've come around to the view that one of them is of particular importance and is the defining characteristic of what I wind up calling 'modern', 'advanced', or 'next generation' filesystems.
By now, current filesystems have mostly solved the twin problems of performance and resilience in the face of crashes (although performance may need some re-solving in the face of SSDs, which change various calculations). Future filesystems will likely make incremental improvements, but I can't currently imagine anything drastically different.
Instead, the next generation frontier is in resilience to disk problems and improved recovery from them. At the heart of this is two things. First, a steadily increased awareness that when you write something to disk (either HD or SSD), you are not absolutely guaranteed to either get it back intact or get an error. Oh, the disk drive and everything involved will try hard, but there are a lot of things that can go wrong and especially over long amounts of time. Second, that the rate at which these problems happen has not really been going down over time. Instead they've actually been going up, because the most common models are based on a chance of error per so much data and the amount of data we store and use has kept going up and up.
The pragmatic result is that an increasing amount of people are starting to worry about quiet data loss, feel that the possibility of it goes up over time, and want to have some way to deal with it and fix things. It doesn't help that we're collectively storing more and more important things on disks (hopefully with backups, yes yes) instead of other media.
The dominant form that meeting this need is taking right now is checksums of everything on disk and filesystems that are aware of what's really happening in volume management. The former creates resilience (at least you can notice that something has gone wrong) and the latter aids recovery from it (since disk redundancy is one source of intact copies of the corrupted data, and a good idea anyways since whole disks can die).
(In this entry I'm talking only about local filesystems. There is a whole different evolutionary process going on in multi-node filesystems and multi-node object stores (that may or may not have a more or less POSIX filesystem layer on top of). And I'm not even going to think about various sorts of distributed databases that hold increasingly large amounts of data for large operations.)
PS: Part of my bias here is that resilience is what I've come to personally care about. One reason for this is that other filesystem attributes are pragmatically good enough and not subject to obvious inefficiencies and marvelous improvements (except for performance through SSDs), and another reason is that storage is now big enough and cheap enough that it's perfectly reasonable to store extra data (sometimes a lot of extra data, eg disk mirrors) to help insure that you can get your files back later.