2011-05-24
Proper SQL database design versus the real world
If you read up on SQL databases, there are all sorts of rules that you should follow in order to design your SQL database schema and set up your database the right way, so that everything is efficient and so on. This is the domain of normalization, carefully considered indexes, query plans, and so on. Sometimes it is also the domain of carefully minimized field sizes.
I do not want to knock these rules and the people who advocate following them, at least not too hard. There are sensible reasons for things like normalization and indexes et al are important for good performance, even if all of this makes your life harder when you are designing the database and writing queries against it.
Well. Sort of. The thing is, all of this SQL hairshirtism only really matters if you have a significantly sized database or are operating in a demanding environment, or both. Especially these days, most SQL databases are small and are not operating in demanding environments. Often the entire database will fit in RAM.
(Indexes still matter a bit for an in-RAM database, but nowhere near as much as for larger databases which have to go to disk a lot.)
This means that a lot of SQL optimizations and database design rules simply do not matter for a lot of databases and a lot of programmers. Going out of your way to do them is yet another case of misplaced optimization; you might as well simply design whatever SQL schema is simplest for your application and not worry about it. So my view today is that before you worry about any of these issues, you should ask yourself how many records your database is going to hold, how big its total size will be, and other sizing questions. Almost all of the time I think the answer will be 'small enough that you don't have to worry about it'.
(There once was a day, back in the time of small machines, when even a relatively small SQL database needed to worry about these things. That day is long gone.)
As a side note, another form of SQL hairshirtism is trying very hard to do as much work as possible in SQL instead of in your program because it is 'more efficient' to let the database server do all the work. In real life, it is often more efficient to do it in whatever way is fastest to program. Sometimes doing queries or updates in SQL can save you a bunch of annoying coding, but other times you are better off doing the work in code instead of trying to play in the SQL Turing tar pit (also).
2011-05-18
Why open source projects should use 'git rebase' or the equivalent
One of those 'vigorous debates' in version control is whether you should make frequent use of rebasing changes in order to present a clean version history or whether you should preserve the original, real development history of changes, warts and merges and all. As it happens I am a somewhat involved bystander in this, so I have a grumpy sysadmin's answer: if you are an open source project of moderate size or larger, you should absolutely rebase patches. In fact it would be good if you went further than that.
Why is simple: git bisect and equivalents in other DVCSes. The more
you have working bisection, the more outside people (like me) can send
you problem reports about your new release or beta that include the
magic phrase 'and this specific change is where it broke'.
(In many cases, this phrase drastically reduces how much time you have to spend debugging and fixing problems.)
When it works, bisection is marvelous. But, speaking from personal experience, when I am trying to bisect through strange code and I hit an unbuildable or untestable tree, I pretty much give up on the spot. I simply don't know enough about your project to deal with the issues of bisecting through and around a bad tree (especially given the paradox of too detailed bug reports).
When I'm bisecting, having a project history that includes every partially done modification, misstep, checkpoint, and failed approach that a developer ever made is not a feature. Even if the tree builds and is testable, the presence of partially complete modifications may mean that my bisection confidently declares that the problem changeset is halfway through the development of a modification. This is technically correct but almost certainly not useful, because what you want me to tell you is which fully developed change created the problem.
(Now that I think about it, part of the problem with bisection here is that a binary search is not quite the right model for what you want to do, at least in the general case. But that's another entry, once I've thought about this a bit more.)
So: once your project is large enough that it's helpful to have outside people bisecting things to find where problems were introduced, you want a clean history to enable this as much as possible. Thus you want to rebase, and ideally you would have a rule that all changes committed to the master tree must leave it buildable and testable.
2011-05-06
A realization about cache entry lifetime and validation
In the process of some recent thinking about caches, I had one of those obvious realizations about the relationship between cache entry lifetimes and how correct you need to be with cache (in)validation.
Under many circumstances, there is an inverse relationship: the shorter your entries are potentially valid for, the looser and less guaranteed correct you can be with validating or invalidating them. Conversely, the longer the entries live the more careful you need to be with validation; the ultimate version of this is that entries never time out on their own so you need to be absolutely correct.
(And the ultimate version of short lifetimes is to have a very short lifetime and no (in)validation at all, where entries only vanish when they time out. This works very well in some situations.)
What this is about, of course, is how fast cached mistakes will get 'fixed' by the entry disappearing anyways. When they will disappear fast anyways, you can afford more mistakes than when they will disappear only slowly or not at all. Mistakes may be accidents or they may be things that are deliberately incomplete because being complete is too hard, too complex, or too time-consuming.
This relationship is not completely linear; at a certain point you hit an entry lifetime that is more or less functionally equivalent to 'forever'. Where this point is depends on your specific application and circumstances, including how picky your users are.
2011-05-04
Why xterm's cut and paste model is non-standard and limited
In yesterday's entry I made the offhand
comment that xterm's cut and paste was non-standard. Since this caused
some confusion, it's time to amplify on that and, as a bonus, explain
why xterm's model only works in rather limited circumstances.
People came up with a standard model of cut, paste, and selections very early in the history of windowed GUIs (it may date to the Xerox PARC work, I'm not sure), which goes something like this. First, there is always a selection although it may be zero width, in which case you can call it the cursor or the insertion point. If you type or paste text, the text replaces the current selection (when the current selection is zero width, this has the effect of inserting it where the selection is). You cut and paste by making a selection, invoking the 'copy' (or 'cut') operation, probably changing the insertion point and maybe the current window, and then invoking the 'paste' operation. Let us call this the clipboard model of cut & paste, after the idea that there is a clipboard that holds the 'cut' text. You can see this model even in a lot of graphical programs on Unix, and it is all over the Mac (from the early days), Windows, and most other GUI systems.
The xterm model of cut & paste is much simpler to explain and faster
to use: you make selections with the left mouse button (and extend them
with the right mouse button) and you paste them with the middle mouse
button (where by 'paste' xterm actually means 'you send them to xterm
as keyboard input', although some other X programs following this model
can do a real paste).
The problem with an xterm-like copy and paste model is that it only
works very well in an environment where you don't cut text, only
copy it, and where you don't use 'paste' operations to overwrite
text. The second is easy to see; in an environment where selecting text
immediately makes it what gets pasted, you simply can't do a 'select
text, overwrite it by pasting with copied text from elsewhere' operation
because the moment you select the text you want to be overwritten you
lose the original text you were going to paste. The first is an issue of
data loss. If you actually cut text instead of just copying it, this
text now exists only in your clipboard and so you don't want to have it
casually discarded or overwritten; you want it to take an exceptional,
explicit step, such as another 'copy' operation. Overwriting cut text
by simply making a selection is too easy and leads to data loss of cut
text.
(In theory, xterm doesn't have this problem because you can always go
back and re-select the text you were going to copy. In practice there
are a lot of ways to lose the text, such as closing the xterm with it
or clearing the screen inside the xterm, and periodically this bites
people.)
These limitations clearly make the xterm model not suitable as a general
cut & paste model (you can argue about the 'overwrite selection'
feature, but easily losing cut text is clearly bad), however convenient
it is for an xterm-like situation. The standard model of cut & paste
is slower but does not have these limitations, so if you have a cut
operation (not just 'copy') I think that you really do need it.
(You can probably speed up the standard model by allowing chorded mouse button operations, but most interface designers don't seem to like those. I happen to disagree with them provided that the chorded mouse button operations are standard across a bunch of programs, in line with my long-standing beliefs.)
Sidebar: X's split-brain approach to this
In order to deal with the problems of the xterm approach to cut and
paste, X actually has two different selection models; a selection
based one and a clipboard based one. Most of the time you don't notice
the difference, but there are a few programs that only read from the
clipboard and ignore the current selection (and explicit 'paste'
operations in programs like Firefox generally use the clipboard as the
source, because they have to).