2011-01-31
A little advantage of Django's automatic primary keys
If you don't specify manual primary keys on your schema tables, Django will automatically give you one in the standard form of an auto-incrementing integer. I believe that this is standard ORM behavior and you'll find it in pretty much any ORM interface. After coming to my SQL foreign key realization I've realized that there's an additional advantage to having this sort of primary key (over and above the ability to fix oopses in what is normally that table's primary key).
The advantage is simple: the primary key is unlikely to ever repeat, even after you delete entries.
(I think that technically you could have ID counter rollover, but this would require a lot of row inserts even with a 32-bit ID field.)
That the primary key doesn't repeat means that the primary key will uniquely identify a particular record even after the record is deleted. This is an extremely valuable property for historical records, such as audit log information. All you have to do is explicitly record the primary key value and you're done, you're now pretty much guaranteed to be able to accurately identify just what object you were talking about.
This property is not shared by many explicit primary key fields. While they might be unique in the live data, there's no guarantee that they will not repeat over time. For example, a simple account request system might use the requested login as the primary key on the grounds that you can't allow two requests for the same login. But then you have a request for a login name that is denied and deleted, and later you have another request for the same login; now any audit data or the like might be referring to two different requests.
So now I have two reasons to be happy that I've switched my Django schemas to not using explicit primary keys and let Django handle it for me.
(All of this is probably making SQL people twitch. Certainly it's doing that to the 'normalized SQL database design' portion of my brain.)
2011-01-29
Two things I have discovered about Django schema design
I've now gotten far enough into my application to have stubbed my toes on a few beginner's mistakes in Django and ORM schema design. Fortunately I've done this during development instead of production, so it's merely a valuable learning experience.
First, I should never make schema fields into explicit primary keys, no matter how attractive it seems and how SQL-proper it is. Especially I should not do this when the field is shown to or used by humans. Otherwise, sooner or later I will discover that either there's a typo in that field in some record or that there's some unanticipated need to change it after all. And then of course foreign keys make everyone else get heartburn, at least if you're working through Django's admin interface.
Letting Django have its invisible 'id' primary key field is much, much simpler and far more future proof, no matter how much it offends the SQL-aware portion of my brain.
(A related tragic mistake is directly linking audit records to the object that they're reporting on as a foreign key relationship, thereby insuring that either you can never delete the master object or that when you delete the master object you automatically lose all audit records associated with it.)
The second is that you cannot have an optional character or text field that is also unique (for instance, an optional login). Even if your database allows this in general (not all SQL databases allow UNIQUE fields to have several rows be NULL in the field), Django's admin interface will turn an unset field into an empty string, not a NULL, which really does have to be unique in the field.
Django's documentation covers that you really don't want to try to have fields that can be both blank and null (and what happens with character fields), but the documentation doesn't quite spell out the full consequences. Learn from my example; you really don't want to do that, even if it looks like you're getting away with it.
(If you are a particularly optimistic person, you can view this as Django saving you from making unportable assumptions about database behavior.)
2011-01-26
The various ways of writing a modern Python web app
As the flipside version of last entry, suppose that you want to write some sort of web thing in Python and you want to know what to use to do it. Broadly speaking, you have the following main options:
- Write it as a straight CGI.
This is probably the simplest approach for small stuff if you don't already have a WSGI server or a framework environment set up that you can just plug small applications into. Python's core library has plenty of batteries for this environment.
- Write your own WSGI stack from the ground up.
This will be a great learning experience and it's not that much work (the standard's CGI-based WSGI server is under a hundred lines of Python code, and the code works fine in production with a minor change to gracefully catch network write errors).
- Write a WSGI app for an existing WSGI server environment.
WSGI apps are good for straightforward things and WSGI is more flexible and less annoying than CGI, easier to plug into various setups, and performs (much) better under significant load. You might even be able to find some WSGI middleware that does stuff you want.
(I don't know what WSGI server environment you should use. The extent of my knowledge is that the Django people recommend either mod_wsgi or flup, per here.)
- Write an application for your favorite framework.
This will be the easiest way to do significant work with forms, database tables, and so on, since the framework has already built much of the low level code and infrastructure for you.
I've used most of these at various points.
We have a number of simple Python applications that are just CGIs and static HTML files, such as our self-serve network registration system. While CGIs don't perform as well as the other alternatives, this is perfectly fine for lightly used services and the CGI performance disadvantages are somewhat overdramatized anyways; DWiki started out as entirely a CGI and did fine, and even today it spends much of its time running as a CGI. And DWiki is much bigger (and slower to start) than any sane CGI that you'll write.
However: if you're writing a CGI but not using things like Python's
cgi module, you might as
well write your app as a WSGI app even if you run it as a CGI with the
standard's simple WSGI gateway. It's more future proof, less annoying,
and you're not giving up anything.
For DWiki, I wound up writing a WSGI stack from the ground up. This was a great learning experience and probably made a lot more sense in 2006 than it does today. It's not worth it if you don't want a learning experience; use one of the existing, well regarded WSGI server sets and get on with what's actually important about your application.
Our new account request system is being written as a Django-based application because it involves a lot of non-trivial work with forms, things like dynamically displaying with preloaded values from data, multiple levels of validation, redisplay with error messages, and so on. I started out trying to write it as a CGI and soon came to the conclusion that this was a non-starter; using Django has made it feasibly simple.
(While I'm using Django's ORM and a sqlite database backend, this is just because Django makes it easiest to handle data this way, not because it's actually necessary for our problem.)
I think that DWiki is a good illustration of an application where you should choose WSGI over a framework. DWiki is too large to run as a CGI if it's under decent load, but it also does very little that a framework would be much help with; it has simple, minimal form handling and no interest in locking data up inside a database.
2011-01-25
The modern Python web application stack (as I understand it)
Someone recently asked me what the modern Python equivalent of CGI was. The answer started out simple but the more I wrote the more complicated it got, until it wound up here.
At their base, modern Python web applications are written to an interface called WSGI, the 'Web Server Gateway Interface', a specification for the Python environment that web applications run in. WSGI is fairly low level (roughly on the level of CGI) and generally good, although it has a number of annoying corner cases that everyone ignores in practice. Frameworks such as Django are generally WSGI applications, although if you're using a framework you usually get to ignore the WSGI level entirely because the framework hides it (and you want this hiding).
(Mechanically, a 'WSGI application' is a callable Python object that your server will call with a WSGI environment on each request. The WSGI specification describes the environment your callable object is handed, what you can do with it, and how you return the headers and text of your reply.)
To actually deploy and run a WSGI app, you need something that provides a WSGI environment; call this a 'WSGI server'. In theory your webserver could do this, translating directly from HTTP to WSGI and back, but in practice this is relatively rare (although Apache's mod_wsgi is apparently well regarded). So you usually need a level of indirection; your webserver will go from HTTP to some protocol and speak that protocol to your WSGI server, which then maps it to WSGI and runs your application. Popular protocols for this are SCGI and FastCGI; SCGI is much simpler but also less popular.
(You are not restricted to these two choices; any way of talking to your web server will do. If you want to, you can use CGI (ie, running your WSGI server as a CGI-BIN, which means starting it up from scratch on every request). If you are sufficiently crazy you can add an extra level of indirection, having your web server run CGIs which talk SCGI to your SCGI-based WSGI server.)
Today I would expect that a common Python web application stack thus looks something like this:
- the world talks HTTP to Apache or lighttpd or nginx
- your web server (whatever it is) talks FastCGI to your WSGI server (which is running as a separate, independent process or set of processes)
- your WSGI server constructs a WSGI environment from the HTTP request and calls your framework of choice as a WSGI application
- your framework calls your code
Results trickle back in reverse.
WSGI is not a completely comprehensive standard; the most important
thing it omits (as outside of its scope) is how WSGI applications get
created and configured. The WSGI specification says that WSGI servers
are handed a Python callable and they call it with a WSGI environment
on each request, but it doesn't say how the callable object is created;
that's up to your server. Each WSGI server may have a different answer
to this, and if you're writing one from scratch (for any environment)
you can pick whichever method is most convenient for you, down to
hard-coding the creation of your application callable in your main()
function.
(By now there may be a de facto Python standard on how WSGI servers are supposed to do this; I haven't been paying close attention for a while.)
The WSGI specification itself is short and includes quite readable example code for simple WSGI apps and a CGI-based WSGI server (well, most of it). WSGI is really not that complicated (although there are subtle corners).
2011-01-20
Something I've worked out about Django's admin interface
I'm not very far along in my adventures with Django so far (which means that I spend a lot of time trying to make Django do things in wrong and non-Django ways), but I've played around enough with Django's admin interface to have worked out one important thing: I could fritter away endless amounts of time fine-tuning it to get everything just right.
I'm the kind of person who already fiddles endlessly with the fine details of the output formats of my programs; Django's admin interface raises this to new and exciting levels, since it's not just the output that I'm fiddling with. In some ways it's a close cousin to a Turing tar pit, in that the admin interface is not a general purpose environment so its limitations and features present you with the puzzle of how you can contort it into doing what you want, especially when what you want seems so simple and easy to describe.
(Sure, you could go in with chainsaw of general Python code, but this is sort of like the various '<whatever> golf' hacker games; half the challenge is to achieve your goal in the shortest and simplest way possible.)
In some Django applications the admin interface matters, but not in this one, at least not as I'm currently designing it. Although it's superficially tempting to have sponsors use the admin interface to approve pending account requests (and maybe to create new ones), it feels like the wrong way to do it and a bad fit. There are all sorts of aspects of the approval process that don't fit how the admin interface normally operates; I think I'd basically wind up committing an ever uglier pile of hacks just to get a nice looking UI.
(But I do like the admin UI. It has all sorts of nice little bits, and it makes me sad that my version is not going to be half as good.)
Thus, before I let my fiddling urges run away with me I have to remember that the admin interface doesn't really matter; it'll only get used by core system staff to do infrequent database-level operations like adding a new account sponsor. Fiddling around with the admin interface does not get my application actually written. The only thing that does that is sitting down and grinding out various forms.
(I don't regret the fiddling I did with the admin interface; I think I got a decently usable result, one that's somewhat better than a raw view of my schema, and I got a fair amount of Django experience in the process.)
2011-01-18
What I am trying to do with Django
To follow up on the last entry, what I am using Django for is (re)writing our account request handling system. The simple version of how people get accounts here is a three step process. The would-be user tells us at least their desired Unix login, their name, their email address, and which professor is sponsoring them. We ask the professor if they actually want to sponsor the person's account; if the professor says yes, we create the account and email the requester with details.
(There are a few non-professor account sponsors for things like staff accounts. Professors can sponsor accounts for whoever they want, including people not otherwise associated with the university.)
What I want to automate is the process of submitting requests and having them approved (which seems like a great fit for a simple application built on a modern web framework); we'll continue to do the actual account creation by hand, using a set of scripts we have for it. So far, this has a pretty straightforward two-table and two-form application design; one table for submitted account requests, one table for sponsors, one form for submitting an account request, and a second form for sponsors to approve or reject accounts. Of course, now we get to the complications.
The big complication is that the current 'sponsors' information bundles three or four separate things together: what name people ask to sponsor their account, who actually approves the account, and what home directory new users should be put into (and what Unix group they should be assigned to). The name is usually a professor's, but it can also be a generic thing like 'Professional Masters Student' or 'Graduate Chair'; this means that the same person may have several sponsor entries that they approve accounts for. Home directories are complicated because some professors (and special sponsors) have their own home directories for sponsored accounts, but others put new accounts in the general home directories for their research group.
(DRY suggests that it would be a bad idea to manually replicate a research group's home directory information into the sponsor entries for each of its professors. The OO way out of this is different from the SQL way out.)
Then there are the workflow complications:
- Points of Contact can approve
accounts in place of one of their professors. I don't know how
to cleanly represent this in a schema at all if I want to reuse
the same form that sponsors use.
(Besides, I already have the case that one person can approve requests for multiple 'sponsors' entries.)
- the mass intake of new graduate students is handled differently.
The Graduate Office prepares a list of new students and who is
theoretically supervising them, then the supervisors approve their
new students, and finally we email all of the approved people to
ask them to basically come pick their login.
This creates a couple of schema complications. First, an account request's approval status is different from whether or not it is 'complete' (has enough information to be created). New grad student accounts start out both unapproved and incomplete (since we don't know what login the new grad student wants), become approved but incomplete, and are finally completed when the new grad student picks a login.
Second, there needs to be some way for new grad students to access their approved but incomplete account request so that they can fill in their desired login name, and some sort of authentication for this access.
(I just realized that this implies that the login cannot be the primary key on the 'requests' table, although it still has to be unique.)
- sometimes sponsors just outright make new accounts for people,
including picking their login (this is most common with new
administrative staff). Making them first fill in the request
form then immediately approve it is kind of silly; they should
be able to fill in a preapproved request.
- oh yeah, we need an audit trail for when various things happened and who did them. Should this audit trail simply be text messages, or should I try to give it more structure?
(So far I am assuming that core staff will use the general Django administrative interface to do things like add new sponsors.)
All of these complications leave me looking at a scheme where either the tables are multiplying and cross-connecting, or things are mutating into objects that look less and less like anything with a good SQL representation.
(Talking to the duck here has already been useful in making me realize a few things about the problem.)
Sidebar: the OO way versus the SQL way of handling home directories
The OO way is that the 'sponsors' object has both a 'group' field and a 'homedirs' field that can be empty. The 'group' field points to an object for the research group, which has a 'homedirs' field of its own. If sponsors.homedirs is non-empty, we use that; otherwise, we use sponsors.group.homedirs (which must be non-empty).
The SQL way is probably to have a separate mapping table that translates entities to homedirs. Both groups and sponsors have entries in this table (we require that their names be non-overlapping, which is not a problem in practice). Rows in the 'sponsors' table have a foreign key that points to an entry in the mapping table, either the sponsor's group's entry or the sponsor's individual entry.
(The SQL mapping table approach is roughly how the current system handles this.)
2011-01-17
Wrestling with how to design a schema for a Django app
I've finally gotten to do something with Django, and naturally this has left me wrestling with schema design. This is the first time I've used an ORM and I find that I'm being pulled two ways, neither of which I think are right.
(I'll pause here to note that the data I have to keep track of in this application is tiny, on the order of a few hundred rows in total. I'm using Django's ORM support because this gives me a bunch of functionality for free, I have to keep track of the data somehow, I don't feel like writing yet another set of file parsing and loading and saving code, and using an SQLite backend makes configuration trivial.)
On the one hand are my SQL instincts. With them, I see the problem through the lens of database tables and joins and I think about things like how to normalize everything. I'm pretty sure that this view is wrong in the end, partly because I don't have direct access to SQL joins and so on and partly because it feels like complete overkill to normalize a table that has perhaps 50 rows.
On the other hand are my Python object design instincts; with them I see how the objects would look if I wasn't using an ORM at all, just writing a straight Python program that was loading all of the data from files in some convenient format and then manipulating it (and saving it) without having to think like a database at all. These objects are full of dicts and arrays and names, with 'this defaults to that if not specified' behavior that's easy in code but stupid in databases. They relate to each other in a spiderweb of perfectly sensible relationships that have nothing to do with set operations.
I'm totally sure that the object oriented view is just as wrong as the pure SQL one, because it turns Django's ORM layer into a dumb object store (well, a set of hash tables more or less). I could use the ORM this way if I wanted to; at my tiny data sizes, brute force will work perfectly well.
Somewhere in the middle is an ORM schema design that is sensible both from a Python perspective and a database perspective. I just have to find out what it is, which probably means thinking about the problem more. (Maybe I should write it up for WanderingThoughts, as a form of talking to the duck. That could get me back to first principles, instead of thinking about how to solve in it Python or SQL.)
PS: I consider all of this a valuable learning experience, not something to get frustrated over. I'm certainly glad to be doing a project with a modern web framework instead of banging HTML and CGIs together by hand.