Your databases always have a schema

September 21, 2018

Over on Mastodon, I had a database opinion:

Obvious database hot take: your stored data always has a schema, unless your code neither reads nor writes it as anything except an opaque blob. What NoSQL changes is how many places that schema exists and how easy it is to have multiple schemas for your data.

(SQL allows multiple schemas too. You just change what fields mean in your code, reuse them, extend them, etc etc.)

One schema that your data has is implicit in what data fields your code reads and writes, what it puts in them (both at the mechanical level of data types and at a higher level), and its tacit knowledge of how those fields relate to each other. If you're actually doing anything with your data, this schema necessarily exists. Of course, it may be a partial schema; your stored data may include fields that are neither read nor written by your code, and even for fields that you read and write, your current code may not reflect the full requirements and restrictions that you have in mind for the data.

SQL databases make part of your schema explicit and enforced in the database, but it is only the part that can be sensibly described and checked there. Generally there are plenty of constraints on your data that are not checked at the SQL level for various reasons (and they may not even be checked in your code). As a result, you can bypass the nominal (SQL) schema of your database by reusing and repurposing database fields in ways that SQL doesn't check for or enforce. This in-SQL schema is in addition to the implicit schema that's in your code.

(You can tell that your code's implicit schema exists even with an SQL database, even if your code auto-reads the SQL schemas, by asking what happens if the DBAs decide to rename a bunch of tables and a bunch of fields in those tables, maybe dropping some and adding others. It is extremely likely that the entire program will explode until someone fixes the code to match the new database arrangement. In other words, you have two copies of your schema, one in the database and one in your code, and those copies had better agree.)

Since your schema lives partly in your code, different pieces of code can have different schema for the same data. Given that you can bypass the SQL schema, this is true whether or not you're using a NoSQL database with no schema; NoSQL just makes it easier and perhaps more likely to happen. In some ways NoSQL is more honest than SQL, because it tells you straight up that it's entirely up to your code to have and maintain a schema. Certainly in NoSQL your code is the only place with a schema, and so you have a chance to definitely only have one schema for your data instead of two.

On the other hand, one advantage of SQL is that you have a central point that explicitly documents and enforces at least some of your schema. You don't have to try to reverse engineer even the basics of your schema out of your code, and you know that there is at least basic agreement about data facts on ground (for example, what tables and what fields there are, what basic types can go in those fields, and perhaps what relationships there definitely are between various fields via constraints and foreign keys).

(I've been thinking this thought for some time and was pushed over the edge today by reading yet another article about how SQL databases were better than NoSQL ones partly because they mean you have a schema for your data. As mentioned, I think that there are advantages to having your schema represented in your database, but it is absolutely not the case that NoSQL has no schema or that SQL means you only have one schema.)

Written on 21 September 2018.
« Why I mostly don't use ed(1) for non-interactive edits in scripts
Some differences between various versions of ed(1) »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Sep 21 20:58:49 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.