Your databases always have a schema

September 21, 2018

Over on Mastodon, I had a database opinion:

Obvious database hot take: your stored data always has a schema, unless your code neither reads nor writes it as anything except an opaque blob. What NoSQL changes is how many places that schema exists and how easy it is to have multiple schemas for your data.

(SQL allows multiple schemas too. You just change what fields mean in your code, reuse them, extend them, etc etc.)

One schema that your data has is implicit in what data fields your code reads and writes, what it puts in them (both at the mechanical level of data types and at a higher level), and its tacit knowledge of how those fields relate to each other. If you're actually doing anything with your data, this schema necessarily exists. Of course, it may be a partial schema; your stored data may include fields that are neither read nor written by your code, and even for fields that you read and write, your current code may not reflect the full requirements and restrictions that you have in mind for the data.

SQL databases make part of your schema explicit and enforced in the database, but it is only the part that can be sensibly described and checked there. Generally there are plenty of constraints on your data that are not checked at the SQL level for various reasons (and they may not even be checked in your code). As a result, you can bypass the nominal (SQL) schema of your database by reusing and repurposing database fields in ways that SQL doesn't check for or enforce. This in-SQL schema is in addition to the implicit schema that's in your code.

(You can tell that your code's implicit schema exists even with an SQL database, even if your code auto-reads the SQL schemas, by asking what happens if the DBAs decide to rename a bunch of tables and a bunch of fields in those tables, maybe dropping some and adding others. It is extremely likely that the entire program will explode until someone fixes the code to match the new database arrangement. In other words, you have two copies of your schema, one in the database and one in your code, and those copies had better agree.)

Since your schema lives partly in your code, different pieces of code can have different schema for the same data. Given that you can bypass the SQL schema, this is true whether or not you're using a NoSQL database with no schema; NoSQL just makes it easier and perhaps more likely to happen. In some ways NoSQL is more honest than SQL, because it tells you straight up that it's entirely up to your code to have and maintain a schema. Certainly in NoSQL your code is the only place with a schema, and so you have a chance to definitely only have one schema for your data instead of two.

On the other hand, one advantage of SQL is that you have a central point that explicitly documents and enforces at least some of your schema. You don't have to try to reverse engineer even the basics of your schema out of your code, and you know that there is at least basic agreement about data facts on ground (for example, what tables and what fields there are, what basic types can go in those fields, and perhaps what relationships there definitely are between various fields via constraints and foreign keys).

(I've been thinking this thought for some time and was pushed over the edge today by reading yet another article about how SQL databases were better than NoSQL ones partly because they mean you have a schema for your data. As mentioned, I think that there are advantages to having your schema represented in your database, but it is absolutely not the case that NoSQL has no schema or that SQL means you only have one schema.)


Comments on this page:

By Greg A. Woods at 2018-09-22 16:38:02:

One other advantage to having a schema in the DB is that you can much more easily manage migration of your overall data schema over time as your code changes.

Code and data are necessarily separate, at least they are if you consider that your system survives over more than one execution and especially if your system evolves to meet changing needs, expectations, and understandings.

I've found in systems that I've worked on recently, where code and DB schemas are derived from one shared definition, that evolution of the schema is far safer and more reliable if the persistent store (the DB) contains a recorded schema as well, one that matches the current state of the data at any given time. With some care in defining conversion code to support any previously released version of the schema this means we can take any new version of the code and have it successfully run on any old DB.

Furthermore we've found that the persistent DB schema is also useful in helping manage optional features in library code shared by several applications. Here we can use the presence of a table, or a field in a certain table, to enable a feature in the library code, and to allow this feature to be enabled and disabled by the evolution of the application and specific instances of the DB.

I don't see how these two things could be done reliably if there were not also a persistent schema associated with the persistent data. I guess in theory the only persistent schema strictly needed would be a single value, a version ID. However all the other traditional features of an SQL schema make the whole system far more robust and manageable, and flexible.

By Hal at 2019-02-25 11:23:45:

I always thought every application should have a rigid database with a well planned schema until I ran into a coordinating system that supported a large number of commercial applications at various release levels as well as various computer systems using various operating systems. In the part where the coordinating system sends instructions to the other systems the database has an XML datatype column. Each combination of operating system / application / release has its own XML schema that is defined within the code for that particular combination. You have to purchase these modules separately. When the base code calls the specific system it uses the module which interprets the XML and sends the correct commands to the other systems. Now you can support each type of combination independently and coordinate everything from one consistent source.

By Salient Panda at 2019-02-25 21:11:03:

This is analogous to the comparison of strongly typed vs loosely typed languages. It's a question of how many tools are helping you ensure that you've fully considered your mental model and have correctly translated it into working software.

Written on 21 September 2018.
« Why I mostly don't use ed(1) for non-interactive edits in scripts
Some differences between various versions of ed(1) »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri Sep 21 20:58:49 2018
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.