Wandering Thoughts archives

2018-09-21

Your databases always have a schema

Over on Mastodon, I had a database opinion:

Obvious database hot take: your stored data always has a schema, unless your code neither reads nor writes it as anything except an opaque blob. What NoSQL changes is how many places that schema exists and how easy it is to have multiple schemas for your data.

(SQL allows multiple schemas too. You just change what fields mean in your code, reuse them, extend them, etc etc.)

One schema that your data has is implicit in what data fields your code reads and writes, what it puts in them (both at the mechanical level of data types and at a higher level), and its tacit knowledge of how those fields relate to each other. If you're actually doing anything with your data, this schema necessarily exists. Of course, it may be a partial schema; your stored data may include fields that are neither read nor written by your code, and even for fields that you read and write, your current code may not reflect the full requirements and restrictions that you have in mind for the data.

SQL databases make part of your schema explicit and enforced in the database, but it is only the part that can be sensibly described and checked there. Generally there are plenty of constraints on your data that are not checked at the SQL level for various reasons (and they may not even be checked in your code). As a result, you can bypass the nominal (SQL) schema of your database by reusing and repurposing database fields in ways that SQL doesn't check for or enforce. This in-SQL schema is in addition to the implicit schema that's in your code.

(You can tell that your code's implicit schema exists even with an SQL database, even if your code auto-reads the SQL schemas, by asking what happens if the DBAs decide to rename a bunch of tables and a bunch of fields in those tables, maybe dropping some and adding others. It is extremely likely that the entire program will explode until someone fixes the code to match the new database arrangement. In other words, you have two copies of your schema, one in the database and one in your code, and those copies had better agree.)

Since your schema lives partly in your code, different pieces of code can have different schema for the same data. Given that you can bypass the SQL schema, this is true whether or not you're using a NoSQL database with no schema; NoSQL just makes it easier and perhaps more likely to happen. In some ways NoSQL is more honest than SQL, because it tells you straight up that it's entirely up to your code to have and maintain a schema. Certainly in NoSQL your code is the only place with a schema, and so you have a chance to definitely only have one schema for your data instead of two.

On the other hand, one advantage of SQL is that you have a central point that explicitly documents and enforces at least some of your schema. You don't have to try to reverse engineer even the basics of your schema out of your code, and you know that there is at least basic agreement about data facts on ground (for example, what tables and what fields there are, what basic types can go in those fields, and perhaps what relationships there definitely are between various fields via constraints and foreign keys).

(I've been thinking this thought for some time and was pushed over the edge today by reading yet another article about how SQL databases were better than NoSQL ones partly because they mean you have a schema for your data. As mentioned, I think that there are advantages to having your schema represented in your database, but it is absolutely not the case that NoSQL has no schema or that SQL means you only have one schema.)

programming/DatabasesAlwaysSchemas written at 20:58:49; Add Comment

Why I mostly don't use ed(1) for non-interactive edits in scripts

One of the things that is frequently said about ed(1) is that it remains useful for non-interactive modifications to files, for example as part of shell scripts. I even mentioned this as a good use of ed today in my entry on why ed is not a good (interactive) editor today, and I stand by that. But, well, there is a problem with using ed this way, and that problem is why I only very rarely actually use ed for scripted modifications to files.

The fundamental problem is that non-interactive editing with ed has no error handling. This is perfectly reasonable, because ed was originally written for interactive editing and in interactive editing the human behind the keyboard does the error handling, but when you apply this model to non-interactive editing it means that your stream of ed commands is essentially flying blind. If the input file is in the state that you expected it to be, all will go well. If there is something different about the input file, so that your line numbers are off, or a '/search/' address doesn't match what you expect (or perhaps at all), or any number of other things go wrong, then you can get a mess, sometimes a rapidly escalating one, and then you will get to the end of your ed commands and 'w' the resulting mess into your target file.

As a result of this, among other issues, ed tends to be my last resort for non-interactive edits in scripts. I would much rather use sed or something else that is genuinely focused on stream editing if I can, or put together some code in a language where I can include explicit error checking so I'll handle the situation where my input file is not actually the way I thought it was going to be.

(If I did this very often, I would probably dust off my Perl.)

If I was creating an ideal version of ed for non-interactive editing, I would definitely have it include some form of conditionals and 'abort with a non-zero exit status if ...' command. Perhaps you'd want to model a lot of this on what sed does here with command blocks, b, t (and T in GNU sed), and so on, but I can't help but think that there has to be a more readable and clear version with things like relatively explicit if conditions.

(I have a long standing sed script that uses some clever tricks with b and the pattern space and so on. I wrote it in sed to deliberately explore these features and it works, but it's basically a stunt and I would probably be better off if I rewrote the script in a language where the actual logic was not hiding in the middle of a Turing tarpit.)

PS: One place this comes up, or rather came up years ago and got dealt with then, is in what diff format people use for patch. In theory you can use ed scripts; in practice, everyone considers those to be too prone to problems and uses other formats. These days, about the only thing I think ed format diffs are used for is if you want to see a very compact version of the changes. Even then I'm not convinced by their merits against 'diff -u0', although we still use ed format diffs in our worklogs out of long standing habit.

Sidebar: Where you definitely need ed instead of sed

The obvious case is if you want to move text around (or copy it), especially if you need to move text backwards (to earlier in the file). As a stream editor, sed can change lines and it can move text to later in the file if you work very hard at it, but it can never move text backward. I think it's also easier to delete a variable range of lines in ed, for example 'everything from a start line up to but not including an end marker'.

Ed will also do in-place editing without the need to write to a temporary file and then shuffle the temporary file into place. I'm neutral on whether this is a feature or not, and you can certainly get ed to write your results to a new file if you want to.

sysadmin/EdScriptErrorProblem written at 00:18:31; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.