2018-09-21
Your databases always have a schema
Over on Mastodon, I had a database opinion:
Obvious database hot take: your stored data always has a schema, unless your code neither reads nor writes it as anything except an opaque blob. What NoSQL changes is how many places that schema exists and how easy it is to have multiple schemas for your data.
(SQL allows multiple schemas too. You just change what fields mean in your code, reuse them, extend them, etc etc.)
One schema that your data has is implicit in what data fields your code reads and writes, what it puts in them (both at the mechanical level of data types and at a higher level), and its tacit knowledge of how those fields relate to each other. If you're actually doing anything with your data, this schema necessarily exists. Of course, it may be a partial schema; your stored data may include fields that are neither read nor written by your code, and even for fields that you read and write, your current code may not reflect the full requirements and restrictions that you have in mind for the data.
SQL databases make part of your schema explicit and enforced in the database, but it is only the part that can be sensibly described and checked there. Generally there are plenty of constraints on your data that are not checked at the SQL level for various reasons (and they may not even be checked in your code). As a result, you can bypass the nominal (SQL) schema of your database by reusing and repurposing database fields in ways that SQL doesn't check for or enforce. This in-SQL schema is in addition to the implicit schema that's in your code.
(You can tell that your code's implicit schema exists even with an SQL database, even if your code auto-reads the SQL schemas, by asking what happens if the DBAs decide to rename a bunch of tables and a bunch of fields in those tables, maybe dropping some and adding others. It is extremely likely that the entire program will explode until someone fixes the code to match the new database arrangement. In other words, you have two copies of your schema, one in the database and one in your code, and those copies had better agree.)
Since your schema lives partly in your code, different pieces of code can have different schema for the same data. Given that you can bypass the SQL schema, this is true whether or not you're using a NoSQL database with no schema; NoSQL just makes it easier and perhaps more likely to happen. In some ways NoSQL is more honest than SQL, because it tells you straight up that it's entirely up to your code to have and maintain a schema. Certainly in NoSQL your code is the only place with a schema, and so you have a chance to definitely only have one schema for your data instead of two.
On the other hand, one advantage of SQL is that you have a central point that explicitly documents and enforces at least some of your schema. You don't have to try to reverse engineer even the basics of your schema out of your code, and you know that there is at least basic agreement about data facts on ground (for example, what tables and what fields there are, what basic types can go in those fields, and perhaps what relationships there definitely are between various fields via constraints and foreign keys).
(I've been thinking this thought for some time and was pushed over the edge today by reading yet another article about how SQL databases were better than NoSQL ones partly because they mean you have a schema for your data. As mentioned, I think that there are advantages to having your schema represented in your database, but it is absolutely not the case that NoSQL has no schema or that SQL means you only have one schema.)
Why I mostly don't use ed(1)
for non-interactive edits in scripts
One of the things that is frequently
said about
ed(1)
is that it remains useful for non-interactive modifications
to files, for example as part of shell scripts. I even mentioned
this as a good use of ed
today in my entry on why ed
is not
a good (interactive) editor today,
and I stand by that. But, well, there is a problem with using ed
this way, and that problem is why I only very rarely actually use
ed
for scripted modifications to files.
The fundamental problem is that non-interactive editing with ed
has no error handling. This is perfectly reasonable, because ed
was originally written for interactive editing and in interactive
editing the human behind the keyboard does the error handling, but
when you apply this model to non-interactive editing it means that
your stream of ed
commands is essentially flying blind. If the
input file is in the state that you expected it to be, all will go
well. If there is something different about the input file, so that
your line numbers are off, or a '/search/
' address doesn't match
what you expect (or perhaps at all), or any number of other things
go wrong, then you can get a mess, sometimes a rapidly escalating
one, and then you will get to the end of your ed
commands and
'w
' the resulting mess into your target file.
As a result of this, among other issues, ed
tends to be my last
resort for non-interactive edits in scripts. I would much rather
use sed
or something else that is genuinely focused on stream
editing if I can, or put together some code in a language where I
can include explicit error checking so I'll handle the situation
where my input file is not actually the way I thought it was going
to be.
(If I did this very often, I would probably dust off my Perl.)
If I was creating an ideal version of ed
for non-interactive
editing, I would definitely have it include some form of conditionals
and 'abort with a non-zero exit status if ...' command. Perhaps
you'd want to model a lot of this on what sed
does here with
command blocks, b
, t
(and T
in GNU sed), and so on, but I
can't help but think that there has to be a more readable and clear
version with things like relatively explicit if conditions.
(I have a long standing sed
script that uses some clever tricks
with b
and the pattern space and so on. I wrote it in sed
to
deliberately explore these features and it works, but it's basically
a stunt and I would probably be better off if I rewrote the script
in a language where the actual logic was not hiding in the middle
of a Turing tarpit.)
PS: One place this comes up, or rather came up years ago and got
dealt with then, is in what diff
format people use for patch
.
In theory you can use ed scripts; in practice, everyone considers
those to be too prone to problems and uses other formats. These
days, about the only thing I think ed format diffs are used for is
if you want to see a very compact version of the changes. Even then
I'm not convinced by their merits against 'diff -u0
', although
we still use ed format diffs in our worklogs
out of long standing habit.
Sidebar: Where you definitely need ed
instead of sed
The obvious case is if you want to move text around (or copy it),
especially if you need to move text backwards (to earlier in the
file). As a stream editor, sed
can change lines and it can move
text to later in the file if you work very hard at it, but it can
never move text backward. I think it's also easier to delete a
variable range of lines in ed, for example 'everything from a start
line up to but not including an end marker'.
Ed will also do in-place editing without the need to write to a
temporary file and then shuffle the temporary file into place. I'm
neutral on whether this is a feature or not, and you can certainly
get ed
to write your results to a new file if you want to.