Why having CR LF as your line ending is a mistake
In my entry on what we still use ASCII CR for today, I mentioned in passing that it was unfortunate that protocols like HTTP had continued to specify that their line ending was CR LF instead of plain LF, and called it a mistake. Aneurin Price disagreed with this view, citing the history that CR LF was there first as a line ending. This history is absolutely true, but it doesn't change that CR LF is a mistake today and pretty much always was. In fact, we can be more general. The mistake is not specifically CR LF; the mistake is making any multi-byte sequence be your line ending.
The moment you introduce a multi-byte line ending sequence you
require every piece of code that wants to recognize line endings
to use some sort of state machine, because you have to recognize a
sequence. A CR by itself is not a line ending, and a LF by itself
is theoretically not a line ending; only a CR LF combined is a line
ending, and you must recognize that somehow. This state machine
may be as (apparently) simple as using a library call 'find the
\r\n' instead of a library call 'find the byte
\r on old Macs), or it may be more elaborate when you are attempting
to read an IO stream character by character and stop the moment you
hit end-of-line. But you always need that state machine in some
form, and with it you need state.
If you have a single byte line terminator, life is much easier. You read until you find the byte, or you scan until you find the byte, and you are done. No state is needed to recognize your end of line marker.
(There's also no ambiguity about what you should do when you see just one byte of the line terminator, and thus no disagreement and different behavior between implementations. Such differences definitely exist in handling CR LF and they lead to various sorts of problems in practice.)
The decision by Unix and Mac OS to have a single character represent logical end of line in their standard text format regardless of how many ASCII characters had to be printed to the terminal to actually achieve a proper newline is the correct one. It simplifies and quietly slightly speeds up a huge amount of code, at the minor cost (on Unix) of requiring some more smarts inside the kernel.
(This is also the right place to put the smarts, since far more text is processed on typical systems than is ever printed out to the terminal. The place to pay the cost is at the low-frequency and central spot of actually displaying text to the user, not the high-frequency and widely spread spot of everything that processes text line by line.)
PS: The relevant Wikipedia page credits the idea of using a single character for logical end of line and converting it on output to Multics, which picked LF for this job for perfectly reasonable reasons. See the Newline history section.
How you can abruptly lose your filesystem on a software RAID mirror
We almost certainly just completely lost a software RAID mirror with no advance warning (we'll know for sure when we get a chance tomorrow to power-cycle the machine in the hopes that this revives a drive). This comes as very much of a surprise to us, as we thought that this was not supposed to be possible short of simultaneous two drive failure out of the blue, which should be an extremely rare event. So here is what happened, as best we can reconstruct right now.
In December, both sides of the software RAID mirror were operating
normally (at least as far as we know; unfortunately the filesystem
we've lost here is
/var). Starting around January 4th, one of the
two disks began sporadically returning read errors to software RAID
code, which caused the software RAID to redirect reads to the other
side of the mirror but not otherwise complain to us about the
read errors beyond logging some kernel messages. Since nothing
showed up about these read errors in
monitoring never sent us email about it.
(It's possible that SMART errors were also reported on the drive, but we don't know; smartd monitoring turns out not to be installed by default on CentOS 7 and we never noticed that it was missing until it was too late.)
In the morning of January 27th, the other disk failed outright in a way that caused Linux to mark it as dead. The kernel software RAID code noticed this, of course, and duly marked it as failed. This transferred all IO load to the first disk, the one that had been seeing periodic errors since January 4th. It immediately fell over too; although the kernel has not marked it as explicitly dead, it now fails all IO. Our mirrored filesystem is dead unless we can somehow get one or the other of the drives to talk to us.
The fatal failure here is that nothing told us about the software RAID code having to redirect reads from one side of the mirror to the other due to IO errors. Sure, this information shows up in kernel messages, but so does a ton of other unstructured crap; the kernel message log is the unstructured dumping ground for all sorts of things and as a result, almost nothing attempts to parse it for information (at least not in a standard, regular installation).
Well, let me amend that. It appears that this information is actually
available through sysfs, but nothing actually monitors it (in
mdadm doesn't). There is an
errors file in
/sys/block/mdNN/md/dev-sdXX/ that contains a persistent counter
of corrected read errors (this information is apparently stored in
the device's software RAID superblock), so things like
monitoring could track it and tell you when there were problems.
It just doesn't.
(So if you have software RAID arrays, I suggest that you put together
something that monitors all of your
errors files for increases
and alerts you prominently.)