Why having CR LF as your line ending is a mistake

January 30, 2017

In my entry on what we still use ASCII CR for today, I mentioned in passing that it was unfortunate that protocols like HTTP had continued to specify that their line ending was CR LF instead of plain LF, and called it a mistake. Aneurin Price disagreed with this view, citing the history that CR LF was there first as a line ending. This history is absolutely true, but it doesn't change that CR LF is a mistake today and pretty much always was. In fact, we can be more general. The mistake is not specifically CR LF; the mistake is making any multi-byte sequence be your line ending.

The moment you introduce a multi-byte line ending sequence you require every piece of code that wants to recognize line endings to use some sort of state machine, because you have to recognize a sequence. A CR by itself is not a line ending, and a LF by itself is theoretically not a line ending; only a CR LF combined is a line ending, and you must recognize that somehow. This state machine may be as (apparently) simple as using a library call 'find the sequence \r\n' instead of a library call 'find the byte \n' (or \r on old Macs), or it may be more elaborate when you are attempting to read an IO stream character by character and stop the moment you hit end-of-line. But you always need that state machine in some form, and with it you need state.

If you have a single byte line terminator, life is much easier. You read until you find the byte, or you scan until you find the byte, and you are done. No state is needed to recognize your end of line marker.

(There's also no ambiguity about what you should do when you see just one byte of the line terminator, and thus no disagreement and different behavior between implementations. Such differences definitely exist in handling CR LF and they lead to various sorts of problems in practice.)

The decision by Unix and Mac OS to have a single character represent logical end of line in their standard text format regardless of how many ASCII characters had to be printed to the terminal to actually achieve a proper newline is the correct one. It simplifies and quietly slightly speeds up a huge amount of code, at the minor cost (on Unix) of requiring some more smarts inside the kernel.

(This is also the right place to put the smarts, since far more text is processed on typical systems than is ever printed out to the terminal. The place to pay the cost is at the low-frequency and central spot of actually displaying text to the user, not the high-frequency and widely spread spot of everything that processes text line by line.)

PS: The relevant Wikipedia page credits the idea of using a single character for logical end of line and converting it on output to Multics, which picked LF for this job for perfectly reasonable reasons. See the Newline history section.


Comments on this page:

By Ricky at 2017-01-31 09:42:13:

Realistically speaking, the line ending for Macs hasn't been '\r' for over 15 years — Mac OS X introduced '\n' as the default line ending, and as far as I can tell, most applications now use that. Am I missing something?

By cks at 2017-01-31 10:09:45:

My mistake in writing the entry; I should have checked the current state of affairs and qualified things. I'm going to stick a little update in the entry.

When you said “state machine” in the context of network protocols, I thought you were going to talk about buffers. That’s an even more painful consequence than just the complexity of scanning for a sequence. If you read a file in chunks, or you receive packets from the network, or do any chunked I/O, a multi-byte end of line means you can’t just process each chunk individually. A chunk might end in CR, which then needs to be accounted for when the next chunk/packet/bufferful/etc is processed: does that one start with LF or not?

And in case you don’t happen to have anywhere to put the state, you get performance implications instead, e.g. a readline function that blocks at the end of one chunk until the next one has been read – where a single-byte end of line would have allowed the line to be returned from the first chunk with no further I/O delay.

Written on 30 January 2017.
« How you can abruptly lose your filesystem on a software RAID mirror
Email attachments of singleton nested zipfiles are suspicious »

Page tools: View Source, View Normal.
Search:
Login: Password:

Last modified: Mon Jan 30 21:39:07 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.