What we still use ASCII CR for today (on Unix)

January 28, 2017

I recently read Things Every Hacker Once Knew, which is really mostly about the somewhat less grander topic of ASCII, RS-232, and serial terminals (via, and also). Part of the article is a writeup of all of the ASCII control characters, covering their original purposes and what they're still used for today, if anything. It has this to say about CR (aka Ctrl-M, C \r, decimal byte 13 (hex 0x0d, octal 015)):

CR (Carriage Return)
It is now possible that the reader has never seen a typewriter, so this needs explanation: "carriage return" is the operation of moving your print head or cursor to the left margin. Windows, other non-Unix operating systems, and some Internet protocols (such as SMTP) tend to use CR-LF as a line terminator, rather than bare LF. Pre-Unix MacOS used a bare CR.

This description may sound like CR is no longer used on Unix, except as part of being carefully compatible with old protocols like SMTP and newer ones like HTTP. This is misleading, because CR is still in active use on Unix today.

(Sadly, HTTP and other new(ish) protocols continue to specify that 'lines' in the protocol are terminated with CR LF instead of plain LF. This is generally an annoying mistake that simply complicates everyone's life, but that's another rant.)

You see, printing a CR has an extremely useful property: it painlessly resets the cursor to the start of the line but doesn't advance it to the next line. So if you print something without a newline, print a CR, and then carefully print again just so, you will overwrite your original output with new output on the same line. This is the traditional and frequently used low-rent way of creating a constantly updated line for program progress, the current status of something, or basically anything where you want frequent updates but not to scroll things madly the way you would if you printed each update as a new line.

Of course this has limits. The big limit is that what you want to print and over-print can't be longer than one row in your (emulated) terminal. If your terminal is, say, 60 columns wide and you print a 70-character status, the last ten characters or so will overflow onto the next physical line, along with the cursor, and then your CR will only return the cursor to the start of that second line. If you write another 70-character status update, you'll advance yet another line, and so on.

(Dealing with multi-line status updates requires going to full cursor addressing using curses(3) or the like. This is a lot more complicated, which is why people really like to stick to just printing CRs for as long as possible and thus why some things will explode if you run them in too-narrow terminals or resize the terminal on them as they're running.)

As a side note, this isn't the only way to do same-line status updates; you can also backspace over what you've printed by printing some number of Ctrl-Hs. The Ctrl-H trick tends to be what gets used if you just want to update a bit of status at the end of a line, eg updating the percentage in a message like 'current progress on frobnicating things: XX%'. The CR trick usually gets used when counting how many Ctrl-Hs to print (and printing them all) gets annoying. However, Ctrl-H has a quiet advantage; it often does a better job of handling overly-long status lines, because in many (emulated) terminals enough Ctrl-Hs will back up to previous lines. If you print 90 normal characters and then 90 Ctrl-Hs, you usually wind up with the cursor where you started no matter what the width of the terminal is.

(Reading the description of DEL in the article might make you think you could print DELs instead of BSs, with the extra advantage that this would not merely move the cursor back but also erase that pesky existing status for you. In practice (emulated) terminals generally don't respond at all to having DEL printed out to them; it gets ignored and does nothing.)


Comments on this page:

By Zev Weiss at 2017-01-28 02:30:47:

I think there are also a lot implicit CRs going on all the time in most "cooked" terminal modes (I'm not an expert in TTY stuff, but I think the source of that may be here?). You can see it (or its absence) in action by switching to raw mode and writing CR-less LFs:

$ stty raw && printf 'abc\ndef\nxyz\r\n123\r\n' && stty sane
abc
   def
      xyz
123
$
By Aneurin Price at 2017-01-28 13:36:03:
Sadly, HTTP and other new(ish) protocols continue to specify that 'lines' in the protocol are terminated with CR LF instead of plain LF. This is generally an annoying mistake that simply complicates everyone's life, but that's another rant.

Hmm. I think this is a bit backwards, really.

The Unix habit of pretending that LF means CRLF is a historical accident that comes from a time when it was perfectly reasonable to think "well, people rarely want to use LF on its own, so we can save one byte per line by inserting a CR automatically so it never needs to actually be there".

Granted, the assumption that people rarely want to use LF on its own is at least accurate, so it's better than the bafflingly insane decision made by Apple to do it the other way around, but it's still an overcomplication that's decades past the point that its flimsy justification had any use at all.

Nowadays it's no more than an annoying mistake that simply complicates everyone's life, but everyone just has to deal with it because Unix is everywhere and it will never change.

By Aneurin Price at 2017-01-28 13:42:54:
I think there are also a lot implicit CRs going on all the time in most "cooked" terminal modes

Yes, the way Unix handles newlines is that the tty will implicitly insert CRs when it gets LFs. This makes things a bit of a mess really, but like seemingly everything in modern operating systems it's there for backwards compatibility with past mistakes that we just have to live with.

In some sense the mistake is that “end of logical line” was ever defined based on any combination of CR and LF, which are essentially bytecode for typewriters. There should have been one single control character with that specific meaning instead. (Denotational vs operational.) Depending on how you look at it, you could even argue that this character already exists: 0x1E, “Record Separator”. But I’ve never seen any software use that for anything at all. And anyway, maybe end-of-line should be a distinct control character anyway.

Another subtle difference in this area is that DOS likes to consider its CR+LF as a separator while Unix prefers to think of its LF as a terminator, which makes a difference at the end of a text file. Unix tools consider DOS text to have an incomplete last line while DOS tools consider Unix text to have an extraneous empty line at the end. Again a single control character with a single shared definition would have been nice.

By cks at 2017-01-30 21:43:25:

Regardless of its history, I maintain that using CR LF for end of line is a technical mistake. Why doesn't fit within the sensible margins of a comment, so I wrote WhyCRLFIsAMistake to cover it.

(It turns out that using LF as logical end of line goes back to 1964, with Multics.)

Written on 28 January 2017.
« Conversations, conversational units, and Twitter
How Unix erases things when you type a backspace while entering text »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Jan 28 00:16:58 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.