Does CR LF as a line ending cause extra problems with buffers?

February 14, 2017

In reaction to my entry about why having CR LF as your line ending is a mistake, Aristotle Pagaltzis raised the issue of CR LF's consequences for buffers in a comment:

When you said “state machine” in the context of network protocols, I thought you were going to talk about buffers. That’s an even more painful consequence than just the complexity of scanning for a sequence. [...]

My first reaction was that I didn't think a multi-byte line ending sequence causes extra problems, because dealing with line oriented input through buffering already gives you enough of them. Any time you read input in buffers but want to produce output in lines, you need to deal with the problem that a line may not end in the current buffer. This is especially common if you're reading through input in fixed-size chunks; you would have to be very lucky to always have a line end right at the end of every 4k block (or 16k block or whatever). Sooner or later a block boundary will happen in the middle and there you are. So you have to be prepared to glue lines together across buffers no matter what.

This is too simple a view, though, once you (ie, I) think about it more. When your line ending is a single byte, you have an unambiguous situation within a single buffer; either the line definitely ends in the buffer or it doesn't. Your check for the line ending is 'find occurrence of byte <X>' and once this fails you'll never have to re-check the current buffer's contents. This is not true with a multi-byte line ending, because the line ending CR LF sequence may be split over a buffer boundary. This means that you can no longer scan each buffer independently. Either you need to scan them together so that such split CR LF sequences are fused back together, or you need to remember that the last byte in the current buffer is a CR and look for a bare LF at the start of the next buffer.

Of course, CR LF line endings aren't the only case in modern text processing where you have multi-byte sequences. A great deal of modern text is encoded in UTF-8, and many UTF-8 codepoints are multi-byte sequences; if you want to recognize such a codepoint in buffers of UTF-8 text, you have the same problem that the UTF-8 encoding may start at the end of one buffer and finish in the start of the next. It feels like there ought to be a general way of dealing with this that could then be trivially applied to the CR LF case.

(As Aristotle Pagaltzis kind of mentions later in his comment, this is going to involve storing state somewhere, either explicitly in a data structure or implicitly in the call stack of a routine that's pulling in the next buffer's worth of data.)

Written on 14 February 2017.
« What file types we see inside singleton nested zipfiles in email
Another risk of hardware RAID controllers is the manufacturer vanishing »

Page tools: View Source, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Tue Feb 14 00:56:35 2017
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.