TCP/IP and a consequence of reliable delivery guarantees

November 14, 2019

I recently read My hardest bug to debug (via), which discusses an interesting and hard to find bug that caused an industrial digital camera used for barcode scanning to hang. The process of diagnosis (and the lessons learned from it) are interesting, so I urge you to go read the article now before reading further here, because I have to spoil the actual bug.

(Really, go read the article. This is your last chance.)

One part of the control system worked by making a TCP connection to the camera, doing some initial setup, and then leaving the connection open so that it could later send any setting changes to the camera without having to re-open a connection. It turned out that the camera had an undocumented behavior of sending scan results over this TCP connection (as well as making them available in other ways). The control system didn't expect this return traffic, so it never listened for responses on the TCP connection. The article ends with, in part:

I still don't understand how this caused the camera to lock up. We were receiving the TCP results via Telnet but we weren't reading the stream. Did it just build up in some buffer? How did this cause the camera to lock up? I still can't answer these questions.

The most likely guesses are that yes, the sent data built up in a collection of buffers on both the receiver and the sender, and this caused the hang because eventually the camera software attempted to send more data to the network and the OS on the camera put the software to sleep because there wasn't any more buffer room.

While you might blame common networking APIs a bit here, in large part this is a deep but in some sens straightforward consequence of TCP/IP promising reliable delivery of your bytes in a world of finite and often limited resources. A properly operating receiving system cannot throw away bytes after it's ACK'd them, so it must buffer them; a sensible system will have a limit on how many bytes it will buffer before it stops accepting any more. Similarly, a properly operating sending system can't generally throw away bytes after accepting them from an application, so if the receiving system isn't accepting more bytes the sending system has to (eventually) stop accepting them from the application. All of this is familiar in general as backpressure. When the backpressure propagates all the way back to the sending application, it can either stall itself or it can get a 'no more buffer space' error from the operating system.

(Where the APIs come in is that when there's no more buffer space, they generally opt to have the application's attempt to send more data just stall instead of producing an error. This is generally easier for many applications to handle, but sometimes it's not what you want.)

TCP's reliable delivery guarantee means that you can only send so much data to something that isn't dealing with it. You cannot just send data forever and have it vanish into the void because no one is dealing with it; that wouldn't be reliable delivery. After all, the receiving application might wake up and start reading that accumulated data some day, and if it does all the data had better be there for it.

Written on 14 November 2019.
« How to make a rather obnoxiously bad web spider the easy way
How we structure our Django web application's configuration settings »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Thu Nov 14 23:54:10 2019
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.