There are two facets to dd usage

March 18, 2023

Recently I shared a modern Unix superstition on the Fediverse:

Is it superstition that I do 'dd if=... bs=<whatever> | cat >/dev/null' instead of just having 'of=/dev/null' because I'm cautious about some version of dd optimizing that command a little too much? Probably.

There are various things you could say about this, but thinking about it has made me realize that in practice, there are two facets to dd, what you could call two usage cases, and they're somewhat in conflict with each other.

The first facet is dd as a way to copy data around. If you view dd this way, it's fine if some combination of dd, your C library, and the kernel optimize how this data copying is done. For example, if dd is reading or writing a file to or from a network socket, in many cases it would be desirable to directly connect the file and the network socket inside the kernel so that you don't have to flow data through user level. If you're using dd to copy data, you generally don't care exactly how it happens, you just want the result.

(Dd traditionally has some odd behavior around block sizes, but many people using dd to copy data don't actually want this behavior or care about it.)

The second facet is dd as a way to cause specific IO to happen. If you view dd this way, it is absolutely not safe for the collective stack to optimize how the data is copied. You want dd to do exactly the IO that you asked for, and not change that. If you read from a file and write to /dev/null you don't want dd to connect the file and /dev/null in the kernel and then the kernel to optimize this to do no IO. Reading the file (or the disk) was the entire point.

My impression is that historically, dd originated in the first usage case; it was created around the time of V5 Unix (cf, also) in order to "convert and copy a file" in the words of the V6 dd manual page. System administrators later pressed it into use for the second facet, because it allowed for relatively precise control and it seemed like a safe command that was unlikely to choke on odd sources of input or output or do anything unpredictable with the data it read and wrote.

You can criticize this, but Unix didn't and still doesn't have a standard tool that's explicitly about performing certain IOs. Maybe it should have one, since dd can be awkward to use for highly-specific IO. Also, at the time that system administrators started assuming that dd would perform their IO as 'written', I don't think anyone expected the degree of cleverness that modern Unix utilities and kernels exhibit (cf this note about GNU coreutils cat and GNU grep apparently optimizing the case of its output being /dev/null for a long time).


Comments on this page:

I use sum FILENAME to, e.g., read FILENAME from slow storage into the page cache for later use.

It is quite fast, because it uses a simple, old checksum algorithm by default, and it needs to read the data to compute the checksum.

I do not throw the output away to guard against too clever implementations.

A POSIX-compliant dd must be implemented as a blockwise read-write loop.

From 193.219.181.219 at 2023-03-21 10:52:13:

I use sum FILENAME to, e.g., read FILENAME from slow storage into the page cache for later use.

My usual tool for that is pv (because of the progress bar), but I recently discovered vmtouch for that purpose. It's not necessarily better, I just find it cool – e.g. its ability to show what parts of a file are currently in the page cache, as well as drop the specified files from cache.

This confusion is a natural result of making no concrete decisions with regards to whether these programs should be used for results or effects.

Also, at the time that system administrators started assuming that dd would perform their IO as 'written', I don't think anyone expected the degree of cleverness that modern Unix utilities and kernels exhibit

Before GNU, did anyone expect them to work at all, without callously truncating lines deemed to be too long, alongside other atrocities?

Written on 18 March 2023.
« Some reasons why CPUs might re-use unofficial NOPs for other things
Easily adjusting the minimum interval on panels in Grafana dashboards »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sat Mar 18 22:01:15 2023
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.