The downsides of processing files using too large a buffer size
I was recently reading this entry by Ben Boyter, part of which is his discussion of some attempts to optimize file IO. In these attempts he varied his buffer sizes, from reading the entire file more or less at once to reading in smaller buffers. As I thought about this, I had a belated realization about file buffer sizes when you're processing the result as a stream.
My normal inclination when picking read buffer sizes for file IO is to pick a large number. Classically this has two good effects; it reduces system call overhead (because you make fewer of them) and it gets the operating system to do IO to the underlying disks in larger chunks, which is often better. However, there is a hidden drawback to large buffer sizes, namely that reading data into a buffer is a synchronous action as far as your program is concerned; under normal circumstances, the operating system can't give your program back control until it's put the very last byte of the buffer into place. If you ask for 16 Kb, your program can start work once byte 16,384 has shown up; if you ask for 1 Mbyte, you get to wait until byte 1,048,576 has shown up, which is generally going to take longer. The more you try to read at once, the longer you're going to stall.
On the surface this looks like it reduces the time to process the start of the file but not necessarily the time to process the end (because to get to the end of a 1 Mbyte file, you still need to wait for byte 1,048,576 to show up). However, reading data is not necessarily a synchronous action all the way to the disk. If you're reading data sequentially, all OSes are going to start doing readahead. This readahead means that you're effectively doing asynchronous disk reads that at least partially overlap with your program's work; while your program is processing its current buffer of data, the OS is issuing readaheads and may be able to satisfy your program's next read by just copying things around in RAM, instead of waiting for the disk.
If you attempt to read the entire file before processing any of it, you don't get any of these benefits. If you read in quite large buffers, you probably only get moderate benefits; you're still waiting for relatively large read operations to finish before you can start processing data, and the OS may not be willing to do enough readahead to cover the next full buffer. For good results, you don't want your buffer sizes to be too large, although I don't know what a good size is these days.
(Because I like ZFS and the normal ZFS block size for many files is 128 Kb, I think that 128 Kb is a good starting point for a read buffer size. If you strongly care about this, you may want to benchmark on your specific environment, because it's going to depend on how much readahead your OS is willing to do for you.)
PS: This also depends on your processing of the file not taking too long. If you can only process the file at a rate far lower than the speed of your IO, IO time has a relatively low impact on things and so it may not matter much how you read the file.
(In retrospect this feels like a reasonably obvious thing, but it didn't occur to me until now and as mentioned I've tended to reflexively do read IO in quite large buffer sizes. I'm probably going to be changing that in the future, at least for programs that process what they read.)