The basics of 4K sector hard drives (aka 'Advanced Format' drives)

April 15, 2013

Modern hard drives have two sector sizes, the physical sector size and the logical one. The physical sector size is what the hard drive actually reads and writes in; the logical sector size is what you can ask it to read or write (and I believe what logical block addresses are in). The physical block size is always equal to or larger than the logical one. Writing to only part of a physical sector requires the drive to do a read-modify-write cycle.

In the beginning, basically all drives had a 512 byte sector size (for both physical and logical, which weren't really split back then). Today it's difficult or impossible to find a current SATA drive that is not an 'Advanced Format' drive with 4096 byte physical sectors. To date I believe that all 4k drives have a 512 byte logical sector size (call this 4k/512), but in the future that may change so that we see 4k/4k drives.

(At this point I have no idea if vendors want to move to a 4k logical sector size. If they don't move life gets simpler for a lot of people, us included.)

The main issue for 4k/512 drives is partial writes. If you're waiting for the write to complete a partial write apparently costs you one rotational latency in extra time. If you're not waiting, eg if you're just writing to the drive's write cache (at a volume where it doesn't fill up), you're probably still going to lose overall IOPs.

(The other problem with partial writes is that if things go wrong they can corrupt the data in the rest of the physical sector, data which you didn't think you were writing.)

There are two ways to get partial writes. The first is that your OS simply writes things smaller than the physical block size (perhaps it uses the logical block size for something or just assumes that sectors are 512 bytes and that it can write single ones). The other is unaligned large writes, where you may be issuing writes that are multiples of the physical block size but the starting position is not lined up with the start of physical blocks. Since most filesystems today normally write in 4k blocks or larger, unaligned writes are the most common problem. The extra bonus for unaligned writes is that they give you two partial writes, one at the start and a second at the end, both of which cost you time, IOPs, or both.

(Aligned large writes that are not multiples of the physical block size will also cause partial writes at the end, but I think that this is relatively uncommon today.)

Getting writes to be aligned requires that everything in the chain from basic partitioning (BIOS or GPT, take your pick) up through internal OS partitioning and on-disk filesystem data structures be on 4k (or larger) boundaries. This is often not the case for existing legacy partitioning. Frequently the original (and existing) partitioning tools rounded things up (or down) to essentially arbitrary 'cylinder' boundaries using nominal disk geometries that were entirely imaginary and generally arbitrary.

(There was a day when disk geometries were real and meaningful, but that was more than a decade ago for most machines.)

Modern disk drives advertise both their physical and logical block sizes (in disk inquiry data). Unfortunately this information may or may not properly propagate up through a complex storage stack (which may involve hardware or software RAID, SAN controllers, logical volume management, virtualization, and so on). The good news is that most modern software aligns things on 4k or larger boundaries regardless of what block size the underlying storage claims to have, so you have at least some chance of having everything work out. The bad news is that you're probably not using all-modern software.

(This is the kind of thing that I write to get everything fixed in my head, since we're now seriously looking into how badly 4k sector drives are going to impact our fileserver environment.)

Note that some vendors make drives with the same model number that can have different physical block sizes. I have a pair of Seagate 500 GB SATA drives (with the same model number, ST500DM002), bought at the same time from the same vendor, one of which turns out to have 4k sectors and one of which has 512 byte sectors as I expected. Fortunately the difference is basically harmless for what I'm using them for.

(Seagate documents this possibility in a footnote on their technical PDF for the drive series, if you read the small print.)

Written on 15 April 2013.
« Go's friction points for me (and a comparison to Python)
Some thoughts on going to HTTPS by default »

Page tools: View Source, Add Comment.
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Mon Apr 15 23:45:35 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.