Wandering Thoughts archives

2013-04-30

The two stories of RISC

I said this implicitly in my entry on ARM versus other RISCs but I've realized that I want to say it explicitly: there are (or were) effectively two stories of RISC development. You could see RISC as a way of designing simple chips, or you could see RISC as a way of designing fast chips.

In the story of simple chips, designing a RISC instruction set architecture with its simple and regular set of operations and registers and so on is a great way of designing a simple chip. You don't need complex instruction decoding, you don't need microcode, you don't need all sorts of irregularities in this and that, and so on. You throw out complex instructions that would take significant amount of silicon in favour of simple ones (and anyways, the early RISC studies suggested that those instructions didn't get used much anyways). The result would be good enough and more importantly, you could actually design and build it.

In the story of fast chips, all of the ISA and implementation simplicity was there to let you design a small chip that you could make go fast (and then scale up easily to make it go even faster). CPU cycle time was a holy thing and every instruction and feature had to be usable by the compiler in practice to make things go faster, not just locally (in code where it could be used) but globally (looking at total execution time across all programs, however you did that). Various RISC research had showed you could throw out a lot of the CISC complexity and push a lot of things to the compiler without slowing code down in practice, so that's what you did. Designing one of these RISC chips was actually a pretty big amount of work, not for the raw design so much as also building the compilers, the simulation environment, and so on.

(These RISC chips were almost invariably built in conjunction with their compilers. The chip and the compiler were two parts of a single overall system.)

Almost all of the RISC chips that people like me have heard of (and lusted after) were designed under the fast chips story; this is MIPS, the DEC Alpha, the Sun SPARC, IBM's Power (later PowerPC aka PPC), Intel's Itanium, and HP's PA-RISC, probably among others. This is the sort of RISC that I learned about from John Mashey's comp.arch posts in the late 1980s and early 1990s and what I still reflexively think of as 'RISC' today. They were what went into Unix servers and workstations (and then later into Macs) and the loss of their elegant, nice architectures to the brute force and money of x86 made many Unix people sad.

(I'd say that I've gotten over my own sadness, but that's not quite what I did. I still don't particularly like the x86 architecture, I just ignore it because my machines are cheap and run fast.)

As I discovered when I researched my entry on ARM, ARM chips are the other story, the unsexy story, the story of simple chips. Unix people like me didn't (and often still don't) really pay them much attention because they were never really server or workstation chips; they didn't appear in machines that we really cared about. Of course the punchline is that they turned out to be the more important sort of RISC chips.

TwoRISCStories written at 23:00:14; Add Comment

2013-04-29

My view of ARM versus other RISCs

Way back in my second entry on x86 winning against RISC a commentator asked:

And now you need to write who how ARM up-ending this orderly structure of the universe.

Writing about this is hampered by my lack of knowledge of the details of ARM history, but after some reading I have my theory: ARM has been successful where other RISCs weren't because from the first it was targeted differently.

(This is not a matter of architecture, as I initially thought before I started reading. The original ARM ISA was a bit odd but no more so than other RISCs.)

The simple version is that all other RISCs saw themselves as competing for the performance crown (against each other and then x86); they quite carefully and quite consciously engineered for performance and then tried to sell their CPUs on that basis. This was a sensible decision because performance was where the money was (and also because there clearly was a lot of room to improve CPU performance). It just happened that Intel was able to spend enough money to scale x86 up enough to crush everyone else with good enough performance for not too much money (and with the other advantages x86 gave them).

Acorn (where ARM started) doesn't seem to have seen itself as building a high-performance CPU. Instead it wanted to build a CPU that met its needs as far as features went, performed well enough, and was simple enough that a small company could design it (after all, early RISCs were designed by a class of graduate students). This gave ARM different design priorities and, just as importantly, meant that Acorn (and later ARM Ltd) didn't spend huge amounts of money on R&D efforts to crank performance up to compete with other CPUs (a race that they would have been doomed to lose). Free from chasing performance at all costs, ARM was both able and willing to adopt its design for people who were interested in other needs.

(I have no particular insight about why ARM won out over similar higher end low-power CPU efforts of the early 1990s. However two things seem worth noting there. First, all of the really big designs seem to have been RISC, which makes sense; if you're building a new embedded CPU, you want something simple. Second, both the Intel i960 and the AMD 29k were made by companies that were also chasing the CPU performance crown; among other things this probably drained their design teams of top talent.)

It's worth noting that part of this difference is a difference in business priorities. One reason that ARM is so widely used is that it is dirt cheap and one fundamental reason that it's dirt cheap is that ARM Ltd licenses designs instead of selling chips. This means that ARM Ltd has made a fraction of the profit that, say, Intel has made from their CPUs. Licensing widely and cheaply is an excellent way to spread your CPUs around but a terrible way to make a lot of money.

(Of course losing the CPU performance race was an even worse way of making money for all of the other RISC companies. But if the RISC revolution had actually worked out, one or more of those companies could have been an Intel or a mini-Intel.)

ARMvsRISC written at 01:24:50; Add Comment

2013-04-19

How I want storage systems to handle disk block sizes

What I mean by a storage system here is anything that exports what look like disks through some mechanism, whether that's iSCSI, AoE, FibreChannel, a directly attached smart controller of some sort, or something I haven't heard of. As I mentioned last entry, I have some developing opinions on how these things should handle the current minefield of logical and physical block sizes.

First off, modern storage systems have no excuse for not knowing about logical block size versus physical block size. The world is no longer a simple place where all disks can be assumed to have 512 byte physical sectors and you're done. So the basic behavior is to pass through the logical and physical block sizes of the underlying disk that you're exporting. If you're exporting something aggregated together from multiple disks, you should obviously advertise the largest block size used by any part of the underlying storage.

(If the system has complex multi-layered storage it should try hard to propagate all of this information up through the layers.)

You should also provide the ability to explicitly configure what logical and physical block sizes a particular piece of storage advertises. You should allow physical block sizes to be varied up and down from their true value and for logical block sizes to be varied up (and down if you support making that work). It may not be obvious why people need all of this, so let me mention some scenarios:

  • you may want to bump the physical block size of all your storage to 4kb regardless of the actual disks used so that your filesystems et al will be ready and optimal when you start replacing your current 512 byte disks with 4kb disks. (Possibly) wasting a bit of space now beats copying terabytes of data later.

  • similarly you may be replacing 512 byte disks with 4kb disks (because they're all that you can get) but your systems really don't deal well with this so you want to lie to them about it. There are other related scenarios that I'll leave to your imagination.

  • you may want to set a 4 kb logical sector size to see how your software copes with it in various ways. Sometime in the future setting it will also be a future-proofing step (just as setting a 4 kb physical block size is today).

It would be handy if storage systems had both global and per-whatever settings for these. Global settings are both easier and less error prone for certain things; with a global setting, for example, I can make sure that I never accidentally advertise a disk as having 512 byte physical sectors.

(Why this now matters very much is the subject for a future entry.)

SANAdvertisingBlocksizes written at 02:19:21; Add Comment

2013-04-18

How SCSI devices tell you their logical and physical block sizes

Since I spent today looking this up and working it all out, I might as well write all of this down.

Old SCSI had no distinction between logical and physical size; it just had the block size. Modern SCSI has redefined those old plain block sizes to be the logical block size and then added an odd way of encoding the physical block size. This information is reported through the SCSI operation READ CAPACITY (16), which unlike its stunted older brother READ CAPACITY (10) is not actually a SCSI command; instead it's a sub-option of a general SERVICE ACTION IN command. This may assist you in finding it in code and/or documentation.

(SERVICE ACTION IN is SCSI opcode 0x9E and READ CAPACITY (16) is sub-action 0x10. Nice code will have some #defines or the like for these; other code, well, may not. See the discussion of finding SCSI opcodes and so on in this entry.)

The logical block size is returned as a big endian byte count in response bytes 8 through 11 (counting from 0; 0 through 7 are the device's size in logical blocks, again big endian). The size of physical blocks is reported by giving the 'logical blocks per physical block exponent' in the low order four bits of byte 13. If it is set to some non-zero value N, there are 2^N logical blocks per physical block; for 4k sector disks with 512 byte logical blocks the magic exponent is thus 3.

There is no guarantee that code that uses READ CAPACITY (16) either sets or reads this exponent. My impression is that RC (16) and its use in code predates at least the need to think about the difference and perhaps the actual definition of the field (as opposed to just marking it 'reserved').

Note that some code may talk about or #define 'READ CAPACITY' when it means READ CAPACITY (10). You should ignore this code because no one wants to use RC (10) any more. If there's code that is carefully handling a device capacity case of '0xffffffff', you're reading the wrong code. Yes, this can be confusing.

(One of the problems with READ CAPACITY (10) is that the (logical block) size of the device is limited to a 32-bit field. With 512 byte blocks this translates to a disk size of about 2 Tb. It follows that if some old system can't deal with 2 Tb SCSI disks, it's extremely likely that it probably also has no idea of physical block size versus logical block size.)

I'm developing opinions on how storage systems should handle all of this, but that's going to have to wait for another entry.

SCSIBlocksizesDiscovery written at 00:25:44; Add Comment

2013-04-15

The basics of 4K sector hard drives (aka 'Advanced Format' drives)

Modern hard drives have two sector sizes, the physical sector size and the logical one. The physical sector size is what the hard drive actually reads and writes in; the logical sector size is what you can ask it to read or write (and I believe what logical block addresses are in). The physical block size is always equal to or larger than the logical one. Writing to only part of a physical sector requires the drive to do a read-modify-write cycle.

In the beginning, basically all drives had a 512 byte sector size (for both physical and logical, which weren't really split back then). Today it's difficult or impossible to find a current SATA drive that is not an 'Advanced Format' drive with 4096 byte physical sectors. To date I believe that all 4k drives have a 512 byte logical sector size (call this 4k/512), but in the future that may change so that we see 4k/4k drives.

(At this point I have no idea if vendors want to move to a 4k logical sector size. If they don't move life gets simpler for a lot of people, us included.)

The main issue for 4k/512 drives is partial writes. If you're waiting for the write to complete a partial write apparently costs you one rotational latency in extra time. If you're not waiting, eg if you're just writing to the drive's write cache (at a volume where it doesn't fill up), you're probably still going to lose overall IOPs.

(The other problem with partial writes is that if things go wrong they can corrupt the data in the rest of the physical sector, data which you didn't think you were writing.)

There are two ways to get partial writes. The first is that your OS simply writes things smaller than the physical block size (perhaps it uses the logical block size for something or just assumes that sectors are 512 bytes and that it can write single ones). The other is unaligned large writes, where you may be issuing writes that are multiples of the physical block size but the starting position is not lined up with the start of physical blocks. Since most filesystems today normally write in 4k blocks or larger, unaligned writes are the most common problem. The extra bonus for unaligned writes is that they give you two partial writes, one at the start and a second at the end, both of which cost you time, IOPs, or both.

(Aligned large writes that are not multiples of the physical block size will also cause partial writes at the end, but I think that this is relatively uncommon today.)

Getting writes to be aligned requires that everything in the chain from basic partitioning (BIOS or GPT, take your pick) up through internal OS partitioning and on-disk filesystem data structures be on 4k (or larger) boundaries. This is often not the case for existing legacy partitioning. Frequently the original (and existing) partitioning tools rounded things up (or down) to essentially arbitrary 'cylinder' boundaries using nominal disk geometries that were entirely imaginary and generally arbitrary.

(There was a day when disk geometries were real and meaningful, but that was more than a decade ago for most machines.)

Modern disk drives advertise both their physical and logical block sizes (in disk inquiry data). Unfortunately this information may or may not properly propagate up through a complex storage stack (which may involve hardware or software RAID, SAN controllers, logical volume management, virtualization, and so on). The good news is that most modern software aligns things on 4k or larger boundaries regardless of what block size the underlying storage claims to have, so you have at least some chance of having everything work out. The bad news is that you're probably not using all-modern software.

(This is the kind of thing that I write to get everything fixed in my head, since we're now seriously looking into how badly 4k sector drives are going to impact our fileserver environment.)

Note that some vendors make drives with the same model number that can have different physical block sizes. I have a pair of Seagate 500 GB SATA drives (with the same model number, ST500DM002), bought at the same time from the same vendor, one of which turns out to have 4k sectors and one of which has 512 byte sectors as I expected. Fortunately the difference is basically harmless for what I'm using them for.

(Seagate documents this possibility in a footnote on their technical PDF for the drive series, if you read the small print.)

AdvancedFormatDrives written at 23:45:35; Add Comment

2013-04-12

My view on software RAID and the RAID write hole

The old issue of Software RAID versus hardware RAID came up recently on Twitter, which got Chris Cowley to write Stop the hate on Software RAID, which prompted a small lobste.rs discussion in which people pointed to the RAID 5 write hole as a reason to prefer hardware RAID over software RAID. I've written several entries about how I favour software RAID but I've never talked about the write hole.

(For now let's ignore some other issues with RAID 5 or pretend that we're talking about RAID 6 instead, which also has this write hole issue.)

I'll start by being honest even if it's painful: hardware RAID has an advantage here. Yes, you can (and should) put your software RAID system on a UPS (or two) and so on, but there are simply more parts that can fail abruptly when you're dealing with a full server than when you're dealing with an on-card battery. This doesn't mean either that hardware RAID is risk free (hardware RAID cards fail too) or that software RAID is particularly risky (abrupt crashes of this sort are extreme outliers in most environments), but it does mean that hardware RAID is less risky in this specific respect.

This is where we get into tradeoffs. Hardware RAID has both drawbacks and risks of its own (relative to software RAID). When building any real system you have to assess the relative importance and real world chances of these risks (and how successfully you feel that you can mitigate them), because real systems are always almost always a balance between (potential) problems. My personal view is that in general, abrupt system halts are a vanishingly rare in properly designed systems. This makes the RAID write hole essentially a non-issue for software RAID.

(Of course there are all sorts of cautions here. For example, if you're operating enough systems the vanishingly rare can start happening more often than you want.)

Thus my overall feeling is (and remains) that most people and most systems are better off with software RAID than with hardware RAID. In practice I think you are much more likely to get bitten by various issues with hardware RAID than you are to blow things up by hitting the software RAID write hole with a system crash or power loss event.

(By the way, if you're seriously worried about the RAID write hole you'll want to carefully verify that your disks actually write data when they tell you that they have. This is probably much less of a risk if you buy expensive 'enterprise' SAS drives, of course.)

SoftwareRAIDAndRAIDWriteHole written at 00:24:19; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.