Wandering Thoughts archives

2009-09-28

Two ends of hardware acceleration

One of the ways that you can categorize hardware acceleration is to say that there is a continuum between two sorts of hardware acceleration: doing something instead of having the CPU do it, and doing something that the CPU can't even come close to doing fast enough.

If you have a choice, you obviously want to be on the latter end of the scale. Ideally you'll have a fairly solid proof that the best software implementation on a general CPU can't possibly be fast enough because it requires very hard hardware performance (very low-latency access to a lot of memory, for example, as you might get with extensive lookup tables). This gives you a reasonable amount of confidence that Moore's Law as applied to general CPU performance is not about to eat your lunch in a few years.

Life at the other end of the scale is much more difficult, because you run into the hardware RAID problem, namely that you need to find people for whom the problem is important and who are also CPU constrained at the same time. (It is a tragic mistake to merely find people with your problem; to put it one way, there are a lot more people with slow disks than people who will pay much money to speed them up.)

On a side note, sometimes doing it instead of the CPU can be a sales pitch in its own right, but you have to be in a special circumstance. The best example of this is hardware cryptographic modules for signing things, where the attraction is that the CPU (and its buggy, vulnerable software) gets nowhere near your signing keys.

HardwareAccelerationRange written at 22:49:52; Add Comment

What I think about why graphics cards keep being successful

Graphics cards are the single most pervasive and successful sort of hardware accelerator in the computer world; they are a shining exception to how hardware acceleration has generally been bad. Given my views, I'm interested in figuring out why graphics cards are such an exception.

Here's my current thinking on why graphics cards work, in point form (and in no particular order):

  • avid users (ie, gamers) are CPU constrained during operation as well as graphics constrained.
  • avid users will pay significant amounts of money for graphics cards, and will do so on a regular basis.

  • there is essentially no maximum useful performance limit; so far, people and programs can always use more graphics power.

  • GPUs have found various ways of going significantly faster than the CPU, ways that the CPU currently cannot match, including:
    • significant parts of the problem they're addressing is naturally (and often embarrassingly) parallel; this makes it relatively simple to speed things up by just throwing more circuitry at the problem.
    • they have almost always used high speed memory interfaces (or highly parallel ones), getting around the memory speed performance limit.

  • while GPUs have problems with the costs of having the CPU actually talk to them, they have found a number of ways to amortize that overhead and work around it.

    (For example, these days you rarely do individual graphics operations one by one; instead you batch them up and do them in bulk.)

  • GPU vendors are successful enough to spend a lot of money on hardware design.
  • GPU vendors iterate products rapidly, often faster than CPU vendors.

I think that many of these reasons can be inverted to explain why hardware acceleration is a hard problem, but that's another entry.

WhyGraphicsCardsWork written at 00:42:14; Add Comment

2009-09-19

Why I am not a fan of hardware acceleration

I am generally not a fan of hardware accelerators of various sorts (the stereotypical example is hardware RAID cards). One of the reasons why is that historically (with one important exception), hardware accelerators generally just haven't been very good.

By not very good I mean that, well (generally):

  • they haven't accelerated things very much.
  • they haven't sped up actual important bottlenecks, except in rare circumstances.
  • they almost invariably cost a significant amount of money.
  • even when they manage to get past all of this, they rarely stay fast, compared to the state of the art in machines and software.
  • they often stop accelerating anything useful after a while.

(Consider the current usefulness of a hypothetical world's fastest MD5 checksum offload engine, or just of a TCP offload engine that doesn't handle packets with the current set of options that everyone is using because they weren't common when it was designed.)

This litany of bad outcomes hasn't happened because everyone who's tried to make hardware accelerator has been incompetent; far from it. Instead there are a number of sound reasons for all of these, and they make this a hard problem. We can see how hard it is by seeing how little success the field has had, despite a great deal of effort poured into it over the years.

(By the way, the important exception that I'm thinking of is graphics cards. I happen to think that they have a lot of characteristics that makes them the exception that proves the rule.)

BadHardwareAcceleration written at 01:47:57; Add Comment

2009-09-13

How modern CPUs are like (modern) disks

Once upon a time, hard disk transfer rates were an issue of serious concern. It mattered a great deal how fast your disks and their IO channels could run, and changing technologies could have significant performance effects; IDE versus SCSI and so on really made a difference.

For a lot of people, those days are long over. Disk interconnects are essentially irrelevant (for this) and streaming read and write bandwidth has become if not irrelevant then generally not important. What matters, what limits performance, is seek time. Your disk could transfer data at a rate of a gigabyte a second and your practical performance might not go up at all, because you can still only do 100 to 150 reads a second.

(Hence the growing popularity of SSDs; they may or may not improve your read and write data rates, but they drive seek time basically to zero.)

Modern CPUs are just like this. In many situations their performance limit is not how fast they can execute instructions, it is memory bandwidth; the ability to run code very fast doesn't mean very much if you can't get data to and from that code. Among other odd effects, this has made an increasing amount of computation effectively free if you are already accessing memory or copying it around.

(Years ago this started happening to TCP and UDP checksumming, where it took no extra time to compute the checksum at the time when you copied the data between buffers in memory.)

One of the important consequences of this is what it does to would-be hardware accelerators for various tasks. If what you are attempting to accelerate involves reading or copying data, well, you are competing with this effect; you need either a job that the CPU can't do very fast but that you can for some reason, or a way of having much higher memory bandwidth than the CPU does. Or both.

Even if you have a job that's currently CPU-bound instead of memory bound, the speedup that your accelerator can get simply by doing it faster than the CPU is limited by how close the CPU is to hitting memory bandwidth. The other way to put it is that your maximum speedup is however much time the CPU is leaving the memory system idle. If it is already 70% busy (30% idle), you can never get better than a 30% speedup; once you're 30% faster your accelerator is running the memory system at full bandwidth, and that's it.

(This is just Amdahl's law applied, of course.)

HowCPUsAreLikeDisks written at 01:19:29; Add Comment


Page tools: See As Normal.
Search:
Login: Password:
Atom Syndication: Recent Pages, Recent Comments.

This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.