How modern CPUs are like (modern) disks
Once upon a time, hard disk transfer rates were an issue of serious concern. It mattered a great deal how fast your disks and their IO channels could run, and changing technologies could have significant performance effects; IDE versus SCSI and so on really made a difference.
For a lot of people, those days are long over. Disk interconnects are essentially irrelevant (for this) and streaming read and write bandwidth has become if not irrelevant then generally not important. What matters, what limits performance, is seek time. Your disk could transfer data at a rate of a gigabyte a second and your practical performance might not go up at all, because you can still only do 100 to 150 reads a second.
(Hence the growing popularity of SSDs; they may or may not improve your read and write data rates, but they drive seek time basically to zero.)
Modern CPUs are just like this. In many situations their performance limit is not how fast they can execute instructions, it is memory bandwidth; the ability to run code very fast doesn't mean very much if you can't get data to and from that code. Among other odd effects, this has made an increasing amount of computation effectively free if you are already accessing memory or copying it around.
(Years ago this started happening to TCP and UDP checksumming, where it took no extra time to compute the checksum at the time when you copied the data between buffers in memory.)
One of the important consequences of this is what it does to would-be hardware accelerators for various tasks. If what you are attempting to accelerate involves reading or copying data, well, you are competing with this effect; you need either a job that the CPU can't do very fast but that you can for some reason, or a way of having much higher memory bandwidth than the CPU does. Or both.
Even if you have a job that's currently CPU-bound instead of memory bound, the speedup that your accelerator can get simply by doing it faster than the CPU is limited by how close the CPU is to hitting memory bandwidth. The other way to put it is that your maximum speedup is however much time the CPU is leaving the memory system idle. If it is already 70% busy (30% idle), you can never get better than a 30% speedup; once you're 30% faster your accelerator is running the memory system at full bandwidth, and that's it.
(This is just Amdahl's law applied, of course.)