SIMD - are they a waste of transistors ?

Thread Starter

JohnEod

Joined Jun 23, 2015
3
Hello everyone.

I am a software engineer mostly familiar with the x86 architecture. I always wondered whether SIMD instructions are a waste of transistors that would be advantageously replaced by more CPU cores. Because even though they can provide 100% performance boosts in a few scenarios, this is actually very circumstancial and almost useless for most of applications.


I imagined some reasons behind this choice:
a) The software industry didn't yet completed the required changes to easily cope with large-scale parallelism (requires new programming languages that force you to explicit memory sharing and encourage mostly verifiable models instead, requires software transactional memory for some cases, and other things).

b) The few scenarios that benefit from SIMD instructions are important enough for consumers (video games and video decoding).

c) Maybe SIMD instructions are really cheap and easy to add? How does it compare with the cost of a reasonably powerful core?

d) Amdahl's law. Ok, although problems that fit SIMD instructions are typically parallelizable problems, there can still be situations where it's not worth starting a new thread and they help with Amdahl's bottleneck.



And by the way I do have two other, somehwat related, questions :
1) Are there manufacturers working on chips supporting software transactional memory that is not limited by cache associativity and never fallback on software (unlike Intel's TSX)?

2) Are there some works around "explicit hardware parallellism" (maybe a better name exists)? For example a chip would let me declare that two function calls are to be ran concurrently on the same core (one processing data while the other awaits for data), without having to start an OS thread on another core. Both hardware threads should operate on independant data (A cannot write something that B needs) and an error interrupt would be raised should they not.
 

nsaspook

Joined Aug 27, 2009
13,315
C: The chip real-estate that SIMD instructions use is small and they are simply structured so the added cost of the instructions is almost nothing but today GPU chips handle most of the core video functions and physics calculations .

1. Don't know.
2. Well they can't run concurrently on the same core, they can only be scheduled or run in sequence.
https://www.kernel.org/pub/linux/ke...grammingTutorial/BasicsOfSIMDProgramming.html
 

dl324

Joined Mar 30, 2015
16,943
I guess the answer depends on your definition of waste.

At some point, all features in a microprocessor are evaluated on merit. Every feature takes time to design/test, consumes area, consumes power, etc. I know for a fact that companies put new features in microprocessors and don't advertise them until they're convinced they work and add value.

When SIMD was first implemented, no one was thinking about multiple cores.

You should probably post your question in one of the other forums; maybe a computer architect will respond...
 

WBahn

Joined Mar 31, 2012
30,087
There are mode applications that benefit from SIMD than you probably realize. Many math and scientific applications use it extensively. Plus, compilers are getting better and better at leveraging SIMD instructions. The effective use of higher level parallelism is still in its relative infancy -- we just really haven't figured out how to use it well enough. But that will come.
 

WBahn

Joined Mar 31, 2012
30,087
SIMD is not a waste but it's also just not as useful in general number crunching processes.
http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
That's been pretty much true for most instructions added to a processor. I forget the numbers, but when someone profiled the instruction set on one of Intel's chips years ago (one of the Pentium family, I think) they found that most instructions here almost never used. RISC proponents have been making similar assertions for decades. But there's more to the story than that. Many of the instructions that got little use were determined to be instructions that could have resulted in significant performance gains in lots of software and that the compiler just didn't ever generate code that used them. This was believed to be due to several factors -- compiler writers are first and foremost concerned with writing compilers that produce correct code. If your compiler is already producing correct code for the current generation of CPU and the next generation is a superset of it, then there is a hesitancy to change the compiler except where it really needs to be changed. In addition, if the compiler supports a broad spectrum of processors, then it is likely that the focus is going to be on maximizing the use of instruction subsets that are common to a large number of those processors, which means that taking advantage of later superscalar bells and whistles are going to be low on the priority list.

But it's also the case that many of those old CISC instructions were devised with particular performance-sensitive computing applications in mind and, even if they are used extensively for that, they are still only used by a tiny fraction of code.
 

nsaspook

Joined Aug 27, 2009
13,315
A lot of those old instructions are also for boot compatibility for bios systems and DOS. The Virtual 8086 mode is an example. The 'real' instruction set for modern Intel chips is a RISC machine and you can still see that if you tweak the cache just right to peek under the x86 front-end macro-instruction decoder. The microcoded ability to modify CISC instructions is increasing being used in modern processors so it's a possible attack vector during (encrypted) Microcode updates.
 
Last edited:

Thread Starter

JohnEod

Joined Jun 23, 2015
3
C: The chip real-estate that SIMD instructions use is small and they are simply structured so the added cost of the instructions is almost nothing but today GPU chips handle most of the core video functions and physics calculations .
Thank you for your answer. Would you have some idea about the relative cost of the different parts of modern cores please?

2. Well they can't run concurrently on the same core, they can only be scheduled or run in sequence.
https://www.kernel.org/pub/linux/ke...grammingTutorial/BasicsOfSIMDProgramming.html
To be more precise I was thinking about the GPU model: they have something like one processing unit for ten threads. They execute the instruction for one thread, then fetch the data for the next instruction and during that time they run the same instruction for the other threads. Their model is all about mitigating latency, something that is even more of a problem on GPU than on CPU. From the hardware perspective this is in sequence of course, but from a software perspective this is like concurrency.

Intel's hyperthreading and its pipeline seem to achieve something similar from my understanding: run indepdendent instructions to mitigate latency.

Now all of that is nice, but sometimes the pipeline is full of dependent instructons that cannot be "parallelized". On the other hand, I always found ridiculous to have to use complex software primitives to dispatch a small indepdendent work unit on another core while the current core's cache already hold all the data needed and its pipeline only contains dépendent instructions.

But ow that I think abçout it, I guess that simply writing those two pieces of code will be enough to have the pipeline run them "in parallel".


I guess the answer depends on your definition of waste.
I could rephrase my question such as:

"If I had 100 cores and I were to remove SIMD instructions, how many cores would I have?"


You should probably post your question in one of the other forums; maybe a computer architect will respond...
Thank you, I will consider this.
 

Thread Starter

JohnEod

Joined Jun 23, 2015
3
But now that I think about it, I guess that simply writing those two pieces of code sequentially will be enough to have the pipeline run them "in parallel".

(...)

"If I had 100 cores and I were to remove SIMD instructions, how many cores could I have instead?"
Fixed, sorry for the ambiguity.
 
Top