SIMD - are they a waste of transistors ?

JohnEod · Jun 23, 2015

Hello everyone.

I am a software engineer mostly familiar with the x86 architecture. I always wondered whether SIMD instructions are a waste of transistors that would be advantageously replaced by more CPU cores. Because even though they can provide 100% performance boosts in a few scenarios, this is actually very circumstancial and almost useless for most of applications.

I imagined some reasons behind this choice:
a) The software industry didn't yet completed the required changes to easily cope with large-scale parallelism (requires new programming languages that force you to explicit memory sharing and encourage mostly verifiable models instead, requires software transactional memory for some cases, and other things).

b) The few scenarios that benefit from SIMD instructions are important enough for consumers (video games and video decoding).

c) Maybe SIMD instructions are really cheap and easy to add? How does it compare with the cost of a reasonably powerful core?

d) Amdahl's law. Ok, although problems that fit SIMD instructions are typically parallelizable problems, there can still be situations where it's not worth starting a new thread and they help with Amdahl's bottleneck.

And by the way I do have two other, somehwat related, questions :
1) Are there manufacturers working on chips supporting software transactional memory that is not limited by cache associativity and never fallback on software (unlike Intel's TSX)?

2) Are there some works around "explicit hardware parallellism" (maybe a better name exists)? For example a chip would let me declare that two function calls are to be ran concurrently on the same core (one processing data while the other awaits for data), without having to start an OS thread on another core. Both hardware threads should operate on independant data (A cannot write something that B needs) and an error interrupt would be raised should they not.

nsaspook · Jun 23, 2015

C: The chip real-estate that SIMD instructions use is small and they are simply structured so the added cost of the instructions is almost nothing but today GPU chips handle most of the core video functions and physics calculations .

1. Don't know.
2. Well they can't run concurrently on the same core, they can only be scheduled or run in sequence.
https://www.kernel.org/pub/linux/ke...grammingTutorial/BasicsOfSIMDProgramming.html

dl324 · Jun 23, 2015

I guess the answer depends on your definition of waste.

At some point, all features in a microprocessor are evaluated on merit. Every feature takes time to design/test, consumes area, consumes power, etc. I know for a fact that companies put new features in microprocessors and don't advertise them until they're convinced they work and add value.

When SIMD was first implemented, no one was thinking about multiple cores.

You should probably post your question in one of the other forums; maybe a computer architect will respond...

WBahn · Jun 23, 2015

There are mode applications that benefit from SIMD than you probably realize. Many math and scientific applications use it extensively. Plus, compilers are getting better and better at leveraging SIMD instructions. The effective use of higher level parallelism is still in its relative infancy -- we just really haven't figured out how to use it well enough. But that will come.

nsaspook · Jun 23, 2015

SIMD is not a waste but it's also just not as useful in general number crunching processes.
http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html

WBahn · Jun 23, 2015

nsaspook said:
SIMD is not a waste but it's also just not as useful in general number crunching processes.
http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html

That's been pretty much true for most instructions added to a processor. I forget the numbers, but when someone profiled the instruction set on one of Intel's chips years ago (one of the Pentium family, I think) they found that most instructions here almost never used. RISC proponents have been making similar assertions for decades. But there's more to the story than that. Many of the instructions that got little use were determined to be instructions that could have resulted in significant performance gains in lots of software and that the compiler just didn't ever generate code that used them. This was believed to be due to several factors -- compiler writers are first and foremost concerned with writing compilers that produce correct code. If your compiler is already producing correct code for the current generation of CPU and the next generation is a superset of it, then there is a hesitancy to change the compiler except where it really needs to be changed. In addition, if the compiler supports a broad spectrum of processors, then it is likely that the focus is going to be on maximizing the use of instruction subsets that are common to a large number of those processors, which means that taking advantage of later superscalar bells and whistles are going to be low on the priority list.

But it's also the case that many of those old CISC instructions were devised with particular performance-sensitive computing applications in mind and, even if they are used extensively for that, they are still only used by a tiny fraction of code.

nsaspook · Jun 23, 2015

A lot of those old instructions are also for boot compatibility for bios systems and DOS. The Virtual 8086 mode is an example. The 'real' instruction set for modern Intel chips is a RISC machine and you can still see that if you tweak the cache just right to peek under the x86 front-end macro-instruction decoder. The microcoded ability to modify CISC instructions is increasing being used in modern processors so it's a possible attack vector during (encrypted) Microcode updates.

takao21203 · Jun 24, 2015

JohnEod said:
I am a software engineer mostly familiar with the x86 architecture.

For example a chip would let me declare that two function calls are to be ran concurrently on the same core

And Im the queen of EHB.

JohnEod · Jun 24, 2015

nsaspook said:
C: The chip real-estate that SIMD instructions use is small and they are simply structured so the added cost of the instructions is almost nothing but today GPU chips handle most of the core video functions and physics calculations .

Thank you for your answer. Would you have some idea about the relative cost of the different parts of modern cores please?

2. Well they can't run concurrently on the same core, they can only be scheduled or run in sequence.
https://www.kernel.org/pub/linux/ke...grammingTutorial/BasicsOfSIMDProgramming.html

To be more precise I was thinking about the GPU model: they have something like one processing unit for ten threads. They execute the instruction for one thread, then fetch the data for the next instruction and during that time they run the same instruction for the other threads. Their model is all about mitigating latency, something that is even more of a problem on GPU than on CPU. From the hardware perspective this is in sequence of course, but from a software perspective this is like concurrency.

Intel's hyperthreading and its pipeline seem to achieve something similar from my understanding: run indepdendent instructions to mitigate latency.

Now all of that is nice, but sometimes the pipeline is full of dependent instructons that cannot be "parallelized". On the other hand, I always found ridiculous to have to use complex software primitives to dispatch a small indepdendent work unit on another core while the current core's cache already hold all the data needed and its pipeline only contains dépendent instructions.

But ow that I think abçout it, I guess that simply writing those two pieces of code will be enough to have the pipeline run them "in parallel".

dl324 said:
I guess the answer depends on your definition of waste.

I could rephrase my question such as:

"If I had 100 cores and I were to remove SIMD instructions, how many cores would I have?"

You should probably post your question in one of the other forums; maybe a computer architect will respond...

Thank you, I will consider this.

JohnEod · Jun 24, 2015

JohnEod said:
But now that I think about it, I guess that simply writing those two pieces of code sequentially will be enough to have the pipeline run them "in parallel".

(...)

"If I had 100 cores and I were to remove SIMD instructions, how many cores could I have instead?"

Fixed, sorry for the ambiguity.

Thread starter	Similar threads	Forum	Replies	Date
R	ELECTRONICS Knowledge a Waste of Time Except for Engineers and Well-Set $$$ Hobby types With Free Time	Off-Topic	22	Nov 25, 2023
	electronics waste recycling idea - would like your input please.	Marketplace	41	Aug 22, 2023
	Can i put E-waste disposal marking on silkscreen !?	PCB Layout , EDA & Simulations	3	Nov 18, 2021
H	A general waste of time, making battery cells using conductive ink	Power Electronics	16	Apr 26, 2020
D	Which are the Processor Families have SIMD Architecture	Microcontrollers	0	Jun 9, 2010

SIMD - are they a waste of transistors ?

Join our Engineering Community! Sign-in with:

SIMD - are they a waste of transistors ?

JohnEod

nsaspook

dl324

WBahn

nsaspook

WBahn

nsaspook

takao21203

JohnEod

JohnEod

You May Also Like

ISSCC 2024: Inside AMD’s Zen 4c—The Area-Optimized Cloud Computing Core

A Q&A With Renowned Neuromorphic Chip Designer Chiara Bartolozzi

Circuit and Operation of a D Flip-Flop

NXP Launches Open S32 CoreRide Platform for Software-Defined Vehicles