In a CPU or GPU, would 1 bad transistor cause failure?

Thread Starter

Obanion

Joined Nov 26, 2009
24
Hey guys,

I was just thinking about this and trying to figure out how these semiconductor giants make products with billions of transistors work (with no defects?). It seems almost unimaginable that you can make a CPU/GPU without a bad transistor somewhere.

Is there some kind of redundancy to mitigate this type of defect? Can CPUs or GPUs have absolutely zero defects? And if there are defects, I'm guessing this is what will cause crashing/blue screen?

I think the concept it pretty interesting, but I haven't been able to find out much about it. Anyone here know?
 

tom66

Joined May 9, 2009
2,595
Yes, silicon dice do have yield issues.

A failure is usually brought on by problems during the manufacture. Once in operation, a processor will continue working for millions of hours without a single problem. Temperature extremes will reduce the lifespan, however.

A single transistor failure in a CPU is likely to cause a failure to start up or operate correctly if it occurs in the logic section (the "brains"), but a failure in the cache or in the rest of the computer's memory could only cause intermittent problems.

A GPU is a bit more tricky. As there is rarely any decision logic, you can get weird "bugs" with GPUs when they overheat. Often, this is caused by the solder joints connecting the GPU to the PCB cracking or becoming dislodged, rarely it is because of damage to the actual die. This doesn't always result in complete failure, instead, you get things like polygons extending to infinity, colours appearing incorrectly, and glitches on the screen. You can blame the Xbox 360 failures, and partially the PS3 YLODs, on this.
 

marshallf3

Joined Jul 26, 2010
2,358
Or, you can do what Intel does. If one of the transistors in the integral cache is defective you disable that half of the cache and rename it a Celeron.
 

sceadwian

Joined Jun 1, 2009
499
Don't laugh Riffa, Sony does it too, the Ps3 Cell chip is actually an 8 core processor. 6 are usable by the software, 1 is dedicated exclusively to Sony's hypervisor software, and the 8th core is considered unusable and fused out to increase the yield should there be one bad core on a die, the fuses are set AFTER manufacturing and testing, though I have a feeling that if all eight cores test okay that some software might be able to take advantage of it.
 

Ghar

Joined Mar 8, 2010
655
One potential issue with manufacturing a semiconductor is mask alignment.
You need to produce many layers exactly lined up to create the transistors and interconnects.
If there's misalignment you can have unintended or missing connections. To avoid this you have design rules, where you make sure to have space around everything to make sure that even with worst case misalignment nothing will fail catastrophically.

Performance may be a bit degraded but it will still work.
 

marshallf3

Joined Jul 26, 2010
2,358
It's true about most any CPU of identical speed but different cache levels. It was originally Intel's idea and I think they started it with the P-II.

And yes, AMD has sold disabled quad core Phenoms as triples and even doubles they labeled as an Athlon 64 7750 or 7850. I've got two of the 7750 BEs and, with the proper BIOS &/or motherboard, you can re-enable the other cores and usually find nothing wrong with them. I proved it could be done but just run them as duals. I've only got one machine that actually benefits from (and can truly take advantage of) a true quad so it has a real one.

There were two schools of thought on the AMD disabled core phemomena:

1) It was actually the same if not cheaper to make a giant run of quads then set the number of active cores and price according to market demand at the time.

2) Some cores were actually a tiny bit sub-par. We've come into an age where if someonce can't overclock their chip by some insane amounts they feel they've been cheated.

These 7750 BE chips I've got have been taken to 7 GHz on liquid nitrogen cooling. They were considered a great chip to overclock because, since you were only using about half of the die, heat was easier to deal with. I took one of mine to 3.4 on stock voltages and cooler but just put it back to stock, only one of the 5 PCs in my main system ever sees any load and that's only if I'm dealing with a special case of video encoding &/or decoding that I need to get done in a hurry.
 

Ghar

Joined Mar 8, 2010
655
Binning parts based on performance has been standard practice for a long time, seems pretty natural to do it with something like a CPU as well.
 

marshallf3

Joined Jul 26, 2010
2,358
Binning parts based on performance has been standard practice for a long time, seems pretty natural to do it with something like a CPU as well.
Sad but true as I learned during one summer job while in College. Mercury marine had a large factory in Stillwater, OK for years, they mainly manufactured key components of the stern drives and some of the engines including the ones for the 1990 or 1991 Corvette that was current at the time.

I was in the "precision aluminum" department that dealt with 5 of the main stern drive components. Our function was to do a fair amount of the machine work and testing on these parts and a tested part could meet with one of three outcomes. If it spec'd out within X it went into the bin to become a Mercury part, if it didn't make X but could pass Y then it became a Mariner part (their lower end line) but if it couldn't pass either it got melted down again.

Friend of mine had been there for ages and was a senior QC inspector for the engine line. For some reason they were getting a lot of fallout due to some spec involving the cam bearings. After some time upper management chose to ease up on that spec without GM's approval to improve productivity. My friend was a pro, didn't like the idea and hinted that if they didn't properly address the problem he was going to somehow make GM aware of what was going on. He was given the choice of shutting up or voluntarily resigning which he chose. I forget how long it took but eventually these out of spec engines started to exhibit premature failure in that area. Obviously GM figured it out and pulled the contract.
 

BillB3857

Joined Feb 28, 2009
2,571
Many, many years ago, when IC's were first seeing widespread usage, we were seeing a very high failure rate of on particular IC in a Numerical Control system. The usage design was conservative, so the component manufacturer was contacted. After sending several of the failed chips to them, we got a response. During manufacture, a rinse process between the various etchings was not totally effective in nuetralizing the etchant. Even though the units tested fine and worked for a period of time, the slow process of the continuing, weak, etch process killed the chip. A campaign was initiated to inspect every board in every machine that used that particular chip for that particular lot number and replace the chip before failure.
 

t06afre

Joined May 11, 2009
5,934
In the Z80 created by Zilog there are nearly seven hundred well documented instructions for the machine code programmer to play about with. Much more interesting though, are the instructions that Zilog left in the Z80 but never bothered to document. Why? Because they couldn't be guaranteed to work in every chip. On example that a fail do not need to take down the whole chip. But I think this is more common for errors in the production process. If the fail is caused by say ESD the damage is more severe. Causing damage to the whole substrate. Not only a a few transistors. Also think about one-time programmable non-volatile memory (OTP NVM) They are actually programmed by causing a permanent controlled damage (fusing) to the chip
 

t06afre

Joined May 11, 2009
5,934
This is very simplified. The production of ICs and a PCBs do have that common that both use a Photolithography Process. So picture for you that you have produced a film for PCB production. But that film has some flaws. That flaw will in some cases cause a broken track, in other cases it will still be a connection. As I have heard this was the case for the first production batches of the Z80. The mask used in the Photolithography Process had some flaws. This kind of error may or may not be critical. In this case it was not critical. And Zilog shipped the chip, solving the problem by not documenting some instructions that was prone to fail.
 

kubeek

Joined Sep 20, 2005
5,795
Where's the logic in that, how can an instruction work on some chips but not others and what would those instructions be?
My gess on this is that for example some CPUs can be overclocked to much higher frequency and still work. The faulty instructions on the Z80 could be those that start to fail first when you clock the chip higher. So they were ready in the die, but the higher recommended timing could cause them to fail sometimes.
 

marshallf3

Joined Jul 26, 2010
2,358
My gess on this is that for example some CPUs can be overclocked to much higher frequency and still work. The faulty instructions on the Z80 could be those that start to fail first when you clock the chip higher. So they were ready in the die, but the higher recommended timing could cause them to fail sometimes.
Overclocking wasn't even a word when the Z80 came out.

The intermittent instruction may have been something like a shortcut to adding the contents of two registers under certain conditions that could also be done in a different manner or by adding another line of code. If you never knew it was intended to be there you just didn't use that instruction when you were coding.
 

kubeek

Joined Sep 20, 2005
5,795
Overclocking wasn't even a word when the Z80 came out.

The intermittent instruction may have been something like a shortcut to adding the contents of two registers under certain conditions that could also be done in a different manner or by adding another line of code. If you never knew it was intended to be there you just didn't use that instruction when you were coding.
I meant it in the way that they wanted to rise the working frequency of the chip to a certain limit, so they sacrificed some non critical instructions that were problematic.
 
Top