How do they begin to conceive of the design of an Apple M1 chip?

Thread Starter

Jennifer Solomon

Joined Mar 20, 2017
112
This system does 1000x more operations than IBM’S Deep Blue that beat Kasparov in chess in 1997. Deep Blue was a literal ton, the size of a refrigerator, and the M1 is a single chip with 16 billion 5-nm transistors, dozens of layers, has 175 TB of information in its design, and does some 11 trillion operations per second. Deep Blue did 11 billion.

They obviously have a team per each sections of the chip, like 23 sections... but what’s the 10,000 foot view on sitting down and “bettering the previous chip?” Question is generic to CPU design. Is more transistors the primary metric for speed increase? They know that every gate in their team’s section has to be high or low output at some clock cycle? They have more gates doing parallel processes?

Obviously computers are “designing” the new layouts, but the complexity is unfathomable to even keep track of. They’re splitting light into UV, bouncing off mirrors, and sending it through a puddle of water to create the lithographic template! This stuff doesn’t even seem believable as science fiction at this point.
 

dl324

Joined Mar 30, 2015
16,846
You asked a lot of questions and there are a lot of considerations.

A reasonable start would be to define the architectural requirements. It's unlikely that Apple's chip is a completely new architecture, so they'll have some requirements for existing architectures what that would make it more advantageous for them to design versus buy.

Every feature has a cost in die area. More area means more transistors which means higher manufacturing cost and more power dissipation. When I was peripherally involved with microprocessor architectural design, I learned that increased functionality was often weighed against increased power dissipation.

Lithography limits how small feature sizes can be, but only critical layers will use EUV because the throughput for those machines is low compared to what manufacturers were accustomed to with the more traditional lithography. EUV was so late that manufacturers had to resort to using multiple masks to expose a single layer. Many of the layers are probably still using that technique.

As an aside. Manufacturing capacity doesn't increase overnight and it isn't cheap. I've read that auto manufacturers are being limited by semiconductor availability and AMD isn't as competitive as they'd like with Intel because they can't get as many parts as they'd like from TSMC.

EDIT: corrected typo
 
Last edited:

Thread Starter

Jennifer Solomon

Joined Mar 20, 2017
112
You asked a lot of questions and there are a lot of considerations.

A reasonable start would be to define the architectural requirements. It's unlikely that Apple's chip is a completely new architecture, so they'll have some requirements for existing architectures what that would make it more advantageous for them to design versus buy.

Every feature has a cost in die area. More area means more transistors which means higher manufacturing cost and more power dissipation. When I was peripherally involved with microprocessor architectural design, I learned that increased functionality was often weighed against increased power dissipation.

Lithography limits how small feature sizes can be, but only critical layers will use EUV because the throughput for those machines is low compared to what manufacturers were accustomed to with the more traditional lithography. EUV was so late that manufacturers had to resort to using multiple masks to expose a single layer. Many of the layers are probably still using that technique.

As an aside. Manufacturing capacity doesn't increase overnight and it isn't cheap. I've read that auto manufacturers are being limited by semiconductor availability and AMD isn't as competitive as they'd like with Intel because they can't get as many parts as they'd like from TSMC.

EDIT: corrected typo
Thanks for the response.

I’m trying to zoom in just a little more to get a concept of the actual starting points of design. 23 different sections, but we’re still dealing with NAND and XOR gates primarily in that small square space, correct? And this thing is doing trillions of instructions per second due to the size and number of the transistors primarily? The transistors and wires are now atoms-wide at this point... aren’t we at a brick wall pretty soon?
 

kubeek

Joined Sep 20, 2005
5,794
The basic start would be what kind of instruction set they want the processor to use, or if they want to extend something existing or make their own design. That then defines what registers the core will have, what kind of parallel execution of instructions will be possible in that core, branch predictions, out of order exectution and optimalization on the fly.... Then with such single cores you define how the mutliple cores share the common resources, how they communicate with main second level cache, main memory, gpu and other peripheries. Add power dissipation optimalization to start and stop all little parts that are not currently in use, throttle the performance to keep the average dissipation below some threshold.... Lot of stuff to be considered, and most of it is not really designed from scratch but rather modified and optimized from older architectures.
 

dl324

Joined Mar 30, 2015
16,846
I’m trying to zoom in just a little more to get a concept of the actual starting points of design. 23 different sections, but we’re still dealing with NAND and XOR gates primarily in that small square space, correct?
There are many phases to design. You're talking about the actual layout of the transistors which comes long after the architecture has been developed.

Where I worked, there were several levels of layout hierarchy.

There were some standard cells (inverters, AND/NAND, OR/NOR, XOR/XNOR, flip flops, registers, etc). These cells were placed in logic blocks called FUBs (Functional Unit Blocks). FUBs were placed into blocks called Units and the chip was made up of some number of them. There were also special blocks like SRAM/cache. As I recall, the LLC (Last Level Cache, also called L3 cache) on some Intel processors was larger than the processor die.

In some areas, custom layout is used instead of standard cells to get higher transistor density. Standard cells are designed with routing areas over the cell and power is connected by abutting to an adjacent cell. Standard cells are organized in rows and some rows are flipped so that wells can be shared. P and N devices aren't intermixed because one type will require a well of the opposite polarity as the bulk. For example, if the bulk material is P type, then P devices would be drawn in wells of N type (nwells).

Custom cells wouldn't normally leave routing area for lower level metals. Fubs would be limited to the lowest 3-4 metal layers. Units would typically use the higher level metals for interconnect.
And this thing is doing trillions of instructions per second due to the size and number of the transistors primarily?
Instructions per second is speed and has more to do with the architecture and transistor speed. No single microprocessor is capable of a Teraflop (yet, still under around 100MIPS/core now).

Japan has the fastest super computer rated at 537,212 TFlops/s utilizing 7.63M cores. That works out to about 70MFlops/s per core.
The transistors and wires are now atoms-wide at this point... aren’t we at a brick wall pretty soon?
Things aren't that small at the transistor and interconnect level. Gate oxide thickness is down to some small number of atoms, but features associated with the gate are among the smallest.

Process nodes smaller than 5-7nm will likely be using CNT (Carbon NanoTubes) for the transistors.
 
Last edited:

ZCochran98

Joined Jul 24, 2018
303
I've only heard TSMC, IBM, and Intel talking about carbon nanotubes.
GNRFETs (graphene nanoribbon FETs) are still largely experimental. On IEEE there's tons of papers dating back to 2008 (or sooner?) on them. They're similar to GFETs (graphene FETs), but are far superior (better subthreshold swing, a nonzero bandgap, which is tunable by nanoribbon width, lower capacitances, and a few other things). CNFETs are older, which is why TSMC/IBM/Intel talk about them over GNRFETs (and GNRFETs still have a number of hurtles to overcome before being able to be mass-produced).
It also doesn't help the GNRFET's case that there's quite a number of different physical designs for them (single gate, double gate, triple gate, GNR-on-silicon, GNR-on-oxide, GNR-on-hexagonal BN, and many others).
 

Thread Starter

Jennifer Solomon

Joined Mar 20, 2017
112
There are many phases to design. You're talking about the actual layout of the transistors which comes long after the architecture has been developed.

Where I worked, there were several levels of layout hierarchy.

There were some standard cells (inverters, AND/NAND, OR/NOR, XOR/XNOR, flip flops, registers, etc). These cells were placed in logic blocks called FUBs (Functional Unit Blocks). FUBs were placed into blocks called Units and the chip was made up of some number of them. There were also special blocks like SRAM/cache. As I recall, the LLC (Last Level Cache, also called L3 cache) on some Intel processors was larger than the processor die.

In some areas, custom layout is used instead of standard cells to get higher transistor density. Standard cells are designed with routing areas over the cell and power is connected by abutting to an adjacent cell. Standard cells are organized in rows and some rows are flipped so that wells can be shared. P and N devices aren't intermixed because one type will require a well of the opposite polarity as the bulk. For example, if the bulk material is P type, then P devices would be drawn in wells of N type (nwells).

Custom cells wouldn't normally leave routing area for lower level metals. Fubs would be limited to the lowest 3-4 metal layers. Units would typically use the higher level metals for interconnect.
Instructions per second is speed and has more to do with the architecture and transistor speed. No single microprocessor is capable of a Teraflop (yet, still under around 100MIPS/core now).

Japan has the fastest super computer rated at 537,212 TFlops/s utilizing 7.63M cores. That works out to about 70MFlops/s per core.
Things aren't that small at the transistor and interconnect level. Gate oxide thickness is down to some small number of atoms, but features associated with the gate are among the smallest.

Process nodes smaller than 5-7nm will likely be using CNT (Carbon NanoTubes) for the transistors.
So FUBS are the “zoom level” they work at design-wise.

Thanks again for the info. Some fascinating stuff!
 

Thread Starter

Jennifer Solomon

Joined Mar 20, 2017
112
The basic start would be what kind of instruction set they want the processor to use, or if they want to extend something existing or make their own design. That then defines what registers the core will have, what kind of parallel execution of instructions will be possible in that core, branch predictions, out of order exectution and optimalization on the fly.... Then with such single cores you define how the mutliple cores share the common resources, how they communicate with main second level cache, main memory, gpu and other peripheries. Add power dissipation optimalization to start and stop all little parts that are not currently in use, throttle the performance to keep the average dissipation below some threshold.... Lot of stuff to be considered, and most of it is not really designed from scratch but rather modified and optimized from older architectures.
Interesting stuff...thanks for the info.
 

dl324

Joined Mar 30, 2015
16,846
So FUBS are the “zoom level” they work at design-wise.
That's just the way work was partitioned at the company I worked at. Layout designers who worked on FUB level were specialized. They rarely worked below or above that level of integration. FUBs were a manageable amount of data for a designer to layout and verify. It could be that that was also a comfortable level of integration that made design engineers effective because that's the lowest level where simulations could be run. Too much data would mean longer turnaround times.

Turnaround time is the amount of time it takes the engineer to run a simulation, analyze the results, make modifications, and be ready to run simulations again.

Turnaround time for layout designers was the time it took to draw the layout, check design rules and connectivity, make corrections, and be ready to run again. Some layout designers preferred to only work on design rule corrections and then work on connectivity. I think it's better to work on both at the same time because fixing connectivity errors usually resulted in the creation of more design rule errors, and vice versa.

There were literally hundreds of design rules layout designers had to follow. Some layout editors were capable of comprehending design rules and assisting layout designers in creating data that was correct by construction; others didn't and the layout designer had to know the rules. None of our rule aware layout programs were 100% accurate because they didn't have the capability of doing the more complex checks robustly.


Some companies do only automatic cell place and route. Some companies do a combination of standard cells and custom layout. At one time, most/all layout was fully custom. In terms of time to complete layouts for each method, these are ranked from fastest to slowest.

In terms of layout density, automatic place and route gave the lowest while full custom gave the highest. Higher density would usually lower manufacturing cost (die area), so there was a constant battle between time to market and manufacturing costs.
 
Last edited:
This system does 1000x more operations than IBM’S Deep Blue that beat Kasparov in chess in 1997. Deep Blue was a literal ton, the size of a refrigerator, and the M1 is a single chip with 16 billion 5-nm
And power wise, Deep Blue must have required several kilowatts of power to run and a few more to cool down, whereas the M1 can run from a lithium battery pack.
 

kubeek

Joined Sep 20, 2005
5,794
If you want more, you can search for university courses on instruction set architecture and computer architecture. Quick search for "instruction set architecture course" turned these two courses which go into deatil of the low level hardware, and then talk about the higher level details of more efficient processing. I am sure you can find more info online if you want to understand more details on specific parts and problems.
http://www.cs.umd.edu/~meesh/411/CA-online/chapter/computer-architectureintroduction/index.html
https://ocw.mit.edu/courses/electri...nce/6-004-computation-structures-spring-2017/
 

Thread Starter

Jennifer Solomon

Joined Mar 20, 2017
112
Thanks again for the quality replies.

One very puzzling element I find is how lithography renders functional switches from what amounts to a “stenciled” etching. How are there moving parts assembled in the etchings at such nano-scales?
 

ZCochran98

Joined Jul 24, 2018
303
Unless you're referring to NEMS or MEMS (nano- and micro-electromechanical systems), which I doubt in context to this thread (the M1 chip or other processors in general), there are no moving parts - the switching comes entirely from manipulation of the conductivity of the channel of a transistor via the electric fields in the silicon.
 

dl324

Joined Mar 30, 2015
16,846
One very puzzling element I find is how lithography renders functional switches from what amounts to a “stenciled” etching. How are there moving parts assembled in the etchings at such nano-scales?
There are no moving parts in microprocessors (when they're designed correctly).

They can make mechanical parts using similar steps. Look up MEMS (MicroElectroMechanical Systems).
 

ZCochran98

Joined Jul 24, 2018
303
Depending on what you're looking for. For general how they work, try "MOSFET theory" or "how MOSFETs work." For how they work in digital, try "MOSFET logic gates" or "Transistor logic gates." For how they're made, try "MOSFET lithography" or "semiconductor process lithography." Those may get you the results you're looking for.
Lots of stuff out there, but figuring out how to find it...that's the hard part indeed.
 

dl324

Joined Mar 30, 2015
16,846
Is there a specific word/process for this to google on (conductivity-based switching)?
This is how typical MOSFETs are drawn
clipimage.jpg
From http://staff.utar.edu.my/limsk/Basic Electronics/Chapter 5 MOSFET Theory and Applications.pdf

Enhancement mode are the most common, but the topology for depletion mode devices is the same.

For an N channel MOSFET, a channel is formed between the source and drain when a positive voltage is applied to the gate. The more positive that voltage, the deeper the channel will be.

In digital circuits, the MOSFET is either on or off; though leakage currents got quite bad before the advent of finFETs. At around the 65nm node, leakage current (current conducted when the transistor was off) accounted for a significant portion of total power dissipation.

Since the wafers are either N or P type, one of the devices has to be drawn in a well of the correct polarity. If the substrate is P type, the wells would be N type.
 
Top