Thoughts on comparison circuits?

Thread Starter

xox

Joined Sep 8, 2017
936
Correctness is the most important part -- it doesn't matter how fast, small, or cheap it is if it doesn't work.


It took me a while to figure out your approach and I thought I had done so, but after a closer inspection, I'm not sure. I would really help if you described the idea behind how your circuits work, a description of what the inputs and outputs mean, and truth tables of how they are intended to behave.


I'll try to walk though how my attempt to reverse engineer your circuit lead me to think and you can correct me where I've missed the mark.


My first impression was that your overall circuit is taking two 2-bit inputs, R and L, with the msb each being suffixed with H and the lsb suffixed with L. It is then producing two outputs, LG and RG, in which


LG = L>R?

RG = L<R?


With this supposition in mind, I then concluded (a conclusion that I am seriously questioning, however) that your half comparator answered the question


out = L>R (or L<R)


and that the idea was that you would swap L and R inputs to one of them to get the other, thus giving you both LT and GT outputs. But that doesn't seem to be what's happening since you aren't changing the L/R orientation, but instead taking the ones complement of one or the other. But perhaps this accomplishes the same thing.


My next point of concern is that your half-comparator circuit seems to be symmetric regarding L and R, meaning that they can be swapped without affecting the output. More troubling is that it appears that the bits in each position can be independently swapped without affecting the output. Either of these would render it incapable to properly distinguishing between L and R values properly. But perhaps my visual inspection is missing something.


To see one instance of this, consider just the LL and RL inputs to the half-comparator. The only place these two inputs go are to a 2-input NOR gate and it doesn't matter which one goes to which since NOR is symmetric. Assuming that LL is the lsb of the left number and RL is the lsb of the right, this means that the circuit will produce the same output for L=00, R=01 as it does for L=01, R=00.


I'll try to explore this a bit deeper later.
Yes, the function of the half-comparator is indeed rather unusual! (In retrospect it was a poor naming choice because that isn't what it does at all.) Basically it outputs HI if all of the pins are LO, if only one of the inputs is HI, or if both of the LSB inputs are HI. Otherwise it outputs a LO signal. And of course in operation two of its inputs will be inverted externally (which is probably why at a glance it appears that the bits in each position can be independently swapped without affecting the output). Within the 2-bit comparator the "upper stage" LX pins are inverted, likewise for the "lower stage" RX input pins.

Everything does seem to be working properly but I suspect that there could be some simplification done. I will work on it some more over the week. (Just let me know if you would like a link to simulation. I can wrap the whole circuit into a text file which can then be loaded by the Falstad simulator.)

Just out of curiosity, how few transistors do you think the 2-bit comparator could be implemented with? Right now the number needed to make an N-bit comparator with this design amounts to 60*(N-1) transistors. (Which means a 64-bit comparator would weigh in at 3780 MOSFETs!) That sure seems like a lot to me.
 
Last edited:

WBahn

Joined Mar 31, 2012
32,845
So let's redraw your circuits using traditional logic gate symbols. We don't need to draw the circuit at the transistor level provided we know the transistor-level implementation of each of the gates we are using.

I've marked up your half-comparator circuit, identifying the primitive gates and labeling each with a reference designator. I also assigned labels to the internal nodes.

1761380070501.png

This is to make it easier to verify that the following gate-level schematic matches it.

1761381102206.png


This is the Truth Table that I get for this circuit:

Code:
(L) LH LL (R) RH RL | A B C D F G | out
(0)  0  0 (0)  0  0 | 1 0 1 1 1 0 |  1
(0)  0  0 (1)  0  1 | 0 1 1 0 0 0 |  1
(0)  0  0 (2)  1  0 | 1 0 0 0 1 0 |  1
(0)  0  0 (3)  1  1 | 0 1 0 0 0 1 |  0
(1)  0  1 (0)  0  0 | 0 1 1 0 0 0 |  1
(1)  0  1 (1)  0  1 | 0 1 1 0 0 0 |  1
(1)  0  1 (2)  1  0 | 0 1 0 0 0 1 |  0
(1)  0  1 (3)  1  1 | 0 1 0 0 0 1 |  0
(2)  1  0 (0)  0  0 | 1 0 0 1 0 0 |  1
(2)  1  0 (1)  0  1 | 0 1 0 0 0 1 |  0
(2)  1  0 (2)  1  0 | 1 0 0 0 0 1 |  0
(2)  1  0 (3)  1  1 | 0 1 0 0 0 1 |  0
(3)  1  1 (0)  0  0 | 0 1 0 0 0 1 |  0
(3)  1  1 (1)  0  1 | 0 1 0 0 0 1 |  0
(3)  1  1 (2)  1  0 | 0 1 0 0 0 1 |  0
(3)  1  1 (3)  1  1 | 0 1 0 0 0 1 |  0
So the circuit outputs HI for the following input combinations of <L,R>:

HI: <0,0>, <0,1>, <0,2>, <1,0>, <1,1>, <2,0>

So, assuming that the overall circuit does, indeed, work, I don't understand how. But, perhaps it has to do with the ones comp aspect of the overall circuit, so let's examine the LG output, which I'm assuming means that L is strictly greater than R.

If that's correct, then we want it to be HI for the following <L,R> combinations:

HI: <1,0>, <2,0>, <3,0>, <2,1>, <3,1>, <3,2> // Desired LG output

For half-comparator that drives the LG output, the L value is negated, which means that the above output combinations for the half-adder map to the following inputs:

HI: <3,0>, <3,1>, <3,2>, <2,0>, <2,1>, <1,0> // Actual LG output

And, somewhat surprisingly to me, this is correct. So let's look at the RG output (which, at this point, I'm expecting will also be correct):

For this one, we want it to be HI for the following combinations:

HI: <0,1>, <0,2>, <0,3>, <1,2>, <1,3>, <2,3> // Desired RG output

HI: <0,3>, <0,2>, <0,1>, <1,3>, <1,2>, <2,3> // Actual RG output

And, it is, so I must agree that your solution is functionally correct (though the fact that you can swap values and parts of values still bothers me -- I'll have to examine that more closely).

The next step in the optimization process is to see if we can reduce the transistor count and/or the gate delay (focusing on whichever is more important -- fortunately, digital logic is one of the few areas of engineering where reducing one also reduces the other, though there are plenty of exceptions where we have to trade one for the other).

Good job. I would be interested in a description of how you came up with this solution. That might give me some good hints on figuring out why it actually works.
 
Last edited:

WBahn

Joined Mar 31, 2012
32,845
Ok thanks. So then how are you defining 'chain' here?
The same way I defined it in that same post: "A logic chain is a circuit topology that consists of a bunch of identical blocks in which one feeds the next, along with the next bit(s) (pun intended) of data to be processed. It is like links in a chain, hence the name."

Is that just referring to only one level of logic or something?
What did I say in that same post?

"Another example would be a ripple-carry adder implemented using nothing but half-adders. It can be done, but the resulting circuit is not a chain of half-adders. But if I use three half-adders to make a full-adder, then I can made the overall adders as a chain of full-adders. Of course, once we have our chain of full-adders, we are free to replace the internal implementation of a full-adder with anything we want, as long as we keep it functionally identical, without altering the fact that, at the level of the overall adder, it is implemented as a chain of full-adders."
 
Last edited:

WBahn

Joined Mar 31, 2012
32,845
Yes, the function of the half-comparator is indeed rather unusual! (In retrospect it was a poor naming choice because that isn't what it does at all.) Basically it outputs HI if all of the pins are LO, if only one of the inputs is HI, or if both of the LSB inputs are HI. Otherwise it outputs a LO signal. And of course in operation two of its inputs will be inverted externally (which is probably why at a glance it appears that the bits in each position can be independently swapped without affecting the output). Within the 2-bit comparator the "upper stage" LX pins are inverted, likewise for the "lower stage" RX input pins.
I'll have to ponder that in terms of the conceptual underpinnings of a comparator, but I suspect you have at least hit on the mechanics, if not the supporting proof.

Everything does seem to be working properly but I suspect that there could be some simplification done. I will work on it some more over the week. (Just let me know if you would like a link to simulation. I can wrap the whole circuit into a text file which can then be loaded by the Falstad simulator.)
Yep, it does work. This is one that I'm going to throw out as a challenge to my friends in the 6502 Group. I'll give them the gate-level schematic, explain what it is supposed to do, explain my concerns about the symmetry, and then see if they can determine why it works despite that. Hopefully, by the time we get together to discuss it at our weekly meeting, I will have figured out the answer myself!

Just out of curiosity, how few transistors do you think the 2-bit comparator could be implemented with? Right now the number needed to make an N-bit comparator with this design amounts to 60*(N-1) transistors. (Which means a 64-bit comparator would weigh in at 3780 MOSFETs!) That sure seems like a lot to me.
Yes, I expect that some significant optimization can be done. Your half-comparator has five gate delays, plus another one for the NOT gates at the top level. To go through a 64-bit number (assuming they are, indeed, chainable), that would yield a total delay of 384 gate delays, which is substantial. High speed digital designs frequently try to keep the critical path in a given processing stage to something like 5 to 10 gate delays.

I'm not sure how you are coming up with your transistor count. An N-bit comparator is going to need more than just a bunch of these 2-bit comparators. How are you going to combine them to produce a final output? On the surface, these are not chainable, unless you are feeding <LG,RG> in as the <LH,RH> inputs of the next stage???? If that's what you have in mind (and if it works), then I would agree with your figures. If you can describe how you would chain them, I'll verify whether or not it should work.

Don't be too shocked by high transistor counts -- modern CPUs have tens or even hundreds of billions of transistors. Now you are getting an appreciation for why.

But we can play with this and see if we can either improve this one, or explore some other approaches, to get those counts down.
 

MrAl

Joined Jun 17, 2014
13,704
The same way I defined it in that same post: "A logic chain is a circuit topology that consists of a bunch of identical blocks in which one feeds the next, along with the next bit(s) (pun intended) of data to be processed. It is like links in a chain, hence the name."



What did I say in that same post?

"Another example would be a ripple-carry adder implemented using nothing but half-adders. It can be done, but the resulting circuit is not a chain of half-adders. But if I use three half-adders to make a full-adder, then I can made the overall adders as a chain of full-adders. Of course, once we have our chain of full-adders, we are free to replace the internal implementation of a full-adder with anything we want, as long as we keep it functionally identical, without altering the fact that, at the level of the overall adder, it is implemented as a chain of full-adders."
Hi again,

Thanks, but you didn't answer my question. You are forcing me to infer something important from a description of something else which may or may not be what you really mean.
I am asking for an explicit answer "yes" or "no" because that will nail down what you are saying to one specific thing. That way I can clearly understand your meaning.
I am asking because your definition of chain sounds different than what you are describing by example.

Your definition of a logic chain you said was:
"A logic chain is a circuit topology that consists of a bunch of identical blocks in which one feeds the next, along with the next bit(s) (pun intended) of data to be processed. It is like links in a chain, hence the name."
A "bunch of identical blocks" is not the same as a lot of AND gates forming a chain which was one of your examples. The AND gate is one level, then they are chained, but a bunch of identical blocks can be almost anything as long as there is dependency on previous blocks and I assume combined with more and more bits.

And at the same time you said we cannot use a chain on the individual bits for GT or LT, and my confusion here is that the solution I posted for GT could easily be interpreted as a chain. So in that manner I had chained a bunch of identical blocks in order to get a solution. The only difference is that the blocks themselves might be considered to be chained.
Maybe you do not consider that a chain though:
[A>B]
G3
+ E3' * G2
+ E3' * E2' * G1
+ E3' * E2' * E1' * G0
This looks like a chain because each new line depends on previous equalities combined with the next lower bit.
Perhaps you would rather call this a recursive chain. The blocks are not identical but the structure of the blocks is.
 

WBahn

Joined Mar 31, 2012
32,845
Hi again,

Thanks, but you didn't answer my question. You are forcing me to infer something important from a description of something else which may or may not be what you really mean.
I am asking for an explicit answer "yes" or "no" because that will nail down what you are saying to one specific thing. That way I can clearly understand your meaning.
I am asking because your definition of chain sounds different than what you are describing by example.

Your definition of a logic chain you said was:

A "bunch of identical blocks" is not the same as a lot of AND gates forming a chain which was one of your examples. The AND gate is one level, then they are chained, but a bunch of identical blocks can be almost anything as long as there is dependency on previous blocks and I assume combined with more and more bits.
Uh... why can't a single AND gate constitute a "block"?

Strictly speaking, each block in a chain would only take outputs from the prior block in the chain as its inputs, but in practice this is too restrictive to be really useful. Also, in broader usage, each block does not have to be identical -- think of a processing chain in which each block takes a signal, performs some operation on it (say a band-limited filter), and passes the output to the next block in the chain which doesn't something different to it (say an amplifier). But the context of this discussion is the notion of chaining a bunch of identical things (e.g., a 2-bit comparator comparator) together to make bigger thing (e.g., a 64-bit comparator).

Within that context...

Generic chain:

1761427648317.png
Put anything you want in each block, as long as it is the same. For example, if we put an AND gate in each block:

1761432212994.png
Frequently, the first and/or last block is a chain is different, in case it's inclusion as part of the chain is a matter of interpretation, but the significance between whether it is or isn't rarely matters.

In the case above, we could certainly implement an 8-input AND gate by chaining eight blocks, each containing an AND gate, as shown. The 'in' signal in that case would be tied LO. But we can also recognize that the first block can be simplified by removing the AND gate and tying the output directly to the d0 input.
1761432467479.png
In this case, the first block can no go away completely and we are left with a seven-block chain.

In the case of a ripple-carry adder that does not support carry-in, we can either leave the first block as a full-adder and simply tie the carry-in input LO, or we can modify that block to use a half-adder instead. Whether or not the half-adder is considered to be part of the chain is up for interpretation and depends on the context of the specific discussion involved.

And at the same time you said we cannot use a chain on the individual bits for GT or LT, and my confusion here is that the solution I posted for GT could easily be interpreted as a chain. So in that manner I had chained a bunch of identical blocks in order to get a solution. The only difference is that the blocks themselves might be considered to be chained.
Maybe you do not consider that a chain though:
[A>B]
G3
+ E3' * G2
+ E3' * E2' * G1
+ E3' * E2' * E1' * G0
This looks like a chain because each new line depends on previous equalities combined with the next lower bit.
Perhaps you would rather call this a recursive chain. The blocks are not identical but the structure of the blocks is.
Can you draw this as a chain in which the contents of each block are the same? If so, please do it. I'm not going to spend a bunch of time taking a verbal description and figuring out if it can or can't be -- that's YOUR responsibility.
 

Attachments

Thread Starter

xox

Joined Sep 8, 2017
936
I'll have to ponder that in terms of the conceptual underpinnings of a comparator, but I suspect you have at least hit on the mechanics, if not the supporting proof.

Yep, it does work. This is one that I'm going to throw out as a challenge to my friends in the 6502 Group. I'll give them the gate-level schematic, explain what it is supposed to do, explain my concerns about the symmetry, and then see if they can determine why it works despite that. Hopefully, by the time we get together to discuss it at our weekly meeting, I will have figured out the answer myself!
Neat! I would absolutely love to hear what they have to say about it. I seem to have a penchant for finding "glitches in the matrix" and this appears to be yet another. =p

Yes, I expect that some significant optimization can be done. Your half-comparator has five gate delays, plus another one for the NOT gates at the top level. To go through a 64-bit number (assuming they are, indeed, chainable), that would yield a total delay of 384 gate delays, which is substantial. High speed digital designs frequently try to keep the critical path in a given processing stage to something like 5 to 10 gate delays.

I'm not sure how you are coming up with your transistor count. An N-bit comparator is going to need more than just a bunch of these 2-bit comparators. How are you going to combine them to produce a final output? On the surface, these are not chainable, unless you are feeding <LG,RG> in as the <LH,RH> inputs of the next stage???? If that's what you have in mind (and if it works), then I would agree with your figures. If you can describe how you would chain them, I'll verify whether or not it should work.
The way I am using them is as a simple "stack of chains" of 2-bit comparators (TBCs). First start by constructing a 4-bit comparator (FBC). The top two bits are compared, then the next two, and finally the four outputs are fed into another TBC. Next, just take two FBCs and connect them in the very same way to yet another TBC in order make an 8-bit comparator (EBC). And so on. Like this:


CMOS-8-bit-comparator.png


Hence the number of TBCs required to construct an N-bit comparator is simply N-1 and the number of gate delays is Dlog2(N), where D is the number if delays per TBC. So assuming D=6 an EBC would require 7 TBCs with a overall gate delay of 18, and thus a 64-bit comparator would be composed of 63 TBCs with a fairly decent (?) gate delay of just 36.

Don't be too shocked by high transistor counts -- modern CPUs have tens or even hundreds of billions of transistors. Now you are getting an appreciation for why.

But we can play with this and see if we can either improve this one, or explore some other approaches, to get those counts down.
Yep, "shocked" pretty sums up my reaction. Initially I thought it would be just a few dozen!

And I keep staring at it, trying to see if I can find any further simplifications, but I think my inexperience in CMOS is hindering me somewhat. I just need to keep visualizing how these devices interplay with eachother for a while. Eventually I think something will come to me. For now it all seems just a bit too opaque.
 
Last edited:

WBahn

Joined Mar 31, 2012
32,845
Neat! I would absolutely love to hear what they have to say about it. I seem to have a penchant for finding "glitches in the matrix" and this appears to be yet another. =p
I'll let you know what the reaction is like. They generally love these kinds of problems.

The way I am using them is as a simple "stack of chains" of 2-bit comparators (TBCs). First start by constructing a 4-bit comparator (FBC). The top two bits are compared, then the next two, and finally the four outputs are fed into another TBC. Next, just take two FBCs and connect them in the very same way to yet another TBC in order make an 8-bit comparator (EBC). And so on. Like this:


View attachment 357706


Hence the number of TBCs required to construct an N-bit comparator is simply N-1 and the number of gate delays is Dlog2(N), where D is the number if delays per TBC. So assuming D=6 an EBC would require 7 TBCs with a overall gate delay of 18, and thus a 64-bit comparator would be composed of 63 TBCs with a fairly decent (?) gate delay of just 36.



Yep, "shocked" pretty sums up my reaction. Initially I thought it would be just a few dozen!

And I keep staring at it, trying to see if I can find any further simplifications, but I think my inexperience in CMOS is hindering me somewhat. I just need to keep visualizing how these devices interplay with eachother for a while. Eventually I think something will come to me. For now it all seems just a bit too opaque.
So your topology isn't a chain, but rather a tree, which makes the underlying logic a lot different. Instead of working from one side to the other (msb to lsb, for instance) and combining the cumulative results with the next pair of bits in line, you are splitting the problem into halves with each block taking the results from the two halves of the prior layer and combining them. That gives me a better notion of what it needs to accomplish to work, which will allow me to better tackle the task of understanding why it works, despite the symmetry concerns.

I'm hoping to find some time to play with this over the weekend, but it's definitely a back-burner item on my to-do list. The big head scratcher is that your TBC is used in the first layer to compare two two-bit binary-weighted values. But in subsequent layers, it is being used to compare "values" that are built up based on comparison flags from the prior layer. But I think I've already spotted the key, which is that the values you are building up for the next layer from the flags construct a number in which the msb bits of both "values" comes from the more significant parts of the original value, while the lsb bits come from the less significant portions. With that in mind, the reason why it works at the tree level is actually pretty obvious, though that still leaves the symmetry-related concerns about the bottom level.

Tree topologies are usually good from a speed standpoint -- as you noted, the delay is O(log(n)), which is pretty darn good. Even though the number of transistors would be the same as a chain topology, the chain is generally easier to lay out in a compact way because the routing is very simple and regular, so it will take up less die area (which generally improves the speed, but only marginally). It's also easier to expand to a greater numbers of bits and to accommodate arbitrary widths, whereas a tree likes the number of inputs to be an integer power of two,

As for the number of transistors, here's some ballpark figures (via Copilot, so take them with a grain of salt, but they are in line with my experience) for different kinds of 64-bit adders:

Ripple Carry - 1792
Carry-Lookahead - 6400 to 9600
Kogge-Stone - 12,800+

These are representative of the classic speed/cost tradeoffs common in engineering. A ripple-carry adder is very compact and simple, and extremely slow (but more than fast enough for many applications). The Kogge-Stone topology is extremely fast, but consumes almost an order of magnitude more die area and is a lot more power hungry. The Carry-Lookahead (of which there are a number of variants) are in the middle representing a compromise of the two.
 
Last edited:

Thread Starter

xox

Joined Sep 8, 2017
936
I'm hoping to find some time to play with this over the weekend, but it's definitely a back-burner item on my to-do list. The big head scratcher is that your TBC is used in the first layer to compare two two-bit binary-weighted values. But in subsequent layers, it is being used to compare "values" that are built up based on comparison flags from the prior layer. But I think I've already spotted the key, which is that the values you are building up for the next layer from the flags construct a number in which the msb bits of both "values" comes from the more significant parts of the original value, while the lsb bits come from the less significant portions. With that in mind, the reason why it works at the tree level is actually pretty obvious, though that still leaves the symmetry-related concerns about the bottom level.


Tree topologies are usually good from a speed standpoint -- as you noted, the delay is O(log(n)), which is pretty darn good. Even though the number of transistors would be the same as a chain topology, the chain is generally easier to lay out in a compact way because the routing is very simple and regular, so it will take up less die area (which generally improves the speed, but only marginally). It's also easier to expand to a greater numbers of bits and to accommodate arbitrary widths, whereas a tree likes the number of inputs to be an integer power of two,
Exactly. We're basically just building up smaller binary numbers at each level which are guaranteed to compare in such a way that the previous comparison dominates. I ended up going with a tree topology because I thought the chain approach would likely result in a really slow implementation. (And one slight correction: the gate delay actually works out to be D(log2(N)+1)).

As for the number of transistors, here's some ballpark figures (via Copilot, so take them with a grain of salt, but they are in line with my experience) for different kinds of 64-bit adders:


Ripple Carry - 1792

Carry-Lookahead - 6400 to 9600

Kogge-Stone - 12,800+


These are representative of the classic speed/cost tradeoffs common in engineering. A ripple-carry adder is very compact and simple, and extremely slow (but more than fast enough for many applications). The Kogge-Stone topology is extremely fast, but consumes almost an order of magnitude more die area and is a lot more power hungry. The Carry-Lookahead (of which there are a number of variants) are in the middle representing a compromise of the two.
Well that makes me feel a little better. I'd say 3780 seems about fair to middling.
 

WBahn

Joined Mar 31, 2012
32,845
Exactly. We're basically just building up smaller binary numbers at each level which are guaranteed to compare in such a way that the previous comparison dominates. I ended up going with a tree topology because I thought the chain approach would likely result in a really slow implementation. (And one slight correction: the gate delay actually works out to be D(log2(N)+1)).
I don't recall all of the various complexity metrics, so refresh me as to what D() is, I don't recall ever seeing it before.

Regardless, I think you got it right the first time. Remember that your first layer only needs half as many TBC's as there are bits in the number.

Each of your TBCs compares two bits, so for a N bits where N = 2^k, there are N/2 TBCs in the first layer. Starting from the output layer and calling that layer 1, each successive layer has twice as many TBCs in it, so layer m has 2^(m-1) TBCs. Each layer imposes a delay of D_TBC, so the total delay is

D_tot = (m-1)*D_TBC

where m is the smallest integer such that

2^(m-1) ≥ N/2

2·2^(m-1) = 2^m ≥ N

m ≥ log2(N)

m ≥ log2(N)

m = ceil(log2(N))

The total delay is therefore:

D_tot = ceil(log2(N)) * D_TBC

In your case, D_TBC is 6 gate delays.

When comparing two systems that are in different complexity classes, the multiplying constant becomes irrelevant once N gets large enough for the "faster" algorithm to overtake it. The same is true for additive constants even within the same complexity class. But this is always with the understanding that we are talking about problems that are "big". However, in circuit design, N is usually small enough that the multiplying constants and constant offsets are important -- even critically so. As can the quantization involved.

Well that makes me feel a little better. I'd say 3780 seems about fair to middling.
It will be interesting to see how much better we can do if we bang away at it enough.
 

Thread Starter

xox

Joined Sep 8, 2017
936
I don't recall all of the various complexity metrics, so refresh me as to what D() is, I don't recall ever seeing it before.


Regardless, I think you got it right the first time. Remember that your first layer only needs half as many TBC's as there are bits in the number.


Each of your TBCs compares two bits, so for a N bits where N = 2^k, there are N/2 TBCs in the first layer. Starting from the output layer and calling that layer 1, each successive layer has twice as many TBCs in it, so layer m has 2^(m-1) TBCs. Each layer imposes a delay of D_TBC, so the total delay is


D_tot = (m-1)*D_TBC


where m is the smallest integer such that


2^(m-1) ≥ N/2


2·2^(m-1) = 2^m ≥ N


m ≥ log2(N)


m ≥ log2(N)


m = ceil(log2(N))


The total delay is therefore:


D_tot = ceil(log2(N)) * D_TBC


In your case, D_TBC is 6 gate delays.

You just have to add one to that expression. The FBC for example is going to be log2(4)+1 = 3 levels of D_TBC delays. Whereas a 64-bit comparator would be log2(64)+1 = 7 levels, so (assuming D_TBC=6) 42 gate delays (and not the 36 that I claimed earlier).

When comparing two systems that are in different complexity classes, the multiplying constant becomes irrelevant once N gets large enough for the "faster" algorithm to overtake it. The same is true for additive constants even within the same complexity class. But this is always with the understanding that we are talking about problems that are "big". However, in circuit design, N is usually small enough that the multiplying constants and constant offsets are important -- even critically so. As can the quantization involved.

It will be interesting to see how much better we can do if we bang away at it enough.
Right, yes big-O notation can be somewhat confusing. Thanks for the explanation! I'll work on some ideas for pairing down the transistor count over the course of the week. In the meantime I have a minor home-improvement project that needs tending to. I'll keep you posted. :)
 

WBahn

Joined Mar 31, 2012
32,845
You just have to add one to that expression. The FBC for example is going to be log2(4)+1 = 3 levels of D_TBC delays. Whereas a 64-bit comparator would be log2(64)+1 = 7 levels, so (assuming D_TBC=6) 42 gate delays (and not the 36 that I claimed earlier).
But your four bit comparator only has 2 levels of D_TBC delays.

Look at the diagram for your 8-bit comparator. The signals clearly go through just three TBC blocks, hence three D_TBC delays.

Right, yes big-O notation can be somewhat confusing. Thanks for the explanation! I'll work on some ideas for pairing down the transistor count over the course of the week. In the meantime I have a minor home-improvement project that needs tending to. I'll keep you posted. :)
One place to start is the 1-bit comparator that I showed back near the end of this post. The NOR version produces GT and LT signals with just two gate delays and 12 transistors. Two of those will let you get those signals for two adjacent bits (processed in parallel). Then some output processing logic can produce the TBC output signals with an additional three gate delays 20 more transistors, yielding 5 gate delays total (instead of 6) and 44 transistors (instead of 60).

Another approach is to start with the truth table and then develop the SOP (or POS) forms and implement them with AND-OR circuits, yielding 3 gate delays, but a real quick look shows that each TBC will require 72 transistors. So, once again, we see the tradeoff between speed and space.

But, if we are willing to design out own custom CMOS gate, we can implement a TBC with just 22 transistors and only two gate delays.

All of the above were done very quickly with some doodling on paper, so I might have missed something.

Yet another option to be explored is to not get too wedded to the use of GT/LT signals and look at using the other possible combinations.
 

MrAl

Joined Jun 17, 2014
13,704
Uh... why can't a single AND gate constitute a "block"?

Strictly speaking, each block in a chain would only take outputs from the prior block in the chain as its inputs, but in practice this is too restrictive to be really useful. Also, in broader usage, each block does not have to be identical -- think of a processing chain in which each block takes a signal, performs some operation on it (say a band-limited filter), and passes the output to the next block in the chain which doesn't something different to it (say an amplifier). But the context of this discussion is the notion of chaining a bunch of identical things (e.g., a 2-bit comparator comparator) together to make bigger thing (e.g., a 64-bit comparator).

Within that context...

Generic chain:

View attachment 357699
Put anything you want in each block, as long as it is the same. For example, if we put an AND gate in each block:

View attachment 357704
Frequently, the first and/or last block is a chain is different, in case it's inclusion as part of the chain is a matter of interpretation, but the significance between whether it is or isn't rarely matters.

In the case above, we could certainly implement an 8-input AND gate by chaining eight blocks, each containing an AND gate, as shown. The 'in' signal in that case would be tied LO. But we can also recognize that the first block can be simplified by removing the AND gate and tying the output directly to the d0 input.
View attachment 357705
In this case, the first block can no go away completely and we are left with a seven-block chain.

In the case of a ripple-carry adder that does not support carry-in, we can either leave the first block as a full-adder and simply tie the carry-in input LO, or we can modify that block to use a half-adder instead. Whether or not the half-adder is considered to be part of the chain is up for interpretation and depends on the context of the specific discussion involved.



Can you draw this as a chain in which the contents of each block are the same? If so, please do it. I'm not going to spend a bunch of time taking a verbal description and figuring out if it can or can't be -- that's YOUR responsibility.
Hello,

I am not sure what was so hard to figure out about the Boolean expressions:
G3
+ E3' * G2
+ E3' * E2' * G1
+ E3' * E2' * E1' * G0

Each line is an OR'd block with new inputs G3, G2, G1, G0.
This means the first block is just 1, and the second E3', and the third E3'*E2', etc.
So each block evolves along with the required new input.
+ means OR
* means AND
' means invert the immediately previous outcome (as usual).

The above is an expression for a 4 input greater than unit.
Gn is the greater than Boolean: A*B' for bits n.
En is the equality Boolean: A xor B for bits n.

You may not like to view this as a chain because the blocks are not statically identical.

I don't think whether or not this is a chain is that important though. The important thing is that it is possible to get the outcome we need from these kinds of expressions using the proposed Boolean primitives.
 

WBahn

Joined Mar 31, 2012
32,845
But, if we are willing to design out own custom CMOS gate, we can implement a TBC with just 22 transistors and only two gate delays.

All of the above were done very quickly with some doodling on paper, so I might have missed something.
I sat down and did the custom CMOS. I missed a transistor in my sketch (which was just the top half of one of the outputs), so my count was off by two. Also, I forgot that I needed both outputs, so I'm off by nearly a factor of two due to that.

Here's the circuit:

1761515379939.png
As you can see, it's a total of 40 transistors, but it only has two gate delays.

So, using the tree topology, an 8-bit comparator would require a total of 280 transistors and have 6 gate delays, while a 64-bit comparator would have 2520 transistors and experience 12 gate delays.
 

Thread Starter

xox

Joined Sep 8, 2017
936
I sat down and did the custom CMOS. I missed a transistor in my sketch (which was just the top half of one of the outputs), so my count was off by two. Also, I forgot that I needed both outputs, so I'm off by nearly a factor of two due to that.

Here's the circuit:

View attachment 357736
As you can see, it's a total of 40 transistors, but it only has two gate delays.

So, using the tree topology, an 8-bit comparator would require a total of 280 transistors and have 6 gate delays, while a 64-bit comparator would have 2520 transistors and experience 12 gate delays.
Impressive. That blows mine right out of the water. I don't quite understand how it works either, but I'll study it for a while until something sinks in. You are a true master of the arts, Sir!
 

WBahn

Joined Mar 31, 2012
32,845
Impressive. That blows mine right out of the water. I don't quite understand how it works either, but I'll study it for a while until something sinks in. You are a true master of the arts, Sir!
It's not as hard as it might look. In fact, I learned it during a job interview.

The interview was for a position with in ASIC design house and the school I was at had no courses on IC design, so I couldn't even figure out why they were interviewing there. Since I had zero background in it, I didn't sign up for an interview. But the Career Center always had the option of adding resumes of students to the pile, so they added mine and I was chosen for one of the on-campus interview slots. I figured it was worth going to just to get more experience doing interviews. I spent some time the night before refamiliarizing myself with the behavior of BJTs and MOSFETs as linear components, but that was about it. I was comfortable with digital logic and opamp circuits, and since I figured I had no shot at getting an offer I wasn't willing to spend a whole of time preparing for it.

After some basic questions about linear BJT and MOSFET circuits, I was asked to draw a gate-level implementation of a fairly simple Boolean expression. I've long since forgotten the expression, but it was something like:

Y = (AB + C)'

So, of course, I drew what was obvious.
1761532305323.png
In fact, I think that might have been the question, or at least something extremely close to it. I remember that it involved two gates, one of which had an inverted output (the other didn't) and that none of the inputs were inverted. I don't recall what the two gates were or which was inverted though, but this is a close match for the questions that followed.

Then he asked how many transistors this solution required and, assuming a NAND, NOR, and NOT each had one unit of delay, how much delay did this solution have.

Since my school had no emphasis in ICs, I had only seen and analyzed the basic TTL NAND gate and even that was just once and was four years in the past. I had never physically touched a MOSFET or been exposed to the internals of CMOS logic beyond one part of one lecture about how a NOT gate worked from an analog standpoint and, I'm pretty sure, we also looked at how a NAND and NOR were implemented, but it was in passing and, again, more than four years in the past.

I said that I recalled that a NOT gate had two transistors, but that I didn't recall the internals of a NAND or NOR gates, though I knew I had seen them, and that I had never seen the internals of a CMOS AND gate. I could tell he was a bit frustrated at that, but since he had nothing better to do for the rest of the hour, he drew a NAND and NOR gate on a sheet of paper and said that the AND and OR gates are implemented by following a NAND or NOR gates with a NOT gate. At that point, I immediately said that the circuit would therefore have ten transistors and three units of delay. The speed with which I responded seemed to put things back on track, so he then asked me to draw a direct implementation of this logic in CMOS.

I remember saying something like, "I think I understand each of the words in that sentence, but I have no idea what they mean when strung together that way." I was expecting that to be the end of the interview, but instead he pointed to the transistor circuit for a NAND gate and explained that PFETs were used to pull the output HI while NFETs were used to pull the output LO. Then he explained that a PFET behaves like a closed switch if the gate is taken LO, while an NFET behaves like a closed switch if the gate is taken HI. So, in a NAND gate, the output is pulled HI if either input is LO, since the PFETs are in series, while it is pulled LO only if both inputs are HI, since they are in parallel.

That made things click and I saw how you could implement arbitrary logic in CMOS with that approach. So I drew the Truth Table for the expression:

Code:
A B C | AB AB+C | Y
0 0 0 |  0   0  | 1
0 0 1 |  0   1  | 0
0 1 0 |  0   0  | 1
0 1 1 |  0   1  | 0
1 0 0 |  0   0  | 1
1 0 1 |  0   1  | 0
1 1 0 |  1   1  | 0
1 1 1 |  1   1  | 0
Then I drew a K-map and got the SOP expression for the output:

Y = A'C' + B'C'

I then explained, while drawing it, that the pull-up portion could be implemented with two parallel strings of transistors in which one string would have two transistors, on that was driven by A and the other by C, while the second would also have two transistors, one driven by B and the other by C. Then I paused and said, "But, wait, The each string has a transistor driven by C, so we could make that common to both strings." So I drew:

1761534490447.png

Then, thinking out loud, I said something like, "Okay, for the bottom we want to pull down whenever we aren't pulling up, so anytime C is HI, we want to pull down, regardless of A and B. But if both A and B are HI, we want to pull down regardless of C. So these have to be in series, but then that has to be in parallel with the one driven by C. So let's try this."

1761534856822.png
I was so excited, because I was seeing something fascinating that I had never even imagined before. I said, "Wow! So that's just six transistors and only a single unit of delay!"

I remember actually saying "Wow!" because he then said, "Wow, is right. Can you come down to the Springs sometime this week and interview with the rest of the engineers?"

A week later I had a job offer and I worked for them for fourteen years, leaving as their Senior Engineer. I continued to do consulting for them for many years after that.

The reason I know that the actual problem he gave me didn't have any inverted inputs was that I left the interview thinking that you could implement any arbitrary logic function this way and only have a single unit of delay. It didn't dawn on me until shortly after I started there (in fact, it was when improving the XOR implementation) that, in general, you need to have both the straight and complemented inputs available to drive the pullup and pulldown networks, so you will usually end up with two units of delay.

I've had a few interviews where I walked out of there knowing something really cool that I didn't know when I walked in. Interestingly, I got offers every time (and accepted them) and truly enjoyed most of my time working there.
 

Attachments

WBahn

Joined Mar 31, 2012
32,845
I sat down and did the custom CMOS. I missed a transistor in my sketch (which was just the top half of one of the outputs), so my count was off by two. Also, I forgot that I needed both outputs, so I'm off by nearly a factor of two due to that.

Here's the circuit:

View attachment 357736
As you can see, it's a total of 40 transistors, but it only has two gate delays.

So, using the tree topology, an 8-bit comparator would require a total of 280 transistors and have 6 gate delays, while a 64-bit comparator would have 2520 transistors and experience 12 gate delays.
So, let me explain this particular circuit -- and the reason for its layout -- and how it ties back to the SOP equations, such as might come out of doing a K-map reduction.

The SOP equation for the GT signal can be arrived at a number of ways. One way is to think it through.

A1·A0 represents the value 3 if true. If it greater than B is either B1 or B2 is LO, so

GT = A1 · A0 · (B1' + B0') + ... = (A1 · A0 · B1') + (A1 · A0 · B0') + ....

A1·A0' represents the value 2 if true and it is greater than B if B is either 1 (which is B1'·B0) or 0 (which is B1'·B0'). But, if we think about it, we only need to know B1 since, if B1 is LO, then B is less than 2 while if it is HI, B is at least 2. So these two terms can be combined. If we don't spot this via reasoning, that's fine, because it will come out in the wash since:

A1·A0'·B1'·B0 + A1·A0'·B1'·B0' = A1·A0'·B1'·(B0 + B0') = A1·A0'·B1'·(1) = A1·A0'·B1'

So now we have

GT = (A1 · A0 · B1') + (A1 · A0 · B0') + (A1 · A0' · B1') + ...

Finally, we have A1'·A0, which represents the value 1 and is only greater than B=0, which is B1'·B0'. We don't have to consider A=0 (A1'·A0') since this can't be greater than B. So our full Boolean equation is:

GT = (A1 · A0 · B1') + (A1 · A0 · B0') + (A1 · A0' · B1') + (A1' · A0 · B1' · B0')

At this point, we can see if we can minimize this using a K-map or any other means. Low hanging fruit can be had by looking if there are any pairs of terms that have all but one factor in common, which is what we have with the first and third terms.

(A1 · A0 · B1') + (A1 · A0' · B1') = (A1 · B1') · (A0 + A0') = A1 · B1'

GT = (A1 · B1') + (A1 · A0 · B0') + (A1' · A0 · B1' · B0')

If you think about it, identifying these relationships and combining them this way is all a K-map is doing, it's just letting us do so graphically. The strength of a K-map is that it lets us see overlapping terms in a way that can be difficult to spot just looking at the equation. For instance, consider what would happen if we fleshed out the middle term to include (B1+B1') as follows:

GT = (A1 · B1') + (A1 · A0 · B0') + (A1' · A0 · B1' · B0') + [(A1 · A0 · B0') · (B1 + B1')]

We know we don't need this term, but we also know it doesn't change anything, so it does no harm. But what happens if we expand it?

GT = (A1 · B1') + (A1 · A0 · B0') + (A1' · A0 · B1' · B0') + [(A1 · A0 · B0' · B1) + (A1 · A0 · B0' · B1')]

Now we have the third and the fifth terms that are identical except one has A1' and the other has A1, which will let us combine them by eliminating A1. The fourth term we already know we can erase anytime we want, so let's do that. This leaves us with

GT = (A1 · B1') + (A1 · A0 · B0') + (A0 · B1' · B0')

This is the minimal SOP form for GT. As we will see, it is NOT necessarily the minimal form in some absolute since, it is merely the minimal "SOP" form -- that's a point that many people lose sight of. But we'll come back to that.

Looking at the equation, we see that we want to pull the GT output HI if any of those three terms is true, which means we want three parallel paths so that each one can pull it HI independent of the other two.

For each term, the path requires that the factors being ANDed together all be true, which implies switches in series. Out pullup switches are PFETs and a PFET is turned on when its gate is take LO, so we need to drive the gates with the opposite polarity of the variable shown in the the term. For instance, in the first term we put a PFET driven by A1' in series with a PFET driven by B1. The result is the pullup network in the circuit that was shown. Here it is again, highlighting the three terms.

1761619423234.png

For our pulldown network, we want the complement of the logic for the pullup network, which we can get by simply negating GT.

GT' = [(A1 · B1') + (A1 · A0 · B0') + (A0 · B1' · B0')]'

Applying DeMorgan's Theorem gets us:

GT' = (A1 · B1')' · (A1 · A0 · B0')' · (A0 · B1' · B0')'

Applying DeMorgan's Theorem again to each term gets us:

GT' = (A1' + B1) · (A1' + A0' + B0) · (A0' + B1 + B0)

We can recognize that here we have three factors that are ANDed together, meaning that they have to be in series, and that each factor consists of terms that are ORed together, meaning that they have to be in parallel. Since we are pulling LO, we use NFETs and since NFETs are turned on by a HI signal at the gate, we don't need to invert the driving signals.

We can no see the logic of our pulldown network as follows:

1761620304602.png

This gives us a way to easily translate a Boolean equation into a CMOS logic gate. In general, we need both the uncomplemented and complemented inputs, though in this particular case we only needed the complement of two of the inputs.

If we have our Boolean equation for the function is SOP, we can implement the pullup as parallel strings term by term. To get the pulldown network, we don't need to manipulate the equation, we just apply DeMorgan's visually by placing stuff that is in series in parallel and stuff that is in parallel in series. Whatever signal we drive a PFET with, we drive the corresponding NFET with that same signal -- essentially DeMorgan's cancels out the inversion associated with the PFET.

But this is not the only way to do this. There is nothing magical about starting with the SOP form. If we start with the POS form, then our pullup is a series of parallel sections and our pulldown is a parallel of series sections. But even that implies a constraint that isn't there. We can put both pullup and pulldown into either SOP or POS form. One consideration in making this decision is which choice results in the longest chain being the shortest (the fewest series transistors), as this will result in the smallest worst-case propagation and transition times.

So, now let's get back to this point that minimal SOP is not necessary the minimal solution.

Consider our pullup section again. Each row represents transistors that are being turned on by the same signal. If there are transistors in multiple columns on a given row, they represent factors that are shared by those terms in the SOP equation. Hence, they can be factored out. Since we can order the factors however we want (i.e., since the transistors are in series, order doesn't matter), if two columns share terms at one end or the other, we can put those transistors in parallel and replace them with a single transistor. We can't do this in the middle of a string, as those transistors are not in parallel, only if they are moved to the same end of a string.

Notice that the middle and right strings actually have two transistors in common. Thus we can move both to the same end and replace each pair with a single transistor. In terms of the Boolean expression, this is factoring out common signals.

GT = (A1 · B1') + (A1 · A0 · B0') + (A0 · B1' · B0')

GT = (A1 · B1') + (A0 · B0')·(A1 + B1')

Notice that if our first term had either A0 of B0' in it, we could have factored it out of both of the remaining terms. But, in this case, we don't have that.

Also notice that, in our original equation, we had other options. We could have factored out A1 from the first two terms, or B1' from the first and third. But, had we done either, we wouldn't have been able to factor out a second signal. Thus, determining the optimal combinations to make in order to eliminate as many instances of signals as possible (since each instance represents a transistor) is more an art than a science.

This means that we can reduce out pullup network as follows:

1761627147874.png

This isn't as neat and tidy, but it reduces the pullup network from eight transistors to six. Since there is a comparable reduction in the pulldown network and then the same in the LT circuit, this removes eight transistors from the TBC circuit, taking it from 40 transistors down to 32, a reduction of 20%. This takes the 64 bit comparator from 2520 transistors to 2016, removing 504 transistors.
 

Thread Starter

xox

Joined Sep 8, 2017
936
So, let me explain this particular circuit -- and the reason for its layout -- and how it ties back to the SOP equations, such as might come out of doing a K-map reduction.


The SOP equation for the GT signal can be arrived at a number of ways. One way is to think it through.


A1·A0 represents the value 3 if true. If it greater than B is either B1 or B2 is LO, so


GT = A1 · A0 · (B1' + B0') + ... = (A1 · A0 · B1') + (A1 · A0 · B0') + ....


A1·A0' represents the value 2 if true and it is greater than B if B is either 1 (which is B1'·B0) or 0 (which is B1'·B0'). But, if we think about it, we only need to know B1 since, if B1 is LO, then B is less than 2 while if it is HI, B is at least 2. So these two terms can be combined. If we don't spot this via reasoning, that's fine, because it will come out in the wash since:


A1·A0'·B1'·B0 + A1·A0'·B1'·B0' = A1·A0'·B1'·(B0 + B0') = A1·A0'·B1'·(1) = A1·A0'·B1'


So now we have


GT = (A1 · A0 · B1') + (A1 · A0 · B0') + (A1 · A0' · B1') + ...


Finally, we have A1'·A0, which represents the value 1 and is only greater than B=0, which is B1'·B0'. We don't have to consider A=0 (A1'·A0') since this can't be greater than B. So our full Boolean equation is:


GT = (A1 · A0 · B1') + (A1 · A0 · B0') + (A1 · A0' · B1') + (A1' · A0 · B1' · B0')


At this point, we can see if we can minimize this using a K-map or any other means. Low hanging fruit can be had by looking if there are any pairs of terms that have all but one factor in common, which is what we have with the first and third terms.


(A1 · A0 · B1') + (A1 · A0' · B1') = (A1 · B1') · (A0 + A0') = A1 · B1'


GT = (A1 · B1') + (A1 · A0 · B0') + (A1' · A0 · B1' · B0')


If you think about it, identifying these relationships and combining them this way is all a K-map is doing, it's just letting us do so graphically. The strength of a K-map is that it lets us see overlapping terms in a way that can be difficult to spot just looking at the equation. For instance, consider what would happen if we fleshed out the middle term to include (B1+B1') as follows:


GT = (A1 · B1') + (A1 · A0 · B0') + (A1' · A0 · B1' · B0') + [(A1 · A0 · B0') · (B1 + B1')]


We know we don't need this term, but we also know it doesn't change anything, so it does no harm. But what happens if we expand it?


GT = (A1 · B1') + (A1 · A0 · B0') + (A1' · A0 · B1' · B0') + [(A1 · A0 · B0' · B1) + (A1 · A0 · B0' · B1')]


Now we have the third and the fifth terms that are identical except one has A1' and the other has A1, which will let us combine them by eliminating A1. The fourth term we already know we can erase anytime we want, so let's do that. This leaves us with


GT = (A1 · B1') + (A1 · A0 · B0') + (A0 · B1' · B0')


This is the minimal SOP form for GT. As we will see, it is NOT necessarily the minimal form in some absolute since, it is merely the minimal "SOP" form -- that's a point that many people lose sight of. But we'll come back to that.


Looking at the equation, we see that we want to pull the GT output HI if any of those three terms is true, which means we want three parallel paths so that each one can pull it HI independent of the other two.


For each term, the path requires that the factors being ANDed together all be true, which implies switches in series. Out pullup switches are PFETs and a PFET is turned on when its gate is take LO, so we need to drive the gates with the opposite polarity of the variable shown in the the term. For instance, in the first term we put a PFET driven by A1' in series with a PFET driven by B1. The result is the pullup network in the circuit that was shown. Here it is again, highlighting the three terms.


1761619423234.png


For our pulldown network, we want the complement of the logic for the pullup network, which we can get by simply negating GT.


GT' = [(A1 · B1') + (A1 · A0 · B0') + (A0 · B1' · B0')]'


Applying DeMorgan's Theorem gets us:


GT' = (A1 · B1')' · (A1 · A0 · B0')' · (A0 · B1' · B0')'


Applying DeMorgan's Theorem again to each term gets us:


GT' = (A1' + B1) · (A1' + A0' + B0) · (A0' + B1 + B0)


We can recognize that here we have three factors that are ANDed together, meaning that they have to be in series, and that each factor consists of terms that are ORed together, meaning that they have to be in parallel. Since we are pulling LO, we use NFETs and since NFETs are turned on by a HI signal at the gate, we don't need to invert the driving signals.


We can no see the logic of our pulldown network as follows:


1761620304602.png


This gives us a way to easily translate a Boolean equation into a CMOS logic gate. In general, we need both the uncomplemented and complemented inputs, though in this particular case we only needed the complement of two of the inputs.


If we have our Boolean equation for the function is SOP, we can implement the pullup as parallel strings term by term. To get the pulldown network, we don't need to manipulate the equation, we just apply DeMorgan's visually by placing stuff that is in series in parallel and stuff that is in parallel in series. Whatever signal we drive a PFET with, we drive the corresponding NFET with that same signal -- essentially DeMorgan's cancels out the inversion associated with the PFET.


But this is not the only way to do this. There is nothing magical about starting with the SOP form. If we start with the POS form, then our pullup is a series of parallel sections and our pulldown is a parallel of series sections. But even that implies a constraint that isn't there. We can put both pullup and pulldown into either SOP or POS form. One consideration in making this decision is which choice results in the longest chain being the shortest (the fewest series transistors), as this will result in the smallest worst-case propagation and transition times.


So, now let's get back to this point that minimal SOP is not necessary the minimal solution.


Consider our pullup section again. Each row represents transistors that are being turned on by the same signal. If there are transistors in multiple columns on a given row, they represent factors that are shared by those terms in the SOP equation. Hence, they can be factored out. Since we can order the factors however we want (i.e., since the transistors are in series, order doesn't matter), if two columns share terms at one end or the other, we can put those transistors in parallel and replace them with a single transistor. We can't do this in the middle of a string, as those transistors are not in parallel, only if they are moved to the same end of a string.


Notice that the middle and right strings actually have two transistors in common. Thus we can move both to the same end and replace each pair with a single transistor. In terms of the Boolean expression, this is factoring out common signals.


GT = (A1 · B1') + (A1 · A0 · B0') + (A0 · B1' · B0')


GT = (A1 · B1') + (A0 · B0')·(A1 + B1')


Notice that if our first term had either A0 of B0' in it, we could have factored it out of both of the remaining terms. But, in this case, we don't have that.


Also notice that, in our original equation, we had other options. We could have factored out A1 from the first two terms, or B1' from the first and third. But, had we done either, we wouldn't have been able to factor out a second signal. Thus, determining the optimal combinations to make in order to eliminate as many instances of signals as possible (since each instance represents a transistor) is more an art than a science.


This means that we can reduce out pullup network as follows:


1761627147874.png


This isn't as neat and tidy, but it reduces the pullup network from eight transistors to six. Since there is a comparable reduction in the pulldown network and then the same in the LT circuit, this removes eight transistors from the TBC circuit, taking it from 40 transistors down to 32, a reduction of 20%. This takes the 64 bit comparator from 2520 transistors to 2016, removing 504 transistors.

Interesting. So basically a theoretically optimal implementation is achieved by breaking the circuit down into a combination of pull-up/pull-down networks. As you've pointed out though it is something of a heuristic approach! I will definitely have to study that technique a bit more before I think I could be very proficient at it. I suppose it just boils down to using De Morgan's theorem to put the equations in both SOP and POS form, applying K-map reduction, and then just a matter of choosing the most efficient layout.

I would love to get good enough at this so I can tackle something like a simple 4-bit computer. I have always wanted to understand how all of those things work together; clock signals, program counter, memory fetching, opcode decoding, registers, ALU, etc, and even if it doesn't do much it would at least be satisfying to build something that actually performs a computation. (Maybe that's just the programmer in me.) Although I imagine that the number of transistors needed to put together even the most rudimentary computer would be insanely high. In the thousands, at the very least!
 

WBahn

Joined Mar 31, 2012
32,845
I would love to get good enough at this so I can tackle something like a simple 4-bit computer. I have always wanted to understand how all of those things work together; clock signals, program counter, memory fetching, opcode decoding, registers, ALU, etc, and even if it doesn't do much it would at least be satisfying to build something that actually performs a computation. (Maybe that's just the programmer in me.) Although I imagine that the number of transistors needed to put together even the most rudimentary computer would be insanely high. In the thousands, at the very least!
I would strongly recommend that you look into Nand-to-Tetris (which is a companion project to The Elements of Computing Systems text/course). It is intended as a one-semester course at the sophomore level with the only expectation being a semester's high-level programming experience, preferably in an object-oriented language. It's a progression of twelve very-carefully scoped projects that walk you through building a very simple 16-bit computer starting with nothing but 2-input NAND gates and D-type Flip Flops. The first five projects are the hardware construction (which is all done in HDL and simulation). Then you write an assembler for the very basic (28 instructions) instruction set architecture. That's the first half of the book, which is available for free on the author's website (along with all of the software tools you need). In the second half of the book (which is not free, but even new from the publisher is very reasonably priced), you write a virtual language translator for a stack-oriented virtual machine. That's two projects. Then another two are writing a recursive-descent compiler for a high-level object-oriented language called Jack. Finally, you write the operating system libraries. Everything is kept very minimal with no bells and whistles, so it is quite tractable to do it in one semester. But you will get a good flavor for how the basics of everything works. Want to know how recursive functions calls work under the hood on a computer that has no concept of a call stack? Well, you're going to write the software that generates the necessary assembly code to make that work. Ever wonder how dynamic memory allocation works? Want to know how arrays and string objects work? Well, you're going to write the routines that manage them. Well, you're going to write the memory manager that does it. Ever wonder how text is printed on a screen? Well, you're going to write the code to work with the font table to do it. Would you like to know how multiplication, division, and square-root can be implemented on a processor that can barely add? Well, guess what? Wonder how keyboard input is captured on a process that has no interrupts or how numbers entered via the keyboard are turning into integer values in memory? Well, you guessed it. Like to know how lines and circles are drawn on a screen? That's right, you're going to implement the algorithms to do it.
 

Thread Starter

xox

Joined Sep 8, 2017
936
I would strongly recommend that you look into Nand-to-Tetris (which is a companion project to The Elements of Computing Systems text/course). It is intended as a one-semester course at the sophomore level with the only expectation being a semester's high-level programming experience, preferably in an object-oriented language. It's a progression of twelve very-carefully scoped projects that walk you through building a very simple 16-bit computer starting with nothing but 2-input NAND gates and D-type Flip Flops. The first five projects are the hardware construction (which is all done in HDL and simulation). Then you write an assembler for the very basic (28 instructions) instruction set architecture. That's the first half of the book, which is available for free on the author's website (along with all of the software tools you need). In the second half of the book (which is not free, but even new from the publisher is very reasonably priced), you write a virtual language translator for a stack-oriented virtual machine. That's two projects. Then another two are writing a recursive-descent compiler for a high-level object-oriented language called Jack. Finally, you write the operating system libraries. Everything is kept very minimal with no bells and whistles, so it is quite tractable to do it in one semester. But you will get a good flavor for how the basics of everything works. Want to know how recursive functions calls work under the hood on a computer that has no concept of a call stack? Well, you're going to write the software that generates the necessary assembly code to make that work. Ever wonder how dynamic memory allocation works? Want to know how arrays and string objects work? Well, you're going to write the routines that manage them. Well, you're going to write the memory manager that does it. Ever wonder how text is printed on a screen? Well, you're going to write the code to work with the font table to do it. Would you like to know how multiplication, division, and square-root can be implemented on a processor that can barely add? Well, guess what? Wonder how keyboard input is captured on a process that has no interrupts or how numbers entered via the keyboard are turning into integer values in memory? Well, you guessed it. Like to know how lines and circles are drawn on a screen? That's right, you're going to implement the algorithms to do it.

TBQH it's mostly the hardware aspect that currently interests me. I actually have quite a bit of experience with writing parsers/lexers, virtual machines, memory allocators; just about any kind of programming you can imagine, I have probably done it. I've even implemented a simple graphical operating system from scratch! (In no small part thanks to Ralph Brown's Interrupt List.)

But yeah. The idea of creating a physical computer with just NAND gates and flip flops does sound like a good start for a hardware project. I don't know how practical a 16-bit computer would be for me. Ideally I would like the project to fit on a table (not fill a garage) so if I did end up working on something like that, most likely it would only be 8 or maybe even just 4 bits. The main thing is I would just want it to perform all of the basic functions of a computer.

The biggest mystery is how things like division would have to be implemented in hardware. I mean I can write an algorithm for it, but I really don't want that to have to be "executed" as a set of instructions within the CPU as a normal program would. I will have to figure out some way to translate that to a more "raw" representation. For example I recently had an idea for an integer division algorithm that might be a good candidate (as it just uses shifts, additions, and subtractions). However I would somehow need to convert any loops and comparisons to something which translates to logic gates! Of course that would obviously be one of the last pieces of functionality to be added to the project (if at all). Just getting the darn thing to step through instructions and manipulate data is going to be more than enough stuff to keep me busy for a while!

Anyway I really do appreciate all of the great advice on such low level stuff by the way. I mostly avoided working with logic gates until now and I honestly regret not having looked into it much sooner. Neat stuff!
 
Top