Closing timing around a hard multiplier

Thread Starter

tindel

Joined Sep 16, 2012
756
I've been using a MAX10 device for prototyping a closed loop system. This is my first real FPGA project. I'm using an ADC to measure a value for a control loop. Due to the control loop, I have to do some multiplication. My controller has two stages, each doing multiplication. I've found that I have had to put 2, 3, or 10 cycles of delay between the two stages to get timing to close. But when I look at the skews, this doesn't make sense. There also isn't much time for timing to close (only 239ps, with 2 delay cycles max) which I don't believe is very robust - would prefer 2ns of setup time slack or more.

Input value -> Controller Stage 1 -> Controller Stage 2 -> Output Control value
The controller iteration clock is 50kHz - compared to the main clock of 50MHz.

Below is a picture of their IP block. Note however, that I'm not using their IP block, only the hard multiplier using an "out = data_a * data_b;" command - this synthesizes to using the hard multiplier). It appears the hard multiplier is asynchronous and I'm using an output register on all four stages. When I look at the violations it always ends up being violated by the information delay from Controller Stage 1 to Controller Stage 2.

I think my main question is why isn't my slack improving by 20ns per clock delay if the update rate of my controller is every 20us?

The code is simplified as follows:
Code:
assign dataout = data_A * data_B;
always @(posedge clk or negedge reset_n) begin  // clk is 50MHz
    if (~reset_n)
        out <= 0;
    else if (dlyd_update[DLY_CLK])  // update rate is 50kHz, with 1/50MHz * DLY_CLK cycles
        out <= dataout;
end
1590846704114.png1590848058193.png
 

Thread Starter

tindel

Joined Sep 16, 2012
756
I tried removing the reset, but it doesn't appear to be the root cause, nor does adding delays increase the slack by 20ns.
 

Analog Ground

Joined Apr 24, 2019
446
It has been awhile but I recall your situation is called a "multicycle path" for timing analysis. There is a way to tell the timing analyzer the data path between registers is more than the default number of cycles (which is usually 1). Perhaps you could search the Quartus user guide for this topic. Why don't you use the multiplier IP and the options in the Megafunction Wizard? You can specify input and output registers and to use the hard multipliers. Are you going for max portability?
 

andrewmm

Joined Feb 25, 2011
915
I tried removing the reset, but it doesn't appear to be the root cause, nor does adding delays increase the slack by 20ns.
General point, Resets, think local not global,
ONLY reset things you need to, resets add product terms / routing congestion.

I only just noted, your also enabling the registers,
try getting rid of that as well, see what it does,

These MAC's are funny things
The built in registers of the MAC are required to get speed,
The MAC will easily perform a multiply at 50 MHz,
so no need nor advantage to use a slow enable,
inferring MACs is great, but you need to follow exactly the examples,
as the tools can be flacky about extracting from your code to a hard MAC.

Remember adding registers does not change the slack , it changes the delay though,
have you simulated to prove the function is working as you expect ?

Also remember about the tools doing register push back / duplication, and use of IOB registers.

Can you post your full code and test bench to show what your up to or make a test case that shows what you have that you cna share. The snippet is ok, but open to too many other effects ,
 

soyez

Joined Aug 17, 2020
51
I've been using a MAX10 device for prototyping a closed loop system. This is my first real FPGA project. I'm using an ADC to measure a value for a control loop. Due to the control loop, I have to do some multiplication. My controller has two stages, each doing multiplication. I've found that I have had to put 2, 3, or 10 cycles of delay between the two stages to get timing to close. But when I look at the skews, this doesn't make sense. There also isn't much time for timing to close (only 239ps, with 2 delay cycles max) which I don't believe is very robust - would prefer 2ns of setup time slack or more.

Input value -> Controller Stage 1 -> Controller Stage 2 -> Output Control value
The controller iteration clock is 50kHz - compared to the main clock of 50MHz.

Below is a picture of their IP block. Note however, that I'm not using their IP block, only the hard multiplier using an "out = data_a * data_b;" command - this synthesizes to using the hard multiplier). It appears the hard multiplier is asynchronous and I'm using an output register on all four stages. When I look at the violations it always ends up being violated by the information delay from Controller Stage 1 to Controller Stage 2.

I think my main question is why isn't my slack improving by 20ns per clock delay if the update rate of my controller is every 20us?

The code is simplified as follows:
Code:
assign dataout = data_A * data_B;
always @(posedge clk or negedge reset_n) begin  // clk is 50MHz
    if (~reset_n)
        out <= 0;
    else if (dlyd_update[DLY_CLK])  // update rate is 50kHz, with 1/50MHz * DLY_CLK cycles
        out <= dataout;
end
View attachment 208498View attachment 208515
Timing closure is the process by which a logic design consisting of primitive elements such as combinatorial logic gates (and, or, not, nand, nor, etc.) and sequential logic gates (flip flops, latches, memories) is modified to meet its timing requirements. Unlike in a computer program where there is no explicit delay to perform a calculation, logic circuits have intrinsic and well defined delays to propagate inputs to outputs. In simple cases, the user can compute the path delay between elements manually. If the design is more than a dozen or so elements this is impractical. For example, the time delay along a path from the output of a D-Flip Flop, through combinatorial logic gates, then into the next D-Flip Flop input must satisfy (be less than) the time period between synchronizing clock pulses to the two flip flops. When the delay through the elements is greater than the clock cycle time, the elements are said to be on the critical path. The circuit will not function when the path delay exceeds the clock cycle delay so modifying the circuit to remove the timing failure (and eliminate the critical path) is an important part of the logic design engineer's task.
 

Thread Starter

tindel

Joined Sep 16, 2012
756
@andrewmm - I know it has been a while, but I did use this advice the last couple of days. Thanks so much for your valuable input.

General point, Resets, think local not global,
ONLY reset things you need to, resets add product terms / routing congestion.
Do you know of a white paper that talks about the effects of resets and routing. Today I had a setup slack of -0.957ns and then I reduced the resets to only the needed resets and my slack improved to +0.091ns. That is a 1ns improvement!

Remember adding registers does not change the slack , it changes the delay though,
have you simulated to prove the function is working as you expect ?
I found that adding registers did change the slack. I am also now on a different platform than the MAX10, so perhaps there is some variation?

Doing the multiplication is only part of the problem. There's also addition, bit shifts, and maximum/minimum compares going on too. I found that using a shift register/enable to sequence events helped significantly. Separating out the operations allowed me to complete the operation in 4 clock cycles, and close my timing, as shown below.

Description / Slack
one cycle -27.619ns
three clock cycles -5.150ns
four clock cycles -0.957ns
four clock cycles + unneeded resets removed +0.091ns
four clock cycles + unneeded resets and enables removed +0.306ns (in this section - another section now has negative slack.)

Another thing I found that helped a lot was drawing out the circuit that I was designing on the whiteboard.

Anyway - just wanted to let you know that I did not ignore your post, and it was of great value to help me converge on a solution.
 

andrewmm

Joined Feb 25, 2011
915
Well done for getting back,
your doing all the right things,
remembering that your describing logic is always a good step, easily forgotten.

resets,

https://www.xilinx.com/support/documentation/white_papers/wp272.pdf


re slack, adding the registers. Adding pipe limning makes it easier for the tools to find routes and means of making your logic meet your timing.
So yes the slack can / does improve as you add pipe line, but its because of the pipe lines making routing easier is the point to take out of this.

Also key to remember, is the tools run till they meet your timing and size requirements and STOP.
so any "slack" you get , is just where the tool got to, not the best it can do,
A big difference,
 
Top