Closing timing around a hard multiplier

tindel · May 30, 2020

I've been using a MAX10 device for prototyping a closed loop system. This is my first real FPGA project. I'm using an ADC to measure a value for a control loop. Due to the control loop, I have to do some multiplication. My controller has two stages, each doing multiplication. I've found that I have had to put 2, 3, or 10 cycles of delay between the two stages to get timing to close. But when I look at the skews, this doesn't make sense. There also isn't much time for timing to close (only 239ps, with 2 delay cycles max) which I don't believe is very robust - would prefer 2ns of setup time slack or more.

Input value -> Controller Stage 1 -> Controller Stage 2 -> Output Control value
The controller iteration clock is 50kHz - compared to the main clock of 50MHz.

Below is a picture of their IP block. Note however, that I'm not using their IP block, only the hard multiplier using an "out = data_a * data_b;" command - this synthesizes to using the hard multiplier). It appears the hard multiplier is asynchronous and I'm using an output register on all four stages. When I look at the violations it always ends up being violated by the information delay from Controller Stage 1 to Controller Stage 2.

I think my main question is why isn't my slack improving by 20ns per clock delay if the update rate of my controller is every 20us?

The code is simplified as follows:

Code:

assign dataout = data_A * data_B;
always @(posedge clk or negedge reset_n) begin  // clk is 50MHz
    if (~reset_n)
        out <= 0;
    else if (dlyd_update[DLY_CLK])  // update rate is 50kHz, with 1/50MHz * DLY_CLK cycles
        out <= dataout;
end

Deleted member 115935 · May 30, 2020

get rid of your reset.

tindel · Jun 1, 2020

I tried removing the reset, but it doesn't appear to be the root cause, nor does adding delays increase the slack by 20ns.

Analog Ground · Jun 1, 2020

It has been awhile but I recall your situation is called a "multicycle path" for timing analysis. There is a way to tell the timing analyzer the data path between registers is more than the default number of cycles (which is usually 1). Perhaps you could search the Quartus user guide for this topic. Why don't you use the multiplier IP and the options in the Megafunction Wizard? You can specify input and output registers and to use the hard multipliers. Are you going for max portability?

Deleted member 115935 · Jun 1, 2020

tindel said:
I tried removing the reset, but it doesn't appear to be the root cause, nor does adding delays increase the slack by 20ns.

General point, Resets, think local not global,
ONLY reset things you need to, resets add product terms / routing congestion.

I only just noted, your also enabling the registers,
try getting rid of that as well, see what it does,

These MAC's are funny things
The built in registers of the MAC are required to get speed,
The MAC will easily perform a multiply at 50 MHz,
so no need nor advantage to use a slow enable,
inferring MACs is great, but you need to follow exactly the examples,
as the tools can be flacky about extracting from your code to a hard MAC.

Remember adding registers does not change the slack , it changes the delay though,
have you simulated to prove the function is working as you expect ?

Also remember about the tools doing register push back / duplication, and use of IOB registers.

Can you post your full code and test bench to show what your up to or make a test case that shows what you have that you cna share. The snippet is ok, but open to too many other effects ,

soyez · Sep 7, 2020

tindel said:
I've been using a MAX10 device for prototyping a closed loop system. This is my first real FPGA project. I'm using an ADC to measure a value for a control loop. Due to the control loop, I have to do some multiplication. My controller has two stages, each doing multiplication. I've found that I have had to put 2, 3, or 10 cycles of delay between the two stages to get timing to close. But when I look at the skews, this doesn't make sense. There also isn't much time for timing to close (only 239ps, with 2 delay cycles max) which I don't believe is very robust - would prefer 2ns of setup time slack or more.

Input value -> Controller Stage 1 -> Controller Stage 2 -> Output Control value
The controller iteration clock is 50kHz - compared to the main clock of 50MHz.

Below is a picture of their IP block. Note however, that I'm not using their IP block, only the hard multiplier using an "out = data_a * data_b;" command - this synthesizes to using the hard multiplier). It appears the hard multiplier is asynchronous and I'm using an output register on all four stages. When I look at the violations it always ends up being violated by the information delay from Controller Stage 1 to Controller Stage 2.

I think my main question is why isn't my slack improving by 20ns per clock delay if the update rate of my controller is every 20us?

The code is simplified as follows:

Code:

assign dataout = data_A * data_B; always @(posedge clk or negedge reset_n) begin // clk is 50MHz if (~reset_n) out <= 0; else if (dlyd_update[DLY_CLK]) // update rate is 50kHz, with 1/50MHz * DLY_CLK cycles out <= dataout; end

View attachment 208498 View attachment 208515

Timing closure is the process by which a logic design consisting of primitive elements such as combinatorial logic gates (and, or, not, nand, nor, etc.) and sequential logic gates (flip flops, latches, memories) is modified to meet its timing requirements. Unlike in a computer program where there is no explicit delay to perform a calculation, logic circuits have intrinsic and well defined delays to propagate inputs to outputs. In simple cases, the user can compute the path delay between elements manually. If the design is more than a dozen or so elements this is impractical. For example, the time delay along a path from the output of a D-Flip Flop, through combinatorial logic gates, then into the next D-Flip Flop input must satisfy (be less than) the time period between synchronizing clock pulses to the two flip flops. When the delay through the elements is greater than the clock cycle time, the elements are said to be on the critical path. The circuit will not function when the path delay exceeds the clock cycle delay so modifying the circuit to remove the timing failure (and eliminate the critical path) is an important part of the logic design engineer's task.

tindel · Nov 22, 2020

@andrewmm - I know it has been a while, but I did use this advice the last couple of days. Thanks so much for your valuable input.

andrewmm said:
General point, Resets, think local not global,
ONLY reset things you need to, resets add product terms / routing congestion.

Do you know of a white paper that talks about the effects of resets and routing. Today I had a setup slack of -0.957ns and then I reduced the resets to only the needed resets and my slack improved to +0.091ns. That is a 1ns improvement!

andrewmm said:
Remember adding registers does not change the slack , it changes the delay though,
have you simulated to prove the function is working as you expect ?

I found that adding registers did change the slack. I am also now on a different platform than the MAX10, so perhaps there is some variation?

Doing the multiplication is only part of the problem. There's also addition, bit shifts, and maximum/minimum compares going on too. I found that using a shift register/enable to sequence events helped significantly. Separating out the operations allowed me to complete the operation in 4 clock cycles, and close my timing, as shown below.

Description / Slack
one cycle -27.619ns
three clock cycles -5.150ns
four clock cycles -0.957ns
four clock cycles + unneeded resets removed +0.091ns
four clock cycles + unneeded resets and enables removed +0.306ns (in this section - another section now has negative slack.)

Another thing I found that helped a lot was drawing out the circuit that I was designing on the whiteboard.

Anyway - just wanted to let you know that I did not ignore your post, and it was of great value to help me converge on a solution.

Deleted member 115935 · Nov 23, 2020

Well done for getting back,
your doing all the right things,
remembering that your describing logic is always a good step, easily forgotten.

resets,

https://www.xilinx.com/support/documentation/white_papers/wp272.pdf

re slack, adding the registers. Adding pipe limning makes it easier for the tools to find routes and means of making your logic meet your timing.
So yes the slack can / does improve as you add pipe line, but its because of the pipe lines making routing easier is the point to take out of this.

Also key to remember, is the tools run till they meet your timing and size requirements and STOP.
so any "slack" you get , is just where the tool got to, not the best it can do,
A big difference,

Thread starter	Similar threads	Forum	Replies	Date
P	Switch timing verification -- 3 swtich wipers closing simultanesously	General Electronics Chat	11	Mar 25, 2025
M	Boost Converter DC-DC MOSFET slow closing	Power Electronics	8	Dec 27, 2024
	Arbitrary closing of threads.	Feedback and Suggestions	21	Dec 8, 2022
	PICList is closing	Microcontrollers	1	Oct 28, 2022
	Opening and closing a simple, inaccessible circuit with an impact such as a tap or slap	General Electronics Chat	16	Jun 3, 2022

Closing timing around a hard multiplier

Join our Engineering Community! Sign-in with:

Closing timing around a hard multiplier

tindel

Deleted member 115935

tindel

Analog Ground

Deleted member 115935

soyez

tindel

Deleted member 115935

You May Also Like

Gigadevice Targets HMI and Edge Designs with New MCU Series

Maxlinear Adds Serial Transceivers for Harsh Industrial Applications

Siemens Intros AI-Powered Library Characterizer to Speed IC Design

U-blox’s New Wi-Fi 6E Module Steps Up When the Airwaves Get Crowded