Collecting registers into memory/FIFO

Thread Starter

LowerVoltage

Joined May 12, 2017
9
I am looking for some general advice on digital circuit design. In my proposed circuit, there are a number of identical nodes (50-100). Each node has an 8-bit output register, and a signal that indicates that the output register is valid. There is a master clock that is supplied to all nodes, and each node may or may not produce an output at each clock.

The goal is to gather all the valid outputs in some kind of memory.

This design will be used in an FPGA interfaced to a computer, and the gathered results are to be transmitted as a block to computer memory.
The speed of the circuit is important, so I am trying to avoid designing some huge mux with many gate delays.

What general design technique is appropriate for this kind of circuit? Please note, I am not asking for Verilog/VHDL code, just the design principles.
 

WBahn

Joined Mar 31, 2012
32,823
It is important to know which node each piece of data come from?

How are the nodes being interfaced to the FPGA? I'm assuming that they are being multiplexed somehow.

Do you have the ability to add a bit of logic at each node?

What is the maximum number of nodes that might produce an output on a given clock? All of them? Or just a few of them?

What is the distribution of how frequently a given node produces output?

How is the block of data being transferred to the computer?

How fast is the master clock?
 

Thread Starter

LowerVoltage

Joined May 12, 2017
9
>It is important to know which node each piece of data come from?
Yes, this is needed, but it is already included in the output register.

>How are the nodes being interfaced to the FPGA? I'm assuming that they are being multiplexed somehow.
Sorry, I don't understand the question. The nodes are in the FPGA.

>Do you have the ability to add a bit of logic at each node?
Yes, the nodes are not yet designed. They will be part of an FPGA configuartion.

>What is the maximum number of nodes that might produce an output on a given clock? All of them? Or just a few of them?
Maximum, all of them. Typically, 5-10.

>What is the distribution of how frequently a given node produces output?
Not known, but probably 5-10% per master clock pulse.

>How is the block of data being transferred to the computer?
That hasn't yet been specified. I'm not very familiar with FPGA interface designs, but it looks like PCIe is the fastest.

>How fast is the master clock?
This will be a compromise between the FPGA cost and desire to get the maximum speed.
 

WBahn

Joined Mar 31, 2012
32,823
>It is important to know which node each piece of data come from?
Yes, this is needed, but it is already included in the output register.

>How are the nodes being interfaced to the FPGA? I'm assuming that they are being multiplexed somehow.
Sorry, I don't understand the question. The nodes are in the FPGA.

>Do you have the ability to add a bit of logic at each node?
Yes, the nodes are not yet designed. They will be part of an FPGA configuartion.

>What is the maximum number of nodes that might produce an output on a given clock? All of them? Or just a few of them?
Maximum, all of them. Typically, 5-10.

>What is the distribution of how frequently a given node produces output?
Not known, but probably 5-10% per master clock pulse.

>How is the block of data being transferred to the computer?
That hasn't yet been specified. I'm not very familiar with FPGA interface designs, but it looks like PCIe is the fastest.

>How fast is the master clock?
This will be a compromise between the FPGA cost and desire to get the maximum speed.
With the nodes internal to the FPGA, things become quite a bit easier. I can think of several ways to approach it. Once might be to use a token passing scheme.

Basically, on each clock cycle each node latches whether it has valid data or not. You then launch a token down the chain of nodes. The first node either passes the token on if it doesn't have data or it captures the token and outputs its data into a FIFO. It then releases the token and the next node that has data captures it. Once the controller gets the token back, it knows that all of the data from all of the nodes have been put into the FIFO.

There are many games you can play to speed up the throughput and to mitigate the effect of bursty data.

You can have multiple FIFOs each with its own token chain and the nodes assigned to one of the chains. You can even have the nodes available to all of the chains so that they dump there data into which ever FIFO is available.

You can have a small FIFO associated with each node so that it can generate new output even if the old output hasn't been exported yet. You can then have these FIFOs claim a disproportionate amount of output FIFO time if they are getting full.
 

Thread Starter

LowerVoltage

Joined May 12, 2017
9
With the nodes internal to the FPGA, things become quite a bit easier. I can think of several ways to approach it....
Thank you for your comments.

The token-passing idea is interesting, but it is serial, and hence slow. Another way to implement it is to "scan" the nodes, which would require more interconnections. Massive ranks of FIFOs increase parallelism, but at the cost of more gates required.

So there are tradeoffs here which I need to consider.
 

WBahn

Joined Mar 31, 2012
32,823
Scanning the nodes will almost certainly be slower than token passing. If you scan the nodes, you have to scan all the nodes. With token passing, the token is passed at the speed of the combinatorial gate delays (it is an unclocked process), thus you only process those nodes that have data. If no nodes have data then your token should pass through all of the nodes and be back to the controller extremely quickly. But one thing that does need to be given some thought is that FPGA's are notorious for static timing hazards and, as a result, logic tends to be very glitchy. Whether the propagation speed even needs to be considered depends on the clock rates in the system, which you still haven't given any hint of. Are we talking about a few kilohertz or a few gigahertz. It makes a difference as to which potential solutions are even worth discussing.
 

Thread Starter

LowerVoltage

Joined May 12, 2017
9
Clock rate will be as high as the FPGA and system design will support. I am guessing 300-500 MHz, but the overall design should allow even higher speeds.

Since the purpose of this design is to replace software, its only advantage is speed. I imagine this is true of most FPGAs these days.
 

WBahn

Joined Mar 31, 2012
32,823
So you are wanting to collect data from up to 100 nodes and transfer it to a computer every two nanoseconds?
 

Thread Starter

LowerVoltage

Joined May 12, 2017
9
So you are wanting to collect data from up to 100 nodes and transfer it to a computer every two nanoseconds?
I wish! No, the clock I was referring to was the "master clock" that synchronizes the FPGA logic. The frequency at which data is transferred to computer memory is much slower, and depends on how fast the FPGA can generate the data at the 100 or so nodes, and collect it into some form that can be transferred in a single transaction, plus the overhead of the transfer itself.
 

WBahn

Joined Mar 31, 2012
32,823
I wish! No, the clock I was referring to was the "master clock" that synchronizes the FPGA logic. The frequency at which data is transferred to computer memory is much slower, and depends on how fast the FPGA can generate the data at the 100 or so nodes, and collect it into some form that can be transferred in a single transaction, plus the overhead of the transfer itself.
Let's call one pass at the collection of data from all of the nodes a "frame". What is the minimum rate that is an acceptable frame rate? Saying, "as fast as possible," isn't a useful specification. Yes, I understand you want it to be as fast as possible, but what is the minimum that is acceptable? If you were contracting with someone to do this for you and they said that the fasted possible frame rate was 100 fps, would you go ahead and pay them to design the system since that would meet an "as fast as possible" spec? Or would you say, "Sorry, if that's the case then there's no point proceeding." What is the lowest frame rate that you would consider proceeding a viable option?
 

Thread Starter

LowerVoltage

Joined May 12, 2017
9
Let's call one pass at the collection of data from all of the nodes a "frame". ... What is the lowest frame rate that you would consider proceeding a viable option?
40 nsec/frame (2.5e7 fps). I see why you ask; this is probably not possible without some serious parallelism.
 

WBahn

Joined Mar 31, 2012
32,823
40 nsec/frame (2.5e7 fps). I see why you ask; this is probably not possible without some serious parallelism.
25 MFPS. That's probably going to take some effort, but may be achievable. Let's consider some of the basic parameters that will have to be met.

How much total data, including node identification, node data, and any other information such as any timestamp needed, will each node provide when it has data to offer?

Your node address is going to take about a byte and if your data is also about a byte then you need something on the order of 5000 MB/s, at least for some frames.

You had previously mentioned PCIe, which I have never used. So I did a quick Google search and see that they talk about transfer rates in the 250 MB/s per lane and up to about four times that for the v3. At 250 MB/s you would need 20 lanes. I only saw references to up to 16 lanes. But if you use the 2.0 at 500 MB/s you could do it in 16 lanes and could do it in 8 lanes on a 3.0 card.

So it's not out of the question. The other end of the equation is whether the receiving computer can process the data fast enough to keep up, but that's a different issue.

If you scan the nodes then you need to scan them at about 2.5 GHz. But if you partition them into parallel scan sets you can get that down quite a bit. Let's say that you scan the nodes in a set into a FIFO (or other buffer structure) at 250 MHz. That means that you can service 10 nodes each frame cycle so you would need 10 partitions. If you have 128 nodes you would need 16 partitions with 8 nodes in a set.

This would give you a parallel output of 32 bytes (assuming that 2 bytes per node mentioned earlier).

What does the PCIe interface in the FPGA you are thinking of using look like? How do you feed it data? And how many lanes at what speed does it support?
 

Thread Starter

LowerVoltage

Joined May 12, 2017
9
I should have said, the fps rate I was quoting was only for the FPGA computations. The transfer time will be amortized over several frames. Transfer PC -> FPGA will include parameters for computing, say, 100 frames. The transfer out FPGA->PC will include all 100 frames.

At this stage, though, I am just considering what happens inside the FPGA.

Total data per node will be 8-16 bits, so plan for worst-case 16 bits. It may be possible to reduce this with some clever encoding or ordering.
 

WBahn

Joined Mar 31, 2012
32,823
I understand what you are saying about amortizing the transfer costs over many frames, but we still need to consider the total transfer rate of the interface. If it can handle transferring values from ALL of the nodes in EVERY frame, then the low hanging fruit would say just do that. Even if we want to do better, we would at least know that we have a viable solution almost immediately. On the other hand, if we can't do that, then we KNOW we have to start being clever. The closer the transfer capacity is to the average data rate, the more clever we have to get. If the interface can't handle the time-averaged transfer rate, then we have a fundamental problem that requires a fundamentally different solution such as using a different interface, a different FPGA that offers higher capacity interfaces, or reducing the data throughput by either compressing it or pruning it somehow.

Interface issues aside, I would first suggest looking at implementing parallel partitions of the nodes each feeding a FIFO along the lines of what I described a few posts back. You can scan all of the nodes in a partition set but only push the data from those that have data into the FIFO.
 

Thread Starter

LowerVoltage

Joined May 12, 2017
9
Thanks for your ideas.I will try some designs with the parallel FIFOs. I like the idea of using combinatorial logic, but, as you mentioned, it is subject to race conditions, and you have to be very careful.
 

WBahn

Joined Mar 31, 2012
32,823
It's not race conditions. Even if you design a combinatorial circuit that is free of both static and dynamic timing hazards you will still see glitches. You can even see glitches in signals for which none of the inputs changed. This is a consequence of the nearly universal use of RAM-based look up tables to implement combinatorial logic in FPGAs.
 
Top