Goto page Previous 1, 2
Andy
Guest
Thu Mar 03, 2011 1:43 am
On Mar 2, 1:31 pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
Quote:
As far as I know, yes, the tools are pretty good at optimizing
combinatorial logic. This problem can be pipelined, though, and
as far as I know the tools don't have a way to optimize that.
You would at least have to specify the number of pipeline stages.
It would be nice to have a tool that would optimize the partition
between stages. Even more, given the clock frequency, it would be
nice to have a tool that would find the optimal number of pipeline
stages.
-- glen
Depending on which tools to which you are referring, yes they can be
very good. But there is a large difference between some tools even in
their ability to optimize purely combinatorial circuits, as shown in
Peter's results.
The Synplicity and Mentor tools have the capability to optimize logic
across register boundaries (re-balancing). Some do it over more than
one boundary (moving logic more than one clock cycle forward/back).
You still have to model the pipeline stages, but that can be a simple
register array tacked onto the beginning or end of the logic, and you
just let the tool redistribute the logic amongst them.
Latency often affects other portions of the design, so unless the
entire design is "floated" WRT to latency (clock cycles or pipeline
stages), it makes little sense for a local optimizing routine to pick
the number of stages.
The behavioral C synthesis tools (Catapult-C and others) take a
completely untimed C algorithm in the form of a function, and allow
you to trade resources, clock speed, latency, etc. with the help of
different views including a Gant chart of various resources and their
utilization. They also can synthesize different types of hardware
interfaces, including registers, streaming data, memories (single and
dual port), etc. around the function.
Andy
glen herrmannsfeldt
Guest
Thu Mar 03, 2011 2:42 am
In comp.arch.fpga Andy <jonesandy_at_comcast.net> wrote:
(after I wrote)
Quote:
As far as I know, yes, the tools are pretty good at optimizing
combinatorial logic. This problem can be pipelined, though, and
as far as I know the tools don't have a way to optimize that.
You would at least have to specify the number of pipeline stages.
It would be nice to have a tool that would optimize the partition
between stages. Even more, given the clock frequency, it would be
nice to have a tool that would find the optimal number of pipeline
stages.
Depending on which tools to which you are referring, yes they can be
very good. But there is a large difference between some tools even in
their ability to optimize purely combinatorial circuits, as shown in
Peter's results.
The Synplicity and Mentor tools have the capability to optimize logic
across register boundaries (re-balancing). Some do it over more than
one boundary (moving logic more than one clock cycle forward/back).
You still have to model the pipeline stages, but that can be a simple
register array tacked onto the beginning or end of the logic, and you
just let the tool redistribute the logic amongst them.
That does sound pretty nice of them. Lately I mostly use Xilinx
ISE which, as far as I know, doesn't do that.
Quote:
Latency often affects other portions of the design, so unless the
entire design is "floated" WRT to latency (clock cycles or pipeline
stages), it makes little sense for a local optimizing routine to pick
the number of stages.
I have done systolic array designs that do, as you say, float.
The constraint is on the clock period. Though in addition there
is the question of the number of unit cells that can fit into
one FPGA. Throughput is clock frequency times (cells/FPGA).
Quote:
The behavioral C synthesis tools (Catapult-C and others) take a
completely untimed C algorithm in the form of a function, and allow
you to trade resources, clock speed, latency, etc. with the help of
different views including a Gant chart of various resources and their
utilization. They also can synthesize different types of hardware
interfaces, including registers, streaming data, memories (single and
dual port), etc. around the function.
Are there tools that will convert a sequential C code
dynamic programming algorithm to a systolic array?
That would be a pretty amazing optimization.
-- glen
JustJohn
Guest
Fri Mar 04, 2011 10:08 pm
On Mar 2, 12:38 pm, a s <nospa...@gmail.com> wrote:
Quote:
On Mar 2, 5:52 pm, Gabor <ga...@alacron.com> wrote:
I didn't catch which device you are targeting, but I
decided to try this myself with XST and Spartan 3A,
using Verilog to see if there are any significant
differences in synthesis performance.
I am targeting Virtex4FX.
Here's the code:
module count_bits
#(
parameter IN_WIDTH = 32,
parameter OUT_WIDTH = 6
)
(
input wire [IN_WIDTH-1:0] data_in,
output reg [OUT_WIDTH-1:0] data_out
);
always @*
begin : proc
integer i;
integer sum;
sum = 0;
for (i = 0;i < IN_WIDTH;i = i + 1) sum = sum + data_in[i];
data_out = sum;
end
endmodule
And the results for the 32-bit case (XST)
Number of Slices: 41 out of 1792 2%
Number of 4 input LUTs: 73 out of 3584 2%
which is very close to your original unrolled result.
I get the same results with XST targeting V4.
But that's really interesting how XST produces better results
with Verilog than with VHDL for basically exactly the same input.
Running your module through Synopsys results again
in seemingly "optimum" 57LUTs and 34 slices.
I find it pretty amusing how many options did we come up already
with such a "basic" problem as is counting ones in a word. ;-)
Regards- Hide quoted text -
- Show quoted text -
Eight years ago (Sept/Oct 2003), we went through this exercise in the
thread "Counting Ones" (I was posting as JustJohn back then, not
John_H). See that thread for some ASCII art of the trees. I ended up
with the following VHDL function that produces "optimum" 55 4-input
LUTs for 32-bit vector input. I haven't seen anything better yet. I
liked Andy's recursion suggestion, it'll take some thought to figure
out how to auto-distribute the carry-in bits to the adders.
Yesterday, Gabor posted 35 6-input LUTs.
Gabor, what code did you use?
I think a nice challenge to the C.A.F. group mind is to beat that.
John L. Smith
-- This function counts bits = '1' in a 32-bit word, using a tree
-- structure with Full Adders at leafs for "minimum" logic
utilization.
function vec32_sum2( in_vec : in UNSIGNED ) return UNSIGNED is
type FA_Arr_Type is array ( 0 to 9 ) of UNSIGNED( 1 downto
0 );
variable FA_Array : FA_Arr_Type;
variable result : UNSIGNED( 5 downto 0 );
variable Leaf_Bits : UNSIGNED( 2 downto 0 );
variable Sum3_1 : UNSIGNED( 2 downto 0 );
variable Sum3_2 : UNSIGNED( 2 downto 0 );
variable Sum3_3 : UNSIGNED( 2 downto 0 );
variable Sum3_4 : UNSIGNED( 2 downto 0 );
variable Sum3_5 : UNSIGNED( 2 downto 0 );
variable Sum4_1 : UNSIGNED( 3 downto 0 );
variable Sum4_2 : UNSIGNED( 3 downto 0 );
variable Sum5_1 : UNSIGNED( 4 downto 0 );
begin
for i in 0 to 9 loop
Leaf_Bits := in_vec( 3 * i + 2 downto 3 * i );
case Leaf_Bits is
when "000" => FA_Array( i ) := "00";
when "001" => FA_Array( i ) := "01";
when "010" => FA_Array( i ) := "01";
when "011" => FA_Array( i ) := "10";
when "100" => FA_Array( i ) := "01";
when "101" => FA_Array( i ) := "10";
when "110" => FA_Array( i ) := "10";
when others => FA_Array( i ) := "11";
end case;
end loop;
Sum3_1 := ( "0" & FA_Array( 0 ) ) + ( "0" & FA_Array( 1 ) );
Sum3_2 := ( "0" & FA_Array( 2 ) ) + ( "0" & FA_Array( 3 ) );
Sum3_3 := ( "0" & FA_Array( 4 ) ) + ( "0" & FA_Array( 5 ) );
Sum3_4 := ( "0" & FA_Array( 6 ) ) + ( "0" & FA_Array( 7 ) )
+ ( "00" & in_vec( 30 ) );
Sum3_5 := ( "0" & FA_Array( 8 ) ) + ( "0" & FA_Array( 9 ) )
+ ( "00" & in_vec( 31 ) );
Sum4_1 := ( "0" & Sum3_1 ) + ( "0" & Sum3_2 );
Sum4_2 := ( "0" & Sum3_3 ) + ( "0" & Sum3_4 );
Sum5_1 := ( "0" & Sum4_1 ) + ( "0" & Sum4_2 );
result := ( "0" & Sum5_1 )
+ ( "000" & Sum3_5 );
return result;
end vec32_sum2;
Gabor Sz
Guest
Sat Mar 05, 2011 5:28 am
On Friday, March 4, 2011 3:08:59 PM UTC-5, JustJohn wrote:
[snip]
Quote:
Yesterday, Gabor posted 35 6-input LUTs.
Gabor, what code did you use?
I posted the Verilog, it's just a simple loop.
The difference was targeting V6 in XST 12.1
Looking through the synthesis report it became
apparent that there are some new templates for
"adder tree" that show up in the V6 implementation
but not for V5. This seems to be the reason
for the dramatic reduction going from V5 to V6
while both have 6-input LUT's. Only XST for V6
has the adder tree templates, so it got 65 LUT's
down to 35. I haven't really thought about an
optimal implementation for 6-input LUT's but given
the fact that XST inferred a tree structure, it's
probably pretty close to optimal already.
-- Gabor
JustJohn
Guest
Mon Mar 07, 2011 9:30 am
On Mar 1, 8:41 am, Peter wrote:
Quote:
Andy, I completely agree with what you have written above.
One should strive for maintainable and understandable version.
Although, on my particular case, I have to find a good solution
in terms of LUT resources, because I need 8 instances of
one counters with 64-bit input data. And the device is getting full...
Peter,
It looks like you've solved your problem simply by moving to a
better synthesis tool, so this may not be of interest anymore.
However, in addition to space, finding a more compact implementation
often leads to a speed increase as well, AND power savings.
Additionally, I like this counting bits problem, it turns up often
enough that it deserves some attention. As to maintainability, nothing
promotes that more than good comments. Once the function is written it
can be stuffed away in a package, never dealt with again (If anyone
copies the code, please include credit).
See my other post for the most compact/fastest way to implement the 32-
bit sum using 4-LUTs and taking advantage of carry logic fabric.
Gabor's post of the 35 LUT number when using 6-LUTs got me looking at
that case. Here are the results (Spartan 3 for the 4-LUTs, Spartan 6
for the 6-LUTs, XST for both, I'd be curious if any other synthesis
tool does better). Synthesizers continually improve, but nothing beats
a good look at the problem, as the 6-LUT case illustrates with a
better than 2:1 savings:
vec32_sum2: 4-input LUTs: 53 Slices: 31
vec32_sum3: 6-input LUTs: 15 Slices: 4
Finally, this is a neat problem because it's nice to make the little
things count.
Best regards all,
John L. Smith
Code for 6-LUT based 32 bit sum:
function vec32_sum3( in_vec : in UNSIGNED ) return UNSIGNED is
type Leaf_type is array ( 0 to 63 ) of UNSIGNED( 2 downto
0 );
-- Each ROM entry is sum of address bits:
constant Leaf_ROM : Leaf_type := (
"000", "001", "001", "010", "001", "010", "010", "011",
"001", "010", "010", "011", "010", "011", "011", "100",
"001", "010", "010", "011", "010", "011", "011", "100",
"010", "011", "011", "100", "011", "100", "100", "101",
"001", "010", "010", "011", "010", "011", "011", "100",
"010", "011", "011", "100", "011", "100", "100", "101",
"010", "011", "011", "100", "011", "100", "100", "101",
"011", "100", "100", "101", "100", "101", "101", "110" );
type S3_Type is array ( 0 to 4 ) of UNSIGNED( 2 downto 0 );
variable S3 : S3_Type;
variable result : UNSIGNED( 5 downto 0 );
variable Leaf_Bits : natural range 0 to 63;
variable S4_1 : UNSIGNED( 3 downto 0 );
variable S4_2 : UNSIGNED( 3 downto 0 );
variable S5_1 : UNSIGNED( 4 downto 0 );
begin
-- Form five 3-bit sums using three 6-LUTs each:
for i in 0 to 4 loop
Leaf_Bits := TO_INTEGER( UNSIGNED( in_vec( 6 * i + 5 downto 6 *
i ) ) );
S3( i ) := Leaf_ROM( Leaf_Bits );
end loop;
-- Add two 3-bit sums + leftover leaf bits as a carry in to get 4
bit sums:
S4_1 := ( "0" & S3( 0 ) ) + ( "0" & S3( 1 ) )
+ ( "000" & in_vec( 30 ) );
S4_2 := ( "0" & S3( 2 ) ) + ( "0" & S3( 3 ) )
+ ( "000" & in_vec( 31 ) );
-- Add 4 bit sums to get 5 bit sum:
S5_1 := ( "0" & S4_1 ) + ( "0" & S4_2 );
-- Add leftover 3 bit sum to get 5 bit result:
result := ( "0" & S5_1 )
+ ( "000" & S3( 4 ) );
return result;
end vec32_sum3;
JustJohn
Guest
Tue Mar 08, 2011 7:35 pm
On Mar 6, 11:30 pm, JustJohn <justjohna...@gmail.com> wrote:
Quote:
On Mar 1, 8:41 am, Peter wrote:
Andy, I completely agree with what you have written above.
One should strive for maintainable and understandable version.
Although, on my particular case, I have to find a good solution
in terms of LUT resources, because I need 8 instances of
one counters with 64-bit input data. And the device is getting full...
Peter,
It looks like you've solved your problem simply by moving to a
better synthesis tool, so this may not be of interest anymore.
However, in addition to space, finding a more compact implementation
often leads to a speed increase as well, AND power savings.
Additionally, I like this counting bits problem, it turns up often
enough that it deserves some attention. As to maintainability, nothing
promotes that more than good comments. Once the function is written it
can be stuffed away in a package, never dealt with again (If anyone
copies the code, please include credit).
See my other post for the most compact/fastest way to implement the 32-
bit sum using 4-LUTs and taking advantage of carry logic fabric.
Gabor's post of the 35 LUT number when using 6-LUTs got me looking at
that case. Here are the results (Spartan 3 for the 4-LUTs, Spartan 6
for the 6-LUTs, XST for both, I'd be curious if any other synthesis
tool does better). Synthesizers continually improve, but nothing beats
a good look at the problem, as the 6-LUT case illustrates with a
better than 2:1 savings:
vec32_sum2: 4-input LUTs: 53 Slices: 31
vec32_sum3: 6-input LUTs: 15 Slices: 4
Finally, this is a neat problem because it's nice to make the little
things count.
Best regards all,
John L. Smith
Chagrinned OOPS!, synthesizer was throwing the leaf ROMs into BRAMs
for the 6-LUT case. Actual numbers for the 6-LUT case without BRAMs
are (and this is not a synthesis benchmark, it is an illustraion of
efficient circuit structure expressed via coding style to reduce LUT
usage, applicable across all synthesis tools, thanks for the warning
Philippe, and BTW I AGREE, EULAs forbidding benchmarks are evil):
vec32_sum3: 6-input LUTs: 30 Slices: 4
Still tops 35, although some may consider the code too complex for
marginal improvement.
Regards,
John
Goto page Previous 1, 2