Goto page 1, 2 Next
a s
Guest
Tue Mar 01, 2011 2:59 pm
Dear all,
I have come up with 2 solutions in VHDL, how to count number of bits
in input data.
The thing I don't understand is why the 2 solutions produce different
results, at least with Xilinx ISE and its XST.
There is quite a substantial difference in required number of slices/
LUTs.
1. solution with unrolled loop: 41 slices, 73 LUTs
2. solution with loop: 54 slices, 100 LUTs
The entity of both architectures is the same:
entity one_count is
Port ( din : in STD_LOGIC_vector(31 downto 0);
dout : out STD_LOGIC_vector(5 downto 0)
);
end one_count;
The architecture with an unrolled loop is the following:
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;
entity one_count is
Port ( din : in STD_LOGIC_vector(31 downto 0);
dout : out STD_LOGIC_vector(5 downto 0)
);
end one_count;
architecture one_count_unrolled_arch of one_count is
signal cnt : integer range 0 to 32;
begin
cnt <= to_integer(unsigned(din( 0 downto 0))) +
to_integer(unsigned(din( 1 downto 1))) +
to_integer(unsigned(din( 2 downto 2))) +
to_integer(unsigned(din( 3 downto 3))) +
to_integer(unsigned(din( 4 downto 4))) +
to_integer(unsigned(din( 5 downto 5))) +
to_integer(unsigned(din( 6 downto 6))) +
to_integer(unsigned(din( 7 downto 7))) +
to_integer(unsigned(din( 8 downto

)) +
to_integer(unsigned(din( 9 downto 9))) +
to_integer(unsigned(din(10 downto 10))) +
to_integer(unsigned(din(11 downto 11))) +
to_integer(unsigned(din(12 downto 12))) +
to_integer(unsigned(din(13 downto 13))) +
to_integer(unsigned(din(14 downto 14))) +
to_integer(unsigned(din(15 downto 15))) +
to_integer(unsigned(din(16 downto 16))) +
to_integer(unsigned(din(17 downto 17))) +
to_integer(unsigned(din(18 downto 1

)) +
to_integer(unsigned(din(19 downto 19))) +
to_integer(unsigned(din(20 downto 20))) +
to_integer(unsigned(din(21 downto 21))) +
to_integer(unsigned(din(22 downto 22))) +
to_integer(unsigned(din(23 downto 23))) +
to_integer(unsigned(din(24 downto 24))) +
to_integer(unsigned(din(25 downto 25))) +
to_integer(unsigned(din(26 downto 26))) +
to_integer(unsigned(din(27 downto 27))) +
to_integer(unsigned(din(28 downto 2

)) +
to_integer(unsigned(din(29 downto 29))) +
to_integer(unsigned(din(30 downto 30))) +
to_integer(unsigned(din(31 downto 31)));
dout <= std_logic_vector(to_unsigned(cnt,6));
end one_count_unrolled_arch ;
And the architecture with a loop is the following:
architecture one_count_loop_arch of one_count_loop is
signal cnt : integer range 0 to 32;
begin
process(din) is
variable tmp : integer range 0 to 32;
begin
tmp := to_integer(unsigned(din(0 downto 0)));
for i in 1 to 31 loop
tmp := tmp + to_integer(unsigned(din(i downto i)));
end loop;
cnt <= tmp;
end process;
dout <= std_logic_vector(to_unsigned(cnt,6));
end one_count_loop_arch ;
I would be really grateful if somebody could point out what I did
wrong with the 2. solution with loop.
It certainly must be my mistake, but I can not find it...
Additionally, I know that this "brute-force" one counting might not be
the optimal approach,
but this is just my first attempt to get the job done. If somebody has
a better solution, I would
appreciate it if you could share it.
Regards,
Peter
Tricky
Guest
Tue Mar 01, 2011 4:33 pm
On Mar 1, 12:59 pm, a s <nospa...@gmail.com> wrote:
Quote:
Dear all,
I have come up with 2 solutions in VHDL, how to count number of bits
in input data.
The thing I don't understand is why the 2 solutions produce different
results, at least with Xilinx ISE and its XST.
There is quite a substantial difference in required number of slices/
LUTs.
1. solution with unrolled loop: 41 slices, 73 LUTs
2. solution with loop: 54 slices, 100 LUTs
The entity of both architectures is the same:
entity one_count is
Port ( din : in STD_LOGIC_vector(31 downto 0);
dout : out STD_LOGIC_vector(5 downto 0)
);
end one_count;
The architecture with an unrolled loop is the following:
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;
entity one_count is
Port ( din : in STD_LOGIC_vector(31 downto 0);
dout : out STD_LOGIC_vector(5 downto 0)
);
end one_count;
architecture one_count_unrolled_arch of one_count is
signal cnt : integer range 0 to 32;
begin
cnt <= to_integer(unsigned(din( 0 downto 0))) +
to_integer(unsigned(din( 1 downto 1))) +
to_integer(unsigned(din( 2 downto 2))) +
to_integer(unsigned(din( 3 downto 3))) +
to_integer(unsigned(din( 4 downto 4))) +
to_integer(unsigned(din( 5 downto 5))) +
to_integer(unsigned(din( 6 downto 6))) +
to_integer(unsigned(din( 7 downto 7))) +
to_integer(unsigned(din( 8 downto

)) +
to_integer(unsigned(din( 9 downto 9))) +
to_integer(unsigned(din(10 downto 10))) +
to_integer(unsigned(din(11 downto 11))) +
to_integer(unsigned(din(12 downto 12))) +
to_integer(unsigned(din(13 downto 13))) +
to_integer(unsigned(din(14 downto 14))) +
to_integer(unsigned(din(15 downto 15))) +
to_integer(unsigned(din(16 downto 16))) +
to_integer(unsigned(din(17 downto 17))) +
to_integer(unsigned(din(18 downto 1

)) +
to_integer(unsigned(din(19 downto 19))) +
to_integer(unsigned(din(20 downto 20))) +
to_integer(unsigned(din(21 downto 21))) +
to_integer(unsigned(din(22 downto 22))) +
to_integer(unsigned(din(23 downto 23))) +
to_integer(unsigned(din(24 downto 24))) +
to_integer(unsigned(din(25 downto 25))) +
to_integer(unsigned(din(26 downto 26))) +
to_integer(unsigned(din(27 downto 27))) +
to_integer(unsigned(din(28 downto 2

)) +
to_integer(unsigned(din(29 downto 29))) +
to_integer(unsigned(din(30 downto 30))) +
to_integer(unsigned(din(31 downto 31)));
dout <= std_logic_vector(to_unsigned(cnt,6));
end one_count_unrolled_arch ;
And the architecture with a loop is the following:
architecture one_count_loop_arch of one_count_loop is
signal cnt : integer range 0 to 32;
begin
process(din) is
variable tmp : integer range 0 to 32;
begin
tmp := to_integer(unsigned(din(0 downto 0)));
for i in 1 to 31 loop
tmp := tmp + to_integer(unsigned(din(i downto i)));
end loop;
cnt <= tmp;
end process;
dout <= std_logic_vector(to_unsigned(cnt,6));
end one_count_loop_arch ;
I would be really grateful if somebody could point out what I did
wrong with the 2. solution with loop.
It certainly must be my mistake, but I can not find it...
Additionally, I know that this "brute-force" one counting might not be
the optimal approach,
but this is just my first attempt to get the job done. If somebody has
a better solution, I would
appreciate it if you could share it.
Regards,
Peter
see what you get with this function instead (a function I have used
before):
function count_ones(slv : std_logic_vector) return natural is
varaible n_ones : natural := 0;
begin
for i in slv'range loop
if slv(i) = '1' then
n_ones := n_ones + 1;
end if;
end loop;
return n_ones;
end function count_ones;
.....
inside architecture, no process needed:
dout <= std_logic_vector( to_unsigned(count_ones(din), dout'length) );
The beauty with this is the function will work with a std_logic_vector
of any length.
Andy
Guest
Tue Mar 01, 2011 4:49 pm
A good synthesis tool should be able to optimize either version to the
same implementation. But there are semantic differences that Xilinx
may be getting hung up on.
In the unrolled version, you have a long expression, and there is
freedom within vhdl to evaluate it in different orders or groups of
operations. In the loop version, since you are continually updating
tmp, you are describing an explicitly sequential order in which the
calculation takes place. Like I said, a good synthesis tool should be
able to handle either one equally well, but you get what you pay for
in synthesis tools.
If you are looking for a general solution to the problem for any size
vector, try a recursive function implentation of a binary tree and see
what happens.
Just for kicks, you might also put the whole calculation in the loop
(0 to 31), with temp set to 0 before the loop. Shouldn't make any
difference, but then again, we're already seeing differences where
there should be none.
On the other hand, if what you have works (fits and meets timing), use
the most maintainable, understandable version. It will save you time
(=money) in the long run. It is often interesting to find out what is
"the optimal" way to code something such that it results in the
smallest/fastest circuit. But in the big picture, it most often does
not matter, especially when you write some cryptic code to squeeze the
last pS/LUT out of it, and you had plenty of slack and space to spare
anyway. Nevertheless, knowing how to squeeze every pS/LUT comes in
handy every once in a while.
Andy
a s
Guest
Tue Mar 01, 2011 6:41 pm
Dear Andy, Tricky,
thank you both for your valuable input. Please find my comments below.
On Mar 1, 3:49 pm, Andy <jonesa...@comcast.net> wrote:
Quote:
A good synthesis tool should be able to optimize either version to the
same implementation. But there are semantic differences that Xilinx
may be getting hung up on.
Aha, that's a good thing, it means that I did not make some obvious
mistake. ;-)
Quote:
If you are looking for a general solution to the problem for any size
vector, try a recursive function implentation of a binary tree and see
what happens.
OK, I didn't quite get that but will consider it again.
Quote:
Just for kicks, you might also put the whole calculation in the loop
(0 to 31), with temp set to 0 before the loop. Shouldn't make any
difference, but then again, we're already seeing differences where
there should be none.
Sorry I didn't tell you before. I have already tried that and in this
case XST produces the same result.
Quote:
On the other hand, if what you have works (fits and meets timing), use
the most maintainable, understandable version. It will save you time
(=money) in the long run. It is often interesting to find out what is
"the optimal" way to code something such that it results in the
smallest/fastest circuit. But in the big picture, it most often does
not matter, especially when you write some cryptic code to squeeze the
last pS/LUT out of it, and you had plenty of slack and space to spare
anyway. Nevertheless, knowing how to squeeze every pS/LUT comes in
handy every once in a while.
Andy, I completely agree with what you have written above.
One should strive for maintainable and understandable version.
Although, on my particular case, I have to find a good solution
in terms of LUT resources, because I need 8 instances of
one counters with 64-bit input data. And the device is getting full...
Tricky, your approach does indeed look very neat. I like it.
Although it is far less efficient than mine. For the same input/output
ports, your version with function requires 171 Slices in 313 LUTs.
(The minimum that I get with unrolled version is 41 Slices and 73
LUTs).
johnp
Guest
Tue Mar 01, 2011 11:41 pm
Here's a slightly different approach to your problem.... It tries to
take advantage
of the fact that the LUTs are pretty good as lookup tables. It's in
Verilog, but
you should easily be able to convert it to VHDL.
`define V1
module tst (
input [31:0] data,
output [5:0] cnt
);
`ifdef V1
/*
Device utilization summary:
---------------------------
Selected Device : 3s50pq208-5
Number of Slices: 35 out of 768
4%
Number of 4 input LUTs: 62 out of 1536
4%
Number of IOs: 38
Number of bonded IOBs: 0 out of 124
0%
*/
function [1:0] cnt3;
input [2:0] d_in;
begin
case (d_in)
3'h0: cnt3 = 2'h0;
3'h1: cnt3 = 2'h1;
3'h2: cnt3 = 2'h1;
3'h3: cnt3 = 2'h2;
3'h4: cnt3 = 2'h1;
3'h5: cnt3 = 2'h2;
3'h6: cnt3 = 2'h2;
3'h7: cnt3 = 2'h3;
endcase
end
endfunction
assign cnt = cnt3(data[2:0])
+ cnt3(data[5:3])
+ cnt3(data[8:6])
+ cnt3(data[11:9])
+ cnt3(data[14:12])
+ cnt3(data[17:15])
+ cnt3(data[20:18])
+ cnt3(data[23:21])
+ cnt3(data[26:24])
+ cnt3(data[29:27])
+ cnt3({1'b0, data[31:30]})
;
`endif
`ifdef V2
/*
Selected Device : 3s50pq208-5
Number of Slices: 44 out of 768
5%
Number of 4 input LUTs: 79 out of 1536
5%
Number of IOs: 38
Number of bonded IOBs: 0 out of 124
0%
*/
function [2:0] cnt4;
input [3:0] d_in;
begin
case (d_in)
4'h0: cnt4 = 3'h0;
4'h1: cnt4 = 3'h1;
4'h2: cnt4 = 3'h1;
4'h3: cnt4 = 3'h2;
4'h4: cnt4 = 3'h1;
4'h5: cnt4 = 3'h2;
4'h6: cnt4 = 3'h2;
4'h7: cnt4 = 3'h3;
4'h8: cnt4 = 3'h1;
4'h9: cnt4 = 3'h2;
4'ha: cnt4 = 3'h2;
4'hb: cnt4 = 3'h3;
4'hc: cnt4 = 3'h2;
4'hd: cnt4 = 3'h3;
4'he: cnt4 = 3'h3;
4'hf: cnt4 = 3'h4;
endcase
end
endfunction
assign cnt = cnt4(data[3:0])
+ cnt4(data[7:4])
+ cnt4(data[11:8])
+ cnt4(data[15:12])
+ cnt4(data[19:16])
+ cnt4(data[23:20])
+ cnt4(data[27:24])
+ cnt4(data[31:28])
;
`endif
endmodule
John Providenza
Andy
Guest
Wed Mar 02, 2011 12:36 am
Quote:
If you are looking for a general solution to the problem for any size
vector, try a recursive function implentation of a binary tree and see
what happens.
OK, I didn't quite get that but will consider it again.
The synthesis tool may not be able to figure out that it need not
carry all the bits of the sum through every caculation in the loop. A
binary tree implementation can manage the sum width at every stage.
For a recursive binary tree implementation, define a function that
divides the vector into two ~equal parts (i.e. n/2 and n/2 + n mod 2),
calls itself on each one, and returns the sum of the two results. Stop
recursion when the size of the incoming vector is 1 (just return 1 if
the bit is set, and 0 if not). This is synthesizeable as long as the
recursion is statically bound (which it is, by the size of the
original vector).
It should work out pretty close to what johnp's approach does, except
work for any size input vector.
Andy
glen herrmannsfeldt
Guest
Wed Mar 02, 2011 2:20 am
In comp.arch.fpga Andy <jonesandy_at_comcast.net> wrote:
(snip)
Quote:
The synthesis tool may not be able to figure out that it need not
carry all the bits of the sum through every caculation in the loop. A
binary tree implementation can manage the sum width at every stage.
Is that the same as a Carry Save Adder tree?
Maybe not. The CSA has three inputs and two outputs, so it
isn't exactly binary.
-- glen
Brian Davis
Guest
Wed Mar 02, 2011 2:59 am
Peter wrote:
Quote:
If somebody has a better solution, I would appreciate it if you could share it.
When I looked at this some years back, XST worked well enough
at creating an adder cascade using the old "mask and add" software
trick that I never bothered writing something more optimal:
gen_bcnt32: if ( ALU_WIDTH = 32 ) generate
begin
process(din)
-- multiple variables not needed, but make intermediate steps
visible in simulation
variable temp : unsigned (ALU_MSB downto 0);
variable temp1 : unsigned (ALU_MSB downto 0);
variable temp2 : unsigned (ALU_MSB downto 0);
variable temp3 : unsigned (ALU_MSB downto 0);
variable temp4 : unsigned (ALU_MSB downto 0);
variable temp5 : unsigned (ALU_MSB downto 0);
begin
temp := unsigned(din);
temp1 := (temp AND X"5555_5555") + ( ( temp srl 1) AND
X"5555_5555"); -- 0..2 out x16
temp2 := (temp1 AND X"3333_3333") + ( ( temp1 srl 2) AND
X"3333_3333"); -- 0..4 out x8
temp3 := (temp2 AND X"0707_0707") + ( ( temp2 srl 4) AND
X"0707_0707"); -- 0..8 out x4
temp4 := (temp3 AND X"001f_001f") + ( ( temp3 srl

AND
X"001f_001f"); -- 0..16 out x2
temp5 := (temp4 AND X"0000_003f") + ( ( temp4 srl 16) AND
X"0000_003f"); -- 0..32 out
cnt <= std_logic_vector(temp5(5 downto 0));
end process;
end generate gen_bcnt32;
Brian
glen herrmannsfeldt
Guest
Wed Mar 02, 2011 7:55 am
In comp.arch.fpga Brian Davis <brimdavis_at_aol.com> wrote:
(snip)
Quote:
When I looked at this some years back, XST worked well enough
at creating an adder cascade using the old "mask and add" software
trick that I never bothered writing something more optimal:
(snip)
Quote:
temp1 := (temp AND X"5555_5555") + ( ( temp srl 1) AND
X"5555_5555"); -- 0..2 out x16
temp2 := (temp1 AND X"3333_3333") + ( ( temp1 srl 2) AND
X"3333_3333"); -- 0..4 out x8
OK, that would be a binary tree. I believe the CSA adder tree
is slightly more efficient, though it might depend on the number
of inputs.
The binary tree cascade works especially well on word oriented
machines, and can be easily written in many high-level languages
(with the assumption of knowing the word size).
The first stage of a CSA tree starts with N inputs, and generates
N/3 two bit outputs that are the sums and carries from N/3 full adders.
(If N isn't a multiple of three, then one bit may bypass the stage,
and two bits go into a half adder.)
The next stage takes the N/3 ones and N/3 twos, and generates
N/9 ones, 2N/9 twos, and N/9 fours. You can continue until
there is only one bit of each, or sometimes there are other
optimizations near the end. Last time I did one, I only needed
to know zero, one, two, three, or more than three, which simplifies
it slightly.
It also pipelines well, but then so does the binary tree.
-- glen
a s
Guest
Wed Mar 02, 2011 8:50 am
Andy nailed it again when he said you get what you pay for
regarding synthesis tool. Initially I was synthesising with
ISE12.4 and the results were, hm, inconsistent. After
switching the synthesis tool to SynplifyPro v2010.03 the
results were as expected and, of course, even better than that.
Please see the table below. Tricky's version is denoted "funct",
where there are major differences between the 2 tools:
---------- 32-bit input data --------
unrolled: XST 74 LUTs, 41 slices
unrolled: SynplifyPro 57 LUTs, 34 slices
loop: XST 100 LUTs, 54 slices
loop: SynplifyPro 57 LUTs, 34 slices
funct: XST 317 LUTs, 161 slices
funct: SynplifyPro 58 LUTs, 34 slices
---------- 64-bit input data --------
unrolled: XST 174 LUTs, 96 slices
unrolled: SynplifyPro 129 LUTs, 80 slices
loop: XST 227 LUTs, 121 slices
loop: SynplifyPro 129 LUTs, 80 slices
funct: XST 813 LUTs, 436 slices
funct: SynplifyPro 130 LUTs, 82 slices
Peter
a s
Guest
Wed Mar 02, 2011 9:33 am
Johnp, Brian, thank you too for your input! Much appreciated.
I have ran your code through 2 synthesisers and have updated the table
of required resources.
-------------- 32-bit input data --------------
unrolled: XST 74 LUTs, 41 slices
unrolled: SynplifyPro 57 LUTs, 34 slices
loop: XST 100 LUTs, 54 slices
loop: SynplifyPro 57 LUTs, 34 slices
funct: XST 317 LUTs, 161 slices
funct: SynplifyPro 58 LUTs, 34 slices
JohnpV1: XST 62 LUTs, 35 slices
JohnpV1: SynplifyPro 57 LUTs, 33 slices
JohnpV2: XST 78 LUTs, 43 slices
JohnpV2: SynplifyPro 54 LUTs, 32 slices
Brian: XST 57 LUTs, 39 slices
Brian: SynplifyPro 57 LUTs, 41 slices
The latest 3 pairs of results are interesting because even
XST produces good results, especially in Brian's version
where XST is surprisingly even slightly better. But anyway,
it's not that XST is so clever, it is a clever coding of the design.
Regards,
Peter
Andy
Guest
Wed Mar 02, 2011 6:30 pm
Thanks for publishing your results!
It is interesting how little variance there is in the SynplifyPro
results from widely different RTL implementations. This allows you,
within limits, to code something so that it makes sense functionally
to you and others, and the synthesis tool will still get pretty darn
close to "the optimal" implementation so that it will work even in
demanding environments (speed and/or space constrained). This also
allows reuse of the same code across more environments.
Andy
glen herrmannsfeldt
Guest
Wed Mar 02, 2011 9:31 pm
In comp.arch.fpga Andy <jonesandy_at_comcast.net> wrote:
Quote:
It is interesting how little variance there is in the SynplifyPro
results from widely different RTL implementations. This allows you,
within limits, to code something so that it makes sense functionally
to you and others, and the synthesis tool will still get pretty darn
close to "the optimal" implementation so that it will work even in
demanding environments (speed and/or space constrained). This also
allows reuse of the same code across more environments.
As far as I know, yes, the tools are pretty good at optimizing
combinatorial logic. This problem can be pipelined, though, and
as far as I know the tools don't have a way to optimize that.
You would at least have to specify the number of pipeline stages.
It would be nice to have a tool that would optimize the partition
between stages. Even more, given the clock frequency, it would be
nice to have a tool that would find the optimal number of pipeline
stages.
-- glen
a s
Guest
Wed Mar 02, 2011 10:38 pm
On Mar 2, 5:52 pm, Gabor <ga...@alacron.com> wrote:
Quote:
I didn't catch which device you are targeting, but I
decided to try this myself with XST and Spartan 3A,
using Verilog to see if there are any significant
differences in synthesis performance.
I am targeting Virtex4FX.
Quote:
Here's the code:
module count_bits
#(
parameter IN_WIDTH = 32,
parameter OUT_WIDTH = 6
)
(
input wire [IN_WIDTH-1:0] data_in,
output reg [OUT_WIDTH-1:0] data_out
);
always @*
begin : proc
integer i;
integer sum;
sum = 0;
for (i = 0;i < IN_WIDTH;i = i + 1) sum = sum + data_in[i];
data_out = sum;
end
endmodule
And the results for the 32-bit case (XST)
Number of Slices: 41 out of 1792 2%
Number of 4 input LUTs: 73 out of 3584 2%
which is very close to your original unrolled result.
I get the same results with XST targeting V4.
But that's really interesting how XST produces better results
with Verilog than with VHDL for basically exactly the same input.
Running your module through Synopsys results again
in seemingly "optimum" 57LUTs and 34 slices.
I find it pretty amusing how many options did we come up already
with such a "basic" problem as is counting ones in a word. ;-)
Regards
glen herrmannsfeldt
Guest
Wed Mar 02, 2011 10:53 pm
In comp.arch.fpga a s <nospamas_at_gmail.com> wrote:
(snip)
Quote:
Running your module through Synopsys results again
in seemingly "optimum" 57LUTs and 34 slices.
One should probably also compare propagation delay in addition
to the number of LUTs or slices used. I don't believe it is
large, but there is some tradeoff between the two. Worst
delay would be (N-1) consecutive adders, increasing in width
down the line.
Quote:
I find it pretty amusing how many options did we come up already
with such a "basic" problem as is counting ones in a word.
-- glen
Goto page 1, 2 Next