Max finding

J

Josh Model

Guest
Hi all,

Just wanted to delve into the groups accumulated knowledge. I've got an app
where I must find the index of the maximum of 40 or so (unsorted, of course)
20-bit numbers. --the specifics aren't that important, just helps me think.
It's pretty time critical, with throughput more important than latency,
unless latency becomes absurd, and not area critical.

The "brute force" solution of cycling through the 40 numbers and doing a
compare-store on each is pretty slow for FPGA implementation, and cascading
comparators is pretty ugly.

I did some googling and literature searching and came up with plenty of
CS-style results, but very few hardware results.

This was one...
http://www.fpgacpu.org/usenet/max.html
which was seemed the FPGA equivalent of a very clever IEE paper entitled
"Efficient Parallel Pipelineable VLSI architechture for finding the maximum
binary number" by F. Daneshgaran and K. Yao. which solved the problem using
tristates.

The bitwise parallel-max approach is pretty attactive and reduces the clock
cycles to the bitwidth, e.g. 20 (though I guess you could look at m-bits at
a time, but it becomes more complex) but I think it will get somewhat hung
up on the "check for no candidates" logic.

I was wondering if there were any other "clever" approaches around to this
sort of thing from an FPGA synthesis viewpoint? I'll be the first to admit
to it being a while since I've looked at an algorithms book, so feel free to
club me over the head with any idiocies.

--Josh Model
Asst. Technical Staff
MIT Lincoln Laboratory
 
"Josh Model" <model@ll.mit.edu> wrote in message
news:Mg13b.61$Y5.21@llslave.llan.ll.mit.edu...

I must find the index of the maximum of 40 or so (unsorted, of course)
20-bit numbers. --the specifics aren't that important, just helps me
think.
It's pretty time critical, with throughput more important than latency,
unless latency becomes absurd, and not area critical.
Do you get all 40 numbers on every clock? Or do you get them one at
a time? Or are they all sitting in a memory somewhere?

If the numbers arrive one by one, then keeping track of the largest
(and its index) is pretty simple.

If you need the max of the 40 numbers that you saw most recently,
it's a bit harder because you must keep the numbers in sorted order,
but you also need to know about their arrival order so that you can
throw away the correct value when it reaches the "stale" end of the
list. I've done similar things (on a smaller scale) for median
filters, and it gets kinda messy.

The "brute force" solution of cycling through the 40 numbers and doing a
compare-store on each is pretty slow for FPGA implementation, and
cascading
comparators is pretty ugly.
I don't think it's that ugly, really. A tree of 40 comparator/mux modules
will do it, in a fairly systematic way. (It would be horrific to code in
Verilog, but easy in VHDL.)

I did some googling and literature searching and came up with plenty of
CS-style results, but very few hardware results.
Somewhere in the archives here we have a heapsorter design which does
the right kind of thing... I'll seek it out, if you are interested.

--

Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * Perl * Tcl/Tk * Verification * Project Services

Doulos Ltd. Church Hatch, 22 Market Place, Ringwood, Hampshire, BH24 1AW, UK
Tel: +44 (0)1425 471223 mail: jonathan.bromley@doulos.com
Fax: +44 (0)1425 471573 Web: http://www.doulos.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
 
Responses *'d

I must find the index of the maximum of 40 or so (unsorted, of course)
20-bit numbers. --the specifics aren't that important, just helps me
think.
It's pretty time critical, with throughput more important than latency,
unless latency becomes absurd, and not area critical.

Do you get all 40 numbers on every clock? Or do you get them one at
a time? Or are they all sitting in a memory somewhere?

If the numbers arrive one by one, then keeping track of the largest
(and its index) is pretty simple.

*Right. we get all 40 numbers (outputs of Multiply accumulates) once every
N clock cycles, with N probably being determined by the update rate at which
we can find the maximum. In the ideal situation, N = 1, and we update the
maximum index every clock cycle.

If you need the max of the 40 numbers that you saw most recently,
it's a bit harder because you must keep the numbers in sorted order,
but you also need to know about their arrival order so that you can
throw away the correct value when it reaches the "stale" end of the
list. I've done similar things (on a smaller scale) for median
filters, and it gets kinda messy.

The "brute force" solution of cycling through the 40 numbers and doing a
compare-store on each is pretty slow for FPGA implementation, and
cascading
comparators is pretty ugly.

I don't think it's that ugly, really. A tree of 40 comparator/mux modules
will do it, in a fairly systematic way. (It would be horrific to code in
Verilog, but easy in VHDL.)
* I think you're right, might not be so bad. If pipelined it would reduce
latency to Ceil(log2(40)) = 6, which is pretty good, and keep throughput at
maximum. The way I see it, each module would be made up (using 20-bit
numbers & 6-bit addresses) a 2-input, 20-bit comparator which serves as the
select for this module, a 20-bit 2:1 mux to pass the greater value, and a
6-bit 2:1 mux to pass the corect index. Cascade these guys in levels and
you'd need 20 + 10 + 5 + 2 + 2 + 1 modules. Still seems sort of inelegant
to me, but it gets the job done, which is the point.


I did some googling and literature searching and came up with plenty of
CS-style results, but very few hardware results.

Somewhere in the archives here we have a heapsorter design which does
the right kind of thing... I'll seek it out, if you are interested.
* I'd be interested if it's easy for you to find -- for my app, I think most
of the overhead of building the heap would be wasted, since I only want the
max.

*Thanks,
*--Josh
 
Jonathan Bromley wrote:

"Josh Model" <model@ll.mit.edu> wrote in message
news:Mg13b.61$Y5.21@llslave.llan.ll.mit.edu...


The "brute force" solution of cycling through the 40 numbers and doing a
compare-store on each is pretty slow for FPGA implementation, and
cascading


comparators is pretty ugly.



I don't think it's that ugly, really. A tree of 40 comparator/mux modules
will do it, in a fairly systematic way. (It would be horrific to code in
Verilog, but easy in VHDL.)


Aw, heck... it's just a bunch of for loops.

40 values, 20 compares, 20 results.
20 values, 10 compares, 10 results.
10 values, 5 compares, 5 results.
5 values, 2 compares, 3 results.
3 values, 1 compare, 2 results.
2 values, 1 compare, final result.

41 20 bit registers in the pipeline and a result in about 5 clocks.

one stage would look like:
for( i=0; i<10; i=i+1)
begin
result_stage3 <= result_stage2[2*i] > result_stage2[2*i+1] ?
result_stage2[2*i] : result_stage2[2*i+1];
index_stage3 <= result_stage2[2*i] > result_stage2[2*i+1] ?
index_stage2[2*i] : index_stage2[2*i+1];
end
 
"Josh Model" <model@ll.mit.edu> wrote in message
news:vE23b.67$Y5.42@llslave.llan.ll.mit.edu...
Responses *'d

I must find the index of the maximum of 40 or so (unsorted, of course)
20-bit numbers. --the specifics aren't that important, just helps me
think.
....
*Right. we get all 40 numbers (outputs of Multiply accumulates) once
every
N clock cycles, with N probably being determined by the update rate at
which
we can find the maximum. In the ideal situation, N = 1, and we update the
maximum index every clock cycle.
What's the operating clock rate? If you are running in the low tens of MHz
just up your internal FPGA frequency. That's the "FPGA way".

Also, can you use any information prior to the MAC to pre-calculate which
result will be the largest? Are the coefficients fixed? You get the idea.

What's feeding data into the FPGA? Can any intelligenge be derived from
this source?

It seems to me that the best you can do is a parallel binary search type
algorithm. It will take several clocks no matter what.

Another option might be (WARNING: IT'S UGLY AND GENERALLY BAD DESIGN) to
hand place all the components and look into a fully combinatorial solution.
Depending on your clock rate you might be able to pull this off.

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Martin Euredjian

To send private email:
0_0_0_0_@pacbell.net
where
"0_0_0_0_" = "martineu"
 

Welcome to EDABoard.com

Sponsor

Back
Top