How to start with FPGA as \"coprocessor\"...

T

Thomas Koenig

Guest
Hi,

I have a certain interest in a mathematical puzzle that I have not
been able to solve using a normal CPU, and I thought that using
an FPGA could work.

For this, I would like to assign some work packages to search
for certain numbers to the FPGA, which then processes them and
returns the data, plus an indication that it has finished with
that particular package.

The task at hand is extremely parallel, so FPGAs should be a
good match. However, I have zero actual experience with FPGAs,
and I have no idea how to go about assigning the work packages
and getting back the results.

Any pointers? What sort of board should I look for, and how
should I handle the communication?

(For those who are interested: I want to find numbers other than
zero and one for which the sum of digits in all prime bases up
to 17 is the same, the successor to https://oeis.org/A335839 ,
so to speak).
 
On 08/05/2021 16:28, Thomas Koenig wrote:
Hi,

I have a certain interest in a mathematical puzzle that I have not
been able to solve using a normal CPU, and I thought that using
an FPGA could work.

....

I think you are moving in the wrong direction, if you can\'t solve it
with some numerical package like numpy/linpack then it is highly
unlikely you will succeed with an FPGA based solution.
What you probably want is a fast graphics card + CUDA/OpenCL which will
most likely outperform your FPGA based design.

Still it will be an interesting learning exercise ;-)

Hans
www.ht-lab.com
 
On 08/05/2021 16:28, Thomas Koenig wrote:
Hi,

I have a certain interest in a mathematical puzzle that I have not
been able to solve using a normal CPU, and I thought that using
an FPGA could work.

....

I think you are moving in the wrong direction, if you can\'t solve it
with some numerical package like numpy/linpack then it is highly
unlikely you will succeed with an FPGA based solution.
What you probably want is a fast graphics card + CUDA/OpenCL which will
most likely outperform your FPGA based design.

Still it will be an interesting learning exercise ;-)

Hans
www.ht-lab.com
 
On 08/05/2021 16:28, Thomas Koenig wrote:
Hi,

I have a certain interest in a mathematical puzzle that I have not
been able to solve using a normal CPU, and I thought that using
an FPGA could work.

....

I think you are moving in the wrong direction, if you can\'t solve it
with some numerical package like numpy/linpack then it is highly
unlikely you will succeed with an FPGA based solution.
What you probably want is a fast graphics card + CUDA/OpenCL which will
most likely outperform your FPGA based design.

Still it will be an interesting learning exercise ;-)

Hans
www.ht-lab.com
 
On 5/8/21 5:28 PM, Thomas Koenig wrote:
I have a certain interest in a mathematical puzzle that I have not
been able to solve using a normal CPU, and I thought that using
an FPGA could work.

Out of curiosity, what is the specific issue you encounter using a
\'normal\' CPU ?

As you say:

> The task at hand is extremely parallel,

Typically, assuming a constant-time (of duration ta) atom of work and n
atoms to process over p cpu, the cpu would take a time
t_cpu = ta_cpu * round-up(n/p_cpu)

The only way a FPGA can beat that is if it:
a) has a ta_fpga <<< ta_cpu while retaining p_fpga ~= p_cpu
b) has a p_fpga >>> p_cpu while retaining ta_fpga ~= ta_cpu
c) has a ta_fpga <<< ta_cpu and a p_fpga >>> p_cpu (ideal case)

Depending on how much you\'re willing to spend (big FPGAs aren\'t cheap),
the first question would be, how big can you get \'p_cpu\' ? Using MPI to
distribute the atoms of work over a lot of cores should not be very
difficult, and a \'lot of cores\' can be obtained easily from cloud
providers nowadays.

FPGAs are not as easy to tryout, today I think it\'s pretty much Amazon
F1 in the cloud - or buying.

That being said, FPGA vendors promote a lot of solutions for this
particular problem, from low-level solutions (e.g. a PCIe core and a lot
of hand-written Verilog/VHDL/...) to high-level solutions (e.g.
<https://www.intel.com/content/www/us/en/programmable/documentation/div1537518568620.html>,
<https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html>,
etc.). Those solutions can be with stand-alone FPGAs or with the FPGA
integrated in a SoC with normal cores (e.g. Xilinx Zynq families, among
others).

There\'s also non-vendor solutions, mostly accelerated SoC such as
<https://www.esp.cs.columbia.edu/> or
<https://github.com/google/CFU-Playground> (extension to
<https://github.com/enjoy-digital/litex>) that can help get started.

Cordially,

--
Romain
 
On 5/8/21 5:28 PM, Thomas Koenig wrote:
I have a certain interest in a mathematical puzzle that I have not
been able to solve using a normal CPU, and I thought that using
an FPGA could work.

Out of curiosity, what is the specific issue you encounter using a
\'normal\' CPU ?

As you say:

> The task at hand is extremely parallel,

Typically, assuming a constant-time (of duration ta) atom of work and n
atoms to process over p cpu, the cpu would take a time
t_cpu = ta_cpu * round-up(n/p_cpu)

The only way a FPGA can beat that is if it:
a) has a ta_fpga <<< ta_cpu while retaining p_fpga ~= p_cpu
b) has a p_fpga >>> p_cpu while retaining ta_fpga ~= ta_cpu
c) has a ta_fpga <<< ta_cpu and a p_fpga >>> p_cpu (ideal case)

Depending on how much you\'re willing to spend (big FPGAs aren\'t cheap),
the first question would be, how big can you get \'p_cpu\' ? Using MPI to
distribute the atoms of work over a lot of cores should not be very
difficult, and a \'lot of cores\' can be obtained easily from cloud
providers nowadays.

FPGAs are not as easy to tryout, today I think it\'s pretty much Amazon
F1 in the cloud - or buying.

That being said, FPGA vendors promote a lot of solutions for this
particular problem, from low-level solutions (e.g. a PCIe core and a lot
of hand-written Verilog/VHDL/...) to high-level solutions (e.g.
<https://www.intel.com/content/www/us/en/programmable/documentation/div1537518568620.html>,
<https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html>,
etc.). Those solutions can be with stand-alone FPGAs or with the FPGA
integrated in a SoC with normal cores (e.g. Xilinx Zynq families, among
others).

There\'s also non-vendor solutions, mostly accelerated SoC such as
<https://www.esp.cs.columbia.edu/> or
<https://github.com/google/CFU-Playground> (extension to
<https://github.com/enjoy-digital/litex>) that can help get started.

Cordially,

--
Romain
 
On 5/8/21 5:28 PM, Thomas Koenig wrote:
I have a certain interest in a mathematical puzzle that I have not
been able to solve using a normal CPU, and I thought that using
an FPGA could work.

Out of curiosity, what is the specific issue you encounter using a
\'normal\' CPU ?

As you say:

> The task at hand is extremely parallel,

Typically, assuming a constant-time (of duration ta) atom of work and n
atoms to process over p cpu, the cpu would take a time
t_cpu = ta_cpu * round-up(n/p_cpu)

The only way a FPGA can beat that is if it:
a) has a ta_fpga <<< ta_cpu while retaining p_fpga ~= p_cpu
b) has a p_fpga >>> p_cpu while retaining ta_fpga ~= ta_cpu
c) has a ta_fpga <<< ta_cpu and a p_fpga >>> p_cpu (ideal case)

Depending on how much you\'re willing to spend (big FPGAs aren\'t cheap),
the first question would be, how big can you get \'p_cpu\' ? Using MPI to
distribute the atoms of work over a lot of cores should not be very
difficult, and a \'lot of cores\' can be obtained easily from cloud
providers nowadays.

FPGAs are not as easy to tryout, today I think it\'s pretty much Amazon
F1 in the cloud - or buying.

That being said, FPGA vendors promote a lot of solutions for this
particular problem, from low-level solutions (e.g. a PCIe core and a lot
of hand-written Verilog/VHDL/...) to high-level solutions (e.g.
<https://www.intel.com/content/www/us/en/programmable/documentation/div1537518568620.html>,
<https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html>,
etc.). Those solutions can be with stand-alone FPGAs or with the FPGA
integrated in a SoC with normal cores (e.g. Xilinx Zynq families, among
others).

There\'s also non-vendor solutions, mostly accelerated SoC such as
<https://www.esp.cs.columbia.edu/> or
<https://github.com/google/CFU-Playground> (extension to
<https://github.com/enjoy-digital/litex>) that can help get started.

Cordially,

--
Romain
 
HT-Lab <hans64@htminuslab.com> schrieb:
On 08/05/2021 16:28, Thomas Koenig wrote:
Hi,

I have a certain interest in a mathematical puzzle that I have not
been able to solve using a normal CPU, and I thought that using
an FPGA could work.

...

I think you are moving in the wrong direction, if you can\'t solve it
with some numerical package like numpy/linpack

Definitely not the right kind of problem.

then it is highly
unlikely you will succeed with an FPGA based solution.

An FPGA would be quite good, IMHO.

What I would need are things like

- an efficient (base 2) popcount operation

- counters in base 3, 5, 7,11, 13 and 17

- adders for all of the bases above

- efficient popcount operations for all of the bases
above

plus handling of numbers in the region of 72 bits.

What you probably want is a fast graphics card + CUDA/OpenCL which will
most likely outperform your FPGA based design.

That is an alternative. I am also looking at that, but
FPGAs seem to be more interesting, at the moment.

> Still it will be an interesting learning exercise ;-)

Certainly.

Therefore, what sort of system should I be looking for? I don\'t
want to spend my whole time writing Linux kernel drivers or
Bluetooth communication drivers for the FPGA :)

So, something that can be interfaced easily with a computer
(either on board or with a host computer running Linux) would
be great.
 
HT-Lab <hans64@htminuslab.com> schrieb:
On 08/05/2021 16:28, Thomas Koenig wrote:
Hi,

I have a certain interest in a mathematical puzzle that I have not
been able to solve using a normal CPU, and I thought that using
an FPGA could work.

...

I think you are moving in the wrong direction, if you can\'t solve it
with some numerical package like numpy/linpack

Definitely not the right kind of problem.

then it is highly
unlikely you will succeed with an FPGA based solution.

An FPGA would be quite good, IMHO.

What I would need are things like

- an efficient (base 2) popcount operation

- counters in base 3, 5, 7,11, 13 and 17

- adders for all of the bases above

- efficient popcount operations for all of the bases
above

plus handling of numbers in the region of 72 bits.

What you probably want is a fast graphics card + CUDA/OpenCL which will
most likely outperform your FPGA based design.

That is an alternative. I am also looking at that, but
FPGAs seem to be more interesting, at the moment.

> Still it will be an interesting learning exercise ;-)

Certainly.

Therefore, what sort of system should I be looking for? I don\'t
want to spend my whole time writing Linux kernel drivers or
Bluetooth communication drivers for the FPGA :)

So, something that can be interfaced easily with a computer
(either on board or with a host computer running Linux) would
be great.
 
HT-Lab <hans64@htminuslab.com> schrieb:
On 08/05/2021 16:28, Thomas Koenig wrote:
Hi,

I have a certain interest in a mathematical puzzle that I have not
been able to solve using a normal CPU, and I thought that using
an FPGA could work.

...

I think you are moving in the wrong direction, if you can\'t solve it
with some numerical package like numpy/linpack

Definitely not the right kind of problem.

then it is highly
unlikely you will succeed with an FPGA based solution.

An FPGA would be quite good, IMHO.

What I would need are things like

- an efficient (base 2) popcount operation

- counters in base 3, 5, 7,11, 13 and 17

- adders for all of the bases above

- efficient popcount operations for all of the bases
above

plus handling of numbers in the region of 72 bits.

What you probably want is a fast graphics card + CUDA/OpenCL which will
most likely outperform your FPGA based design.

That is an alternative. I am also looking at that, but
FPGAs seem to be more interesting, at the moment.

> Still it will be an interesting learning exercise ;-)

Certainly.

Therefore, what sort of system should I be looking for? I don\'t
want to spend my whole time writing Linux kernel drivers or
Bluetooth communication drivers for the FPGA :)

So, something that can be interfaced easily with a computer
(either on board or with a host computer running Linux) would
be great.
 
Romain Dolbeau <romain@dolbeau.org> schrieb:
On 5/8/21 5:28 PM, Thomas Koenig wrote:
I have a certain interest in a mathematical puzzle that I have not
been able to solve using a normal CPU, and I thought that using
an FPGA could work.

Out of curiosity, what is the specific issue you encounter using a
\'normal\' CPU ?

It\'s too slow.

I managed to search the number space up to around 2^64 in around
half a CPU year (from which you can tell that one key is to
reduce the search space).

The task at hand is extremely parallel,

Typically, assuming a constant-time (of duration ta) atom of work and n
atoms to process over p cpu, the cpu would take a time
t_cpu = ta_cpu * round-up(n/p_cpu)

The only way a FPGA can beat that is if it:
a) has a ta_fpga <<< ta_cpu while retaining p_fpga ~= p_cpu
b) has a p_fpga >>> p_cpu while retaining ta_fpga ~= ta_cpu
c) has a ta_fpga <<< ta_cpu and a p_fpga >>> p_cpu (ideal case)

There are things that an FPGA should be able to do better than
a CPU. One example is implementing a base-n counter, which is a
serial operation on a CPU and can easily be done in parallel on
an FPGA.

Depending on how much you\'re willing to spend (big FPGAs aren\'t cheap),
the first question would be, how big can you get \'p_cpu\' ? Using MPI to
distribute the atoms of work over a lot of cores should not be very
difficult, and a \'lot of cores\' can be obtained easily from cloud
providers nowadays.

That is of course a possibility. In the CPU-based approach I
simply used OpenMP with schedule(dynamic). However, for this
kind of hobbyist thing, I\'d rather learn something interesting
than throw money at a cloud provider :)

That being said, FPGA vendors promote a lot of solutions for this
particular problem, from low-level solutions (e.g. a PCIe core and a lot
of hand-written Verilog/VHDL/...) to high-level solutions (e.g.
https://www.intel.com/content/www/us/en/programmable/documentation/div1537518568620.html>,
https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html>,
etc.). Those solutions can be with stand-alone FPGAs or with the FPGA
integrated in a SoC with normal cores (e.g. Xilinx Zynq families, among
others).

There\'s also non-vendor solutions, mostly accelerated SoC such as
https://www.esp.cs.columbia.edu/> or
https://github.com/google/CFU-Playground> (extension to
https://github.com/enjoy-digital/litex>) that can help get started.

Thanks for the pointers.

Seems to be rather high-level, and also rather abstract (ok, so these
systems are usually aimed at professionals, not at hobbyists).

I\'ll look around a bit and see if I can find anything that helps me,
but at the moment, I have to say it all looks rather daunting :)
 
Romain Dolbeau <romain@dolbeau.org> schrieb:
On 5/8/21 5:28 PM, Thomas Koenig wrote:
I have a certain interest in a mathematical puzzle that I have not
been able to solve using a normal CPU, and I thought that using
an FPGA could work.

Out of curiosity, what is the specific issue you encounter using a
\'normal\' CPU ?

It\'s too slow.

I managed to search the number space up to around 2^64 in around
half a CPU year (from which you can tell that one key is to
reduce the search space).

The task at hand is extremely parallel,

Typically, assuming a constant-time (of duration ta) atom of work and n
atoms to process over p cpu, the cpu would take a time
t_cpu = ta_cpu * round-up(n/p_cpu)

The only way a FPGA can beat that is if it:
a) has a ta_fpga <<< ta_cpu while retaining p_fpga ~= p_cpu
b) has a p_fpga >>> p_cpu while retaining ta_fpga ~= ta_cpu
c) has a ta_fpga <<< ta_cpu and a p_fpga >>> p_cpu (ideal case)

There are things that an FPGA should be able to do better than
a CPU. One example is implementing a base-n counter, which is a
serial operation on a CPU and can easily be done in parallel on
an FPGA.

Depending on how much you\'re willing to spend (big FPGAs aren\'t cheap),
the first question would be, how big can you get \'p_cpu\' ? Using MPI to
distribute the atoms of work over a lot of cores should not be very
difficult, and a \'lot of cores\' can be obtained easily from cloud
providers nowadays.

That is of course a possibility. In the CPU-based approach I
simply used OpenMP with schedule(dynamic). However, for this
kind of hobbyist thing, I\'d rather learn something interesting
than throw money at a cloud provider :)

That being said, FPGA vendors promote a lot of solutions for this
particular problem, from low-level solutions (e.g. a PCIe core and a lot
of hand-written Verilog/VHDL/...) to high-level solutions (e.g.
https://www.intel.com/content/www/us/en/programmable/documentation/div1537518568620.html>,
https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html>,
etc.). Those solutions can be with stand-alone FPGAs or with the FPGA
integrated in a SoC with normal cores (e.g. Xilinx Zynq families, among
others).

There\'s also non-vendor solutions, mostly accelerated SoC such as
https://www.esp.cs.columbia.edu/> or
https://github.com/google/CFU-Playground> (extension to
https://github.com/enjoy-digital/litex>) that can help get started.

Thanks for the pointers.

Seems to be rather high-level, and also rather abstract (ok, so these
systems are usually aimed at professionals, not at hobbyists).

I\'ll look around a bit and see if I can find anything that helps me,
but at the moment, I have to say it all looks rather daunting :)
 
Romain Dolbeau <romain@dolbeau.org> schrieb:
On 5/8/21 5:28 PM, Thomas Koenig wrote:
I have a certain interest in a mathematical puzzle that I have not
been able to solve using a normal CPU, and I thought that using
an FPGA could work.

Out of curiosity, what is the specific issue you encounter using a
\'normal\' CPU ?

It\'s too slow.

I managed to search the number space up to around 2^64 in around
half a CPU year (from which you can tell that one key is to
reduce the search space).

The task at hand is extremely parallel,

Typically, assuming a constant-time (of duration ta) atom of work and n
atoms to process over p cpu, the cpu would take a time
t_cpu = ta_cpu * round-up(n/p_cpu)

The only way a FPGA can beat that is if it:
a) has a ta_fpga <<< ta_cpu while retaining p_fpga ~= p_cpu
b) has a p_fpga >>> p_cpu while retaining ta_fpga ~= ta_cpu
c) has a ta_fpga <<< ta_cpu and a p_fpga >>> p_cpu (ideal case)

There are things that an FPGA should be able to do better than
a CPU. One example is implementing a base-n counter, which is a
serial operation on a CPU and can easily be done in parallel on
an FPGA.

Depending on how much you\'re willing to spend (big FPGAs aren\'t cheap),
the first question would be, how big can you get \'p_cpu\' ? Using MPI to
distribute the atoms of work over a lot of cores should not be very
difficult, and a \'lot of cores\' can be obtained easily from cloud
providers nowadays.

That is of course a possibility. In the CPU-based approach I
simply used OpenMP with schedule(dynamic). However, for this
kind of hobbyist thing, I\'d rather learn something interesting
than throw money at a cloud provider :)

That being said, FPGA vendors promote a lot of solutions for this
particular problem, from low-level solutions (e.g. a PCIe core and a lot
of hand-written Verilog/VHDL/...) to high-level solutions (e.g.
https://www.intel.com/content/www/us/en/programmable/documentation/div1537518568620.html>,
https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html>,
etc.). Those solutions can be with stand-alone FPGAs or with the FPGA
integrated in a SoC with normal cores (e.g. Xilinx Zynq families, among
others).

There\'s also non-vendor solutions, mostly accelerated SoC such as
https://www.esp.cs.columbia.edu/> or
https://github.com/google/CFU-Playground> (extension to
https://github.com/enjoy-digital/litex>) that can help get started.

Thanks for the pointers.

Seems to be rather high-level, and also rather abstract (ok, so these
systems are usually aimed at professionals, not at hobbyists).

I\'ll look around a bit and see if I can find anything that helps me,
but at the moment, I have to say it all looks rather daunting :)
 
On 5/9/21 1:50 PM, Thomas Koenig wrote:
That is of course a possibility. In the CPU-based approach I
simply used OpenMP with schedule(dynamic). However, for this
kind of hobbyist thing, I\'d rather learn something interesting
than throw money at a cloud provider :)

Agreed. But distributed computing can do wonder for many brute-force
mathematical problems (e.g.
<https://en.wikipedia.org/wiki/Lychrel_number#196_palindrome_quest> ;-) ).

Seems to be rather high-level, and also rather abstract (ok, so these
systems are usually aimed at professionals, not at hobbyists).
I\'ll look around a bit and see if I can find anything that helps me,
but at the moment, I have to say it all looks rather daunting :)

It depends where you start... I see two problems to solve to use a FPGA:

(a) implementing the hardware operator(s) in a fast/efficient way;
(b) integrating said operator(s) in an acceleration framework.

Tackling both at once can be daunting. However, tackling only the first
(which is likely the most interesting one and sort-of-research) should
be quite achievable. Once that works, you can figure out how to put them
in \'production\' by having the right set up (number of operators, speed,
memory requirements, etc.).

E.g., if you can express the problem as a (set of) 32x32 -> 32 operators
(with some ad-hoc encoding, presumably), you can easily add dedicated
instructions in a 32-bits softcore to evaluate the performance benefit.
Then you can figure out a way to high-performance. Some softcore might
let you do wider operands/results - 64-bits should not be much of a
problem with 64-bit softcores.

Blowing my own horn here (sorry), but for instance you can have a look
at: <https://github.com/rdolbeau/VexRiscvBPluginGenerator/> which is
designed to easily add integer-pipeline instructions to a VexRiscv
(RV32) core in a Linux-capable Litex SoC. I run a 100 MHz quad-cores on
a ~$90 board; it won\'t be faster than a beefy CPU, but it\'s a cheap way
to start evaluating implementations. There\'s also support for 32x32->64
instructions (from the draft P \'packed simd\' extension to RISC-V [2]),
32x32x32->32 instructions (from the draft Zbt \'bitmanip ternary\'
extension [1]) and you can even do 32x32x32->64 if you really want (by
abusing both systems, for instance to implement a faster Chacha [3], see
e.g.
<https://github.com/rdolbeau/VexRiscvBPluginGenerator/blob/master/data_Chacha64.txt>).
That would the easiest way to prototype some operators, I believe.

Alternatively, VexRiscv has a FPU \'coprocessor\' that can be an
inspiration to implement a dedicated unit with data width up to 64 bits
(but it will be more complicated).

Finally you can go for the full-custom acceleration peripheral; the
CFU-playgrounds is one way, or you can look at the ESP project as they
are more focused on acceleration. But that\'s basically solving (b) along
with (a).

Cordially,

Romain

[1] <https://github.com/riscv/riscv-bitmanip>
[2] <https://github.com/riscv/riscv-p-spec>
[3] <https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant>

--
Romain
 
On 5/9/21 1:50 PM, Thomas Koenig wrote:
That is of course a possibility. In the CPU-based approach I
simply used OpenMP with schedule(dynamic). However, for this
kind of hobbyist thing, I\'d rather learn something interesting
than throw money at a cloud provider :)

Agreed. But distributed computing can do wonder for many brute-force
mathematical problems (e.g.
<https://en.wikipedia.org/wiki/Lychrel_number#196_palindrome_quest> ;-) ).

Seems to be rather high-level, and also rather abstract (ok, so these
systems are usually aimed at professionals, not at hobbyists).
I\'ll look around a bit and see if I can find anything that helps me,
but at the moment, I have to say it all looks rather daunting :)

It depends where you start... I see two problems to solve to use a FPGA:

(a) implementing the hardware operator(s) in a fast/efficient way;
(b) integrating said operator(s) in an acceleration framework.

Tackling both at once can be daunting. However, tackling only the first
(which is likely the most interesting one and sort-of-research) should
be quite achievable. Once that works, you can figure out how to put them
in \'production\' by having the right set up (number of operators, speed,
memory requirements, etc.).

E.g., if you can express the problem as a (set of) 32x32 -> 32 operators
(with some ad-hoc encoding, presumably), you can easily add dedicated
instructions in a 32-bits softcore to evaluate the performance benefit.
Then you can figure out a way to high-performance. Some softcore might
let you do wider operands/results - 64-bits should not be much of a
problem with 64-bit softcores.

Blowing my own horn here (sorry), but for instance you can have a look
at: <https://github.com/rdolbeau/VexRiscvBPluginGenerator/> which is
designed to easily add integer-pipeline instructions to a VexRiscv
(RV32) core in a Linux-capable Litex SoC. I run a 100 MHz quad-cores on
a ~$90 board; it won\'t be faster than a beefy CPU, but it\'s a cheap way
to start evaluating implementations. There\'s also support for 32x32->64
instructions (from the draft P \'packed simd\' extension to RISC-V [2]),
32x32x32->32 instructions (from the draft Zbt \'bitmanip ternary\'
extension [1]) and you can even do 32x32x32->64 if you really want (by
abusing both systems, for instance to implement a faster Chacha [3], see
e.g.
<https://github.com/rdolbeau/VexRiscvBPluginGenerator/blob/master/data_Chacha64.txt>).
That would the easiest way to prototype some operators, I believe.

Alternatively, VexRiscv has a FPU \'coprocessor\' that can be an
inspiration to implement a dedicated unit with data width up to 64 bits
(but it will be more complicated).

Finally you can go for the full-custom acceleration peripheral; the
CFU-playgrounds is one way, or you can look at the ESP project as they
are more focused on acceleration. But that\'s basically solving (b) along
with (a).

Cordially,

Romain

[1] <https://github.com/riscv/riscv-bitmanip>
[2] <https://github.com/riscv/riscv-p-spec>
[3] <https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant>

--
Romain
 
On 5/9/21 1:50 PM, Thomas Koenig wrote:
That is of course a possibility. In the CPU-based approach I
simply used OpenMP with schedule(dynamic). However, for this
kind of hobbyist thing, I\'d rather learn something interesting
than throw money at a cloud provider :)

Agreed. But distributed computing can do wonder for many brute-force
mathematical problems (e.g.
<https://en.wikipedia.org/wiki/Lychrel_number#196_palindrome_quest> ;-) ).

Seems to be rather high-level, and also rather abstract (ok, so these
systems are usually aimed at professionals, not at hobbyists).
I\'ll look around a bit and see if I can find anything that helps me,
but at the moment, I have to say it all looks rather daunting :)

It depends where you start... I see two problems to solve to use a FPGA:

(a) implementing the hardware operator(s) in a fast/efficient way;
(b) integrating said operator(s) in an acceleration framework.

Tackling both at once can be daunting. However, tackling only the first
(which is likely the most interesting one and sort-of-research) should
be quite achievable. Once that works, you can figure out how to put them
in \'production\' by having the right set up (number of operators, speed,
memory requirements, etc.).

E.g., if you can express the problem as a (set of) 32x32 -> 32 operators
(with some ad-hoc encoding, presumably), you can easily add dedicated
instructions in a 32-bits softcore to evaluate the performance benefit.
Then you can figure out a way to high-performance. Some softcore might
let you do wider operands/results - 64-bits should not be much of a
problem with 64-bit softcores.

Blowing my own horn here (sorry), but for instance you can have a look
at: <https://github.com/rdolbeau/VexRiscvBPluginGenerator/> which is
designed to easily add integer-pipeline instructions to a VexRiscv
(RV32) core in a Linux-capable Litex SoC. I run a 100 MHz quad-cores on
a ~$90 board; it won\'t be faster than a beefy CPU, but it\'s a cheap way
to start evaluating implementations. There\'s also support for 32x32->64
instructions (from the draft P \'packed simd\' extension to RISC-V [2]),
32x32x32->32 instructions (from the draft Zbt \'bitmanip ternary\'
extension [1]) and you can even do 32x32x32->64 if you really want (by
abusing both systems, for instance to implement a faster Chacha [3], see
e.g.
<https://github.com/rdolbeau/VexRiscvBPluginGenerator/blob/master/data_Chacha64.txt>).
That would the easiest way to prototype some operators, I believe.

Alternatively, VexRiscv has a FPU \'coprocessor\' that can be an
inspiration to implement a dedicated unit with data width up to 64 bits
(but it will be more complicated).

Finally you can go for the full-custom acceleration peripheral; the
CFU-playgrounds is one way, or you can look at the ESP project as they
are more focused on acceleration. But that\'s basically solving (b) along
with (a).

Cordially,

Romain

[1] <https://github.com/riscv/riscv-bitmanip>
[2] <https://github.com/riscv/riscv-p-spec>
[3] <https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant>

--
Romain
 
On 09/05/2021 09:22, Thomas Koenig wrote:
HT-Lab <hans64@htminuslab.com> schrieb:
On 08/05/2021 16:28, Thomas Koenig wrote:
Hi,

I have a certain interest in a mathematical puzzle that I have not
been able to solve using a normal CPU, and I thought that using
an FPGA could work.

...

I think you are moving in the wrong direction, if you can\'t solve it
with some numerical package like numpy/linpack

Definitely not the right kind of problem.

OK, I must admit I didn\'t really look closely at the page you gave but I
do know for a lot of numerical intensive calculations a modern PC + Cuda
is not easily beaten by an FPGA especially in terms of cost and
development time.
then it is highly
unlikely you will succeed with an FPGA based solution.

An FPGA would be quite good, IMHO.

What I would need are things like

- an efficient (base 2) popcount operation

This is easy as most processors have a POPCNT instruction so you should
be able to find some efficient RTL code on the web. In most cases it is
just a bunch of counters/adders.

- counters in base 3, 5, 7,11, 13 and 17

This I suspect will be more difficult especially if you have to deal
with large word length, if not then a LUTs+adders could provide a fast
solution.

- adders for all of the bases above

No idea, perhaps converting to base2 (allowing you to instantiate
optimised vendors cores), do all your operations and move back to base
3..17?
- efficient popcount operations for all of the bases
above

plus handling of numbers in the region of 72 bits.

That could be a problem as 72bits adders/popcnt will not be fast, you
will need to heavily pipeline and optimise your design which adds
another level of complexity.

What you probably want is a fast graphics card + CUDA/OpenCL which will
most likely outperform your FPGA based design.

That is an alternative. I am also looking at that, but
FPGAs seem to be more interesting, at the moment.

Still it will be an interesting learning exercise ;-)

Certainly.

Therefore, what sort of system should I be looking for? I don\'t
want to spend my whole time writing Linux kernel drivers or
Bluetooth communication drivers for the FPGA :)

If you looked at Bluetooth I assume the data rate required is not that
high. In this case I would go for a simple UART, you can easily get
1Mbits without much effort. No special drivers are required.
If you need more bandwidth then have a look at the many Future
Technology USB devices like the F232H which are easy to interface and
could give you up to 40MByte/sec transfer speeds. The drivers are freely
available for Windows and Linux. I have used them on a previous project
and they worked without any issue.
For anything higher get a PCIe FPGA development board which normally
come with drivers to fast DMA a block of data to and from the FPGA.


Good luck,
Hans
www.ht-lab.com


So, something that can be interfaced easily with a computer
(either on board or with a host computer running Linux) would
be great.
 
On 09/05/2021 09:22, Thomas Koenig wrote:
HT-Lab <hans64@htminuslab.com> schrieb:
On 08/05/2021 16:28, Thomas Koenig wrote:
Hi,

I have a certain interest in a mathematical puzzle that I have not
been able to solve using a normal CPU, and I thought that using
an FPGA could work.

...

I think you are moving in the wrong direction, if you can\'t solve it
with some numerical package like numpy/linpack

Definitely not the right kind of problem.

OK, I must admit I didn\'t really look closely at the page you gave but I
do know for a lot of numerical intensive calculations a modern PC + Cuda
is not easily beaten by an FPGA especially in terms of cost and
development time.
then it is highly
unlikely you will succeed with an FPGA based solution.

An FPGA would be quite good, IMHO.

What I would need are things like

- an efficient (base 2) popcount operation

This is easy as most processors have a POPCNT instruction so you should
be able to find some efficient RTL code on the web. In most cases it is
just a bunch of counters/adders.

- counters in base 3, 5, 7,11, 13 and 17

This I suspect will be more difficult especially if you have to deal
with large word length, if not then a LUTs+adders could provide a fast
solution.

- adders for all of the bases above

No idea, perhaps converting to base2 (allowing you to instantiate
optimised vendors cores), do all your operations and move back to base
3..17?
- efficient popcount operations for all of the bases
above

plus handling of numbers in the region of 72 bits.

That could be a problem as 72bits adders/popcnt will not be fast, you
will need to heavily pipeline and optimise your design which adds
another level of complexity.

What you probably want is a fast graphics card + CUDA/OpenCL which will
most likely outperform your FPGA based design.

That is an alternative. I am also looking at that, but
FPGAs seem to be more interesting, at the moment.

Still it will be an interesting learning exercise ;-)

Certainly.

Therefore, what sort of system should I be looking for? I don\'t
want to spend my whole time writing Linux kernel drivers or
Bluetooth communication drivers for the FPGA :)

If you looked at Bluetooth I assume the data rate required is not that
high. In this case I would go for a simple UART, you can easily get
1Mbits without much effort. No special drivers are required.
If you need more bandwidth then have a look at the many Future
Technology USB devices like the F232H which are easy to interface and
could give you up to 40MByte/sec transfer speeds. The drivers are freely
available for Windows and Linux. I have used them on a previous project
and they worked without any issue.
For anything higher get a PCIe FPGA development board which normally
come with drivers to fast DMA a block of data to and from the FPGA.


Good luck,
Hans
www.ht-lab.com


So, something that can be interfaced easily with a computer
(either on board or with a host computer running Linux) would
be great.
 
On 09/05/2021 09:22, Thomas Koenig wrote:
HT-Lab <hans64@htminuslab.com> schrieb:
On 08/05/2021 16:28, Thomas Koenig wrote:
Hi,

I have a certain interest in a mathematical puzzle that I have not
been able to solve using a normal CPU, and I thought that using
an FPGA could work.

...

I think you are moving in the wrong direction, if you can\'t solve it
with some numerical package like numpy/linpack

Definitely not the right kind of problem.

OK, I must admit I didn\'t really look closely at the page you gave but I
do know for a lot of numerical intensive calculations a modern PC + Cuda
is not easily beaten by an FPGA especially in terms of cost and
development time.
then it is highly
unlikely you will succeed with an FPGA based solution.

An FPGA would be quite good, IMHO.

What I would need are things like

- an efficient (base 2) popcount operation

This is easy as most processors have a POPCNT instruction so you should
be able to find some efficient RTL code on the web. In most cases it is
just a bunch of counters/adders.

- counters in base 3, 5, 7,11, 13 and 17

This I suspect will be more difficult especially if you have to deal
with large word length, if not then a LUTs+adders could provide a fast
solution.

- adders for all of the bases above

No idea, perhaps converting to base2 (allowing you to instantiate
optimised vendors cores), do all your operations and move back to base
3..17?
- efficient popcount operations for all of the bases
above

plus handling of numbers in the region of 72 bits.

That could be a problem as 72bits adders/popcnt will not be fast, you
will need to heavily pipeline and optimise your design which adds
another level of complexity.

What you probably want is a fast graphics card + CUDA/OpenCL which will
most likely outperform your FPGA based design.

That is an alternative. I am also looking at that, but
FPGAs seem to be more interesting, at the moment.

Still it will be an interesting learning exercise ;-)

Certainly.

Therefore, what sort of system should I be looking for? I don\'t
want to spend my whole time writing Linux kernel drivers or
Bluetooth communication drivers for the FPGA :)

If you looked at Bluetooth I assume the data rate required is not that
high. In this case I would go for a simple UART, you can easily get
1Mbits without much effort. No special drivers are required.
If you need more bandwidth then have a look at the many Future
Technology USB devices like the F232H which are easy to interface and
could give you up to 40MByte/sec transfer speeds. The drivers are freely
available for Windows and Linux. I have used them on a previous project
and they worked without any issue.
For anything higher get a PCIe FPGA development board which normally
come with drivers to fast DMA a block of data to and from the FPGA.


Good luck,
Hans
www.ht-lab.com


So, something that can be interfaced easily with a computer
(either on board or with a host computer running Linux) would
be great.
 
On 09/05/2021 12:50, Thomas Koenig wrote:
...snip
That is of course a possibility. In the CPU-based approach I
simply used OpenMP with schedule(dynamic). However, for this
kind of hobbyist thing,

Ah, I assumed this was some commercial project, in that case go for it,
FPGA\'s are the best solution :)


I\'d rather learn something interesting
than throw money at a cloud provider :)

That being said, FPGA vendors promote a lot of solutions for this
particular problem, from low-level solutions (e.g. a PCIe core and a lot
of hand-written Verilog/VHDL/...) to high-level solutions (e.g.
https://www.intel.com/content/www/us/en/programmable/documentation/div1537518568620.html>,
https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html>,
etc.). Those solutions can be with stand-alone FPGAs or with the FPGA
integrated in a SoC with normal cores (e.g. Xilinx Zynq families, among
others).

There\'s also non-vendor solutions, mostly accelerated SoC such as
https://www.esp.cs.columbia.edu/> or
https://github.com/google/CFU-Playground> (extension to
https://github.com/enjoy-digital/litex>) that can help get started.

Thanks for the pointers.

Seems to be rather high-level, and also rather abstract (ok, so these
systems are usually aimed at professionals, not at hobbyists).

I\'ll look around a bit and see if I can find anything that helps me,
but at the moment, I have to say it all looks rather daunting :)

Just start small, take one of your required operators, say popcnt,
implement it in VHDL/(S)Verilog (or chisel/Python/C/etc) and simulate
it. Next get a low cost board from eBay, download the free vendor tools
and try to implement it. Depending on the prototype board you can
probably use some switches and 7-segment display for I/O.

Good luck,

Regards,
Hans.
www.ht-lab.com
 

Welcome to EDABoard.com

Sponsor

Back
Top