EDAboard.com | EDAboard.de | EDAboard.co.uk | WTWH Media

elektroda.net NewsGroups Forum Index - FPGA - **Unique uses for the DSP48**

Guest

Fri Jun 28, 2019 1:45 am

I've tried to figure out how to use the Xilinx DSP48s for Galois arithmetic, but they really aren't that useful for that. The new ones can do a 96-bit unary XOR, which can be used for GF(2) matrix multiplication, but the multipliers themselves aren't of much use for Galois math. I wondered what unusual uses (besides FIR filters or integer matrix multipliers) people have used these for. Here are some of mine:

- Transposers (shifting rows up a DSP column in A/B, latching into P, and shifting columsn serially out of P using the pattern matcher)

- Barrel shifters (Not that good for wide buses, though)

- Modulo by a constant (using Barrett's Reduction)

- GF(2) bit-by-vector multiply-accumulate (using the ALU as an XOR)

Guest

Fri Jun 28, 2019 5:45 am

On Thursday, June 27, 2019 at 8:27:03 PM UTC-4, Kevin Neilson wrote:

I've tried to figure out how to use the Xilinx DSP48s for Galois arithmetic, but they really aren't that useful for that. The new ones can do a 96-bit unary XOR, which can be used for GF(2) matrix multiplication, but the multipliers themselves aren't of much use for Galois math. I wondered what unusual uses (besides FIR filters or integer matrix multipliers) people have used these for. Here are some of mine:

- Transposers (shifting rows up a DSP column in A/B, latching into P, and shifting columsn serially out of P using the pattern matcher)

- Barrel shifters (Not that good for wide buses, though)

- Modulo by a constant (using Barrett's Reduction)

- GF(2) bit-by-vector multiply-accumulate (using the ALU as an XOR)

- Transposers (shifting rows up a DSP column in A/B, latching into P, and shifting columsn serially out of P using the pattern matcher)

- Barrel shifters (Not that good for wide buses, though)

- Modulo by a constant (using Barrett's Reduction)

- GF(2) bit-by-vector multiply-accumulate (using the ALU as an XOR)

I'm no expert, but I believe Galois filters use modulo 2 arithmetic without carries, so multipliers are not what you want. Since there is no carry, there is no need to use any special features. The typical fabric logic will do the job quite nicely.

Certainly barrel shifters would be useful. Don't know what Barrett's Reduction is.

I don't think GF(2) would be useful since you can't use a multiplier without the carry as far as I know. Am I mistaken?

--

Rick C.

- Get 1,000 miles of free Supercharging

- Tesla referral code - https://ts.la/richard11209

Guest

Fri Jun 28, 2019 5:45 pm

In article <c159d7b2-da4b-4852-b90f-aa619fc9a1b4_at_googlegroups.com>,

Kevin Neilson <kevin.neilson_at_xilinx.com> wrote:

I've tried to figure out how to use the Xilinx DSP48s for Galois arithmetic, but they really aren't that useful for that. The new ones can do a 96-bit unary XOR, which can be used for GF(2) matrix

multiplication, but the multipliers themselves aren't of much use for Galois math. I wondered what unusual uses (besides FIR filters or integer matrix multipliers) people have used these for. Here are some

of mine:

- Transposers (shifting rows up a DSP column in A/B, latching into P, and shifting columsn serially out of P using the pattern matcher)

- Barrel shifters (Not that good for wide buses, though)

- Modulo by a constant (using Barrett's Reduction)

- GF(2) bit-by-vector multiply-accumulate (using the ALU as an XOR)

multiplication, but the multipliers themselves aren't of much use for Galois math. I wondered what unusual uses (besides FIR filters or integer matrix multipliers) people have used these for. Here are some

of mine:

- Transposers (shifting rows up a DSP column in A/B, latching into P, and shifting columsn serially out of P using the pattern matcher)

- Barrel shifters (Not that good for wide buses, though)

- Modulo by a constant (using Barrett's Reduction)

- GF(2) bit-by-vector multiply-accumulate (using the ALU as an XOR)

I've used one to implement a 3-input median filter (12-bit max inputs).

Used the SIMD modes to evaluate each comparison between the three

inputs.

Regards,

Mark

Guest

Fri Jun 28, 2019 11:45 pm

I have to do a lot of GF(2) multiplications, which end up being a vector times a matrix, with the matrix being sometimes 128x128 bits or bigger. There are no carries, like you say. Mostly it's an AND of the vector and a column of the matrix and then an XOR of that, and that is done for each column.. It can use up a lot of LUTs and the routing can get congested, especially when the synthesizer tries to share terms. The XOR is a tree with 3-4 levels of logic.

I think some processors can do a "carryless" multiply for this, so the columns are added but no carries are taken to the next column. You end up with a result that is wider than the field, so you have to do a reduction stage, but it's still advantageous. If you could disable the carries in the DSP48s, you could use them in the GF(2) multipliers, but unfortunately there's no such option.

Guest

Thu Jul 04, 2019 10:45 am

On 29/06/2019 00:44, Kevin Neilson wrote:

I have to do a lot of GF(2) multiplications, which end up being a vector times a matrix, with the matrix being sometimes 128x128 bits or bigger. There are no carries, like you say. Mostly it's an AND of the vector and a column of the matrix and then an XOR of that, and that is done for each column. It can use up a lot of LUTs and the routing can get congested, especially when the synthesizer tries to share terms. The XOR is a tree with 3-4 levels of logic.

I think some processors can do a "carryless" multiply for this, so the columns are added but no carries are taken to the next column. You end up with a result that is wider than the field, so you have to do a reduction stage, but it's still advantageous. If you could disable the carries in the DSP48s, you could use them in the GF(2) multipliers, but unfortunately there's no such option.

I think some processors can do a "carryless" multiply for this, so the columns are added but no carries are taken to the next column. You end up with a result that is wider than the field, so you have to do a reduction stage, but it's still advantageous. If you could disable the carries in the DSP48s, you could use them in the GF(2) multipliers, but unfortunately there's no such option.

Do you really mean GF(2) ? That is just single bits. Addition is XOR,

multiplication is AND.

If you meant something like GF(2^, which is popular in RAID and other

forward error correction systems on byte-wide data, then I can't think

of any smart way to handle multiplication in an FPGA. Multiplying by 2

is easy enough, but arbitrary multiplication is done using log tables in

software. If there is a nice hardware method, I'd love to know.

Guest

Thu Jul 04, 2019 1:45 pm

On Thu, 04 Jul 2019 11:16:07 +0200, David Brown wrote:

On 29/06/2019 00:44, Kevin Neilson wrote:

I have to do a lot of GF(2) multiplications, which end up being a

vector times a matrix, with the matrix being sometimes 128x128 bits or

bigger. There are no carries, like you say. Mostly it's an AND of the

vector and a column of the matrix and then an XOR of that, and that is

done for each column. It can use up a lot of LUTs and the routing can

get congested, especially when the synthesizer tries to share terms.

The XOR is a tree with 3-4 levels of logic.

I think some processors can do a "carryless" multiply for this, so the

columns are added but no carries are taken to the next column. You end

up with a result that is wider than the field, so you have to do a

reduction stage, but it's still advantageous. If you could disable the

carries in the DSP48s, you could use them in the GF(2) multipliers, but

unfortunately there's no such option.

Do you really mean GF(2) ? That is just single bits. Addition is XOR,

multiplication is AND.

If you meant something like GF(2^, which is popular in RAID and other

forward error correction systems on byte-wide data, then I can't think

of any smart way to handle multiplication in an FPGA. Multiplying by 2

is easy enough, but arbitrary multiplication is done using log tables in

software. If there is a nice hardware method, I'd love to know.

I have to do a lot of GF(2) multiplications, which end up being a

vector times a matrix, with the matrix being sometimes 128x128 bits or

bigger. There are no carries, like you say. Mostly it's an AND of the

vector and a column of the matrix and then an XOR of that, and that is

done for each column. It can use up a lot of LUTs and the routing can

get congested, especially when the synthesizer tries to share terms.

The XOR is a tree with 3-4 levels of logic.

I think some processors can do a "carryless" multiply for this, so the

columns are added but no carries are taken to the next column. You end

up with a result that is wider than the field, so you have to do a

reduction stage, but it's still advantageous. If you could disable the

carries in the DSP48s, you could use them in the GF(2) multipliers, but

unfortunately there's no such option.

Do you really mean GF(2) ? That is just single bits. Addition is XOR,

multiplication is AND.

If you meant something like GF(2^, which is popular in RAID and other

forward error correction systems on byte-wide data, then I can't think

of any smart way to handle multiplication in an FPGA. Multiplying by 2

is easy enough, but arbitrary multiplication is done using log tables in

software. If there is a nice hardware method, I'd love to know.

I think he does mean GF(2).

Here's the user guide:

<https://www.xilinx.com/support/documentation/user_guides/ug579-ultrascale-dsp.pdf>

Allan

Guest

Sat Jul 06, 2019 4:45 pm

I normally work in fields like GF(2^ or GF(2^11) (or, for AES, GF(2^128)) but all the operations decompose into GF(2) operations. For example, to do a x b = c in GF(2^, all these values are 8-bit vectors. You can expand b into an 8x8-bit matrix B in which b is the bottom row, b*alpha is the next row up, and b*alpha**7 is the top row, where alpha is the primitive element of the field. Then you multiply a x B using GF(2) to get the bits of c. So all the operations of a characteristic-2 field (based on a power of 2) can be broken down into GF(2) operations.

I always break down matrices in larger fields to GF(2) matrices to make it easier on the synthesizer. It can get bogged down otherwise.

Guest

Mon Jul 08, 2019 8:45 am

On 06/07/2019 17:38, Kevin Neilson wrote:

I normally work in fields like GF(2^ or GF(2^11) (or, for AES,

GF(2^128)) but all the operations decompose into GF(2) operations.

For example, to do a x b = c in GF(2^, all these values are 8-bit

vectors. You can expand b into an 8x8-bit matrix B in which b is the

bottom row, b*alpha is the next row up, and b*alpha**7 is the top

row, where alpha is the primitive element of the field. Then you

multiply a x B using GF(2) to get the bits of c. So all the

operations of a characteristic-2 field (based on a power of 2) can be

broken down into GF(2) operations.

I always break down matrices in larger fields to GF(2) matrices to

make it easier on the synthesizer. It can get bogged down

otherwise.

GF(2^128)) but all the operations decompose into GF(2) operations.

For example, to do a x b = c in GF(2^, all these values are 8-bit

vectors. You can expand b into an 8x8-bit matrix B in which b is the

bottom row, b*alpha is the next row up, and b*alpha**7 is the top

row, where alpha is the primitive element of the field. Then you

multiply a x B using GF(2) to get the bits of c. So all the

operations of a characteristic-2 field (based on a power of 2) can be

broken down into GF(2) operations.

I always break down matrices in larger fields to GF(2) matrices to

make it easier on the synthesizer. It can get bogged down

otherwise.

Thanks for that information. I did not know GF(2^ calculations could

be done by matrices over GF(2) like that, but it makes sense. I have

only used the field in software - it is key to RAID6 and higher RAID

versions - and there you do the multiplication and division using log

tables.

Guest

Mon Jul 08, 2019 4:45 pm

I could see how the log tables might be better in software, but in hardware a pair of log/antilog lookup tables (and a mod 2**m-1 adder) will normally be bigger and slower than a hardware multiplier, especially as m gets bigger (where the field is GF(2**m)). I do use lookup tables for reciprocals, but above some m, tables become inefficient for that as well.

elektroda.net NewsGroups Forum Index - FPGA - **Unique uses for the DSP48**