EDAboard.com | EDAboard.de | EDAboard.co.uk | WTWH Media

elektroda.net NewsGroups Forum Index - FPGA - **What to do with an improved algorithm?**

Guest

Mon Sep 03, 2018 11:45 am

Hi,

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

Guest

Mon Sep 03, 2018 11:45 am

I think the best option is to write an article -- or a patent.

Simply because it's an extra opportunity to verify that your approach is

correct.

If there's a hidden mistake, a student might be unable to see it.

Gene

On 03.09.2018 13:17, Mike Field wrote:

Hi,

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

Guest

Mon Sep 03, 2018 10:45 pm

I agree with Gene, plus you might consider publishing the IP as open source code on a website of your own or opencores.org.

--Mike

On Monday, September 3, 2018 at 3:41:02 AM UTC-7, Gene Filatov wrote:

I think the best option is to write an article -- or a patent.

Simply because it's an extra opportunity to verify that your approach is

correct.

If there's a hidden mistake, a student might be unable to see it.

Gene

On 03.09.2018 13:17, Mike Field wrote:

Hi,

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

Simply because it's an extra opportunity to verify that your approach is

correct.

If there's a hidden mistake, a student might be unable to see it.

Gene

On 03.09.2018 13:17, Mike Field wrote:

Hi,

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

Guest

Tue Sep 04, 2018 2:45 pm

On 03/09/2018 11:17, Mike Field wrote:

Hi,

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

I'd publish - since you are not already in the IP licensing/patenting

groove I doubt if you would make any money from it but you might gain

kudos which may help you career and business. Xilinx might want to

publish it - which might give a lot more visibility.

If you have a web site you could put it on that.

MK

Guest

Tue Sep 04, 2018 4:45 pm

On 9/3/2018 6:17 AM, Mike Field wrote:

Hi,

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

I am serious in this reply. You may already know this, if so then I

post for other people to know as well.

I would recommend that if you have means to get by in your life as

you are today without receiving income from this invention / discovery,

then not only give it away to people, but to do so in the name of Jesus

Christ. Tell the people around you that because of what Jesus has done

for you, you are desiring to give to the people here the fruit of what

He first gave you (your intellect, abilities, opportunities to learn,

guiding your path through life, etc.).

By giving Him the credit for your invention / discovery, and not doing

it for money or some other thing here, then you will receive a reward

in Heaven for your labor / knowledge gift to other people, and that re-

ward in Heaven is like the burning bush, or the fishes and the loaves,

meaning it is able to be used without being consumed. In short: re-

wards in Heaven do not wear out or diminish with use. They remain in

their full gift for all time, even under maximum / heavy use.

Remember the Lord in your life, and give things to people citing Him

as the reason why you are doing so. You will receive the maximum re-

turn on your efforts, and it sounds like you will also help out many

people here on this Earth by your gift.

It would be a win-win for everyone in your giving.

--

Rick C. Hodgin

Guest

Tue Sep 04, 2018 4:45 pm

On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

Licensing and selling IP comes with a bit of a learning curve and requires an investment on your part. As Michael mentions, without some of that framework already in place, a license vetted by an IP attorney, and a good marketing plan, you might not see a return on that investment.

If you want your name more prominently attached to it, I'd suggest posting up on a personal Github account rather than opencores.org which makes you conform to their requirements (such as wishbone interface, etc.).

Xilinx always welcomes guest articles on their blogs (although those have been in flux since the recent reorg), and their e-magazine Xcell Journal (again, seems to have been discontinued and the Xcell Daily Blog archived)

https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive

https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog

https://www.xilinx.com/about/xcell-publications/xcell-journal.html

--Kris

Guest

Tue Sep 04, 2018 6:45 pm

On Tuesday, September 4, 2018 at 7:48:41 AM UTC-7, kkoorndyk wrote:

If you want your name more prominently attached to it, I'd suggest posting up on a personal Github account rather than opencores.org which makes you conform to their requirements (such as wishbone interface, etc.).

OpenCores encourages use of the Wishbone interface for SoC components and they do offer coding guidelines, but there are no requirements for either. For example in the entire DSP core section there are 38 entries, none of which are marked as Wishbone compliant.

OpenCores encourages use of the Wishbone interface for SoC components and they do offer coding guidelines, but there are no requirements for either. For example in the entire DSP core section there are 38 entries, none of which are marked as Wishbone compliant.

Guest

Wed Sep 05, 2018 6:45 am

On Wednesday, 5 September 2018 02:48:41 UTC+12, kkoorndyk wrote:

On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:

Hi,

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

Licensing and selling IP comes with a bit of a learning curve and requires an investment on your part. As Michael mentions, without some of that framework already in place, a license vetted by an IP attorney, and a good marketing plan, you might not see a return on that investment.

If you want your name more prominently attached to it, I'd suggest posting up on a personal Github account rather than opencores.org which makes you conform to their requirements (such as wishbone interface, etc.).

Xilinx always welcomes guest articles on their blogs (although those have been in flux since the recent reorg), and their e-magazine Xcell Journal (again, seems to have been discontinued and the Xcell Daily Blog archived)

https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive

https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog

https://www.xilinx.com/about/xcell-publications/xcell-journal.html

--Kris

Hi,

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

Licensing and selling IP comes with a bit of a learning curve and requires an investment on your part. As Michael mentions, without some of that framework already in place, a license vetted by an IP attorney, and a good marketing plan, you might not see a return on that investment.

If you want your name more prominently attached to it, I'd suggest posting up on a personal Github account rather than opencores.org which makes you conform to their requirements (such as wishbone interface, etc.).

Xilinx always welcomes guest articles on their blogs (although those have been in flux since the recent reorg), and their e-magazine Xcell Journal (again, seems to have been discontinued and the Xcell Daily Blog archived)

https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive

https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog

https://www.xilinx.com/about/xcell-publications/xcell-journal.html

--Kris

I never though I would agree with Rick, but....

All sounds like too much work. So here is a quick summary with C-like pseudo-code. I'll put the HDL code up somewhere soon once I am happy with it. I am removing the last rounding errors.

I've been playing with CORDIC, and have come up with what looks to be an overlooked optimization. I've done a bit of googling, and haven't found anything - maybe it is a novel approach?

I've tested it with 32-bit inputs and outputs, and it is within +/-2, and and average error of around 0.6.I a am sure with a bit more analysis of where the errors are coming from I can get it more accurate.

This has two parts to it, both by themselves seem quite trivial, but complement each other quite nicely.

Scaling Z

---------

1. The 'z' value in CORDIC uses becomes smaller and smaller as stages increase:

The core of CORDIC for SIN() and COS() is:

x = INITIAL;

y = INITIAL;

for(i = 0; i < CORDIC_REPS; i++ ) {

int64_t tx,ty;

// divide to scale the current vector

tx = x >> (i+1);

ty = y >> (i+1);

// Either add or subtract at right angles to the current

x -= (z > 0 ? ty : -ty);

y += (z > 0 ? tx : -tx);

z -= (z > 0 ? angles[i] : -angles[i]);

}

The value for angle[] is all important, for example:

angle[0] = 1238021

angle[1] = 654136

angle[2] = 332050

angle[3] = 166670

angle[4] = 83415

angle[5] = 41718

angle[6] = 20860

angle[7] = 10430

angle[8] = 5215

angle[9] = 2607

angle[10] = 1303

angle[11] = 652

angle[12] = 326

angle[13] = 163

angle[14] = 81

angle[15] = 41

angle[16] = 20

angle[17] = 10

angle[18] = 5

angle[19] = 3

angle[20] = 1

If you make the following change:

for(i = 0; i < CORDIC_REPS; i++ ) {

int64_t tx,ty;

// divide to scale the current vector

tx = x >> (i+1);

ty = y >> (i+1);

// Either add or subtract at right angles

x -= (z > 0 ? ty : -ty);

y += (z > 0 ? tx : -tx);

z -= (z > 0 ? angles[i] : -angles[i]);

//!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

z <<= 1; // Double the result of 'z'

//!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

}

Then you can use all the bits in angle[], because you can scale by 2^i (this is data from a different set of parameters, hence the values and count is different):

angle[0] = 1238021

angle[1] = 1308273

angle[2] = 1328199

angle[3] = 1333354

angle[4] = 1334654

angle[5] = 1334980

angle[6] = 1335061

angle[7] = 1335082

angle[8] = 1335087

angle[9] = 1335088

angle[10] = 1335088

angle[11] = 1335088

angle[12] = 1335088

angle[13] = 1335088

angle[14] = 1335088

angle[15] = 1335088

angle[16] = 1335088

angle[17] = 1335088

angle[18] = 1335088

angle[19] = 1335088

angle[20] = 1335088

angle[21] = 1335088

angle[22] = 1335088

angle[23] = 1335088

angle[24] = 1335088

angle[25] = 1335088

angle[26] = 1335088

angle[27] = 1335088

angle[28] = 1335088

angle[29] = 1335088

....and angle[i] rapidly becomes a constant value after the first 9 or 10 iterations. This is what you would expect, as the angle gets smaller and smaller.

Part 2: Add a lookup table

=========================If you split the input into:

[2 MSB] quadrant

[next 9 bits] an lookup table index

[the rest] The starting CORDIC Z value, offset by 1<<(num_of_bits-1)

And have a lookup table of 512 x 36-bit values (i.e. a block RAM), which hold the SIN/COS values at the center of the range = e.g. initial[i] = scale_factor * sin(PI/2.0/1024*(2*i+1));

Because you need both the SIN() and COS() starting point, you can get them from the same table (screaming out "dual port memory!" to me)

You can then do a standard lookup to get the starting points, 9 cycles into the CORDIC:

/* Use Dual Port memory for this */

if(quadrant & 1) {

x = initial[index];

y = initial[TABLE_SIZE-1-index];

} else {

x = initial[TABLE_SIZE-1-index];

y = initial[index];

}

/* Subtract half the sector angle from Z */

z -= 1 << (CORDIC_BITS-1);

/* Now do standard CORDIC, with a lot of work already done */

...

This removes ~8 cycles of latency.

The end result

=============If you combine both of these you can get rid of the angles[] table completely - it is now a constant.

/* Use Dual Port memory for this */

if(quadrant & 1) {

x = initial[index];

y = initial[TABLE_SIZE-1-index];

} else {

x = initial[TABLE_SIZE-1-index];

y = initial[index];

}

/* Subtract half the sector angle from Z */

z -= 1 << (CORDIC_BITS-1);

/* Now do standard CORDIC, with a lot of work already done,

so less repetitions are needed for the same accuracy */

for(i = 0; i < CORDIC_REPS; i++ ) {

int64_t tx,ty;

// Add rounding and divide to scale the current vector

tx = x >> (INDEX_BITS+i);

ty = y >> (INDEX_BITS+i);

// Either add or subtract at right angles

x -= (z > 0 ? ty : -ty);

y += (z > 0 ? tx : -tx);

z -= (z > 0 ? ANGLE_CONSTANT : -ANGLE_CONSTANT);

z <<= 1;

}

Advantages of this method

========================If you have fully unrolled it to generate a full value per cycle, you end up with:

- 1 BRAM block used (bad)

- 9 less CORDIC stages (good)

- 8 or 9 cycles less latency (good)

For 16-bit values this may only need 5 stages, rather than 14.

If you are trying to minimize area, generating an n-bit value every ~n cycles you end up with:

- 1 BRAM block used (bad)

- 8 or 9 cycles less latency (good)

- no need for the angles[] table (good)

- Less levels of logic, for faster FMAX (good)

For 16-bit values, this could double the number of calculations you can compute at a given clock rate.

You can also tune things some what - you can always throw more BRAM blocks at it to reduce the number of CORDIC stages/iterations required, if you have blocks to spare - but one block to remove 9 stages is pretty good.

What do you think?

Guest

Wed Sep 05, 2018 12:45 pm

On 05.09.2018 8:40, Mike Field wrote:

On Wednesday, 5 September 2018 02:48:41 UTC+12, kkoorndyk wrote:

On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

Licensing and selling IP comes with a bit of a learning curve and requires an investment on your part. As Michael mentions, without some of that framework already in place, a license vetted by an IP attorney, and a good marketing plan, you might not see a return on that investment.

If you want your name more prominently attached to it, I'd suggest posting up on a personal Github account rather than opencores.org which makes you conform to their requirements (such as wishbone interface, etc.).

Xilinx always welcomes guest articles on their blogs (although those have been in flux since the recent reorg), and their e-magazine Xcell Journal (again, seems to have been discontinued and the Xcell Daily Blog archived)

https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive

https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog

https://www.xilinx.com/about/xcell-publications/xcell-journal.html

--Kris

I never though I would agree with Rick, but....

All sounds like too much work. So here is a quick summary with C-like pseudo-code. I'll put the HDL code up somewhere soon once I am happy with it. I am removing the last rounding errors.

I've been playing with CORDIC, and have come up with what looks to be an overlooked optimization. I've done a bit of googling, and haven't found anything - maybe it is a novel approach?

I've tested it with 32-bit inputs and outputs, and it is within +/-2, and and average error of around 0.6.I a am sure with a bit more analysis of where the errors are coming from I can get it more accurate.

This has two parts to it, both by themselves seem quite trivial, but complement each other quite nicely.

Scaling Z

---------

1. The 'z' value in CORDIC uses becomes smaller and smaller as stages increase:

The core of CORDIC for SIN() and COS() is:

x = INITIAL;

y = INITIAL;

for(i = 0; i < CORDIC_REPS; i++ ) {

int64_t tx,ty;

// divide to scale the current vector

tx = x >> (i+1);

ty = y >> (i+1);

// Either add or subtract at right angles to the current

x -= (z > 0 ? ty : -ty);

y += (z > 0 ? tx : -tx);

z -= (z > 0 ? angles[i] : -angles[i]);

}

The value for angle[] is all important, for example:

angle[0] = 1238021

angle[1] = 654136

angle[2] = 332050

angle[3] = 166670

angle[4] = 83415

angle[5] = 41718

angle[6] = 20860

angle[7] = 10430

angle[8] = 5215

angle[9] = 2607

angle[10] = 1303

angle[11] = 652

angle[12] = 326

angle[13] = 163

angle[14] = 81

angle[15] = 41

angle[16] = 20

angle[17] = 10

angle[18] = 5

angle[19] = 3

angle[20] = 1

If you make the following change:

for(i = 0; i < CORDIC_REPS; i++ ) {

int64_t tx,ty;

// divide to scale the current vector

tx = x >> (i+1);

ty = y >> (i+1);

// Either add or subtract at right angles

x -= (z > 0 ? ty : -ty);

y += (z > 0 ? tx : -tx);

z -= (z > 0 ? angles[i] : -angles[i]);

//!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

z <<= 1; // Double the result of 'z'

//!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

}

Then you can use all the bits in angle[], because you can scale by 2^i (this is data from a different set of parameters, hence the values and count is different):

angle[0] = 1238021

angle[1] = 1308273

angle[2] = 1328199

angle[3] = 1333354

angle[4] = 1334654

angle[5] = 1334980

angle[6] = 1335061

angle[7] = 1335082

angle[8] = 1335087

angle[9] = 1335088

angle[10] = 1335088

angle[11] = 1335088

angle[12] = 1335088

angle[13] = 1335088

angle[14] = 1335088

angle[15] = 1335088

angle[16] = 1335088

angle[17] = 1335088

angle[18] = 1335088

angle[19] = 1335088

angle[20] = 1335088

angle[21] = 1335088

angle[22] = 1335088

angle[23] = 1335088

angle[24] = 1335088

angle[25] = 1335088

angle[26] = 1335088

angle[27] = 1335088

angle[28] = 1335088

angle[29] = 1335088

...and angle[i] rapidly becomes a constant value after the first 9 or 10 iterations. This is what you would expect, as the angle gets smaller and smaller.

Part 2: Add a lookup table

==========================

If you split the input into:

[2 MSB] quadrant

[next 9 bits] an lookup table index

[the rest] The starting CORDIC Z value, offset by 1<<(num_of_bits-1)

And have a lookup table of 512 x 36-bit values (i.e. a block RAM), which hold the SIN/COS values at the center of the range = e.g. initial[i] = scale_factor * sin(PI/2.0/1024*(2*i+1));

Because you need both the SIN() and COS() starting point, you can get them from the same table (screaming out "dual port memory!" to me)

You can then do a standard lookup to get the starting points, 9 cycles into the CORDIC:

/* Use Dual Port memory for this */

if(quadrant & 1) {

x = initial[index];

y = initial[TABLE_SIZE-1-index];

} else {

x = initial[TABLE_SIZE-1-index];

y = initial[index];

}

/* Subtract half the sector angle from Z */

z -= 1 << (CORDIC_BITS-1);

/* Now do standard CORDIC, with a lot of work already done */

...

This removes ~8 cycles of latency.

The end result

==============

If you combine both of these you can get rid of the angles[] table completely - it is now a constant.

/* Use Dual Port memory for this */

if(quadrant & 1) {

x = initial[index];

y = initial[TABLE_SIZE-1-index];

} else {

x = initial[TABLE_SIZE-1-index];

y = initial[index];

}

/* Subtract half the sector angle from Z */

z -= 1 << (CORDIC_BITS-1);

/* Now do standard CORDIC, with a lot of work already done,

so less repetitions are needed for the same accuracy */

for(i = 0; i < CORDIC_REPS; i++ ) {

int64_t tx,ty;

// Add rounding and divide to scale the current vector

tx = x >> (INDEX_BITS+i);

ty = y >> (INDEX_BITS+i);

// Either add or subtract at right angles

x -= (z > 0 ? ty : -ty);

y += (z > 0 ? tx : -tx);

z -= (z > 0 ? ANGLE_CONSTANT : -ANGLE_CONSTANT);

z <<= 1;

}

Advantages of this method

=========================

If you have fully unrolled it to generate a full value per cycle, you end up with:

- 1 BRAM block used (bad)

- 9 less CORDIC stages (good)

- 8 or 9 cycles less latency (good)

For 16-bit values this may only need 5 stages, rather than 14.

If you are trying to minimize area, generating an n-bit value every ~n cycles you end up with:

- 1 BRAM block used (bad)

- 8 or 9 cycles less latency (good)

- no need for the angles[] table (good)

- Less levels of logic, for faster FMAX (good)

For 16-bit values, this could double the number of calculations you can compute at a given clock rate.

You can also tune things some what - you can always throw more BRAM blocks at it to reduce the number of CORDIC stages/iterations required, if you have blocks to spare - but one block to remove 9 stages is pretty good.

What do you think?

On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

Licensing and selling IP comes with a bit of a learning curve and requires an investment on your part. As Michael mentions, without some of that framework already in place, a license vetted by an IP attorney, and a good marketing plan, you might not see a return on that investment.

If you want your name more prominently attached to it, I'd suggest posting up on a personal Github account rather than opencores.org which makes you conform to their requirements (such as wishbone interface, etc.).

Xilinx always welcomes guest articles on their blogs (although those have been in flux since the recent reorg), and their e-magazine Xcell Journal (again, seems to have been discontinued and the Xcell Daily Blog archived)

https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive

https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog

https://www.xilinx.com/about/xcell-publications/xcell-journal.html

--Kris

I never though I would agree with Rick, but....

All sounds like too much work. So here is a quick summary with C-like pseudo-code. I'll put the HDL code up somewhere soon once I am happy with it. I am removing the last rounding errors.

I've been playing with CORDIC, and have come up with what looks to be an overlooked optimization. I've done a bit of googling, and haven't found anything - maybe it is a novel approach?

I've tested it with 32-bit inputs and outputs, and it is within +/-2, and and average error of around 0.6.I a am sure with a bit more analysis of where the errors are coming from I can get it more accurate.

This has two parts to it, both by themselves seem quite trivial, but complement each other quite nicely.

Scaling Z

---------

1. The 'z' value in CORDIC uses becomes smaller and smaller as stages increase:

The core of CORDIC for SIN() and COS() is:

x = INITIAL;

y = INITIAL;

for(i = 0; i < CORDIC_REPS; i++ ) {

int64_t tx,ty;

// divide to scale the current vector

tx = x >> (i+1);

ty = y >> (i+1);

// Either add or subtract at right angles to the current

x -= (z > 0 ? ty : -ty);

y += (z > 0 ? tx : -tx);

z -= (z > 0 ? angles[i] : -angles[i]);

}

The value for angle[] is all important, for example:

angle[0] = 1238021

angle[1] = 654136

angle[2] = 332050

angle[3] = 166670

angle[4] = 83415

angle[5] = 41718

angle[6] = 20860

angle[7] = 10430

angle[8] = 5215

angle[9] = 2607

angle[10] = 1303

angle[11] = 652

angle[12] = 326

angle[13] = 163

angle[14] = 81

angle[15] = 41

angle[16] = 20

angle[17] = 10

angle[18] = 5

angle[19] = 3

angle[20] = 1

If you make the following change:

for(i = 0; i < CORDIC_REPS; i++ ) {

int64_t tx,ty;

// divide to scale the current vector

tx = x >> (i+1);

ty = y >> (i+1);

// Either add or subtract at right angles

x -= (z > 0 ? ty : -ty);

y += (z > 0 ? tx : -tx);

z -= (z > 0 ? angles[i] : -angles[i]);

//!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

z <<= 1; // Double the result of 'z'

//!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

}

Then you can use all the bits in angle[], because you can scale by 2^i (this is data from a different set of parameters, hence the values and count is different):

angle[0] = 1238021

angle[1] = 1308273

angle[2] = 1328199

angle[3] = 1333354

angle[4] = 1334654

angle[5] = 1334980

angle[6] = 1335061

angle[7] = 1335082

angle[8] = 1335087

angle[9] = 1335088

angle[10] = 1335088

angle[11] = 1335088

angle[12] = 1335088

angle[13] = 1335088

angle[14] = 1335088

angle[15] = 1335088

angle[16] = 1335088

angle[17] = 1335088

angle[18] = 1335088

angle[19] = 1335088

angle[20] = 1335088

angle[21] = 1335088

angle[22] = 1335088

angle[23] = 1335088

angle[24] = 1335088

angle[25] = 1335088

angle[26] = 1335088

angle[27] = 1335088

angle[28] = 1335088

angle[29] = 1335088

...and angle[i] rapidly becomes a constant value after the first 9 or 10 iterations. This is what you would expect, as the angle gets smaller and smaller.

Part 2: Add a lookup table

==========================

If you split the input into:

[2 MSB] quadrant

[next 9 bits] an lookup table index

[the rest] The starting CORDIC Z value, offset by 1<<(num_of_bits-1)

And have a lookup table of 512 x 36-bit values (i.e. a block RAM), which hold the SIN/COS values at the center of the range = e.g. initial[i] = scale_factor * sin(PI/2.0/1024*(2*i+1));

Because you need both the SIN() and COS() starting point, you can get them from the same table (screaming out "dual port memory!" to me)

You can then do a standard lookup to get the starting points, 9 cycles into the CORDIC:

/* Use Dual Port memory for this */

if(quadrant & 1) {

x = initial[index];

y = initial[TABLE_SIZE-1-index];

} else {

x = initial[TABLE_SIZE-1-index];

y = initial[index];

}

/* Subtract half the sector angle from Z */

z -= 1 << (CORDIC_BITS-1);

/* Now do standard CORDIC, with a lot of work already done */

...

This removes ~8 cycles of latency.

The end result

==============

If you combine both of these you can get rid of the angles[] table completely - it is now a constant.

/* Use Dual Port memory for this */

if(quadrant & 1) {

x = initial[index];

y = initial[TABLE_SIZE-1-index];

} else {

x = initial[TABLE_SIZE-1-index];

y = initial[index];

}

/* Subtract half the sector angle from Z */

z -= 1 << (CORDIC_BITS-1);

/* Now do standard CORDIC, with a lot of work already done,

so less repetitions are needed for the same accuracy */

for(i = 0; i < CORDIC_REPS; i++ ) {

int64_t tx,ty;

// Add rounding and divide to scale the current vector

tx = x >> (INDEX_BITS+i);

ty = y >> (INDEX_BITS+i);

// Either add or subtract at right angles

x -= (z > 0 ? ty : -ty);

y += (z > 0 ? tx : -tx);

z -= (z > 0 ? ANGLE_CONSTANT : -ANGLE_CONSTANT);

z <<= 1;

}

Advantages of this method

=========================

If you have fully unrolled it to generate a full value per cycle, you end up with:

- 1 BRAM block used (bad)

- 9 less CORDIC stages (good)

- 8 or 9 cycles less latency (good)

For 16-bit values this may only need 5 stages, rather than 14.

If you are trying to minimize area, generating an n-bit value every ~n cycles you end up with:

- 1 BRAM block used (bad)

- 8 or 9 cycles less latency (good)

- no need for the angles[] table (good)

- Less levels of logic, for faster FMAX (good)

For 16-bit values, this could double the number of calculations you can compute at a given clock rate.

You can also tune things some what - you can always throw more BRAM blocks at it to reduce the number of CORDIC stages/iterations required, if you have blocks to spare - but one block to remove 9 stages is pretty good.

What do you think?

As far as your "revised" angle[i] converging to a constant is concerned,

there's a simple explanation using the first two terms of the taylor

series for the arctan function:

arctan(x) = x - 1/3 * x^3 + o(x^3)

So that

angle[i] = arctan(2^-i) / pi * 2^i = (1/pi) * ( 1 - 1/3 * 2^-(2*i)) +

o(2^-(2*i))

Based on which you can easily say how many stages of the conventional

cordic algorithm do you need to skip (i.e. store the outputs in a lookup

table) for a given bit precision.

I don't know the literature well, but I think it would be cool if you

actually write an article detailing your approach!

Gene

Guest

Sat Sep 08, 2018 12:45 am

earlier, I wrote:

If perchance this is related to your recent CORDIC rotator code,

I've seen a number of CORDIC optimization schemes over the years

to reduce the number of rotation stages, IIRC typically either

by a 'jump start' or merging/optimizing rotation stages.

oops, for some reason, when first reading this thread I didn't

see the later posts with the explanation... I'd swear they weren't

there, but maybe I was just scroll-impaired.

-Brian

Guest

Sat Sep 08, 2018 12:45 am

On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:

The implementation completes the same tasks in 2/3rds

the cycles and using 2/3rds the resources of an

standard Xilinx IP block, with comparable timing).

If perchance this is related to your recent CORDIC rotator code,

I've seen a number of CORDIC optimization schemes over the years

to reduce the number of rotation stages, IIRC typically either

by a 'jump start' or merging/optimizing rotation stages.

If I ever manage to find my folder of CORDIC papers, I'll post some links...

-------

some notes on your CORDIC implementation

http://hamsterworks.co.nz/mediawiki/index.php/CORDIC

- instead of quadrant folding, given I & Q you can do octant folding

(0-45 degrees) using the top three bits of the phase

- if you pass in the bit widths and stages as generics, you can

initialize the constant arctan table on-the-fly in a function

using VHDL reals

-Brian

Guest

Mon Sep 10, 2018 4:45 am

On Wednesday, 5 September 2018 23:43:28 UTC+12, Gene Filatov wrote:

I don't know the literature well, but I think it would be cool if you

actually write an article detailing your approach!

Gene

I'll work on writing one up over the next few days, as well as posting a sample implementation.

Mike

Guest

Thu Oct 11, 2018 3:45 am

> What do you think?

That's good stuff. I wonder why you think using a BRAM is bad? It's good to use the hard cores in an FPGA instead of the fabric--it's less power and deterministic routing times.

Is the CORDIC still advantageous in a modern FPGA? The last time I needed to find sine, I used a coarse BRAM lookup that output sine on one port and cos on another. Those were used as derivatives for a 2nd-order Taylor. Two multipliers (more hard cores) are used (using Horner's Rule) for the 1st and 2nd-order interpolations. I don't remember how many digits of accuracy this yields, but the latency is low.

Guest

Thu Oct 11, 2018 2:45 pm

Le mercredi 10 octobre 2018 21:52:06 UTC-4, Kevin Neilson a écrit :

What do you think?

That's good stuff. I wonder why you think using a BRAM is bad? It's good to use the hard cores in an FPGA instead of the fabric--it's less power and deterministic routing times.

Is the CORDIC still advantageous in a modern FPGA? The last time I needed to find sine, I used a coarse BRAM lookup that output sine on one port and cos on another. Those were used as derivatives for a 2nd-order Taylor. Two multipliers (more hard cores) are used (using Horner's Rule) for the 1st and 2nd-order interpolations. I don't remember how many digits of accuracy this yields, but the latency is low.

That's good stuff. I wonder why you think using a BRAM is bad? It's good to use the hard cores in an FPGA instead of the fabric--it's less power and deterministic routing times.

Is the CORDIC still advantageous in a modern FPGA? The last time I needed to find sine, I used a coarse BRAM lookup that output sine on one port and cos on another. Those were used as derivatives for a 2nd-order Taylor. Two multipliers (more hard cores) are used (using Horner's Rule) for the 1st and 2nd-order interpolations. I don't remember how many digits of accuracy this yields, but the latency is low.

Cordic is useful to compute high-precision atan2. Otherwise for 2 16-bit inputs, you'd need a ram with 2^32 addresses (maybe 16 times less if you take advantage of the symmetries).

Guest

Fri Oct 12, 2018 2:45 am

> Cordic is useful to compute high-precision atan2. Otherwise for 2 16-bit inputs, you'd need a ram with 2^32 addresses (maybe 16 times less if you take advantage of the symmetries).

I think you missed the part about the Taylor interpolation.

elektroda.net NewsGroups Forum Index - FPGA - **What to do with an improved algorithm?**