Division Algorithms...

  • Thread starter gnuarm.del...@gmail.com
  • Start date
G

gnuarm.del...@gmail.com

Guest
I am looking for an algorithm to calculate a floating point divide. There are a number of options, but the one that is most clear to me and easiest to implement on the hardware I am designing seems to be the Newton–Raphson iterative method. I\'m trying to understand how many iterations it will take. The formula for that in the Wikipedia article assumes an initial estimate for the reciprocal of 48/17 - D * 32/17. This would require 2 iterations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating point numbers. My needs are for something between that range, but would like to minimize the number of iterations... ideally one.

I believe most implementations use a table lookup for the seed value of the reciprocal of the divisor. In the FPGA I am using, a convenient size would be 2k entries (11 bit address) of 18 bits. I\'m wondering if this would be sufficiently more accurate than the above estimate so that one iteration would be sufficient. I\'m not clear on how to calculate this. I know in general the formula converges rapidly with the error being squared, so I\'m thinking an initial value that is good to some 11 bits would produce more than 20 bits for one turn of the Newton-Raphson crank (20 bits being a level of accuracy called \"good enough\" for this application). But I\'d like to have something that verifies this rather than using a rule of thumb. At some point I will probably have to justify the correctness of the calculations.

Because of the hardware facilities available the calculations have intermediate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack other than ToS which is the register in the ALU (>36 bits). I don\'t think 11 bits is quite as good as required, but I\'m thinking one crank of the NR machine and I\'m \"good\". However, I need to be able to show how good. Anyone know of good documentation on this matter?

Oh, I guess I should mention later on I might be repeating this process for a square root calculation. A measurement using a sensor that is at least temporarily deprecated requires a square root. I managed to show the sensor was not capable of providing an accuracy at the low end that would produce meaningful results. But if they find a better sensor they may want to add that sensor as an option and the square root will be back. No! The HORROR!

Whatever... It\'s just an algorithm...

--

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209
 
On Saturday, January 16, 2021 at 2:31:36 PM UTC-7, gnuarm.del...@gmail.com wrote:
I am looking for an algorithm to calculate a floating point divide. There are a number of options, but the one that is most clear to me and easiest to implement on the hardware I am designing seems to be the Newton–Raphson iterative method. I\'m trying to understand how many iterations it will take. The formula for that in the Wikipedia article assumes an initial estimate for the reciprocal of 48/17 - D * 32/17. This would require 2 iterations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating point numbers. My needs are for something between that range, but would like to minimize the number of iterations... ideally one.

I believe most implementations use a table lookup for the seed value of the reciprocal of the divisor. In the FPGA I am using, a convenient size would be 2k entries (11 bit address) of 18 bits. I\'m wondering if this would be sufficiently more accurate than the above estimate so that one iteration would be sufficient. I\'m not clear on how to calculate this. I know in general the formula converges rapidly with the error being squared, so I\'m thinking an initial value that is good to some 11 bits would produce more than 20 bits for one turn of the Newton-Raphson crank (20 bits being a level of accuracy called \"good enough\" for this application). But I\'d like to have something that verifies this rather than using a rule of thumb. At some point I will probably have to justify the correctness of the calculations.

Because of the hardware facilities available the calculations have intermediate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack other than ToS which is the register in the ALU (>36 bits). I don\'t think 11 bits is quite as good as required, but I\'m thinking one crank of the NR machine and I\'m \"good\". However, I need to be able to show how good. Anyone know of good documentation on this matter?

Oh, I guess I should mention later on I might be repeating this process for a square root calculation. A measurement using a sensor that is at least temporarily deprecated requires a square root. I managed to show the sensor was not capable of providing an accuracy at the low end that would produce meaningful results. But if they find a better sensor they may want to add that sensor as an option and the square root will be back. No! The HORROR!

Whatever... It\'s just an algorithm...

--

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209

You don\'t mention throughput or latency requirements, just that you want something non-iterative. A resource I\'ve used before is the Altera Advanced Synthesis Cookbook. You can find it online and it has algorithms/source for both \"approximate\" floating-point divider and root modules. These both sacrifice a bit of precision for low gate count. The divider is pipelined and has a latency of about 5 cycles. Maybe it could be helpful.
 
On Saturday, January 16, 2021 at 2:31:36 PM UTC-7, gnuarm.del...@gmail.com wrote:
I am looking for an algorithm to calculate a floating point divide. There are a number of options, but the one that is most clear to me and easiest to implement on the hardware I am designing seems to be the Newton–Raphson iterative method. I\'m trying to understand how many iterations it will take. The formula for that in the Wikipedia article assumes an initial estimate for the reciprocal of 48/17 - D * 32/17. This would require 2 iterations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating point numbers. My needs are for something between that range, but would like to minimize the number of iterations... ideally one.

I believe most implementations use a table lookup for the seed value of the reciprocal of the divisor. In the FPGA I am using, a convenient size would be 2k entries (11 bit address) of 18 bits. I\'m wondering if this would be sufficiently more accurate than the above estimate so that one iteration would be sufficient. I\'m not clear on how to calculate this. I know in general the formula converges rapidly with the error being squared, so I\'m thinking an initial value that is good to some 11 bits would produce more than 20 bits for one turn of the Newton-Raphson crank (20 bits being a level of accuracy called \"good enough\" for this application). But I\'d like to have something that verifies this rather than using a rule of thumb. At some point I will probably have to justify the correctness of the calculations.

Because of the hardware facilities available the calculations have intermediate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack other than ToS which is the register in the ALU (>36 bits). I don\'t think 11 bits is quite as good as required, but I\'m thinking one crank of the NR machine and I\'m \"good\". However, I need to be able to show how good. Anyone know of good documentation on this matter?

Oh, I guess I should mention later on I might be repeating this process for a square root calculation. A measurement using a sensor that is at least temporarily deprecated requires a square root. I managed to show the sensor was not capable of providing an accuracy at the low end that would produce meaningful results. But if they find a better sensor they may want to add that sensor as an option and the square root will be back. No! The HORROR!

Whatever... It\'s just an algorithm...

--

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209

You don\'t mention throughput or latency requirements, just that you want something non-iterative. A resource I\'ve used before is the Altera Advanced Synthesis Cookbook. You can find it online and it has algorithms/source for both \"approximate\" floating-point divider and root modules. These both sacrifice a bit of precision for low gate count. The divider is pipelined and has a latency of about 5 cycles. Maybe it could be helpful.
 
On Saturday, January 16, 2021 at 2:31:36 PM UTC-7, gnuarm.del...@gmail.com wrote:
I am looking for an algorithm to calculate a floating point divide. There are a number of options, but the one that is most clear to me and easiest to implement on the hardware I am designing seems to be the Newton–Raphson iterative method. I\'m trying to understand how many iterations it will take. The formula for that in the Wikipedia article assumes an initial estimate for the reciprocal of 48/17 - D * 32/17. This would require 2 iterations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating point numbers. My needs are for something between that range, but would like to minimize the number of iterations... ideally one.

I believe most implementations use a table lookup for the seed value of the reciprocal of the divisor. In the FPGA I am using, a convenient size would be 2k entries (11 bit address) of 18 bits. I\'m wondering if this would be sufficiently more accurate than the above estimate so that one iteration would be sufficient. I\'m not clear on how to calculate this. I know in general the formula converges rapidly with the error being squared, so I\'m thinking an initial value that is good to some 11 bits would produce more than 20 bits for one turn of the Newton-Raphson crank (20 bits being a level of accuracy called \"good enough\" for this application). But I\'d like to have something that verifies this rather than using a rule of thumb. At some point I will probably have to justify the correctness of the calculations.

Because of the hardware facilities available the calculations have intermediate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack other than ToS which is the register in the ALU (>36 bits). I don\'t think 11 bits is quite as good as required, but I\'m thinking one crank of the NR machine and I\'m \"good\". However, I need to be able to show how good. Anyone know of good documentation on this matter?

Oh, I guess I should mention later on I might be repeating this process for a square root calculation. A measurement using a sensor that is at least temporarily deprecated requires a square root. I managed to show the sensor was not capable of providing an accuracy at the low end that would produce meaningful results. But if they find a better sensor they may want to add that sensor as an option and the square root will be back. No! The HORROR!

Whatever... It\'s just an algorithm...

--

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209

You don\'t mention throughput or latency requirements, just that you want something non-iterative. A resource I\'ve used before is the Altera Advanced Synthesis Cookbook. You can find it online and it has algorithms/source for both \"approximate\" floating-point divider and root modules. These both sacrifice a bit of precision for low gate count. The divider is pipelined and has a latency of about 5 cycles. Maybe it could be helpful.
 
On Monday, January 18, 2021 at 8:13:02 PM UTC-5, Kevin Neilson wrote:
On Saturday, January 16, 2021 at 2:31:36 PM UTC-7, gnuarm.del...@gmail.com wrote:
I am looking for an algorithm to calculate a floating point divide. There are a number of options, but the one that is most clear to me and easiest to implement on the hardware I am designing seems to be the Newton–Raphson iterative method. I\'m trying to understand how many iterations it will take. The formula for that in the Wikipedia article assumes an initial estimate for the reciprocal of 48/17 - D * 32/17. This would require 2 iterations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating point numbers. My needs are for something between that range, but would like to minimize the number of iterations... ideally one.

I believe most implementations use a table lookup for the seed value of the reciprocal of the divisor. In the FPGA I am using, a convenient size would be 2k entries (11 bit address) of 18 bits. I\'m wondering if this would be sufficiently more accurate than the above estimate so that one iteration would be sufficient. I\'m not clear on how to calculate this. I know in general the formula converges rapidly with the error being squared, so I\'m thinking an initial value that is good to some 11 bits would produce more than 20 bits for one turn of the Newton-Raphson crank (20 bits being a level of accuracy called \"good enough\" for this application). But I\'d like to have something that verifies this rather than using a rule of thumb. At some point I will probably have to justify the correctness of the calculations.

Because of the hardware facilities available the calculations have intermediate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack other than ToS which is the register in the ALU (>36 bits). I don\'t think 11 bits is quite as good as required, but I\'m thinking one crank of the NR machine and I\'m \"good\". However, I need to be able to show how good. Anyone know of good documentation on this matter?

Oh, I guess I should mention later on I might be repeating this process for a square root calculation. A measurement using a sensor that is at least temporarily deprecated requires a square root. I managed to show the sensor was not capable of providing an accuracy at the low end that would produce meaningful results. But if they find a better sensor they may want to add that sensor as an option and the square root will be back. No! The HORROR!

Whatever... It\'s just an algorithm...

--

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209
You don\'t mention throughput or latency requirements, just that you want something non-iterative. A resource I\'ve used before is the Altera Advanced Synthesis Cookbook. You can find it online and it has algorithms/source for both \"approximate\" floating-point divider and root modules. These both sacrifice a bit of precision for low gate count. The divider is pipelined and has a latency of about 5 cycles. Maybe it could be helpful.

I\'m not looking for pipelined solutions because speed is not an issue. I don\'t want to bother with shift and subtract not because it is slow, but because each step in this design would be controlled by a ROM and while I don\'t expect to worry about the ROM size, if it uses 18 locations for each divide in the process I might need to focus on that.

The floating point NR approach seems to work ok. After considering the look up table in detail it seems like it will work fine with one pass of the NR algorithm, about 5 steps in all. Not bad.

--

Rick C.

+ Get 1,000 miles of free Supercharging
+ Tesla referral code - https://ts.la/richard11209
 
On Monday, January 18, 2021 at 8:13:02 PM UTC-5, Kevin Neilson wrote:
On Saturday, January 16, 2021 at 2:31:36 PM UTC-7, gnuarm.del...@gmail.com wrote:
I am looking for an algorithm to calculate a floating point divide. There are a number of options, but the one that is most clear to me and easiest to implement on the hardware I am designing seems to be the Newton–Raphson iterative method. I\'m trying to understand how many iterations it will take. The formula for that in the Wikipedia article assumes an initial estimate for the reciprocal of 48/17 - D * 32/17. This would require 2 iterations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating point numbers. My needs are for something between that range, but would like to minimize the number of iterations... ideally one.

I believe most implementations use a table lookup for the seed value of the reciprocal of the divisor. In the FPGA I am using, a convenient size would be 2k entries (11 bit address) of 18 bits. I\'m wondering if this would be sufficiently more accurate than the above estimate so that one iteration would be sufficient. I\'m not clear on how to calculate this. I know in general the formula converges rapidly with the error being squared, so I\'m thinking an initial value that is good to some 11 bits would produce more than 20 bits for one turn of the Newton-Raphson crank (20 bits being a level of accuracy called \"good enough\" for this application). But I\'d like to have something that verifies this rather than using a rule of thumb. At some point I will probably have to justify the correctness of the calculations.

Because of the hardware facilities available the calculations have intermediate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack other than ToS which is the register in the ALU (>36 bits). I don\'t think 11 bits is quite as good as required, but I\'m thinking one crank of the NR machine and I\'m \"good\". However, I need to be able to show how good. Anyone know of good documentation on this matter?

Oh, I guess I should mention later on I might be repeating this process for a square root calculation. A measurement using a sensor that is at least temporarily deprecated requires a square root. I managed to show the sensor was not capable of providing an accuracy at the low end that would produce meaningful results. But if they find a better sensor they may want to add that sensor as an option and the square root will be back. No! The HORROR!

Whatever... It\'s just an algorithm...

--

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209
You don\'t mention throughput or latency requirements, just that you want something non-iterative. A resource I\'ve used before is the Altera Advanced Synthesis Cookbook. You can find it online and it has algorithms/source for both \"approximate\" floating-point divider and root modules. These both sacrifice a bit of precision for low gate count. The divider is pipelined and has a latency of about 5 cycles. Maybe it could be helpful.

I\'m not looking for pipelined solutions because speed is not an issue. I don\'t want to bother with shift and subtract not because it is slow, but because each step in this design would be controlled by a ROM and while I don\'t expect to worry about the ROM size, if it uses 18 locations for each divide in the process I might need to focus on that.

The floating point NR approach seems to work ok. After considering the look up table in detail it seems like it will work fine with one pass of the NR algorithm, about 5 steps in all. Not bad.

--

Rick C.

+ Get 1,000 miles of free Supercharging
+ Tesla referral code - https://ts.la/richard11209
 
On Monday, January 18, 2021 at 8:13:02 PM UTC-5, Kevin Neilson wrote:
On Saturday, January 16, 2021 at 2:31:36 PM UTC-7, gnuarm.del...@gmail.com wrote:
I am looking for an algorithm to calculate a floating point divide. There are a number of options, but the one that is most clear to me and easiest to implement on the hardware I am designing seems to be the Newton–Raphson iterative method. I\'m trying to understand how many iterations it will take. The formula for that in the Wikipedia article assumes an initial estimate for the reciprocal of 48/17 - D * 32/17. This would require 2 iterations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating point numbers. My needs are for something between that range, but would like to minimize the number of iterations... ideally one.

I believe most implementations use a table lookup for the seed value of the reciprocal of the divisor. In the FPGA I am using, a convenient size would be 2k entries (11 bit address) of 18 bits. I\'m wondering if this would be sufficiently more accurate than the above estimate so that one iteration would be sufficient. I\'m not clear on how to calculate this. I know in general the formula converges rapidly with the error being squared, so I\'m thinking an initial value that is good to some 11 bits would produce more than 20 bits for one turn of the Newton-Raphson crank (20 bits being a level of accuracy called \"good enough\" for this application). But I\'d like to have something that verifies this rather than using a rule of thumb. At some point I will probably have to justify the correctness of the calculations.

Because of the hardware facilities available the calculations have intermediate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack other than ToS which is the register in the ALU (>36 bits). I don\'t think 11 bits is quite as good as required, but I\'m thinking one crank of the NR machine and I\'m \"good\". However, I need to be able to show how good. Anyone know of good documentation on this matter?

Oh, I guess I should mention later on I might be repeating this process for a square root calculation. A measurement using a sensor that is at least temporarily deprecated requires a square root. I managed to show the sensor was not capable of providing an accuracy at the low end that would produce meaningful results. But if they find a better sensor they may want to add that sensor as an option and the square root will be back. No! The HORROR!

Whatever... It\'s just an algorithm...

--

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209
You don\'t mention throughput or latency requirements, just that you want something non-iterative. A resource I\'ve used before is the Altera Advanced Synthesis Cookbook. You can find it online and it has algorithms/source for both \"approximate\" floating-point divider and root modules. These both sacrifice a bit of precision for low gate count. The divider is pipelined and has a latency of about 5 cycles. Maybe it could be helpful.

I\'m not looking for pipelined solutions because speed is not an issue. I don\'t want to bother with shift and subtract not because it is slow, but because each step in this design would be controlled by a ROM and while I don\'t expect to worry about the ROM size, if it uses 18 locations for each divide in the process I might need to focus on that.

The floating point NR approach seems to work ok. After considering the look up table in detail it seems like it will work fine with one pass of the NR algorithm, about 5 steps in all. Not bad.

--

Rick C.

+ Get 1,000 miles of free Supercharging
+ Tesla referral code - https://ts.la/richard11209
 
On Monday, January 18, 2021 at 8:13:02 PM UTC-5, Kevin Neilson wrote:
On Saturday, January 16, 2021 at 2:31:36 PM UTC-7, gnuarm.del...@gmail.com wrote:
I am looking for an algorithm to calculate a floating point divide. There are a number of options, but the one that is most clear to me and easiest to implement on the hardware I am designing seems to be the Newton–Raphson iterative method. I\'m trying to understand how many iterations it will take. The formula for that in the Wikipedia article assumes an initial estimate for the reciprocal of 48/17 - D * 32/17. This would require 2 iterations to gain 16 bits of accuracy or 3 iterations for IEEE 32 bit floating point numbers. My needs are for something between that range, but would like to minimize the number of iterations... ideally one.

I believe most implementations use a table lookup for the seed value of the reciprocal of the divisor. In the FPGA I am using, a convenient size would be 2k entries (11 bit address) of 18 bits. I\'m wondering if this would be sufficiently more accurate than the above estimate so that one iteration would be sufficient. I\'m not clear on how to calculate this. I know in general the formula converges rapidly with the error being squared, so I\'m thinking an initial value that is good to some 11 bits would produce more than 20 bits for one turn of the Newton-Raphson crank (20 bits being a level of accuracy called \"good enough\" for this application). But I\'d like to have something that verifies this rather than using a rule of thumb. At some point I will probably have to justify the correctness of the calculations.

Because of the hardware facilities available the calculations have intermediate mantissas of 18 or 36 bits with 28 bits stored when in memory/stack other than ToS which is the register in the ALU (>36 bits). I don\'t think 11 bits is quite as good as required, but I\'m thinking one crank of the NR machine and I\'m \"good\". However, I need to be able to show how good. Anyone know of good documentation on this matter?

Oh, I guess I should mention later on I might be repeating this process for a square root calculation. A measurement using a sensor that is at least temporarily deprecated requires a square root. I managed to show the sensor was not capable of providing an accuracy at the low end that would produce meaningful results. But if they find a better sensor they may want to add that sensor as an option and the square root will be back. No! The HORROR!

Whatever... It\'s just an algorithm...

--

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209
You don\'t mention throughput or latency requirements, just that you want something non-iterative. A resource I\'ve used before is the Altera Advanced Synthesis Cookbook. You can find it online and it has algorithms/source for both \"approximate\" floating-point divider and root modules. These both sacrifice a bit of precision for low gate count. The divider is pipelined and has a latency of about 5 cycles. Maybe it could be helpful.

I\'m not looking for pipelined solutions because speed is not an issue. I don\'t want to bother with shift and subtract not because it is slow, but because each step in this design would be controlled by a ROM and while I don\'t expect to worry about the ROM size, if it uses 18 locations for each divide in the process I might need to focus on that.

The floating point NR approach seems to work ok. After considering the look up table in detail it seems like it will work fine with one pass of the NR algorithm, about 5 steps in all. Not bad.

--

Rick C.

+ Get 1,000 miles of free Supercharging
+ Tesla referral code - https://ts.la/richard11209
 

Welcome to EDABoard.com

Sponsor

Back
Top