EDAboard.com | EDAboard.eu | EDAboard.de | EDAboard.co.uk | RTV forum PL | NewsGroups PL

Quad-Port BlockRAM in Virtex

Ask a question - edaboard.com

elektroda.net NewsGroups Forum Index - FPGA - Quad-Port BlockRAM in Virtex

Goto page Previous  1, 2

Kevin Neilson
Guest

Tue Nov 08, 2016 6:43 pm   



Quote:
I realized that since I'm doing read-modify-writes, I don't even need the extra semaphore RAMs. Since I'm reading each address two cycles before writing, I can get the semaphores from the data RAMs. When I'm doing a write only, I can precede it by a dummy read to get the semaphores.


I added a diagram of the simplified R-M-W quad-port to that link. http://imgur.com/a/NhNr0

Kevin Neilson
Guest

Tue Nov 08, 2016 7:13 pm   



On Saturday, November 5, 2016 at 11:35:20 AM UTC-6, Guy Lemieux wrote:
Quote:
There is a paper that describes your approach, published by my Ph.D. student Ameer Abdelhadi at FPGA2014. He has also extended it to include switched ports, where some ports can dynamically switch between read and write mode at FCCM2016.

http://ece.ubc.ca/~ameer/publications.html

He has released the designs on GitHub under a permissive open source license.

https://github.com/AmeerAbdelhadi/Switched-Multiported-RAM

Guy


Thanks; I enjoyed looking through the papers. The idea of dynamically switching the write ports to reads is one I might need to use at some point.

The main difference in my diagram is that I implemented part of the I-LVT in the data RAMs. For example, for a 2W/2R memory, you show the I-LVT RAMs as being 1 write, 3 reads. My I-LVTs are 1 write, 1 read, with the rest of the I-LVT done in the data RAMs. In my case, I need 69-wide BRAMs, and the BRAMs are 72 bits wide, so I have an extra 3 bits. I use one of those bits as the I-LVT ("semaphore") bit. When I do a read, I don't have to access a separate I-LVT RAM.

Ameer Abdelhadi
Guest

Tue Dec 13, 2016 11:37 pm   



On Tuesday, November 8, 2016 at 12:13:55 PM UTC-5, Kevin Neilson wrote:
Quote:
On Saturday, November 5, 2016 at 11:35:20 AM UTC-6, Guy Lemieux wrote:
There is a paper that describes your approach, published by my Ph.D. student Ameer Abdelhadi at FPGA2014. He has also extended it to include switched ports, where some ports can dynamically switch between read and write mode at FCCM2016.

http://ece.ubc.ca/~ameer/publications.html

He has released the designs on GitHub under a permissive open source license.

https://github.com/AmeerAbdelhadi/Switched-Multiported-RAM

Guy

Thanks; I enjoyed looking through the papers. The idea of dynamically switching the write ports to reads is one I might need to use at some point.

The main difference in my diagram is that I implemented part of the I-LVT in the data RAMs. For example, for a 2W/2R memory, you show the I-LVT RAMs as being 1 write, 3 reads. My I-LVTs are 1 write, 1 read, with the rest of the I-LVT done in the data RAMs. In my case, I need 69-wide BRAMs, and the BRAMs are 72 bits wide, so I have an extra 3 bits. I use one of those bits as the I-LVT ("semaphore") bit. When I do a read, I don't have to access a separate I-LVT RAM.


Kevin, the method you mentioned is actually identical to the 2W/2R I-LVT (both binary-coded and thermometer-coded) from our FPGA2014 paper, with ONE modification:
You store the BRAM outputs of the LVT in the data banks. After reading the data banks, these LVT bits will also be read as a meta-data, then the output selectors will be extracted (the XOR's in your diagram). This will indeed prevent replicating the LVT BRAMs; however, it incurs other *severe* problems:

1) Additional 2 cycles in the decision path!
The longest path of our I-LVT method passes through the LVT as follows:
1- Reading the I-LVT feedbacks
2- Rewriting the I-LVT
3- Reading the I-LVT to generate (through output extraction function) output mux selectors.
With these three cycles, our I-LVT required a very complicated bypassing circuitry to deal with even simple hazards as Write-After-Write.
Your solution adds two cycles in the selection path, one to rewrite the data banks with the I-LVT bits, and the second to read these bits (then extract the selectors). This solution requires caching to bypass this very long decision path, which will increase the BRAM overhead again.

In other words, the read mechanism of both methods is similar, but the output mux selectors in your method are read from the data banks instead of the LVT. Once a write happens, the output selectors will see the change after 5 cycles (LVT feedback read -> LVT rewrite -> LVT read -> data banks write (selectors) -> data bank read (selectors)), whereas ours requires only 3 cycles.

2) Modularity:
The additional bits can't accommodate bank selectors for every number of write ports. For instance, you mentioned extra 3 bits in each BRAM line. These 3 bits can code selectors for up to 8 write ports. For more than 8 write ports, the meta-data should be stored in additional BRAMs, which will further increase the BRAM consumption.

Anyhow, the I-LVT portion is minor compared to the data banks. For instance, in your diagram, you are using 140Kbits for the data banks and only 2Kbits for the LVT. Our I-LVT requires only 2Kbits more for the I-LVT (only +1.5%), however, it eliminates the need for caching (as required by your solution).

Ameer
http://www.ece.ubc.ca/~ameer/

Ameer Abdelhadi
Guest

Fri Dec 16, 2016 11:43 pm   



On Tuesday, December 13, 2016 at 4:37:42 PM UTC-5, Ameer Abdelhadi wrote:
Quote:
On Tuesday, November 8, 2016 at 12:13:55 PM UTC-5, Kevin Neilson wrote:
On Saturday, November 5, 2016 at 11:35:20 AM UTC-6, Guy Lemieux wrote:
There is a paper that describes your approach, published by my Ph.D. student Ameer Abdelhadi at FPGA2014. He has also extended it to include switched ports, where some ports can dynamically switch between read and write mode at FCCM2016.

http://ece.ubc.ca/~ameer/publications.html

He has released the designs on GitHub under a permissive open source license.

https://github.com/AmeerAbdelhadi/Switched-Multiported-RAM

Guy

Thanks; I enjoyed looking through the papers. The idea of dynamically switching the write ports to reads is one I might need to use at some point..

The main difference in my diagram is that I implemented part of the I-LVT in the data RAMs. For example, for a 2W/2R memory, you show the I-LVT RAMs as being 1 write, 3 reads. My I-LVTs are 1 write, 1 read, with the rest of the I-LVT done in the data RAMs. In my case, I need 69-wide BRAMs, and the BRAMs are 72 bits wide, so I have an extra 3 bits. I use one of those bits as the I-LVT ("semaphore") bit. When I do a read, I don't have to access a separate I-LVT RAM.


Kevin, the method you mentioned is actually identical to the 2W/2R I-LVT (both binary-coded and thermometer-coded) from our FPGA2014 paper, with ONE modification:
You store the BRAM outputs of the LVT in the data banks. After reading the data banks, these LVT bits will also be read as a meta-data, then the output selectors will be extracted (the XOR's in your diagram). This will indeed prevent replicating the LVT BRAMs; however, it incurs other *severe* problems:

1) Additional 2 cycles in the decision path!
The longest path of our I-LVT method passes through the LVT as follows:
1- Reading the I-LVT feedbacks
2- Rewriting the I-LVT
3- Reading the I-LVT to generate (through output extraction function) output mux selectors.
With these three cycles, our I-LVT required a very complicated bypassing circuitry to deal with even simple hazards as Write-After-Write.
Your solution adds two cycles in the selection path, one to rewrite the data banks with the I-LVT bits, and the second to read these bits (then extract the selectors). This solution requires caching to bypass this very long decision path, which will increase the BRAM overhead again.

In other words, the read mechanism of both methods is similar, but the output mux selectors in your method are read from the data banks instead of the LVT. Once a write happens, the output selectors will see the change after 5 cycles (LVT feedback read -> LVT rewrite -> LVT read -> data banks write (selectors) -> data bank read (selectors)), whereas ours requires only 3 cycles.

2) Modularity:
The additional bits can't accommodate bank selectors for every number of write ports. For instance, you mentioned extra 3 bits in each BRAM line. These 3 bits can code selectors for up to 8 write ports. For more than 8 write ports, the meta-data should be stored in additional BRAMs, which will further increase the BRAM consumption.

Anyhow, the I-LVT portion is minor compared to the data banks. For instance, in your diagram, you are using 140Kbits for the data banks and only 2Kbits for the LVT. Our I-LVT requires only 2Kbits more for the I-LVT (only +1..5%), however, it eliminates the need for caching (as required by your solution).

Ameer
http://www.ece.ubc.ca/~ameer/


BTW, our design is available online as an open source library. It's modular, parametrized, optimized for high performance and optimal resources consumption, fully bypassed, and fully tested with a run-in-batch manager for simulation and synthesis.

Just download the Verilog, add it to your project, instantiate the IP module, change to your parameters (e.g. #reads, #writes, data width, RAM depth, bypassing...), and you're ready to go!

Open source libraries:
http://www.ece.ubc.ca/~ameer/opensource.html
https://github.com/AmeerAbdelhadi/

BRAM-based Multi-ported RAM from FPGA'14:
https://github.com/AmeerAbdelhadi/Multiported-RAM
Paper: http://www.ece.ubc.ca/~ameer/publications/Abdelhadi-Conference-2014Feb-FPGA2014-MultiportedRAM.pdf
Slides: http://www.ece.ubc.ca/~ameer/publications/Abdelhadi-Talk-2014Feb-FPGA2014-MultiportedRAM.pdf

Enjoy!

Kevin Neilson
Guest

Wed Dec 21, 2016 2:09 am   



Quote:
Kevin, the method you mentioned is actually identical to the 2W/2R I-LVT (both binary-coded and thermometer-coded) from our FPGA2014 paper, with ONE modification:
You store the BRAM outputs of the LVT in the data banks. After reading the data banks, these LVT bits will also be read as a meta-data, then the output selectors will be extracted (the XOR's in your diagram). This will indeed prevent replicating the LVT BRAMs; however, it incurs other *severe* problems:


Ameer,
Thanks for the response. Yes, there may be some latency disadvantages in my approach. For the cache that I need for the bypass logic, I use a Xilinx dynamic SRL. It's the same size and speed whether or not the cache depth is 2 or 32, so making the cache deeper doesn't make much difference. (There is more address-comparison logic, though.)

As for the memory usage, it just depends on what BRAM width you need. If you need a 512-deep by 64-bit wide BRAM, you have to use a Xilinx simple-dual port BRAM with a width of 72, so then you have 8 bits of each location "wasted" which you can use for ILVT flags. But if you need a 72-bit-wide BRAM for data, then there is no advantage in trying to combine the data and the flags. In my case I just happened to need 69 and had 3 bits left over.

I finished the design that uses the quad-port and I can say it's working well and it simplified my algorithm significantly. My clock speed is 360 MHz which was too fast to use a 2x clock to time-slice the BRAMs, but the I-LVT design works just fine.
Kevin

Goto page Previous  1, 2

elektroda.net NewsGroups Forum Index - FPGA - Quad-Port BlockRAM in Virtex

Ask a question - edaboard.com

Arabic versionBulgarian versionCatalan versionCzech versionDanish versionGerman versionGreek versionEnglish versionSpanish versionFinnish versionFrench versionHindi versionCroatian versionIndonesian versionItalian versionHebrew versionJapanese versionKorean versionLithuanian versionLatvian versionDutch versionNorwegian versionPolish versionPortuguese versionRomanian versionRussian versionSlovak versionSlovenian versionSerbian versionSwedish versionTagalog versionUkrainian versionVietnamese versionChinese version
RTV map EDAboard.com map News map EDAboard.eu map EDAboard.de map EDAboard.co.uk map