EDAboard.com | EDAboard.eu | EDAboard.de | EDAboard.co.uk | RTV forum PL | NewsGroups PL

Quad-Port BlockRAM in Virtex

Ask a question - edaboard.com

elektroda.net NewsGroups Forum Index - FPGA - Quad-Port BlockRAM in Virtex

Goto page 1, 2  Next

Kevin Neilson
Guest

Fri Oct 23, 2015 9:40 pm   



I think I need a quad-port blockRAM in a Xilinx V7. Having multiple read ports is no problem, but I need two read ports and two write ports. The two write ports is the problem. I can't double the clock speed. To be clear, I need to be able to do two reads and two writes per cycle. (Not writes to the same address.)

The only idea I could come up with is to have four dual-port BRAMs and a semaphore array. Let's call the BRAMs AC, AD, BC, and BD. Writer A writes the same value to address x in AC and AD and simultaneously sets the semaphore of address x to point to 'A'. Now when reader C wants to read address x, it reads AC and BC and the semaphore, sees that semaphore points toward the A side, and uses the value from AC and discards BC. If writer B writes to address x, it writes the value to both BC and BD and sets the semaphore x to point to side B. Reader D reads AD and BD and picks one based on the semaphore bit.

The semaphore itself is complicated. I think it would consists of 2 quad-port RAMs, one bit wide and the depth of AC, each one having 1 write and 3 read ports. This could be distributed RAM. Writer A would read the side B semaphore bit and set its own to the same, and writer B would read the side A bit and set its own to the opposite. Now when reader C or D read their two copies (A/B) of the semaphore bits using their read ports, they check if they are the same (use side A) or opposite (use side B).

It's a big mess and uses 4x the BRAMs as a dual-port. Maybe I need a different solution.

Kevin Neilson
Guest

Fri Oct 23, 2015 10:10 pm   



Update: I found a solution in the "Altera Synthesis Cookbook" and it seems to be the scheme I described above, but implementing the semaphore bits as FFs instead of distributed RAM. I'd need about 2048 semaphore bits, so implementing that in a distributed RAM would probably be advantageous. You can do a 64-bit quad port (1 wr, 3 rd) in a 4-LUT slice, so I'd need 2048/64*4*2 = 256 LUTs to do 2 2048-bit quad-port distributed RAMs. (Add in ~10 slices for 32->1 muxes.)

Kevin Neilson
Guest

Tue Nov 01, 2016 10:16 pm   



On Friday, October 23, 2015 at 2:10:20 PM UTC-6, Kevin Neilson wrote:
> Update: I found a solution in the "Altera Synthesis Cookbook" and it seems to be the scheme I described above, but implementing the semaphore bits as FFs instead of distributed RAM. I'd need about 2048 semaphore bits, so implementing that in a distributed RAM would probably be advantageous. You can do a 64-bit quad port (1 wr, 3 rd) in a 4-LUT slice, so I'd need 2048/64*4*2 = 256 LUTs to do 2 2048-bit quad-port distributed RAMs. (Add in ~10 slices for 32->1 muxes.)

Update 2: I came up with a better solution than the Altera Cookbook. The semaphore bits are stored partly in a separate blockRAM and partly in the main data blockRAMs. Then there is very little logic out in the fabric--just the muxes for the two read ports. Too bad there isn't an app note on this.

Evgeny Filatov
Guest

Thu Nov 03, 2016 4:17 pm   



On 01.11.2016 23:16, Kevin Neilson wrote:
Quote:
On Friday, October 23, 2015 at 2:10:20 PM UTC-6, Kevin Neilson wrote:
Update: I found a solution in the "Altera Synthesis Cookbook" and it seems to be the scheme I described above, but implementing the semaphore bits as FFs instead of distributed RAM. I'd need about 2048 semaphore bits, so implementing that in a distributed RAM would probably be advantageous. You can do a 64-bit quad port (1 wr, 3 rd) in a 4-LUT slice, so I'd need 2048/64*4*2 = 256 LUTs to do 2 2048-bit quad-port distributed RAMs. (Add in ~10 slices for 32->1 muxes.)

Update 2: I came up with a better solution than the Altera Cookbook. The semaphore bits are stored partly in a separate blockRAM and partly in the main data blockRAMs. Then there is very little logic out in the fabric--just the muxes for the two read ports. Too bad there isn't an app note on this.


Again, why do you need four BRAMs? Perhaps I'm stupid, but I don't see
what can be achieved with four BRAMs that cannot be achieved with two,
if it's correct that "[h]aving multiple read ports is no problem". Or is
it just how you solve the problem of having multiple read ports?

Like, you have two BRAMs A and B, and a semaphore array. The writer A
writes to A and points the semaphore of address x to A. The writer B
does the same for B. You read simultaneously A and B and the semaphore
for address x.

Gene

Kevin Neilson
Guest

Fri Nov 04, 2016 2:35 am   



Quote:
Again, why do you need four BRAMs?

Gene


I need 4 ports (2 wr, 2 rd). Your 2-BRAM solution allows for 2 wr ports, but only 1 rd port. In your solution you read A and B and the semaphore, then mux either A or B to your read data output based on the semaphore. But I need a second read port, so I have to have a second copy of the system you describe.

I drew up a nice diagram with a good solution for doing the semaphores, but I don't know how to post it here.

Evgeny Filatov
Guest

Fri Nov 04, 2016 7:30 am   



On 04.11.2016 3:35, Kevin Neilson wrote:
Quote:
Again, why do you need four BRAMs?

Gene

I need 4 ports (2 wr, 2 rd). Your 2-BRAM solution allows for 2 wr ports, but only 1 rd port. In your solution you read A and B and the semaphore, then mux either A or B to your read data output based on the semaphore. But I need a second read port, so I have to have a second copy of the system you describe.

I drew up a nice diagram with a good solution for doing the semaphores, but I don't know how to post it here.


Thanks for explaining the rationale for using 4 BRAMs.

Your solution would be surely interesting to look at. To post an image,
you can just upload it to any image-hosting website like

http://imgur.com

and post here the link to your image.

My best idea to remove logic from the design would be to append a
timestamp to each writing operation (instead of switching a semaphore).
During the read operation, the data word with the newest timestamp would
be selected. But it would only work for the limited time, until the data
field with the timestamp overflows.

Gene


Guest

Sat Nov 05, 2016 12:33 am   



On Friday, October 23, 2015 at 2:40:46 PM UTC-5, Kevin Neilson wrote:
Quote:
I think I need a quad-port blockRAM in a Xilinx V7. Having multiple read ports is no problem, but I need two read ports and two write ports. The two write ports is the problem. I can't double the clock speed. To be clear, I need to be able to do two reads and two writes per cycle. (Not writes to the same address.)

The only idea I could come up with is to have four dual-port BRAMs and a semaphore array. Let's call the BRAMs AC, AD, BC, and BD. Writer A writes the same value to address x in AC and AD and simultaneously sets the semaphore of address x to point to 'A'. Now when reader C wants to read address x, it reads AC and BC and the semaphore, sees that semaphore points toward the A side, and uses the value from AC and discards BC. If writer B writes to address x, it writes the value to both BC and BD and sets the semaphore x to point to side B. Reader D reads AD and BD and picks one based on the semaphore bit.

The semaphore itself is complicated. I think it would consists of 2 quad-port RAMs, one bit wide and the depth of AC, each one having 1 write and 3 read ports. This could be distributed RAM. Writer A would read the side B semaphore bit and set its own to the same, and writer B would read the side A bit and set its own to the opposite. Now when reader C or D read their two copies (A/B) of the semaphore bits using their read ports, they check if they are the same (use side A) or opposite (use side B).

It's a big mess and uses 4x the BRAMs as a dual-port. Maybe I need a different solution.


There is a literature on this subject:
http://fpgacpu.ca/multiport/TRETS2014-LaForest-Article.pdf

Kevin Neilson
Guest

Sat Nov 05, 2016 2:00 am   



Quote:
Your solution would be surely interesting to look at. To post an image,
you can just upload it to any image-hosting website like

http://imgur.com

and post here the link to your image.

My best idea to remove logic from the design would be to append a
timestamp to each writing operation (instead of switching a semaphore).
During the read operation, the data word with the newest timestamp would
be selected. But it would only work for the limited time, until the data
field with the timestamp overflows.

Gene


Thanks. Here's my sketch:

http://imgur.com/a/NhNr0

The timestamp is a nice idea, but, like you said, it would overflow quickly. And you'd have a long carry chain to do the timestamp comparison.

Kevin Neilson
Guest

Sat Nov 05, 2016 2:12 am   



Quote:
There is a literature on this subject:
http://fpgacpu.ca/multiport/TRETS2014-LaForest-Article.pdf


Yes, I did actually find this yesterday when searching again. The design I ended up using (http://imgur.com/a/NhNr0 ) looks like what they have in Fig. 3(a), except I implemented the "live value table" in BRAMs so it's much faster. They have a faster solution in Fig. 4(c), which uses their "XOR-based" design. However, it requires a lot more RAM because you need 6 full data storage units. I used only 4, and then two much smaller RAMs for semaphores (aka Live Value Table), and I also store semaphore copies in the 4 data RAMs.


Guest

Sat Nov 05, 2016 4:09 am   



I find this thread very interesting, it discusses quite some approaches I would not have thought of in first place...

Maybe a different view-point: As most modern FPGAs support true dual port RAM, with double clock rate you could write to two ports in the first cycle and read from both ports in the second cycle. This would only require 1 BRAM compared to 4 BRAMs (assuming your content fits into 1 BRAM, of course...).

However, you wrote that you cannot double the clock rate (out of curiosity: which clock rates are we talking about?). But, maybe you could increase it by 50%? Then you could make a 2/3 clock scheme with 2 BRAMs, with all the writes going to both BRAMs (taking two of the 3 cycles), but the reads for these two transactions (4 in total) are done in the 3rd cycle from both BRAMs. Of course this makes only sense if you can find a simple clock-domain-crossing-solution on system level...

Regards,

Thomas
www.entner-electronics.com - Home of EEBlaster and JPEG-Codec

Evgeny Filatov
Guest

Sat Nov 05, 2016 6:04 pm   



On 05.11.2016 3:00, Kevin Neilson wrote:
Quote:

Your solution would be surely interesting to look at. To post an image,
you can just upload it to any image-hosting website like

http://imgur.com

and post here the link to your image.

My best idea to remove logic from the design would be to append a
timestamp to each writing operation (instead of switching a semaphore).
During the read operation, the data word with the newest timestamp would
be selected. But it would only work for the limited time, until the data
field with the timestamp overflows.

Gene

Thanks. Here's my sketch:

http://imgur.com/a/NhNr0

The timestamp is a nice idea, but, like you said, it would overflow quickly. And you'd have a long carry chain to do the timestamp comparison.


Great design! In terms of the referenced article, it combines the good
features of both the LVT/semaphore approach (requires little memory to
store semaphores), and the XOR-based approach (no need for multiport
memory to store semaphores).

I would only suggest, that like discussed at pp. 6-7 of LaForest
article, it's possible to give user the impression there's no writing
delay by adding some forwarding circuitry.

Gene

Kevin Neilson
Guest

Sat Nov 05, 2016 7:27 pm   



Quote:
I would only suggest, that like discussed at pp. 6-7 of LaForest
article, it's possible to give user the impression there's no writing
delay by adding some forwarding circuitry.

Gene


I realized that since I'm doing read-modify-writes, I don't even need the extra semaphore RAMs. Since I'm reading each address two cycles before writing, I can get the semaphores from the data RAMs. When I'm doing a write only, I can precede it by a dummy read to get the semaphores.

The Xilinx BRAMs operate at the same speed for write-first and read-first modes, so I probably wouldn't need the forwarding logic. (The setup time is a lot bigger for write-first mode, though.) However, I do need a short "local cache" for when I try to read-modify-write the same location on successive cycles. Because of the read latency, the second read would be of stale data so I have to read from the local cache instead.

Guy Lemieux
Guest

Sat Nov 05, 2016 7:35 pm   



There is a paper that describes your approach, published by my Ph.D. student Ameer Abdelhadi at FPGA2014. He has also extended it to include switched ports, where some ports can dynamically switch between read and write mode at FCCM2016.

http://ece.ubc.ca/~ameer/publications.html

He has released the designs on GitHub under a permissive open source license.

https://github.com/AmeerAbdelhadi/Switched-Multiported-RAM

Guy

Guy Lemieux
Guest

Sat Nov 05, 2016 7:40 pm   



My Ph.D. Ameer added forwarding paths to his version, available on GitHub. See papers at FPGA2014 and FCCM2016.

http://ece.ubc.ca/~ameer/publications.html

https://github.com/AmeerAbdelhadi/Multiported-RAM

Kevin Neilson
Guest

Sat Nov 05, 2016 7:44 pm   



Quote:
Maybe a different view-point: As most modern FPGAs support true dual port RAM, with double clock
However, you wrote that you cannot double the clock rate (out of curiosity: which clock rates are we talking about?). But, maybe you could increase it by 50%? Then you could make a 2/3 clock scheme with 2 BRAMs, with all the writes going to both BRAMs (taking two of the 3 cycles), but the reads for these two transactions (4 in total) are done in the 3rd cycle from both BRAMs. Of course this makes only sense if you can find a simple clock-domain-crossing-solution on system level...

Regards,

Thomas
www.entner-electronics.com - Home of EEBlaster and JPEG-Codec


That's a great idea. It took me a few minutes to work through this but that seems like it would work. The clock I'm using now is 360MHz so a 1.5x clock would be 540MHz. That's pushing the edge, but Xilinx says the BRAM will run at 543MHz in a -2 part. The clock-domain crossing shouldn't be a problem. The clocks are "periodic-synchronous" so you have a known setup time.. (Assuming you use DLLs to keep them phase-locked.)

Xilinx does have an old app note ( https://www.xilinx.com/support/documentation/application_notes/xapp228.pdf ) on using a 2x clock to make a quad-port. In my case the 2x clock would be 720MHz

Goto page 1, 2  Next

elektroda.net NewsGroups Forum Index - FPGA - Quad-Port BlockRAM in Virtex

Ask a question - edaboard.com

Arabic versionBulgarian versionCatalan versionCzech versionDanish versionGerman versionGreek versionEnglish versionSpanish versionFinnish versionFrench versionHindi versionCroatian versionIndonesian versionItalian versionHebrew versionJapanese versionKorean versionLithuanian versionLatvian versionDutch versionNorwegian versionPolish versionPortuguese versionRomanian versionRussian versionSlovak versionSlovenian versionSerbian versionSwedish versionTagalog versionUkrainian versionVietnamese versionChinese version
RTV map EDAboard.com map News map EDAboard.eu map EDAboard.de map EDAboard.co.uk map