Quartus II Synthesis - System Memory Issues for Large Stratix 10 Design...

C

Chris Adams

Guest
Hello,

I have a Stratix 10 design that is based around an ip core generated using Intel\'s HLS. The core does some simple floating point operations and by itself uses very few resources (1 DSP, a few hundred flops etc).

This core sits inside a generate statement like this:

generate
for(i = 0; i < SOMEBIGNUMBER; i=i+1)
myhlscore u0 (inputs, outputs);
....


The design works and is proven in simulation and in hardware.

The problem comes when I try to increase the value of SOMEBIGNUMBER. Despite there being adequate resources, using values above 200 or so make the synthesis tool run out of memory.

I cannot alleviate this easily by adding more memory - I already tried synthesizing on a computer with 256GB memory and a 200GB swap space and quartus ate it all up before dying.

I\'m using a .ip file from HLS right now. I\'m wondering is there is some way to pre-synthesis the module and keep the results, or is there someway I need to write the generate statement so that it caches less? Perhaps there are some synthesis settings I can change?

Thanks,
C
 
Chris Adams <chris@chrisada.co.uk> wrote:
I\'m using a .ip file from HLS right now. I\'m wondering is there is some
way to pre-synthesis the module and keep the results, or is there someway
I need to write the generate statement so that it caches less? Perhaps
there are some synthesis settings I can change?

You can put the module in a design partition. That should mean the tools
can reuse the block again if it\'s repeated, and won\'t try and flatten
everything out before routing, which is probably what causes the out of
memory errors.

Design Partition Planner is the tool to do this. I haven\'t watched it all,
but this video seems to cover how to use partitions:
https://www.youtube.com/watch?v=AW9kev4lM7g

Beware that if you\'re using a tool that generates Verilog (not sure if HLS
is here) you need to keep the structure through to the Verilog level. In
other words if your myhlscore() is the same Verilog module on each iteration
of the generate loop that\'s fine but if your HLS core is producing
SOMEBIGNUMBER of different verilog modules, or flattening them to one
enormous module, that\'s going to be a problem. It looks like your HLS is
constrained to inside the module and your generate is in regular Verilog so
that\'s probably OK.

Theo
 
Chris Adams <chris@chrisada.co.uk> wrote:
I\'m using a .ip file from HLS right now. I\'m wondering is there is some
way to pre-synthesis the module and keep the results, or is there someway
I need to write the generate statement so that it caches less? Perhaps
there are some synthesis settings I can change?

You can put the module in a design partition. That should mean the tools
can reuse the block again if it\'s repeated, and won\'t try and flatten
everything out before routing, which is probably what causes the out of
memory errors.

Design Partition Planner is the tool to do this. I haven\'t watched it all,
but this video seems to cover how to use partitions:
https://www.youtube.com/watch?v=AW9kev4lM7g

Beware that if you\'re using a tool that generates Verilog (not sure if HLS
is here) you need to keep the structure through to the Verilog level. In
other words if your myhlscore() is the same Verilog module on each iteration
of the generate loop that\'s fine but if your HLS core is producing
SOMEBIGNUMBER of different verilog modules, or flattening them to one
enormous module, that\'s going to be a problem. It looks like your HLS is
constrained to inside the module and your generate is in regular Verilog so
that\'s probably OK.

Theo
 
Chris Adams <chris@chrisada.co.uk> wrote:
I\'m using a .ip file from HLS right now. I\'m wondering is there is some
way to pre-synthesis the module and keep the results, or is there someway
I need to write the generate statement so that it caches less? Perhaps
there are some synthesis settings I can change?

You can put the module in a design partition. That should mean the tools
can reuse the block again if it\'s repeated, and won\'t try and flatten
everything out before routing, which is probably what causes the out of
memory errors.

Design Partition Planner is the tool to do this. I haven\'t watched it all,
but this video seems to cover how to use partitions:
https://www.youtube.com/watch?v=AW9kev4lM7g

Beware that if you\'re using a tool that generates Verilog (not sure if HLS
is here) you need to keep the structure through to the Verilog level. In
other words if your myhlscore() is the same Verilog module on each iteration
of the generate loop that\'s fine but if your HLS core is producing
SOMEBIGNUMBER of different verilog modules, or flattening them to one
enormous module, that\'s going to be a problem. It looks like your HLS is
constrained to inside the module and your generate is in regular Verilog so
that\'s probably OK.

Theo
 
Chris Adams <chris@chrisada.co.uk> wrote:
I\'m using a .ip file from HLS right now. I\'m wondering is there is some
way to pre-synthesis the module and keep the results, or is there someway
I need to write the generate statement so that it caches less? Perhaps
there are some synthesis settings I can change?

You can put the module in a design partition. That should mean the tools
can reuse the block again if it\'s repeated, and won\'t try and flatten
everything out before routing, which is probably what causes the out of
memory errors.

Design Partition Planner is the tool to do this. I haven\'t watched it all,
but this video seems to cover how to use partitions:
https://www.youtube.com/watch?v=AW9kev4lM7g

Beware that if you\'re using a tool that generates Verilog (not sure if HLS
is here) you need to keep the structure through to the Verilog level. In
other words if your myhlscore() is the same Verilog module on each iteration
of the generate loop that\'s fine but if your HLS core is producing
SOMEBIGNUMBER of different verilog modules, or flattening them to one
enormous module, that\'s going to be a problem. It looks like your HLS is
constrained to inside the module and your generate is in regular Verilog so
that\'s probably OK.

Theo
 
On Saturday, 30 October 2021 at 16:55:47 UTC-4, Theo wrote:
Chris Adams <> wrote:
I\'m using a .ip file from HLS right now. I\'m wondering is there is some
way to pre-synthesis the module and keep the results, or is there someway
I need to write the generate statement so that it caches less? Perhaps
there are some synthesis settings I can change?
You can put the module in a design partition. That should mean the tools
can reuse the block again if it\'s repeated, and won\'t try and flatten
everything out before routing, which is probably what causes the out of
memory errors.

Design Partition Planner is the tool to do this. I haven\'t watched it all,
but this video seems to cover how to use partitions:
https://www.youtube.com/watch?v=AW9kev4lM7g

Beware that if you\'re using a tool that generates Verilog (not sure if HLS
is here) you need to keep the structure through to the Verilog level. In
other words if your myhlscore() is the same Verilog module on each iteration
of the generate loop that\'s fine but if your HLS core is producing
SOMEBIGNUMBER of different verilog modules, or flattening them to one
enormous module, that\'s going to be a problem. It looks like your HLS is
constrained to inside the module and your generate is in regular Verilog so
that\'s probably OK.

Theo
Good idea. We tried using a design partition, but even the elaboration stage uses over 120GB of memory before crashing. I may try setting up an extremely large swap file.

Chris
 
On Saturday, 30 October 2021 at 16:55:47 UTC-4, Theo wrote:
Chris Adams <> wrote:
I\'m using a .ip file from HLS right now. I\'m wondering is there is some
way to pre-synthesis the module and keep the results, or is there someway
I need to write the generate statement so that it caches less? Perhaps
there are some synthesis settings I can change?
You can put the module in a design partition. That should mean the tools
can reuse the block again if it\'s repeated, and won\'t try and flatten
everything out before routing, which is probably what causes the out of
memory errors.

Design Partition Planner is the tool to do this. I haven\'t watched it all,
but this video seems to cover how to use partitions:
https://www.youtube.com/watch?v=AW9kev4lM7g

Beware that if you\'re using a tool that generates Verilog (not sure if HLS
is here) you need to keep the structure through to the Verilog level. In
other words if your myhlscore() is the same Verilog module on each iteration
of the generate loop that\'s fine but if your HLS core is producing
SOMEBIGNUMBER of different verilog modules, or flattening them to one
enormous module, that\'s going to be a problem. It looks like your HLS is
constrained to inside the module and your generate is in regular Verilog so
that\'s probably OK.

Theo
Good idea. We tried using a design partition, but even the elaboration stage uses over 120GB of memory before crashing. I may try setting up an extremely large swap file.

Chris
 
Chris Adams <chris@chrisada.co.uk> wrote:

Good idea. We tried using a design partition, but even the elaboration
stage uses over 120GB of memory before crashing. I may try setting up an
extremely large swap file.

I wonder if the generate statement is causing the trouble, and whether just
having a pile of flat instantiations would be any different?

One other thing you could try, if the I/O isn\'t too troublesome, is a tree
of instantiations. module A2 contains two instances of the module A, module
A4 contains two instances of A2, module A8 two of A4, etc. That way it
keeps the complexity at each level of hierarchy down. If the synthesiser is
blowing up at the elaboration stage it could help if the elaboration in a
specific module is within limits. I\'ve not tried it though.

I\'ve had Stratix 10 builds need >16GB but <64GB, for what it\'s worth, but
these weren\'t super full designs.

Theo
 
Chris Adams <chris@chrisada.co.uk> wrote:

Good idea. We tried using a design partition, but even the elaboration
stage uses over 120GB of memory before crashing. I may try setting up an
extremely large swap file.

I wonder if the generate statement is causing the trouble, and whether just
having a pile of flat instantiations would be any different?

One other thing you could try, if the I/O isn\'t too troublesome, is a tree
of instantiations. module A2 contains two instances of the module A, module
A4 contains two instances of A2, module A8 two of A4, etc. That way it
keeps the complexity at each level of hierarchy down. If the synthesiser is
blowing up at the elaboration stage it could help if the elaboration in a
specific module is within limits. I\'ve not tried it though.

I\'ve had Stratix 10 builds need >16GB but <64GB, for what it\'s worth, but
these weren\'t super full designs.

Theo
 
On Tuesday, 9 November 2021 at 15:24:51 UTC-5, Theo wrote:
Chris Adams wrote:

Good idea. We tried using a design partition, but even the elaboration
stage uses over 120GB of memory before crashing. I may try setting up an
extremely large swap file.
I wonder if the generate statement is causing the trouble, and whether just
having a pile of flat instantiations would be any different?

One other thing you could try, if the I/O isn\'t too troublesome, is a tree
of instantiations. module A2 contains two instances of the module A, module
A4 contains two instances of A2, module A8 two of A4, etc. That way it
keeps the complexity at each level of hierarchy down. If the synthesiser is
blowing up at the elaboration stage it could help if the elaboration in a
specific module is within limits. I\'ve not tried it though.

I\'ve had Stratix 10 builds need >16GB but <64GB, for what it\'s worth, but
these weren\'t super full designs.

Theo
This is also a good idea. We tried this, but it didn\'t help either.

We were eventually able to get the design to complete map, but it required a 500GB swap file, and took 2 days or so to build!!!

I don\'t think this is really a good solution, memory access on the system swap is extremely slow.

Anyway the design now fails at the route stage, saying the design cannot be routed... Device resource usage is around 60% following map.

Chris
 
On Tuesday, 14 December 2021 at 08:35:06 UTC-5, Chris Adams wrote:
On Tuesday, 9 November 2021 at 15:24:51 UTC-5, Theo wrote:
Chris Adams wrote:

Good idea. We tried using a design partition, but even the elaboration
stage uses over 120GB of memory before crashing. I may try setting up an
extremely large swap file.
I wonder if the generate statement is causing the trouble, and whether just
having a pile of flat instantiations would be any different?

One other thing you could try, if the I/O isn\'t too troublesome, is a tree
of instantiations. module A2 contains two instances of the module A, module
A4 contains two instances of A2, module A8 two of A4, etc. That way it
keeps the complexity at each level of hierarchy down. If the synthesiser is
blowing up at the elaboration stage it could help if the elaboration in a
specific module is within limits. I\'ve not tried it though.

I\'ve had Stratix 10 builds need >16GB but <64GB, for what it\'s worth, but
these weren\'t super full designs.

Theo
This is also a good idea. We tried this, but it didn\'t help either.

We were eventually able to get the design to complete map, but it required a 500GB swap file, and took 2 days or so to build!!!

I don\'t think this is really a good solution, memory access on the system swap is extremely slow.

Anyway the design now fails at the route stage, saying the design cannot be routed... Device resource usage is around 60% following map.

Chris

Just a quick update - We were able to further reduce synthesis memory usage by lowering the number of cores used for compilation. We went down to 4 cores and memory usage was below 128GB with only a small impact on build performance.
 
Chris Adams <chris@chrisada.co.uk> wrote:
Just a quick update - We were able to further reduce synthesis memory
usage by lowering the number of cores used for compilation. We went down
to 4 cores and memory usage was below 128GB with only a small impact on
build performance.

That\'s good to know, I hadn\'t thought of that. Makes sense - reduces
parallelism but reduces copies of the working set.

I get the impression Quartus spends a fair chunk of its time not being
parallel (Amdahl\'s law) - Quartus Pro is a bit better at being parallel, but
I think it still prefers fewer cores with a higher clock.

Theo
 

Welcome to EDABoard.com

Sponsor

Back
Top