embedded RAM vs. registers

alb · Jan 17, 2014

Hi everyone,

I'm trying to optimize the footprint of my firmware on the target device
and I realize there are a lot of parameters which might be stored in the
embedded RAM instead of dedicated registers.

Certainly the RAM access logic will 'eat some space' but lot's of flops
will be released. Is there any recommendation on how to optimally use
embedded resources? [1]

The main reason for this optimization is to free some space to include a
function which has been added later in the design phase (ouch!).

Thanks a lot,

Al

[1] I know that put like this this question is certainly open to a hot
discussion!

--
A: Because it fouls the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

GaborSzakacs · Jan 18, 2014

alb wrote:

Hi everyone,

I'm trying to optimize the footprint of my firmware on the target device
and I realize there are a lot of parameters which might be stored in the
embedded RAM instead of dedicated registers.

Certainly the RAM access logic will 'eat some space' but lot's of flops
will be released. Is there any recommendation on how to optimally use
embedded resources? [1]

The main reason for this optimization is to free some space to include a
function which has been added later in the design phase (ouch!).

Thanks a lot,

Al

[1] I know that put like this this question is certainly open to a hot
discussion!

It depends on the device you're targetting. To some extent the tools
can make use of embedded RAM without changing your RTL. For example
Xilinx tools allow you to place logic into unused BRAMs, and will
automatically infer SRL's where the design allows it.

I've often used BRAM as a "shadow memory" to keep a copy of internal
configuration registers for readback. That can eliminate a large
mux, at least for all register bits that only change when written.
Read-only bits and self-resetting bits would still need a mux, but
the overall logic could be reduced vs. a complete mux for all bits.

--
Gabor

P.S. - I find your signature more annoying than top posting. In my
opinion the most annoying thing about usenet (besides the text-only
format) is people who think they have been appointed to police the
ettiquette of other posters.

alb · Jan 19, 2014

Hi Gabor,

On 1/17/2014 10:53 PM, GaborSzakacs wrote:
[]

I'm trying to optimize the footprint of my firmware on the target device
and I realize there are a lot of parameters which might be stored in the
embedded RAM instead of dedicated registers.
[]
It depends on the device you're targetting. To some extent the tools
can make use of embedded RAM without changing your RTL. For example
Xilinx tools allow you to place logic into unused BRAMs, and will
automatically infer SRL's where the design allows it.

Uhm, apparently the Microsemi devices I'm using (IGLOO), together with
the toolset (Libero IDE) are not that smart to profit of the local
memory, unless I'm inadvertently asking *not* to use it. To be honest I
have not searched deeply for ram usage on these devices, but the
handbook does not provide any clue on 'use of RAM without changing RTL'.

I've often used BRAM as a "shadow memory" to keep a copy of internal
configuration registers for readback. That can eliminate a large
mux, at least for all register bits that only change when written.
Read-only bits and self-resetting bits would still need a mux, but
the overall logic could be reduced vs. a complete mux for all bits.

I guess I do not completely follow you here, which mux are you talking
about?

Al

p.s.: you are entitled to have your own opinion about Usenet and its
users' opinion, no more than I am.

Gabor · Jan 19, 2014

On 1/18/2014 5:23 PM, alb wrote:

Hi Gabor,

On 1/17/2014 10:53 PM, GaborSzakacs wrote:
[]
I'm trying to optimize the footprint of my firmware on the target device
and I realize there are a lot of parameters which might be stored in the
embedded RAM instead of dedicated registers.
[]
It depends on the device you're targetting. To some extent the tools
can make use of embedded RAM without changing your RTL. For example
Xilinx tools allow you to place logic into unused BRAMs, and will
automatically infer SRL's where the design allows it.

Uhm, apparently the Microsemi devices I'm using (IGLOO), together with
the toolset (Libero IDE) are not that smart to profit of the local
memory, unless I'm inadvertently asking *not* to use it. To be honest I
have not searched deeply for ram usage on these devices, but the
handbook does not provide any clue on 'use of RAM without changing RTL'.

I've often used BRAM as a "shadow memory" to keep a copy of internal
configuration registers for readback. That can eliminate a large
mux, at least for all register bits that only change when written.
Read-only bits and self-resetting bits would still need a mux, but
the overall logic could be reduced vs. a complete mux for all bits.

I guess I do not completely follow you here, which mux are you talking
about?

In a system with a processor (external or embedded) you typically have
some form of bus to read and write registers within the FPGA. Normally
you need the outputs of these registers all the time, so you can't just
implement the whole thing as RAM. Now if the CPU wants to be able to
read back the values it wrote, you need a big readback multiplexer
(unless your IGLOO has internal tristate buffers) to select the register
you want to read back. What I do is to have a RAM that keeps a copy of
what was written by the CPU. Then the readback mux defaults to the
output of this (simple single-port) RAM unless the register is read-only
or has some side-effects that could change the register's value when
it's not being written by the CPU. If you have a design with a whole
lot of registers, you can really reduce the size of the readback mux.

Of course you could save even more logic by not having readback for
values that only change when written by the CPU. These become
"write-only" registers, and the software guy then needs to keep his
own "shadow" copy of the values he wrote if he needs to read it
back later.

Al

p.s.: you are entitled to have your own opinion about Usenet and its
users' opinion, no more than I am.

Someone said, "Opinions are like a**holes. Everyone has one, and they
all stink." In any case I see you removed your signature from the
latest post. ;-)

--
Gabor

alb · Jan 20, 2014

Hi Gabor,

On 1/19/2014 4:30 AM, Gabor wrote:
[]

I guess I do not completely follow you here, which mux are you
talking about?

In a system with a processor (external or embedded) you typically
have some form of bus to read and write registers within the FPGA.
Normally you need the outputs of these registers all the time, so
you can't just implement the whole thing as RAM.

I follow you if you talk about 'state registers', which of course are
needed to keep the current state of the logic, but there are lots of
'configuration registers' which do not need constant access at
their values.

A simple example would be the configuration of an UART, you do not need
to know *constantly* that you need a parity bit or two stop bits. These
type of 'memory' can go in a RAM. Would you agree?

Now if the CPU wants to be able to read back the values it wrote,
you need a big readback multiplexer (unless your IGLOO has internal
tristate buffers) to select the register you want to read back.

Got your point about the multiplexer.

What I do is to have a RAM that keeps a copy of what was written by
the CPU.

I tend to avoid local copies of information since they may not mirror
efficiently, leading to multiple sources of 'truth' which eventually may
bite you.
How do you guarantee on a cycle base that the two locations are
perfectly matching? What happens if they differ? If you do not need
cycle base accuracy then which location you rely upon?

Then the readback mux defaults to the output of this (simple
single-port) RAM unless the register is read-only or has some
side-effects that could change the register's value when it's not
being written by the CPU. If you have a design with a whole lot of
registers, you can really reduce the size of the readback mux.

I now understand your, indeed valid, point.

Of course you could save even more logic by not having readback for
values that only change when written by the CPU. These become
"write-only" registers, and the software guy then needs to keep his
own "shadow" copy of the values he wrote if he needs to read it back
later.

see my opinion on multiple copies above.

[]

Someone said, "Opinions are like a**holes. Everyone has one, and
they all stink."

See, we are not too far apart with our own personal opinion on 'opinions'.

In any case I see you removed your signature from the latest post.
;-)

That is done automatically by my mailer when I'm not the OP, so do not
get too excited about that ;-)

GaborSzakacs · Jan 21, 2014

alb wrote:

Hi Gabor,

On 1/19/2014 4:30 AM, Gabor wrote:
[]
I guess I do not completely follow you here, which mux are you
talking about?

In a system with a processor (external or embedded) you typically
have some form of bus to read and write registers within the FPGA.
Normally you need the outputs of these registers all the time, so
you can't just implement the whole thing as RAM.

I follow you if you talk about 'state registers', which of course are
needed to keep the current state of the logic, but there are lots of
'configuration registers' which do not need constant access at
their values.

A simple example would be the configuration of an UART, you do not need
to know *constantly* that you need a parity bit or two stop bits. These
type of 'memory' can go in a RAM. Would you agree?

Not at all. The UART needs to know how many stop bits and what sort of
parity to use whenever it transmits data. That can be completely
asynchronous to the CPU data bus. If the UART needed to get this info
from RAM, it would need another address port to that RAM. That's a very
inefficient use of hardware to avoid storing 2 or 3 bits in a separate
register. If you meant that the UART would read the RAM and then keep
a local copy, how is this different (in terms of resource usage) than
just having the register implemented in flip-flops?

Now if the CPU wants to be able to read back the values it wrote,
you need a big readback multiplexer (unless your IGLOO has internal
tristate buffers) to select the register you want to read back.

Got your point about the multiplexer.

What I do is to have a RAM that keeps a copy of what was written by
the CPU.

I tend to avoid local copies of information since they may not mirror
efficiently, leading to multiple sources of 'truth' which eventually may
bite you.
How do you guarantee on a cycle base that the two locations are
perfectly matching? What happens if they differ? If you do not need
cycle base accuracy then which location you rely upon?

Then the readback mux defaults to the output of this (simple
single-port) RAM unless the register is read-only or has some
side-effects that could change the register's value when it's not
being written by the CPU. If you have a design with a whole lot of
registers, you can really reduce the size of the readback mux.

I now understand your, indeed valid, point.

Of course you could save even more logic by not having readback for
values that only change when written by the CPU. These become
"write-only" registers, and the software guy then needs to keep his
own "shadow" copy of the values he wrote if he needs to read it back
later.

see my opinion on multiple copies above.

This is indeed an issue whenever you use this technique to save
resources. I look at it as a trade-off. In the case of readback
for read/write bits that only change when written by the CPU, the
only time you would be out of synch is at start-up. In my case
I would either make a rule that the software must write every
register at least once before it could be read back, or I would
program the "RAM" with the initial register values at config time.
This works on Xilnx parts, where the configuration bitstream has
bits for all BRAM locations. Not all FPGA's can do this, though.
Anyway, I thought this thread was about saving device resources...

--
Gabor

glen herrmannsfeldt · Jan 21, 2014

GaborSzakacs <gabor@alacron.com> wrote:

(snip)

In a system with a processor (external or embedded) you typically
have some form of bus to read and write registers within the FPGA.
Normally you need the outputs of these registers all the time, so
you can't just implement the whole thing as RAM.

I follow you if you talk about 'state registers', which of course are
needed to keep the current state of the logic, but there are lots of
'configuration registers' which do not need constant access at
their values.

A simple example would be the configuration of an UART, you do not need
to know *constantly* that you need a parity bit or two stop bits. These
type of 'memory' can go in a RAM. Would you agree?

Not at all. The UART needs to know how many stop bits and what sort of
parity to use whenever it transmits data. That can be completely
asynchronous to the CPU data bus. If the UART needed to get this info
from RAM, it would need another address port to that RAM. That's a very
inefficient use of hardware to avoid storing 2 or 3 bits in a separate
register. If you meant that the UART would read the RAM and then keep
a local copy, how is this different (in terms of resource usage) than
just having the register implemented in flip-flops?

If you think of it that way, (and sometimes I do) then the
microprocessor is the biggest waste of transistors ever
invented. A huge number of transistors, now in the billions,
to get data into, and out of, an arithmetic-logic-unit
containing thousands of transistors.

Most of the time, a large fraction of the logic isn't doing
anything at all!

Consider the old favorite of introductory digital logic
laboratory courses, the digital clock. Almost nothing happens
most of the time (ignore display multiplex for now), but once
a second the display is updated. In the 1970s, you would build
one out of TTL chips. Though the FF's had the ability to
switch at MHz rates, here they ran at 1Hz or less. (Well,
divide down from 60Hz.) Again, the transistors are being
wasted, but now in the time domain instead of the spatial
domain.

A small MCU, with small, built-in RAM and ROM (maybe external
ROM) has plenty of power to run a digital clock. Many more
transistors than the TTL version, and they are used more often
than the TTL version, but the economy of scale of building
small MCUs more than makes up for it.

As to the previous question, how to build a UART.

If you look inside a terminal server (not that anyone uses
them anymore) you find a microprocessor in place of 8 UARTs.
A single mircoprocessor is fast enough to collect the bits
from eight incoming serial ports, and drive the bits into
eight outgoing ports, along with keeping up the TCP connections
to the ethernet port.

I am sure the people who designed and built some of the early
computers would think it strange that we now have a loop waiting
for the user to type on the keyboard.

In the early days, single task batch processing made more
efficient use of the available resources. Not so much later,
multitasking allowed one to keep a single CPU busy, though with
less efficient use of RAM. (Decreasing cost of RAM vs. CPU.)

With an FPGA, one has the ability to keep a large number of
transistors (gates) busy a large fraction of the time, if one
has a problem big enough.

-- glen

Jan 22, 2014

Al,

Most "automatic" conversion of logic from LUTs to RAMs involves using the RAMs like ROMs, preloaded with constant data during configuration. Flash based FPGAs from MicroSemi do not have the ability to preload their BRAMs during "configuration." There is no "configuration" phase at/during startup during which they could automatically be preloaded.

Furthermore, the IGLOO/ProASIC3 series only provide synchronous BRAMs with a clock cycle delay between address in and data out. They can be inferred from RTL, so long as your RTL includes that clock cycle delay.

If you have several identical slow speed interfaces (e.g. UARTs, SPI, I2C, etc.) that could happily run with an effective clock rate of a fraction of your system clock rate, look at C-slow optimization to reduce utilization. There are a few coding tricks that ease translating a single-channel module into a multi-channel, C-slowed module capable of replacing multiple copies of the original.

Retiming can be combined with C-slowing (the two are very synergystic) to enable the original clock rate to be increased, recovering some of the original per-channel performance.

Repipelining can be combined with C-slowing (also synergystic) to hide original design latency, thus recovering some of the per-channel performance without increasing the system clock rate.

Andy

glen herrmannsfeldt · Jan 25, 2014

alb <alessandro.basili@cern.ch> wrote:

(snip)

(snip)

Thanks for the hint. The way I understood this is inserting say 2
registers for each register, increasing latency without affecting the
functionality, but allowing retiming. I haven't understood why they
call it /C/-slowing and why /C/-registers...

I only learned about C-slow a year or two ago, and wasn't sure
why it was so different from the pipelining that computer
designers did in the 1960's and 1970's.

And yes, I don't know what the C is for.

-- glen

alb · Jan 25, 2014

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Andy, my apologies for such a delayed reply.

On 1/22/2014 1:27 AM, jonesandy@comcast.net wrote:
[]

Most "automatic" conversion of logic from LUTs to RAMs involves
using the RAMs like ROMs, preloaded with constant data during
configuration. Flash based FPGAs from MicroSemi do not have the
ability to preload their BRAMs during "configuration." There is no
"configuration" phase at/during startup during which they could
automatically be preloaded.

That is quite a good piece of information. So I can stop looking for
any 'automatic' mode. Moreover any RAM logic I may profit of would
need to be configured after power up via external commands. While this
is certainly possible, it requires system modifications which are not
so often welcome.

[]

If you have several identical slow speed interfaces (e.g. UARTs,
SPI, I2C, etc.) that could happily run with an effective clock rate
of a fraction of your system clock rate, look at C-slow
optimization to reduce utilization. There are a few coding tricks
that ease translating a single-channel module into a multi-channel,
C-slowed module capable of replacing multiple copies of the
original.

Thanks for the hint. The way I understood this is inserting say 2
registers for each register, increasing latency without affecting the
functionality, but allowing retiming. I haven't understood why they
call it /C/-slowing and why /C/-registers...

Retiming can be combined with C-slowing (the two are very
synergystic) to enable the original clock rate to be increased,
recovering some of the original per-channel performance.

Repipelining can be combined with C-slowing (also synergystic) to
hide original design latency, thus recovering some of the
per-channel performance without increasing the system clock rate.

These techniques seem indeed attractive when it comes to speed
optimization, but in my specific case I simply wanted to free some
core logic usage to allow other functionality to be added to the
design. Since these 'features' are given at such a later stage in the
design phase, it seems very unlikely it will not require an
architectural change.

That's the main reason why I was looking at some 'automatic' feature
to profit of onboard resources without the need to change too much
RTL. I guess there's no easy fix for this.

Al
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJS48IPAAoJEPaNonZWXERQgW0IAORzqR8iX68j9u+QZEjZ67ID
C3eHGLk4LddDtrX+Uf2TOYoxH0OV1gvMHqrPdbEg83sBrQtSK62ScnKLNpNnQL7y
ViOnBxuyMn3IbJEp7L7MV31WjFUnhX+k4eUiRMAAEUnOUVIp5VFxlb1eUPqZr/XG
KlLRKw1+a3X1i1UaO2SjlIx+p1/JVZ4fvDb+HWALnbdwFE2edktf/6APl0bee8gB
sOWdTIS88NDscSZtjZBokFIOPDGoo95lOdx2bioR4WYeckZdMyOFjStrzKcDQaZb
tjZKUisyeyOIkjlVur4Vso4XaG4oK+adJgNTq30B2beF+LJkx+cA1soFZgxn5WA=
=qYUI
-----END PGP SIGNATURE-----

Jan 27, 2014

From what I understand, C-slowing originated as a state-space transform. I don't know what C stands for either.

The earliest reference I have seen for its application to digital circuits is:

C. Leiserson, F. Rose, and J. Saxe, "Optimizing synchronous circuitry by retiming," Proceedings of the 3rd Caltech Conference On VLSI, pp. 87-116, March 1983

I do not have access to the paper, but it is cited in many later papers as the initial work in the application of C-slowing to digital circuits. From the title, I would assume that they employed C-slowing to remove all single-clock-cycle feeback paths, which otherwise cannot be retimed.

Andy

embedded RAM vs. registers

alb

Guest

GaborSzakacs

Guest

alb

Guest

Gabor

Guest

alb

Guest

GaborSzakacs

Guest

glen herrmannsfeldt

Guest

Guest

glen herrmannsfeldt

Guest

alb

Guest

Guest

Welcome to EDABoard.com

Sponsor

Online statistics

Forum statistics

embedded RAM vs. registers

alb

Guest

GaborSzakacs

Guest

alb

Guest

Gabor

Guest

alb

Guest

GaborSzakacs

Guest

glen herrmannsfeldt

Guest

Guest

glen herrmannsfeldt

Guest

alb

Guest

Guest

Log in

Welcome to EDABoard.com

Sponsor