Tiny CPUs for Slow Logic

Svenn Are Bjerkem · Mar 19, 2019

On Tuesday, March 19, 2019 at 1:13:38 AM UTC+1, gnuarm.del...@gmail.com wrote:

Most of us have implemented small processors for logic operations that don't need to happen at high speed. Simple CPUs can be built into an FPGA using a very small footprint much like the ALU blocks. There are stack based processors that are very small, smaller than even a few kB of memory.

If they were easily programmable in something other than C would anyone be interested? Or is a C compiler mandatory even for processors running very small programs?

I am picturing this not terribly unlike the sequencer I used many years ago on an I/O board for an array processor which had it's own assembler. It was very simple and easy to use, but very much not a high level language. This would have a language that was high level, just not C rather something extensible and simple to use and potentially interactive.

Rick C.

picoblaze is such a small cpu and I would like to program it in something else but its assembler language.

--
svenn

Tom Gardner · Mar 19, 2019

On 19/03/19 14:29, Theo Markettos wrote:

Tom Gardner <spamjunk@blueyonder.co.uk> wrote:
Understand XMOS's xCORE processors and xC language, see how
they complement and support each other. I found the net result
stunningly easy to get working first time, without having to
continually read obscure errata!

I can see the merits of the XMOS approach. But I'm unclear how this relates
to the OP's proposal, which (I think) is having tiny CPUs as hard
logic blocks on an FPGA, like DSP blocks.

A reasonable question.

A major problem with lots of communicating sequential
processors (such as the OP suggests) is how to /think/
about orchestrating them so they compute and communicate
to produce a useful result.

Once you have such a conceptual framework, thereafter you
can develop tools to help.

Oddly enough that occurred to CAR (Tony) Hoare back in
the 70s, and he produced the CSP (communicating sequential
processes) calculus.

In the 80s that was embodied in hardware and software, the
transputers and occam respectively. The modern variant is
the xCORE processors and xC.

They provide a concrete demonstration of one set of tools
and techniques that allow a cloud of processors to do
useful work.

That's something the GA144 conspicuously failed to achieve.

The OP appears to have a vague concept of something running
through his head, but appears unwilling to understand what
has been tried, what has failed, and where the /conceptual/
practical problems lie.

Overall the OP is a bit like the UK Parliament at the moment.
Both know what they don't want, but can't articulate/decide
what they do want.

The UK Parliament is an unmitigated dysfunctional mess.

I completely understand the problem of running out of hardware threads, so
a means of 'just add another one' is handy. But the issue is how to combine
such things with other synthesised logic.

I don't think it is difficult to combine those, any more
or less than it is difficult to combine current traditional
hardware and software.

The XMOS approach is fine when the hardware is uniform and the software sits
on top, but when the hardware is synthesised and the 'CPUs' sit as pieces in
a fabric containing random logic (as I think the OP is suggesting) it
becomes a lot harder to reason about what the system is doing and what the
software running on such heterogeneous cores should look like. Only the
FPGA tools have a full view of what the system looks like, and it seems
stretching them to have them also generate software to run on these cores.

Through long experience, I'm wary of any single tool that
claims to do everything from top to bottom. They always
work well for things that fit their constraints, but badly
otherwise.

N.B. that includes a single programming style from top to
bottom of a software application. I've used top-level FSMs
expressed in GC'ed OOP languages that had procedural runtimes.
Why? Because the application domain was inherently FSM based,
the GC'ed OOP tools were the best way to create distributed high
availability systems, and the procedural language was the best
way to create the runtime.

I have comparable examples involving hardware all the
way from low-noise analogue electronics upwards.

Moral: choose the right conceptual framework for each part
of the problem.

We are not talking about a multi- or many- core chip here, with the CPUs as
the primary element of compute, but the CPUs scattered around as 'state
machine elements' justs ups the complexity and makes it harder to understand
compared with the same thing synthesised out of flip-flops.

It is up to the OP to give us a clue as to example problems
and solutions, and why his concepts are significantly better
than existing techniques.

I would be interested to know what applications might use heterogenous
many-cores and what performance is achievable.

Yup.

The "granularity" of the computation and communication will
be a key to understanding what the OP is thinking.

Mar 19, 2019

On Tuesday, March 19, 2019 at 6:19:36 PM UTC+2, Tom Gardner wrote:

On 19/03/19 14:29, Theo Markettos wrote:
Tom Gardner <spamjunk@blueyonder.co.uk> wrote:
Understand XMOS's xCORE processors and xC language, see how
they complement and support each other. I found the net result
stunningly easy to get working first time, without having to
continually read obscure errata!

I can see the merits of the XMOS approach. But I'm unclear how this relates
to the OP's proposal, which (I think) is having tiny CPUs as hard
logic blocks on an FPGA, like DSP blocks.

A reasonable question.

A major problem with lots of communicating sequential
processors (such as the OP suggests) is how to /think/
about orchestrating them so they compute and communicate
to produce a useful result.

Once you have such a conceptual framework, thereafter you
can develop tools to help.

Oddly enough that occurred to CAR (Tony) Hoare back in
the 70s, and he produced the CSP (communicating sequential
processes) calculus.

Which had surprisingly small influence on how majority (not majority in sense of 70%, majority in sense of 99.7%) of the industry solve their problems..

In the 80s that was embodied in hardware and software, the
transputers and occam respectively. The modern variant is
the xCORE processors and xC.

The same as above.

They provide a concrete demonstration of one set of tools
and techniques that allow a cloud of processors to do
useful work.

That's something the GA144 conspicuously failed to achieve.

The OP appears to have a vague concept of something running
through his head, but appears unwilling to understand what
has been tried, what has failed, and where the /conceptual/
practical problems lie.

Overall the OP is a bit like the UK Parliament at the moment.
Both know what they don't want, but can't articulate/decide
what they do want.

The UK Parliament is an unmitigated dysfunctional mess.

Do you prefer dysfunctional mesh

I completely understand the problem of running out of hardware threads, so
a means of 'just add another one' is handy. But the issue is how to combine
such things with other synthesised logic.

I don't think it is difficult to combine those, any more
or less than it is difficult to combine current traditional
hardware and software.

The XMOS approach is fine when the hardware is uniform and the software sits
on top, but when the hardware is synthesised and the 'CPUs' sit as pieces in
a fabric containing random logic (as I think the OP is suggesting) it
becomes a lot harder to reason about what the system is doing and what the
software running on such heterogeneous cores should look like. Only the
FPGA tools have a full view of what the system looks like, and it seems
stretching them to have them also generate software to run on these cores.

Through long experience, I'm wary of any single tool that
claims to do everything from top to bottom. They always
work well for things that fit their constraints, but badly
otherwise.

N.B. that includes a single programming style from top to
bottom of a software application. I've used top-level FSMs
expressed in GC'ed OOP languages that had procedural runtimes.
Why? Because the application domain was inherently FSM based,
the GC'ed OOP tools were the best way to create distributed high
availability systems, and the procedural language was the best
way to create the runtime.

I have comparable examples involving hardware all the
way from low-noise analogue electronics upwards.

Moral: choose the right conceptual framework for each part
of the problem.

We are not talking about a multi- or many- core chip here, with the CPUs as
the primary element of compute, but the CPUs scattered around as 'state
machine elements' justs ups the complexity and makes it harder to understand
compared with the same thing synthesised out of flip-flops.

It is up to the OP to give us a clue as to example problems
and solutions, and why his concepts are significantly better
than existing techniques.

I would be interested to know what applications might use heterogenous
many-cores and what performance is achievable.

Yup.

The "granularity" of the computation and communication will
be a key to understanding what the OP is thinking.

I don't know what Rick had in mind.
I personally would go for one "hard-CPU" block per 4000-5000 6-input logic elements (i.e. Altera ALMs or Xilinx CLBs). Each block could be configured either as one 64-bit core or pair of 32-bit cores. The bock would contains hard instruction decoders/ALUs/shifters and hard register files. It can optionally borrow adjacent DSP blocks for multipliers. Adjacent embedded memory blocks can be used for data memory. Code memory should be a bit more flexible giving to designer a choice between embedded memory blocks or distributed memory (X)/MLABs(A).

Tom Gardner · Mar 19, 2019

On 19/03/19 17:35, already5chosen@yahoo.com wrote:

On Tuesday, March 19, 2019 at 6:19:36 PM UTC+2, Tom Gardner wrote:
On 19/03/19 14:29, Theo Markettos wrote:
Tom Gardner <spamjunk@blueyonder.co.uk> wrote:
Understand XMOS's xCORE processors and xC language, see how they
complement and support each other. I found the net result stunningly
easy to get working first time, without having to continually read
obscure errata!

I can see the merits of the XMOS approach. But I'm unclear how this
relates to the OP's proposal, which (I think) is having tiny CPUs as
hard logic blocks on an FPGA, like DSP blocks.

A reasonable question.

A major problem with lots of communicating sequential processors (such as
the OP suggests) is how to /think/ about orchestrating them so they compute
and communicate to produce a useful result.

Once you have such a conceptual framework, thereafter you can develop tools
to help.

Oddly enough that occurred to CAR (Tony) Hoare back in the 70s, and he
produced the CSP (communicating sequential processes) calculus.

Which had surprisingly small influence on how majority (not majority in sense
of 70%, majority in sense of 99.7%) of the industry solve their problems.

That's principally because Moore's "law" enabled people to
avoid confronting the issues. Now that Moore's "law" has run
out of steam, the future becomes more interesting.

Note that TI included some of the concepts in its DSP processors.

Golang has included some of the concepts.

Many libraries included some of the concepts.

In the 80s that was embodied in hardware and software, the transputers and
occam respectively. The modern variant is the xCORE processors and xC.

The same as above.

They provide a concrete demonstration of one set of tools and techniques
that allow a cloud of processors to do useful work.

That's something the GA144 conspicuously failed to achieve.

The OP appears to have a vague concept of something running through his
head, but appears unwilling to understand what has been tried, what has
failed, and where the /conceptual/ practical problems lie.

Overall the OP is a bit like the UK Parliament at the moment. Both know
what they don't want, but can't articulate/decide what they do want.

The UK Parliament is an unmitigated dysfunctional mess.

Do you prefer dysfunctional mesh

I'll settle for anything that /works/ predictably

I completely understand the problem of running out of hardware threads,
so a means of 'just add another one' is handy. But the issue is how to
combine such things with other synthesised logic.

I don't think it is difficult to combine those, any more or less than it is
difficult to combine current traditional hardware and software.

The XMOS approach is fine when the hardware is uniform and the software
sits on top, but when the hardware is synthesised and the 'CPUs' sit as
pieces in a fabric containing random logic (as I think the OP is
suggesting) it becomes a lot harder to reason about what the system is
doing and what the software running on such heterogeneous cores should
look like. Only the FPGA tools have a full view of what the system looks
like, and it seems stretching them to have them also generate software to
run on these cores.

Through long experience, I'm wary of any single tool that claims to do
everything from top to bottom. They always work well for things that fit
their constraints, but badly otherwise.

N.B. that includes a single programming style from top to bottom of a
software application. I've used top-level FSMs expressed in GC'ed OOP
languages that had procedural runtimes. Why? Because the application domain
was inherently FSM based, the GC'ed OOP tools were the best way to create
distributed high availability systems, and the procedural language was the
best way to create the runtime.

I have comparable examples involving hardware all the way from low-noise
analogue electronics upwards.

Moral: choose the right conceptual framework for each part of the problem.

We are not talking about a multi- or many- core chip here, with the CPUs
as the primary element of compute, but the CPUs scattered around as
'state machine elements' justs ups the complexity and makes it harder to
understand compared with the same thing synthesised out of flip-flops.

It is up to the OP to give us a clue as to example problems and solutions,
and why his concepts are significantly better than existing techniques.

I would be interested to know what applications might use heterogenous
many-cores and what performance is achievable.

Yup.

The "granularity" of the computation and communication will be a key to
understanding what the OP is thinking.

I don't know what Rick had in mind. I personally would go for one "hard-CPU"
block per 4000-5000 6-input logic elements (i.e. Altera ALMs or Xilinx CLBs).
Each block could be configured either as one 64-bit core or pair of 32-bit
cores. The bock would contains hard instruction decoders/ALUs/shifters and
hard register files. It can optionally borrow adjacent DSP blocks for
multipliers. Adjacent embedded memory blocks can be used for data memory.
Code memory should be a bit more flexible giving to designer a choice between
embedded memory blocks or distributed memory (X)/MLABs(A).

It would be interesting to find an application level
description (i.e. language constructs) that
- could be automatically mapped onto those primitives
by a toolset
- was useful for more than a niche subset of applications
- was significantly better than existing tools

I wouldn't hold my breath

Mar 20, 2019

On Tuesday, March 19, 2019 at 10:29:07 AM UTC-4, Theo Markettos wrote:

Tom Gardner <spamjunk@blueyonder.co.uk> wrote:
Understand XMOS's xCORE processors and xC language, see how
they complement and support each other. I found the net result
stunningly easy to get working first time, without having to
continually read obscure errata!

I can see the merits of the XMOS approach. But I'm unclear how this relates
to the OP's proposal, which (I think) is having tiny CPUs as hard
logic blocks on an FPGA, like DSP blocks.

I completely understand the problem of running out of hardware threads, so
a means of 'just add another one' is handy. But the issue is how to combine
such things with other synthesised logic.

The XMOS approach is fine when the hardware is uniform and the software sits
on top, but when the hardware is synthesised and the 'CPUs' sit as pieces in
a fabric containing random logic (as I think the OP is suggesting) it
becomes a lot harder to reason about what the system is doing and what the
software running on such heterogeneous cores should look like. Only the
FPGA tools have a full view of what the system looks like, and it seems
stretching them to have them also generate software to run on these cores..

When people talk about things like "software running on such heterogeneous cores" it makes me think they don't really understand how this could be used. If you treat these small cores like logic elements, you don't have such lofty descriptions of "system software" since the software isn't created out of some global software package. Each core is designed to do a specific job just like any other piece of hardware and it has discrete inputs and outputs just like any other piece of hardware. If the hardware clock is not too fast, the software can synchronize with and literally function like hardware, but implementing more complex logic than the same area of FPGA fabric might.

There is no need to think about how the CPUs would communicate unless there is a specific need for them to do so. The F18A uses a handshaked parallel port in their design. They seem to have done a pretty slick job of it and can actually hang the processor waiting for the acknowledgement saving power and getting an instantaneous wake up following the handshake. This can be used with other CPUs or

We are not talking about a multi- or many- core chip here, with the CPUs as
the primary element of compute, but the CPUs scattered around as 'state
machine elements' justs ups the complexity and makes it harder to understand
compared with the same thing synthesised out of flip-flops.

Not sure what is hard to think about. It's a CPU, a small CPU with limited memory to implement small tasks that can do rather complex operations compared to a state machine really and includes memory, arithmetic and logic as well as I/O without having to write a single line of HDL. Only the actual app needs to be written.

I would be interested to know what applications might use heterogenous
many-cores and what performance is achievable.

Yes, clearly not getting the concept. Asking about heterogeneous performance is totally antithetical to this idea.

Rick C.

Mar 20, 2019

On Tuesday, March 19, 2019 at 11:24:33 AM UTC-4, Svenn Are Bjerkem wrote:

On Tuesday, March 19, 2019 at 1:13:38 AM UTC+1, gnuarm.del...@gmail.com wrote:
Most of us have implemented small processors for logic operations that don't need to happen at high speed. Simple CPUs can be built into an FPGA using a very small footprint much like the ALU blocks. There are stack based processors that are very small, smaller than even a few kB of memory.

If they were easily programmable in something other than C would anyone be interested? Or is a C compiler mandatory even for processors running very small programs?

I am picturing this not terribly unlike the sequencer I used many years ago on an I/O board for an array processor which had it's own assembler. It was very simple and easy to use, but very much not a high level language.. This would have a language that was high level, just not C rather something extensible and simple to use and potentially interactive.

Rick C.

picoblaze is such a small cpu and I would like to program it in something else but its assembler language.

Yes, it is small. How large is the program you are interested in?

Rick C.

David Brown · Mar 20, 2019

On 20/03/2019 03:30, gnuarm.deletethisbit@gmail.com wrote:

On Tuesday, March 19, 2019 at 10:29:07 AM UTC-4, Theo Markettos
wrote:
Tom Gardner <spamjunk@blueyonder.co.uk> wrote:
Understand XMOS's xCORE processors and xC language, see how they
complement and support each other. I found the net result
stunningly easy to get working first time, without having to
continually read obscure errata!

I can see the merits of the XMOS approach. But I'm unclear how
this relates to the OP's proposal, which (I think) is having tiny
CPUs as hard logic blocks on an FPGA, like DSP blocks.

I completely understand the problem of running out of hardware
threads, so a means of 'just add another one' is handy. But the
issue is how to combine such things with other synthesised logic.

The XMOS approach is fine when the hardware is uniform and the
software sits on top, but when the hardware is synthesised and the
'CPUs' sit as pieces in a fabric containing random logic (as I
think the OP is suggesting) it becomes a lot harder to reason about
what the system is doing and what the software running on such
heterogeneous cores should look like. Only the FPGA tools have a
full view of what the system looks like, and it seems stretching
them to have them also generate software to run on these cores.

When people talk about things like "software running on such
heterogeneous cores" it makes me think they don't really understand
how this could be used. If you treat these small cores like logic
elements, you don't have such lofty descriptions of "system software"
since the software isn't created out of some global software package.
Each core is designed to do a specific job just like any other piece
of hardware and it has discrete inputs and outputs just like any
other piece of hardware. If the hardware clock is not too fast, the
software can synchronize with and literally function like hardware,
but implementing more complex logic than the same area of FPGA fabric
might.

That is software.

If you want to try to get cycle-precise control of the software and use
that precision for direct hardware interfacing, you are almost certainly
going to have a poor, inefficient and difficult design. It doesn't
matter if you say "think of it like logic" - it is /not/ logic, it is
software, and you don't use that for cycle-precise control. You use
when you need flexibility, calculations, and decisions.

There is no need to think about how the CPUs would communicate unless
there is a specific need for them to do so. The F18A uses a
handshaked parallel port in their design. They seem to have done a
pretty slick job of it and can actually hang the processor waiting
for the acknowledgement saving power and getting an instantaneous
wake up following the handshake. This can be used with other CPUs or

Fair enough.

We are not talking about a multi- or many- core chip here, with the
CPUs as the primary element of compute, but the CPUs scattered
around as 'state machine elements' justs ups the complexity and
makes it harder to understand compared with the same thing
synthesised out of flip-flops.

Not sure what is hard to think about. It's a CPU, a small CPU with
limited memory to implement small tasks that can do rather complex
operations compared to a state machine really and includes memory,
arithmetic and logic as well as I/O without having to write a single
line of HDL. Only the actual app needs to be written.

I would be interested to know what applications might use
heterogenous many-cores and what performance is achievable.

Yes, clearly not getting the concept. Asking about heterogeneous
performance is totally antithetical to this idea.

Rick C.

Philipp Klaus Krause · Mar 20, 2019

Am 19.03.19 um 16:24 schrieb Svenn Are Bjerkem:

picoblaze is such a small cpu and I would like to program it in something else but its assembler language.

It would be possible to write a C compiler for it (with some
restrictions, such as functions being non-reentrant). The architecture
doesn't seem any worse than PIC. And there are / were pic14 and pic16
backends in SDCC.

Philipp

Mar 20, 2019

On Wednesday, March 20, 2019 at 4:32:07 AM UTC+2, gnuarm.del...@gmail.com wrote:

On Tuesday, March 19, 2019 at 11:24:33 AM UTC-4, Svenn Are Bjerkem wrote:
On Tuesday, March 19, 2019 at 1:13:38 AM UTC+1, gnuarm.del...@gmail.com wrote:
Most of us have implemented small processors for logic operations that don't need to happen at high speed. Simple CPUs can be built into an FPGA using a very small footprint much like the ALU blocks. There are stack based processors that are very small, smaller than even a few kB of memory.

If they were easily programmable in something other than C would anyone be interested? Or is a C compiler mandatory even for processors running very small programs?

I am picturing this not terribly unlike the sequencer I used many years ago on an I/O board for an array processor which had it's own assembler. It was very simple and easy to use, but very much not a high level language. This would have a language that was high level, just not C rather something extensible and simple to use and potentially interactive.

Rick C.

picoblaze is such a small cpu and I would like to program it in something else but its assembler language.

Yes, it is small. How large is the program you are interested in?

Rick C.

I don't know about Svenn Are Bjerkem, but can tell you about myself.
Last time when I considered something like that and wrote enough of the program to make measurements the program contained ~250 Nios2 instructions. I'd guess, on minimalistic stack machine it would take 350-400 instructions.
At the end, I didn't do it in software. Coding the same functionality in HDL turned out to be not hard, which probably suggests that my case was smaller than average.

Another extreme, where I did end up using "small" soft core, it was much more like "real" software: 2300 Nios2 instructions.

Mar 20, 2019

On Tuesday, March 19, 2019 at 10:07:38 PM UTC+2, Tom Gardner wrote:

On 19/03/19 17:35, already5chosen@yahoo.com wrote:
On Tuesday, March 19, 2019 at 6:19:36 PM UTC+2, Tom Gardner wrote:

The UK Parliament is an unmitigated dysfunctional mess.

Do you prefer dysfunctional mesh

I'll settle for anything that /works/ predictably

UK political system is completely off-topic in comp.arch.fpga. However I'd say that IMHO right now your parliament is facing unusually difficult problem on one hand, but at the same time it's not really "life or death" sort of the problem. Having troubles and appearing non-decisive in such situation is normal. It does not mean that the system is broken.

Theo · Mar 20, 2019

gnuarm.deletethisbit@gmail.com wrote:

On Tuesday, March 19, 2019 at 10:29:07 AM UTC-4, Theo Markettos wrote:

When people talk about things like "software running on such heterogeneous
cores" it makes me think they don't really understand how this could be
used. If you treat these small cores like logic elements, you don't have
such lofty descriptions of "system software" since the software isn't
created out of some global software package. Each core is designed to do
a specific job just like any other piece of hardware and it has discrete
inputs and outputs just like any other piece of hardware. If the hardware
clock is not too fast, the software can synchronize with and literally
function like hardware, but implementing more complex logic than the same
area of FPGA fabric might.

The point is that we need to understand what the whole system is doing. In
the XMOS case, we can look at a piece of software with N threads, running
across the cores provided on the chip. One piece of software, distributed
over the hardware resource available - the system is doing one thing.

Your bottom-up approach means it's difficult to see the big picture of
what's going on. That means it's hard to understand the whole system, and
to program from a whole-system perspective.

Not sure what is hard to think about. It's a CPU, a small CPU with
limited memory to implement small tasks that can do rather complex
operations compared to a state machine really and includes memory,
arithmetic and logic as well as I/O without having to write a single line
of HDL. Only the actual app needs to be written.

Here are the sematic descriptions of basic logic elements:

LUT: q = f(x,y,z)
FF: q <= d_in (delay of one cycle)
BRAM: q = array[addr]
DSP: q = a*b + c

A P&R tool can build a system out of these building blocks. It's notable
that the state-holding elements in this schema do nothing else except
holding state. That makes writing the tools easier (and we all know how
difficult the tools already are). In general, we don't tend to instantiate
these primitives manually but describe the higher level functions (eg a 64
bit add) in HDL and allow the tools to select appropriate primitives for us
(eg a number of fast-adder blocks chained together).

What's the logic equation of a processor? It has state, but vastly more
state than the simplicity of a flipflop. What pattern does the P&R tool
need to match to infer a processor? How is any verification tool going
to understand whether the processor with software is doing the right thing?

If your answer is 'we don't need verification tools, we program by hand'
then a) software has bugs, and automated verification is a handy way to
catch them, and b) you're never going to be writing hundreds of different
mini-programs to run on each core, let alone make them correct.

If we scale the processors up a bit, I could see the merits in say a bank
of, say, 32 Cortex M0s that could be interconnected as part of the FPGA
fabric and programmed in software for dedicated tasks (for instance, read
the I2C EEPROM on the DRAM DIMM and configure the DRAM controller at boot).
But this is an SoC construct (built using SoC builder tools, and over which
the programmer has some purview although, as it turns out, sketchier than
you might think[1]). Such CPUs would likely be running bigger corpora of
software (for instance, the DRAM controller vendor's provided initialisation
code) which would likely be in C. But in this case we could just use a
soft-core today (the CPU ISA is most irrelevant for this application, so a
RISC-V/Microblaze/NIOS would be fine).

[1] https://inf.ethz.ch/personal/troscoe/pubs/hotos15-gerber.pdf

I can also see another niche, at the extreme bottom end, where a CPLD might
have one of your processors plus a few hundred logic cells. That's
essentially a microcontroller with FPGA, or an FPGA with microcontroller -
which some of the vendors already produce (although possibly not
small/cheap/low power enough). Here I can't see the advantages of using a
stack-based CPU versus paying a bit more to program in C. Although I don't
have experience in markets where the retail price of the product is $1, and so
every $0.001 matters.

I would be interested to know what applications might use heterogenous
many-cores and what performance is achievable.

Yes, clearly not getting the concept. Asking about heterogeneous
performance is totally antithetical to this idea.

You keep mentioning 700 MIPS, which suggests performance is important. If
these are simple state machine replacements, why do we care about
performance?

In essence, your proposal has a disconnect between the situations existing
FPGA blocks are used (implemented automatically by P&R tools) and the
situations software is currently used (human-driven software and
architectural design). It's unclear how you claim to bridge this gap.

Theo

Mar 20, 2019

On Tuesday, March 19, 2019 at 10:07:38 PM UTC+2, Tom Gardner wrote:

On 19/03/19 17:35, already5chosen@yahoo.com wrote:
On Tuesday, March 19, 2019 at 6:19:36 PM UTC+2, Tom Gardner wrote:
The "granularity" of the computation and communication will be a key to
understanding what the OP is thinking.

I don't know what Rick had in mind. I personally would go for one "hard-CPU"
block per 4000-5000 6-input logic elements (i.e. Altera ALMs or Xilinx CLBs).
Each block could be configured either as one 64-bit core or pair of 32-bit
cores. The bock would contains hard instruction decoders/ALUs/shifters and
hard register files. It can optionally borrow adjacent DSP blocks for
multipliers. Adjacent embedded memory blocks can be used for data memory.
Code memory should be a bit more flexible giving to designer a choice between
embedded memory blocks or distributed memory (X)/MLABs(A).

It would be interesting to find an application level
description (i.e. language constructs) that
- could be automatically mapped onto those primitives
by a toolset
- was useful for more than a niche subset of applications
- was significantly better than existing tools

I wouldn't hold my breath

I think, you are looking at it from wrong angle.
One doesn't really need new tools to design and simulate such things. What's needed is a combinations of existing tools - compilers, assemblers, probably software simulator plug-ins into existing HDL simulators, but the later is just luxury for speeding up simulations, in principle, feeding HDL simulator with RTL model of the CPU core will work too.

As to niches, all "hard" blocks that we currently have in FPGAs are about niches. It's extremely rare that user's design uses all or majority of the features of given FPGA device and need LUTs, embedded memories, PLLs, multiplies, SERDESs, DDR DRAM I/O blocks etc in exact amounts appearing in the device.
It still makes sense, economically, to have them all built in, because masks and other NREs are mighty expensive while silicon itself is relatively cheap. Multiple small hard CPU cores are really not very different from features, mentioned above.

Tom Gardner · Mar 20, 2019

On 20/03/19 10:41, already5chosen@yahoo.com wrote:

On Tuesday, March 19, 2019 at 10:07:38 PM UTC+2, Tom Gardner wrote:
On 19/03/19 17:35, already5chosen@yahoo.com wrote:
On Tuesday, March 19, 2019 at 6:19:36 PM UTC+2, Tom Gardner wrote:

The UK Parliament is an unmitigated dysfunctional mess.

Do you prefer dysfunctional mesh

I'll settle for anything that /works/ predictably

UK political system is completely off-topic in comp.arch.fpga. However I'd say that IMHO right now your parliament is facing unusually difficult problem on one hand, but at the same time it's not really "life or death" sort of the problem. Having troubles and appearing non-decisive in such situation is normal. It does not mean that the system is broken.

Firstly, you chose to snip the analogy, thus removing the context.

Secondly, actually currently there are /very/ plausible reasons
to believe it might be life or death for my 98yo mother, and
may hasten my death. No, I'm not going to elaborate on a public
forum.

I will note that Operation Yellowhammer will, barring miracles,
be started on Monday, and that a prominent *brexiteer* (Michael Gove)
is shit scared of a no-deal exit because all the chemicals required
to purify our drinking water come from Europe.

Theo Markettos · Mar 20, 2019

already5chosen@yahoo.com wrote:

As to niches, all "hard" blocks that we currently have in FPGAs are about
niches. It's extremely rare that user's design uses all or majority of
the features of given FPGA device and need LUTs, embedded memories, PLLs,
multiplies, SERDESs, DDR DRAM I/O blocks etc in exact amounts appearing in
the device. It still makes sense, economically, to have them all built
in, because masks and other NREs are mighty expensive while silicon itself
is relatively cheap. Multiple small hard CPU cores are really not very
different from features, mentioned above.

A lot of these 'niches' have been proven in soft-logic.

Implement your system in soft-logic, discover that there's lots of
multiply-adds and they're slow and take up area. A DSP block is thus an
'accelerator' (or 'most compact representation') of the same concept in
soft-logic.

The same goes for BRAMs (can be implemented via registers but too much
area), adders (slow when implemented with generic LUTs), etc.

Other features (SERDES, PLLs, DDR, etc) can't be done at all without
hard-logic support. If you want those features, you need the hard logic,
simple as that.

Through analysis of existing designs we can have a provable win of the hard
over soft logic, to make it worthwhile putting it on the silicon and
integrating into the tools. In some of these cases, I'd guess the win over
the soft-logic is 10x or more saving in area.

Rick's idea can be done today in soft-logic. So someone could build a proof
of concept and measure the cases where it improves things over the baseline.
If that case is compelling, let's put it in the hard logic.

But thus far we haven't seen a clear case for why someone should build a
proof of concept. I'm not saying it doesn't exist, but we need a clear
elucidation of the problem that it might solve.

Theo

Tom Gardner · Mar 20, 2019

On 20/03/19 10:56, already5chosen@yahoo.com wrote:

On Tuesday, March 19, 2019 at 10:07:38 PM UTC+2, Tom Gardner wrote:
On 19/03/19 17:35, already5chosen@yahoo.com wrote:
On Tuesday, March 19, 2019 at 6:19:36 PM UTC+2, Tom Gardner wrote:
The "granularity" of the computation and communication will be a key
to understanding what the OP is thinking.

I don't know what Rick had in mind. I personally would go for one
"hard-CPU" block per 4000-5000 6-input logic elements (i.e. Altera ALMs
or Xilinx CLBs). Each block could be configured either as one 64-bit core
or pair of 32-bit cores. The bock would contains hard instruction
decoders/ALUs/shifters and hard register files. It can optionally borrow
adjacent DSP blocks for multipliers. Adjacent embedded memory blocks can
be used for data memory. Code memory should be a bit more flexible giving
to designer a choice between embedded memory blocks or distributed memory
(X)/MLABs(A).

It would be interesting to find an application level description (i.e.
language constructs) that - could be automatically mapped onto those
primitives by a toolset - was useful for more than a niche subset of
applications - was significantly better than existing tools

I wouldn't hold my breath

I think, you are looking at it from wrong angle. One doesn't really need new
tools to design and simulate such things. What's needed is a combinations of
existing tools - compilers, assemblers, probably software simulator plug-ins
into existing HDL simulators, but the later is just luxury for speeding up
simulations, in principle, feeding HDL simulator with RTL model of the CPU
core will work too.

That would be one perfectly acceptable embodiment of a toolset
that I mentioned.

But more difficult that creating such a toolset is defining
an application level description that a toolset can munge.

So, define (initially by example, later more formally) inputs
to the toolset and outputs from it. Then we can judge whether
the concepts are more than handwaving wishes.

As to niches, all "hard" blocks that we currently have in FPGAs are about
niches. It's extremely rare that user's design uses all or majority of the
features of given FPGA device and need LUTs, embedded memories, PLLs,
multiplies, SERDESs, DDR DRAM I/O blocks etc in exact amounts appearing in
the device. It still makes sense, economically, to have them all built in,
because masks and other NREs are mighty expensive while silicon itself is
relatively cheap. Multiple small hard CPU cores are really not very different
from features, mentioned above.

All the blocks you mention have a simple API and easily
enumerated set of behaviour.

The whole point of processors is that they enable much more
complex behaviour that is practically impossible to enumerate.

Alternatively, if it is possible to enumerate the behaviour
of a processor, then it would be easy and more efficient to
implement the behaviour in conventional logic blocks.

Mar 20, 2019

On Wednesday, March 20, 2019 at 3:37:17 PM UTC+2, Tom Gardner wrote:

But more difficult that creating such a toolset is defining
an application level description that a toolset can munge.

So, define (initially by example, later more formally) inputs
to the toolset and outputs from it. Then we can judge whether
the concepts are more than handwaving wishes.

I don't understand what you are asking for.

If I had such thing, I'd use it in exactly the same way that I use soft cores (Nios2) today. I will just use them more frequently, because today it costs me logic resources (often acceptable, but not always) and synthesis and fitter time (and that what I really hate). On the other hand, "hard" core would be almost free in both aspects.
It would be as expensive as "soft" or even costlier, in HDL simulations, but until now I managed to avoid "full system" simulations that cover everything including CPU core and the program that runs on it. Or may be, I did it once or twice years ago and already don't remember. Anyway, for me it's not an important concern and I consider myself rather heavy user of soft cores.

Also, theoretically, if performance of the hard core is non-trivially higher than that of soft cores, either due to higher IPC (I didn't measure, but would guess that for majority of tasks Nios2-f IPC is 20-30% lower than ARM Cortex-M4) or due to higher clock rate, then it will open up even more niches. However I'd expect that performance factor would be less important for me, personally, than other factors mentioned above.

Tom Gardner · Mar 20, 2019

On 20/03/19 14:11, already5chosen@yahoo.com wrote:

On Wednesday, March 20, 2019 at 3:37:17 PM UTC+2, Tom Gardner wrote:

But more difficult that creating such a toolset is defining an application
level description that a toolset can munge.

So, define (initially by example, later more formally) inputs to the
toolset and outputs from it. Then we can judge whether the concepts are
more than handwaving wishes.

I don't understand what you are asking for.

Go back and read the parts of my post that you chose to snip.

Give a handwaving indication of the concepts that avoid the
conceptual problems that I mentioned.

Or better still, get the OP to do it.

If I had such thing, I'd use it in exactly the same way that I use soft cores
(Nios2) today. I will just use them more frequently, because today it costs
me logic resources (often acceptable, but not always) and synthesis and
fitter time (and that what I really hate). On the other hand, "hard" core
would be almost free in both aspects. It would be as expensive as "soft" or
even costlier, in HDL simulations, but until now I managed to avoid "full
system" simulations that cover everything including CPU core and the program
that runs on it. Or may be, I did it once or twice years ago and already
don't remember. Anyway, for me it's not an important concern and I consider
myself rather heavy user of soft cores.

Also, theoretically, if performance of the hard core is non-trivially higher
than that of soft cores, either due to higher IPC (I didn't measure, but
would guess that for majority of tasks Nios2-f IPC is 20-30% lower than ARM
Cortex-M4) or due to higher clock rate, then it will open up even more
niches. However I'd expect that performance factor would be less important
for me, personally, than other factors mentioned above.

Mar 20, 2019

On Wednesday, March 20, 2019 at 6:14:21 AM UTC-4, David Brown wrote:

On 20/03/2019 03:30, gnuarm.deletethisbit@gmail.com wrote:
On Tuesday, March 19, 2019 at 10:29:07 AM UTC-4, Theo Markettos
wrote:
Tom Gardner <spamjunk@blueyonder.co.uk> wrote:
Understand XMOS's xCORE processors and xC language, see how they
complement and support each other. I found the net result
stunningly easy to get working first time, without having to
continually read obscure errata!

I can see the merits of the XMOS approach. But I'm unclear how
this relates to the OP's proposal, which (I think) is having tiny
CPUs as hard logic blocks on an FPGA, like DSP blocks.

I completely understand the problem of running out of hardware
threads, so a means of 'just add another one' is handy. But the
issue is how to combine such things with other synthesised logic.

The XMOS approach is fine when the hardware is uniform and the
software sits on top, but when the hardware is synthesised and the
'CPUs' sit as pieces in a fabric containing random logic (as I
think the OP is suggesting) it becomes a lot harder to reason about
what the system is doing and what the software running on such
heterogeneous cores should look like. Only the FPGA tools have a
full view of what the system looks like, and it seems stretching
them to have them also generate software to run on these cores.

When people talk about things like "software running on such
heterogeneous cores" it makes me think they don't really understand
how this could be used. If you treat these small cores like logic
elements, you don't have such lofty descriptions of "system software"
since the software isn't created out of some global software package.
Each core is designed to do a specific job just like any other piece
of hardware and it has discrete inputs and outputs just like any
other piece of hardware. If the hardware clock is not too fast, the
software can synchronize with and literally function like hardware,
but implementing more complex logic than the same area of FPGA fabric
might.

That is software.

If you want to try to get cycle-precise control of the software and use
that precision for direct hardware interfacing, you are almost certainly
going to have a poor, inefficient and difficult design. It doesn't
matter if you say "think of it like logic" - it is /not/ logic, it is
software, and you don't use that for cycle-precise control. You use
when you need flexibility, calculations, and decisions.

I suppose you can make anything difficult if you try hard enough.

The point is you don't have to make it difficult by talking about "software running on such heterogeneous cores". Just talk about it being a small hunk of software that is doing a specific job. Then the mystery is gone and the task can be made as easy as the task is.

In VHDL this would be a process(). VHDL programs are typically chock full of processes and no one wrings their hands worrying about how they will design the "software running on such heterogeneous cores".

BTW, VHDL is software too.

There is no need to think about how the CPUs would communicate unless
there is a specific need for them to do so. The F18A uses a
handshaked parallel port in their design. They seem to have done a
pretty slick job of it and can actually hang the processor waiting
for the acknowledgement saving power and getting an instantaneous
wake up following the handshake. This can be used with other CPUs or

Fair enough.

Ok, that's a start.

Rick C.

Mar 20, 2019

On Wednesday, March 20, 2019 at 4:31:27 PM UTC+2, Tom Gardner wrote:

On 20/03/19 14:11, already5chosen@yahoo.com wrote:
On Wednesday, March 20, 2019 at 3:37:17 PM UTC+2, Tom Gardner wrote:

But more difficult that creating such a toolset is defining an application
level description that a toolset can munge.

So, define (initially by example, later more formally) inputs to the
toolset and outputs from it. Then we can judge whether the concepts are
more than handwaving wishes.

I don't understand what you are asking for.

Go back and read the parts of my post that you chose to snip.

Give a handwaving indication of the concepts that avoid the
conceptual problems that I mentioned.

Frankly, it starts to sound like you never used soft CPU cores in your designs.
So, for somebody like myself, who uses them routinely for different tasks since 2006, you are really not easy to understand.
Concept? Concepts are good for new things, not for something that is a variation of something old and routine and obviously working.

Or better still, get the OP to do it.

With that part I agree.

Mar 20, 2019

On Wednesday, March 20, 2019 at 6:29:50 AM UTC-4, already...@yahoo.com wrote:

On Wednesday, March 20, 2019 at 4:32:07 AM UTC+2, gnuarm.del...@gmail.com wrote:
On Tuesday, March 19, 2019 at 11:24:33 AM UTC-4, Svenn Are Bjerkem wrote:
On Tuesday, March 19, 2019 at 1:13:38 AM UTC+1, gnuarm.del...@gmail.com wrote:
Most of us have implemented small processors for logic operations that don't need to happen at high speed. Simple CPUs can be built into an FPGA using a very small footprint much like the ALU blocks. There are stack based processors that are very small, smaller than even a few kB of memory..

If they were easily programmable in something other than C would anyone be interested? Or is a C compiler mandatory even for processors running very small programs?

I am picturing this not terribly unlike the sequencer I used many years ago on an I/O board for an array processor which had it's own assembler. It was very simple and easy to use, but very much not a high level language. This would have a language that was high level, just not C rather something extensible and simple to use and potentially interactive.

Rick C.

picoblaze is such a small cpu and I would like to program it in something else but its assembler language.

Yes, it is small. How large is the program you are interested in?

Rick C.

I don't know about Svenn Are Bjerkem, but can tell you about myself.
Last time when I considered something like that and wrote enough of the program to make measurements the program contained ~250 Nios2 instructions. I'd guess, on minimalistic stack machine it would take 350-400 instructions..
At the end, I didn't do it in software. Coding the same functionality in HDL turned out to be not hard, which probably suggests that my case was smaller than average.

Another extreme, where I did end up using "small" soft core, it was much more like "real" software: 2300 Nios2 instructions.

What sorts of applications where these?

Rick C.

Tiny CPUs for Slow Logic

Svenn Are Bjerkem

Guest

Tom Gardner

Guest

Guest

Tom Gardner

Guest

Guest

Guest

David Brown

Guest

Philipp Klaus Krause

Guest

Guest

Guest

Theo

Guest

Guest

Tom Gardner

Guest

Theo Markettos

Guest

Tom Gardner

Guest

Guest

Tom Gardner

Guest

Guest

Guest

Guest

Welcome to EDABoard.com

Sponsor

Online statistics

Forum statistics

Tiny CPUs for Slow Logic

Svenn Are Bjerkem

Guest

Tom Gardner

Guest

Guest

Tom Gardner

Guest

Guest

Guest

David Brown

Guest

Philipp Klaus Krause

Guest

Guest

Guest

Theo

Guest

Guest

Tom Gardner

Guest

Theo Markettos

Guest

Tom Gardner

Guest

Guest

Tom Gardner

Guest

Guest

Guest

Guest

Log in

Welcome to EDABoard.com

Sponsor