supercomputer progress...

On Friday, April 29, 2022 at 4:39:05 AM UTC-4, Martin Brown wrote:
On 28/04/2022 18:47, Jeroen Belleman wrote:
On 2022-04-28 18:26, boB wrote:
[...]
I would love to have a super computer to run LTspice.

boB

In fact, what you have on your desk *is* a super computer,
in the 1970\'s meaning of the words. It\'s just that it\'s
bogged down running bloatware.
Indeed. The Cray X-MP in its 4 CPU configuration with a 105MHz clock and
a whopping for the time 128MB of fast core memory with 40GB of disk. The
one I used had an amazing for the time 1TB tape cassette backing store.
It did 600 MFLOPs with the right sort of parallel vector code.

That was back in the day when you needed special permission to use more
than 4MB of core on the timesharing IBM 3081 (approx 7 MIPS).

Current Intel 12 gen CPU desktops are ~4GHz, 16GB ram and >1TB of disk.
(and the upper limits are even higher) That combo does ~66,000 MFLOPS.

Spice simulation doesn\'t scale particularly well to large scale
multiprocessor environments to many long range interractions.

The Crays were nice if you had a few million dollars to spend. I worked for a startup building more affordable supercomputers in the same ball park of performance at a fraction of the price. Star Technologies, ST-100 supported 100 MFLOPS and 32 MB of memory, costing around $200,000 with 256 KB of RAM was a fraction of the cost of the only slightly faster Cray X-MP, available at the same time.

--

Rick C.

+ Get 1,000 miles of free Supercharging
+ Tesla referral code - https://ts.la/richard11209
 
Mike Monett wrote:
Phil Hobbs <pcdhSpamMeSenseless@electrooptical.net> wrote:

John Larkin wrote:
On Thu, 28 Apr 2022 12:01:59 -0500, Dennis <dennis@none.none> wrote:

On 4/28/22 11:26, boB wrote:

I would love to have a super computer to run LTspice.

I thought one of the problems with LTspice (and spice in general)
performance is that the algorithms don\'t parallelize very well.

LT runs on multiple cores now. I\'d love the next gen LT Spice to run
on an Nvidia card. 100x at least.


The \"number of threads\" setting doesn\'t do anything very dramatic,
though, at least last time I tried. Splitting up the calculation
between cores would require all of them to communicate a couple of times
per time step, but lots of other simulation codes do that.

The main trouble is that the matrix defining the connectivity between
nodes is highly irregular in general.

Parallellizing that efficiently might well need a special-purpose
compiler, sort of similar to the profile-guided optimizer in the guts of
the FFTW code for computing DFTs. Probably not at all impossible, but
not that straightforward to implement.

Cheers

Phil Hobbs

Supercomputers have thousands or hundreds of thousands of cores.

Quote:

\"Cerebras Systems has unveiled its new Wafer Scale Engine 2 processor with
a record-setting 2.6 trillion transistors and 850,000 AI-optimized cores.
It’s built for supercomputing tasks, and it’s the second time since 2019
that Los Altos, California-based Cerebras has unveiled a chip that is
basically an entire wafer.\"

https://venturebeat.com/2021/04/20/cerebras-systems-launches-new-ai-
supercomputing-processor-with-2-6-trillion-transistors/

Number of cores isn\'t the problem. For fairly tightly-coupled tasks
such as simulations, the issue is interconnect latency between cores,
and the required bandwidth goes roughly as the cube or Moore\'s law, so
it ran out of gas long ago.

One thing that zillions of cores could do for SPICE is to do all the
stepped parameter runs simultaneously. At that point all you need is
infinite bandwidth to disk.

> Man, I wish I were back living in Los Altos again.

I couldn\'t get out of there fast enough, and have never looked back.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
Martin Brown wrote:
On 29/04/2022 07:09, Phil Hobbs wrote:
John Larkin wrote:
On Thu, 28 Apr 2022 12:01:59 -0500, Dennis <dennis@none.none> wrote:

On 4/28/22 11:26, boB wrote:

I would love to have a super computer to run LTspice.

I thought one of the problems with LTspice (and spice in general)
performance is that the algorithms don\'t parallelize very well.

LT runs on multiple cores now. I\'d love the next gen LT Spice to run
on an Nvidia card. 100x at least.


The \"number of threads\" setting doesn\'t do anything very dramatic,
though, at least last time I tried.  Splitting up the calculation
between cores would require all of them to communicate a couple of
times per time step, but lots of other simulation codes do that.

If it is anything like chess problems then the memory bandwidth will
saturate long before all cores+threads are used to optimum effect. After
that point the additional threads merely cause it to run hotter.

I found setting max threads to about 70% of those notionally available
produced the most computing power with the least heat. After that the
performance gain per thread was negligible but the extra heat was not.

Having everything running full bore was actually slower and much hotter!

The main trouble is that the matrix defining the connectivity between
nodes is highly irregular in general.

Parallellizing that efficiently might well need a special-purpose
compiler, sort of similar to the profile-guided optimizer in the guts
of the FFTW code for computing DFTs.  Probably not at all impossible,
but not that straightforward to implement.

I\'m less than impressed with profile guided optimisers in compilers. The
only time I tried it in anger the instrumentation code interfered with
the execution of the algorithms to such an extent as to be meaningless.

It wouldn\'t need to be as general as that--one could simply sort for the
most-connected nodes, and sort by weighted graph distance so as to
minimize the number of connections across the chunks of netlist, then
adjust the data structures for communication appropriately.

It also wouldn\'t parallellize as well as FDTD, say, because there\'s less
computation going on per time step, so the communication overhead is
proportionately much greater.

One gotcha I have identified in the latest MSC is that when it uses
higher order SSE2, AVX, and AVX-512 implicitly in its code generation it
does not align them on the stack properly so that sometimes they are
split across two cache lines. I see two distinct speeds for each
benchmark code segment depending on how the cache allignment falls.

Basically the compiler forces stack alignment to 8 bytes and cache lines
are 64 bytes but the compiler generated objects in play are 16 bytes, 32
bytes or 64 bytes. Alignment failure fractions 1:4, 2:4 and 3:4.

If you manually allocate such objects you can use pragmas to force
optimal alignment but when the code generator chooses to use them
internally you have no such control. Even so the MS compiler does
generate blisteringly fast code compared to either Intel or GCC.

The FFTW profiler works pretty well IME, but I agree, doing it with the
whole program isn\'t trivial.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
Jeroen Belleman wrote:
On 2022-04-28 18:26, boB wrote:
[...]
I would love to have a super computer to run LTspice.

boB



In fact, what you have on your desk *is* a super computer,
in the 1970\'s meaning of the words. It\'s just that it\'s
bogged down running bloatware.

Jeroen Belleman

In the 1990s meaning of the words, in fact. My 2011-vintage desktop box
runs 250 Gflops peak (2x 12-core Magny Cours, 64G main memory, RAID5 disks).

My phone is a supercomputer by 1970s standards. ;)

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
On Friday, April 29, 2022 at 10:12:30 AM UTC-4, Phil Hobbs wrote:
Jeroen Belleman wrote:
On 2022-04-28 18:26, boB wrote:
[...]
I would love to have a super computer to run LTspice.

boB



In fact, what you have on your desk *is* a super computer,
in the 1970\'s meaning of the words. It\'s just that it\'s
bogged down running bloatware.

Jeroen Belleman
In the 1990s meaning of the words, in fact. My 2011-vintage desktop box
runs 250 Gflops peak (2x 12-core Magny Cours, 64G main memory, RAID5 disks).

My phone is a supercomputer by 1970s standards. ;)

And no more possible to build at that time than in ancient Rome. It\'s amazing how rapidly technology changes when spurred by the profit motive.

--

Rick C.

-- Get 1,000 miles of free Supercharging
-- Tesla referral code - https://ts.la/richard11209
 
On Fri, 29 Apr 2022 02:09:19 -0400, Phil Hobbs
<pcdhSpamMeSenseless@electrooptical.net> wrote:

John Larkin wrote:
On Thu, 28 Apr 2022 12:01:59 -0500, Dennis <dennis@none.none> wrote:

On 4/28/22 11:26, boB wrote:

I would love to have a super computer to run LTspice.

I thought one of the problems with LTspice (and spice in general)
performance is that the algorithms don\'t parallelize very well.

LT runs on multiple cores now. I\'d love the next gen LT Spice to run
on an Nvidia card. 100x at least.


The \"number of threads\" setting doesn\'t do anything very dramatic,
though, at least last time I tried. Splitting up the calculation
between cores would require all of them to communicate a couple of times
per time step, but lots of other simulation codes do that.

The main trouble is that the matrix defining the connectivity between
nodes is highly irregular in general.

Parallellizing that efficiently might well need a special-purpose
compiler, sort of similar to the profile-guided optimizer in the guts of
the FFTW code for computing DFTs. Probably not at all impossible, but
not that straightforward to implement.

Cheers

Phil Hobbs

Climate simulation uses enormous multi-CPU supercomputer rigs.

OK, I suppose that makes your point.



--

Anybody can count to one.

- Robert Widlar
 
On Thu, 28 Apr 2022 19:47:03 +0200, Jeroen Belleman
<jeroen@nospam.please> wrote:

On 2022-04-28 18:26, boB wrote:
[...]
I would love to have a super computer to run LTspice.

boB



In fact, what you have on your desk *is* a super computer,
in the 1970\'s meaning of the words. It\'s just that it\'s
bogged down running bloatware.

Jeroen Belleman

My phone probably has more compute power than all the computers in the
world about 1960.



--

Anybody can count to one.

- Robert Widlar
 
On Friday, April 29, 2022 at 10:32:07 AM UTC-4, jla...@highlandsniptechnology.com wrote:
On Thu, 28 Apr 2022 19:47:03 +0200, Jeroen Belleman
jer...@nospam.please> wrote:

On 2022-04-28 18:26, boB wrote:
[...]
I would love to have a super computer to run LTspice.

boB



In fact, what you have on your desk *is* a super computer,
in the 1970\'s meaning of the words. It\'s just that it\'s
bogged down running bloatware.

Jeroen Belleman
My phone probably has more compute power than all the computers in the
world about 1960.

And lets you watch cat videos anywhere you go.

--

Rick C.

-+ Get 1,000 miles of free Supercharging
-+ Tesla referral code - https://ts.la/richard11209
 
On 29/04/2022 14:46, Ricky wrote:
On Friday, April 29, 2022 at 4:39:05 AM UTC-4, Martin Brown wrote:
On 28/04/2022 18:47, Jeroen Belleman wrote:
On 2022-04-28 18:26, boB wrote: [...]
I would love to have a super computer to run LTspice.

boB

In fact, what you have on your desk *is* a super computer, in the
1970\'s meaning of the words. It\'s just that it\'s bogged down
running bloatware.
Indeed. The Cray X-MP in its 4 CPU configuration with a 105MHz
clock and a whopping for the time 128MB of fast core memory with
40GB of disk. The one I used had an amazing for the time 1TB tape
cassette backing store. It did 600 MFLOPs with the right sort of
parallel vector code.

That was back in the day when you needed special permission to use
more than 4MB of core on the timesharing IBM 3081 (approx 7 MIPS).

Current Intel 12 gen CPU desktops are ~4GHz, 16GB ram and >1TB of
disk. (and the upper limits are even higher) That combo does
~66,000 MFLOPS.

Spice simulation doesn\'t scale particularly well to large scale
multiprocessor environments to many long range interractions.

The Crays were nice if you had a few million dollars to spend. I
worked for a startup building more affordable supercomputers in the
same ball park of performance at a fraction of the price. Star
Technologies, ST-100 supported 100 MFLOPS and 32 MB of memory,
costing around $200,000 with 256 KB of RAM was a fraction of the cost
of the only slightly faster Cray X-MP, available at the same time.

At the time I was doing that stuff the FPS-120 array processor attached
to a PDP-11 or Vax was the poor man\'s supercomputer. Provided you had
the right sort of problem it was very good indeed for price performance.
(it was still fairly pricey)

I got to port our code to everything from a humble Z80 (where it could
only solve trivial toy problems) upwards to the high end Cray. The more
expensive the computer the more tolerant of IBM extensions they tended
to be. The Z80 FORTRAN IV I remember as being a stickler for the rules.


--
Regards,
Martin Brown
 
On 29/04/2022 15:30, jlarkin@highlandsniptechnology.com wrote:
On Fri, 29 Apr 2022 02:09:19 -0400, Phil Hobbs
pcdhSpamMeSenseless@electrooptical.net> wrote:

John Larkin wrote:
On Thu, 28 Apr 2022 12:01:59 -0500, Dennis <dennis@none.none> wrote:

On 4/28/22 11:26, boB wrote:

I would love to have a super computer to run LTspice.

I thought one of the problems with LTspice (and spice in general)
performance is that the algorithms don\'t parallelize very well.

LT runs on multiple cores now. I\'d love the next gen LT Spice to run
on an Nvidia card. 100x at least.


The \"number of threads\" setting doesn\'t do anything very dramatic,
though, at least last time I tried. Splitting up the calculation
between cores would require all of them to communicate a couple of times
per time step, but lots of other simulation codes do that.

The main trouble is that the matrix defining the connectivity between
nodes is highly irregular in general.

Parallellizing that efficiently might well need a special-purpose
compiler, sort of similar to the profile-guided optimizer in the guts of
the FFTW code for computing DFTs. Probably not at all impossible, but
not that straightforward to implement.

Cheers

Phil Hobbs

Climate simulation uses enormous multi-CPU supercomputer rigs.

They are basically fluid in cell models with a fair number of parameters
per cell but depending on your exact choice of geometry only 6 nearest
neighbours in a 3D cubic computational grid (worst case 26 cells).

That is a very regular interconnectivity and lends itself to vector
processing (which is why we were using them) though for another problem.

A handful of FLIC practitioners used tetrahedral or hexagonal close
packed grids (4 nearest neighbours or 12 nearest neighbours).
> OK, I suppose that makes your point.

When I was involved in such codes for relativistic particle beams we
used its cylindrical symmetry to make the problem more tractable in 2D.
The results agreed remarkably well with experiments so I see no need to
ridicule other FLIC models as used in weather and climate research.

--
Regards,
Martin Brown
 
On Fri, 29 Apr 2022 10:03:23 -0400, Phil Hobbs
<pcdhSpamMeSenseless@electrooptical.net> wrote:

Mike Monett wrote:
Phil Hobbs <pcdhSpamMeSenseless@electrooptical.net> wrote:

John Larkin wrote:
On Thu, 28 Apr 2022 12:01:59 -0500, Dennis <dennis@none.none> wrote:

On 4/28/22 11:26, boB wrote:

I would love to have a super computer to run LTspice.

I thought one of the problems with LTspice (and spice in general)
performance is that the algorithms don\'t parallelize very well.

LT runs on multiple cores now. I\'d love the next gen LT Spice to run
on an Nvidia card. 100x at least.


The \"number of threads\" setting doesn\'t do anything very dramatic,
though, at least last time I tried. Splitting up the calculation
between cores would require all of them to communicate a couple of times
per time step, but lots of other simulation codes do that.

The main trouble is that the matrix defining the connectivity between
nodes is highly irregular in general.

Parallellizing that efficiently might well need a special-purpose
compiler, sort of similar to the profile-guided optimizer in the guts of
the FFTW code for computing DFTs. Probably not at all impossible, but
not that straightforward to implement.

Cheers

Phil Hobbs

Supercomputers have thousands or hundreds of thousands of cores.

Quote:

\"Cerebras Systems has unveiled its new Wafer Scale Engine 2 processor with
a record-setting 2.6 trillion transistors and 850,000 AI-optimized cores.
It’s built for supercomputing tasks, and it’s the second time since 2019
that Los Altos, California-based Cerebras has unveiled a chip that is
basically an entire wafer.\"

https://venturebeat.com/2021/04/20/cerebras-systems-launches-new-ai-
supercomputing-processor-with-2-6-trillion-transistors/

Number of cores isn\'t the problem. For fairly tightly-coupled tasks
such as simulations, the issue is interconnect latency between cores,
and the required bandwidth goes roughly as the cube or Moore\'s law, so
it ran out of gas long ago.

One thing that zillions of cores could do for SPICE is to do all the
stepped parameter runs simultaneously. At that point all you need is
infinite bandwidth to disk.

This whole hairball is summarized in Amdahl\'s Law:

..<https://en.wikipedia.org/wiki/Amdahl%27s_law#:~:text=In%20computer%20architecture%2C%20Amdahl\'s%20law,system%20whose%20resources%20are%20improved>


Joe Gwinn
 
Joe Gwinn wrote:
On Fri, 29 Apr 2022 10:03:23 -0400, Phil Hobbs
pcdhSpamMeSenseless@electrooptical.net> wrote:

Mike Monett wrote:
Phil Hobbs <pcdhSpamMeSenseless@electrooptical.net> wrote:

John Larkin wrote:
On Thu, 28 Apr 2022 12:01:59 -0500, Dennis <dennis@none.none
wrote:

On 4/28/22 11:26, boB wrote:

I would love to have a super computer to run LTspice.

I thought one of the problems with LTspice (and spice in
general) performance is that the algorithms don\'t
parallelize very well.

LT runs on multiple cores now. I\'d love the next gen LT Spice
to run on an Nvidia card. 100x at least.


The \"number of threads\" setting doesn\'t do anything very
dramatic, though, at least last time I tried. Splitting up the
calculation between cores would require all of them to
communicate a couple of times per time step, but lots of other
simulation codes do that.

The main trouble is that the matrix defining the connectivity
between nodes is highly irregular in general.

Parallellizing that efficiently might well need a
special-purpose compiler, sort of similar to the profile-guided
optimizer in the guts of the FFTW code for computing DFTs.
Probably not at all impossible, but not that straightforward to
implement.

Cheers

Phil Hobbs

Supercomputers have thousands or hundreds of thousands of cores.

Quote:

\"Cerebras Systems has unveiled its new Wafer Scale Engine 2
processor with a record-setting 2.6 trillion transistors and
850,000 AI-optimized cores. It’s built for supercomputing tasks,
and it’s the second time since 2019 that Los Altos,
California-based Cerebras has unveiled a chip that is basically
an entire wafer.\"

https://venturebeat.com/2021/04/20/cerebras-systems-launches-new-ai-
supercomputing-processor-with-2-6-trillion-transistors/

Number of cores isn\'t the problem. For fairly tightly-coupled
tasks such as simulations, the issue is interconnect latency
between cores, and the required bandwidth goes roughly as the cube
or Moore\'s law, so it ran out of gas long ago.

One thing that zillions of cores could do for SPICE is to do all
the stepped parameter runs simultaneously. At that point all you
need is infinite bandwidth to disk.

This whole hairball is summarized in Amdahl\'s Law:

.<https://en.wikipedia.org/wiki/Amdahl%27s_law#:~:text=In%20computer%20architecture%2C%20Amdahl\'s%20law,system%20whose%20resources%20are%20improved

Not exactly. There\'s very little serial execution required to
parallellize parameter stepping, or even genetic-algorithm optimization.

Communications overhead isn\'t strictly serial either--N processors can
have several times N communication channels. It\'s mostly a latency issue.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
On Fri, 29 Apr 2022 20:51:43 -0400, Phil Hobbs
<pcdhSpamMeSenseless@electrooptical.net> wrote:

Joe Gwinn wrote:
On Fri, 29 Apr 2022 10:03:23 -0400, Phil Hobbs
pcdhSpamMeSenseless@electrooptical.net> wrote:

Mike Monett wrote:
Phil Hobbs <pcdhSpamMeSenseless@electrooptical.net> wrote:

John Larkin wrote:
On Thu, 28 Apr 2022 12:01:59 -0500, Dennis <dennis@none.none
wrote:

On 4/28/22 11:26, boB wrote:

I would love to have a super computer to run LTspice.

I thought one of the problems with LTspice (and spice in
general) performance is that the algorithms don\'t
parallelize very well.

LT runs on multiple cores now. I\'d love the next gen LT Spice
to run on an Nvidia card. 100x at least.


The \"number of threads\" setting doesn\'t do anything very
dramatic, though, at least last time I tried. Splitting up the
calculation between cores would require all of them to
communicate a couple of times per time step, but lots of other
simulation codes do that.

The main trouble is that the matrix defining the connectivity
between nodes is highly irregular in general.

Parallellizing that efficiently might well need a
special-purpose compiler, sort of similar to the profile-guided
optimizer in the guts of the FFTW code for computing DFTs.
Probably not at all impossible, but not that straightforward to
implement.

Cheers

Phil Hobbs

Supercomputers have thousands or hundreds of thousands of cores.

Quote:

\"Cerebras Systems has unveiled its new Wafer Scale Engine 2
processor with a record-setting 2.6 trillion transistors and
850,000 AI-optimized cores. It’s built for supercomputing tasks,
and it’s the second time since 2019 that Los Altos,
California-based Cerebras has unveiled a chip that is basically
an entire wafer.\"

https://venturebeat.com/2021/04/20/cerebras-systems-launches-new-ai-
supercomputing-processor-with-2-6-trillion-transistors/

Number of cores isn\'t the problem. For fairly tightly-coupled
tasks such as simulations, the issue is interconnect latency
between cores, and the required bandwidth goes roughly as the cube
or Moore\'s law, so it ran out of gas long ago.

One thing that zillions of cores could do for SPICE is to do all
the stepped parameter runs simultaneously. At that point all you
need is infinite bandwidth to disk.

This whole hairball is summarized in Amdahl\'s Law:

.<https://en.wikipedia.org/wiki/Amdahl%27s_law#:~:text=In%20computer%20architecture%2C%20Amdahl\'s%20law,system%20whose%20resources%20are%20improved

Not exactly. There\'s very little serial execution required to
parallellize parameter stepping, or even genetic-algorithm optimization.

Communications overhead isn\'t strictly serial either--N processors can
have several times N communication channels. It\'s mostly a latency issue.

In general, yes. But far too far down in the weeds.

Amdahl\'s Law is easier to explain to a business manager that thinks
that parallelism solves all performance issues, if only the engineers
would stop carping and do their jobs.

And then there are the architectures that would do wondrous things, if
only light were not so damn slow.

Joe Gwinn
 
On Saturday, April 30, 2022 at 9:04:50 AM UTC-4, Joe Gwinn wrote:
On Fri, 29 Apr 2022 20:51:43 -0400, Phil Hobbs
pcdhSpamM...@electrooptical.net> wrote:

Joe Gwinn wrote:
On Fri, 29 Apr 2022 10:03:23 -0400, Phil Hobbs
pcdhSpamM...@electrooptical.net> wrote:

Mike Monett wrote:
Phil Hobbs <pcdhSpamM...@electrooptical.net> wrote:

John Larkin wrote:
On Thu, 28 Apr 2022 12:01:59 -0500, Dennis <den...@none.none
wrote:

On 4/28/22 11:26, boB wrote:

I would love to have a super computer to run LTspice.

I thought one of the problems with LTspice (and spice in
general) performance is that the algorithms don\'t
parallelize very well.

LT runs on multiple cores now. I\'d love the next gen LT Spice
to run on an Nvidia card. 100x at least.


The \"number of threads\" setting doesn\'t do anything very
dramatic, though, at least last time I tried. Splitting up the
calculation between cores would require all of them to
communicate a couple of times per time step, but lots of other
simulation codes do that.

The main trouble is that the matrix defining the connectivity
between nodes is highly irregular in general.

Parallellizing that efficiently might well need a
special-purpose compiler, sort of similar to the profile-guided
optimizer in the guts of the FFTW code for computing DFTs.
Probably not at all impossible, but not that straightforward to
implement.

Cheers

Phil Hobbs

Supercomputers have thousands or hundreds of thousands of cores.

Quote:

\"Cerebras Systems has unveiled its new Wafer Scale Engine 2
processor with a record-setting 2.6 trillion transistors and
850,000 AI-optimized cores. It’s built for supercomputing tasks,
and it’s the second time since 2019 that Los Altos,
California-based Cerebras has unveiled a chip that is basically
an entire wafer.\"

https://venturebeat.com/2021/04/20/cerebras-systems-launches-new-ai-
supercomputing-processor-with-2-6-trillion-transistors/

Number of cores isn\'t the problem. For fairly tightly-coupled
tasks such as simulations, the issue is interconnect latency
between cores, and the required bandwidth goes roughly as the cube
or Moore\'s law, so it ran out of gas long ago.

One thing that zillions of cores could do for SPICE is to do all
the stepped parameter runs simultaneously. At that point all you
need is infinite bandwidth to disk.

This whole hairball is summarized in Amdahl\'s Law:

.<https://en.wikipedia.org/wiki/Amdahl%27s_law#:~:text=In%20computer%20architecture%2C%20Amdahl\'s%20law,system%20whose%20resources%20are%20improved

Not exactly. There\'s very little serial execution required to
parallellize parameter stepping, or even genetic-algorithm optimization.

Communications overhead isn\'t strictly serial either--N processors can
have several times N communication channels. It\'s mostly a latency issue..
In general, yes. But far too far down in the weeds.

Amdahl\'s Law is easier to explain to a business manager that thinks
that parallelism solves all performance issues, if only the engineers
would stop carping and do their jobs.

And then there are the architectures that would do wondrous things, if
only light were not so damn slow.

People often focus on the fact that the size of the chip limits the speed without considering how the size might be reduced (and the speed increased) using multi-valued logic. I suppose the devil is in the details, but if more information can be carried on fewer wires, the routing area of a chip can be reduced, speeding the entire chip.

I\'ve only heard of memory type circuits being implemented with multivalued logic, since the bulk of the die area is storage and that shrinks considerably. I believe they are up to 16 values, so four bits in place of one, but I only see TLC which has 8 values per cell. Logic chips are much harder to find using multi-valued logic. Obviously there are issues with making them work well.

--

Rick C.

+- Get 1,000 miles of free Supercharging
+- Tesla referral code - https://ts.la/richard11209
 
On 30/04/2022 01:51, Phil Hobbs wrote:
Joe Gwinn wrote:
On Fri, 29 Apr 2022 10:03:23 -0400, Phil Hobbs
pcdhSpamMeSenseless@electrooptical.net> wrote:


Number of cores isn\'t the problem.  For fairly tightly-coupled
tasks such as simulations, the issue is interconnect latency
between cores, and the required bandwidth goes roughly as the cube
or Moore\'s law, so it ran out of gas long ago.

One thing that zillions of cores could do for SPICE is to do all
the stepped parameter runs simultaneously.  At that point all you
need is infinite bandwidth to disk.

Parallelism for exploring a wide range starting parameters and then
evolving them based on how well the model fits seems to be in vogue now. eg

https://arxiv.org/abs/1804.04737

This whole hairball is summarized in Amdahl\'s Law:

.<https://en.wikipedia.org/wiki/Amdahl%27s_law#:~:text=In%20computer%20architecture%2C%20Amdahl\'s%20law,system%20whose%20resources%20are%20improved


Not exactly.  There\'s very little serial execution required to
parallellize parameter stepping, or even genetic-algorithm optimization.

Communications overhead isn\'t strictly serial either--N processors can
have several times N communication channels.  It\'s mostly a latency issue.

Anyone who has ever done it quickly learns that by far the most
important highest priority task is the not the computation itself but
the management required to keep all of the cores doing useful work!

It is easy to have all cores working flat out but if most of the
parallelised work being done so quickly will be later shown to be
redundant due to some higher level pruning algorithm all you are doing
is generating more heat and only a miniscule performance gain (if that).

SIMD has made quite a performance improvement for some problems on the
Intel and AMD platforms. The compilers still haven\'t quite caught up
with the hardware though. Alignment is now a rather annoying issue if
you care about avoiding unnecessary cache misses and pipeline stalls.

You can align your own structures correctly but can do nothing about
virtual structures that the compiler creates and puts on the stack
misaligned spanning two cache lines. The result is code which executes
with two distinct characteristic times depending on where the cache line
boundaries are in relation the top of stack when it is called!

It really only matters in the very deepest levels of computationally
intensive code which is probably why they don\'t try quite hard enough.
Most people probably wouldn\'t notice ~5% changes unless they were
benchmarking or monitoring MSRs for cache misses and pipeline stalls.

--
Regards,
Martin Brown
 
Martin Brown wrote:
On 30/04/2022 01:51, Phil Hobbs wrote:
Joe Gwinn wrote:
On Fri, 29 Apr 2022 10:03:23 -0400, Phil Hobbs
pcdhSpamMeSenseless@electrooptical.net> wrote:


Number of cores isn\'t the problem.  For fairly tightly-coupled
tasks such as simulations, the issue is interconnect latency
between cores, and the required bandwidth goes roughly as the cube
or Moore\'s law, so it ran out of gas long ago.

One thing that zillions of cores could do for SPICE is to do all
the stepped parameter runs simultaneously.  At that point all you
need is infinite bandwidth to disk.

Parallelism for exploring a wide range starting parameters and then
evolving them based on how well the model fits seems to be in vogue now. eg

https://arxiv.org/abs/1804.04737

This whole hairball is summarized in Amdahl\'s Law:

.<https://en.wikipedia.org/wiki/Amdahl%27s_law#:~:text=In%20computer%20architecture%2C%20Amdahl\'s%20law,system%20whose%20resources%20are%20improved


Not exactly.  There\'s very little serial execution required to
parallellize parameter stepping, or even genetic-algorithm optimization.

Communications overhead isn\'t strictly serial either--N processors can
have several times N communication channels.  It\'s mostly a latency
issue.

Anyone who has ever done it quickly learns that by far the most
important highest priority task is the not the computation itself but
the management required to keep all of the cores doing useful work!

Yup. In my big EM code, that\'s handled by The Cluster Script From Hell. ;)

It is easy to have all cores working flat out but if most of the
parallelised work being done so quickly will be later shown to be
redundant due to some higher level pruning algorithm all you are doing
is generating more heat and only a miniscule performance gain (if that).

SIMD has made quite a performance improvement for some problems on the
Intel and AMD platforms. The compilers still haven\'t quite caught up
with the hardware though. Alignment is now a rather annoying issue if
you care about avoiding unnecessary cache misses and pipeline stalls.

You can align your own structures correctly but can do nothing about
virtual structures that the compiler creates and puts on the stack
misaligned spanning two cache lines. The result is code which executes
with two distinct characteristic times depending on where the cache line
boundaries are in relation the top of stack when it is called!

It really only matters in the very deepest levels of computationally
intensive code which is probably why they don\'t try quite hard enough.
Most people probably wouldn\'t notice ~5% changes unless they were
benchmarking or monitoring MSRs for cache misses and pipeline stalls.

Well, your average hardcore numerical guy would proably just buy two
clusters and pick the one that finished first. ;)

Fifteen or so years ago, I got about a 3:1 improvement in FDTD speed by
precomputing a strategy that let me iterate over a list containing runs
of voxels with the same material, vs. just putting a big switch
statement inside a triply-nested loop (the usual approach).

I mentioned it to another EM simulation guy at a conference once, who
said, \"So what? I\'d just get a bigger cluster.\"

Cheers

Phil Hobbs


--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
Martin Brown <\'\'\'newspam\'\'\'@nonad.co.uk> wrote:
On 28/04/2022 18:47, Jeroen Belleman wrote:
On 2022-04-28 18:26, boB wrote:
[...]
I would love to have a super computer to run LTspice.

boB

In fact, what you have on your desk *is* a super computer,
in the 1970\'s meaning of the words. It\'s just that it\'s
bogged down running bloatware.

Indeed. The Cray X-MP in its 4 CPU configuration with a 105MHz clock and
a whopping for the time 128MB of fast core memory with 40GB of disk. The

what is fast core memory?

one I used had an amazing for the time 1TB tape cassette backing store.
It did 600 MFLOPs with the right sort of parallel vector code.

That was back in the day when you needed special permission to use more
than 4MB of core on the timesharing IBM 3081 (approx 7 MIPS).

Current Intel 12 gen CPU desktops are ~4GHz, 16GB ram and >1TB of disk.
(and the upper limits are even higher) That combo does ~66,000 MFLOPS.

Spice simulation doesn\'t scale particularly well to large scale
multiprocessor environments to many long range interractions.
 
On Friday, April 29, 2022 at 7:30:55 AM UTC-7, jla...@highlandsniptechnology.com wrote:

> Climate simulation uses enormous multi-CPU supercomputer rigs.

Not so; it\'s WEATHER mapping and prediction that uses the complex data sets
for a varied bunch of globe locations doing sensing, to make a 3-d map for
the planet\'s atmosphere. Climate is a much cruder problem, no details
required. Much of the greenhouse gas analysis comes out of models
that a PC spreadsheet would handle easily.
 
On 05/03/2022 03:12 PM, Cydrome Leader wrote:
Martin Brown <\'\'\'newspam\'\'\'@nonad.co.uk> wrote:
On 28/04/2022 18:47, Jeroen Belleman wrote:
On 2022-04-28 18:26, boB wrote:
[...]
I would love to have a super computer to run LTspice.

boB
In fact, what you have on your desk *is* a super computer,
in the 1970\'s meaning of the words. It\'s just that it\'s
bogged down running bloatware.
Indeed. The Cray X-MP in its 4 CPU configuration with a 105MHz clock and
a whopping for the time 128MB of fast core memory with 40GB of disk. The
what is fast core memory?

A very expensive item:

https://en.wikipedia.org/wiki/Magnetic-core_memory

Fortunately by the X-MP\'s time SRAMs had replaced magnetic core.
 
On Tuesday, May 3, 2022 at 5:12:59 PM UTC-4, Cydrome Leader wrote:
Martin Brown <\'\'\'newspam\'\'\'@nonad.co.uk> wrote:
On 28/04/2022 18:47, Jeroen Belleman wrote:
On 2022-04-28 18:26, boB wrote:
[...]
I would love to have a super computer to run LTspice.

boB

In fact, what you have on your desk *is* a super computer,
in the 1970\'s meaning of the words. It\'s just that it\'s
bogged down running bloatware.

Indeed. The Cray X-MP in its 4 CPU configuration with a 105MHz clock and
a whopping for the time 128MB of fast core memory with 40GB of disk. The
what is fast core memory?

An oxymoron.

--

Rick C.

++ Get 1,000 miles of free Supercharging
++ Tesla referral code - https://ts.la/richard11209
 

Welcome to EDABoard.com

Sponsor

Back
Top