myStiC bYTeS - atari fALcoN demoCREW

- - --- -- --------------------------------------------------------------------  
Optimizing for FastRAM
& 68060 CPU
------------------------------------------------------------------ - - -- -----

Well, so it happend, ct60 arrived to our hands =) Short time after receiving my
CT60 I've  got  some  nice  c2p  source  from  Evil/DHS  by  Michael  Kalms aka
Scout/Appendix where I learned some things about 060 so why not write it here:)

Burst mode
==========
I was very surprised when I didn't find  'burst mode' bit in CACR register. But
this doesn't mean 060 has no burst...  060  operates in burst mode all the time
:) And what is this burst mode?  If  you  read  my article about 030 timing you
know how it works in the 030 cache:  every  word  or long read has its place in
data cache (unless the data cache isn't disabled and/or frozen of course)... so
if you want to load let's say 32 bytes into data cache you have to do on 030:

        tst.w   0(a0)           ;  4 bytes loaded
        tst.w   4(a0)           ;+ 4 bytes
        tst.w   8(a0)           ;+ 4 bytes
        tst.w   12(a0)          ;+ 4 bytes
        :
        :
        tst.w   28(a0)          ;+ 4 bytes = 32 bytes

Since every data entry is stored  as  a  long...  you  can use the advantage of
misaligned operands:

        tst.w   3(a0)           ;  4+4 bytes loaded
        tst.w   11(a0)          ;+ 4+4 bytes
        tst.w   19(a0)          ;+ 4+4 bytes
        tst.w   27(a0)          ;+ 4+4 bytes = 32 bytes

Since we're filling two entries at once  (because of misaligned words). And now
the Burst Mode: this one allows you to load 16 bytes at once if your cache line
is 'clean', that means all entries  are  marked  as  'invalid' and your data is
read from a 16 bytes boundary:

        tst.w   (a0)            ;  16 bytes loaded
        tst.w   16(a0)          ;+ 16 bytes = 32 bytes

And we can again use advantage of misaligned operands:

        tst.w   0*16+15(a0)     ; 32 bytes loaded
        tst.w   2*16+15(a0)     ; next 32 bytes loaded

Cool isn't it? :) By the way, you can use this trick on CT2, too, since CT2 has
FastRAM & burst support  for  it.  But  don't  forget  to enable the previously
mentioned bit in the CACR !!!

Writing to ST RAM
=================
Yeah, yeah... we have got superb 68060 CPU, superb FastRAM and still we have to
do what? Write to ST RAM! Now someone  could  ask if 68060 and FastRAM helps in
this area, too. So, for very curious people: YES, IT DOES :) How?

1. Store Buffer
---------------
Even if our 8 KB data cache is  a  lot of space, for copying thousands of bytes
It isn't very useful :) And so  here  comes  our store buffer into play: it's a
four  entry (that means 4 longs)  first-in-first-out  buffer used by writing to
slow memory. So, if we want to write  a  word or long to memory and the databus
is still used by the previous memory  write,  this value will be stored in this
buffer and the program will continue to the next (hopefully not memory operate)
instruction.

2. Instruction overlapping
--------------------------
I touched this topic in 030  timing  article  a  little bit: If your code isn't
only about writing to ST RAM, you  can  use this very nice trick with fantastic
results. I mean here a famous chunky  to  planar routines of course. Let's make
some analysis:

For 320*240/TC you need  to  transfer/clear  320*240*2  =  153600 bytes what is
76800 words. If one word takes  4  cycles  to  write  to memory, we need 307200
cycles. And most demos didn't use 'true'  truecolor: they used lookup table for
256 colours + additional values for lighting, shading or pixel overlapping...

What about 256 colour modes? On  standard  Falcon  we can't use them because of
... bitplanes. Simply, without FastRAM you have to:
- clear chunky buffer in ST RAM (320*240 bytes = 38400 words)
- do some nice 3D stuff (variable amount of writes)
- copy from chunky buffer to screen memory (2*38400 words since both chunky
  buffer and screen are in ST RAM)

This gives us 3*38400 words what is 3*38400*4 = 460800 cycles and still without
instruction timings.. so it's slower than TC..

OK, but FastRAM comes into play! The situation looks much much better:
- clearing chunky buffer in FastRAM (19200 longs)
- do some nice 3D stuff (still in FastRAM)
- c2p conversion (19200 longs to transfer = 38400 words)

So... if one write to SDRAM is one 66.666 MHz clock cycle what is
16/66.666 = 0.24 of one 16 MHz clock cycle we get:

19200*0.24 + 38400*4 = 158208 cycles! Let's compare:

320*240/TC: 307 200 cycles + reads from lookup table
320*240/256: 158 208 cycles

Maybe you ask why I'm so sure c2p  conversion will not take some cycles ;) It's
because of instruction overlapping. If you  write  a  long to ST RAM (typically
c2p where we are writing longs) it takes eight 16 MHz cycles:

2*4*(1/16000000) / (1/66666000) ~ 33 66.666  MHz  cycles between each c2p pass.
And be sure in this time you can do everything :)

Here we see  that  idea  of  putting  our  truecolor  screen  into  FastRAM has
practically no sense - ok, we have our buffer in SDRAM:

- clearing of buffer: 320*240*2 = 38400 longs
- doing stuff in FastRAM...
- copying to ST RAM: 38400 longs to transfer = 76800 words

38400*0.24 + 76800*4 = 316416... we didn't  help ourselves very much... 2 times
slower than 256 colours mode....

Caches
======
Only some words on this topic: unroll  your  loops !!!!!!!!!!!!!! =) And for ST
RAM operations... try to optimize a program pipeline to the max...

Superscalar architecture
========================
People, this thing rules =) It's a little  bit similar to the DSP pipeline, but
with much more freedom. Here's a copy&paste  from  one mail by Amiga guy Thomas
Richter:

---------------
Actually, the '060 UM is sufficient  in  this  topic.  The '060 has two ALUs of
which one has only  a  restricted  instruction  set  (the sOEP).  Most *simple*
operations can run in parallel in the  pEOP and sOEP provided the results don't
depend on each other (and provided you don't trip on a bug in the '060 of which
- unfortunately - there are some).

Thus,

 add.l d0,d1
 add.l d2,d3

can be executed in parallel since "add"  can  be executed in both ALUs, and the
source of the second instruction does not depend on a result of the first.

Instructions as "sOEP|pOEP" run on both  ALUs.  Those marked as "pOEP" can only
run on the primary ALU and hence  may  cause stalls. Further-more, the FPU runs
in parallel with the integer unit, it makes  quite some sense to 'fire off' the
FPU and to perform integer arithmetic while the FPU is busy.

Thus, programming hint:  Try  to  'interleave'  instructions  from two separate
instruction pipelines to keep both ALUs  busy.  For  example, if you have tight
inner loops, unroll  the  loop  (if  possible)  into  two  parallel instruction
streams.
---------------

I proved this to myself by modifying that  c2p routine which Evil sent me and I
have to say, it's faster! Even with incredible slow ST RAM writes!


And that is it... just short overview  if  you  are too lazy to read Motorola's
docs ;) What to say at the end...  make sure you have enabled intruction & data
& branch cache, enabled FIFO  buffer  for  data  cache and enabled "superscalar
mode" in PCR !

I attached to this article mentioned c2p routine,  I don't know a faster one at
this time ;)

CT60 rules !!!!!!!!!!!! =)


-------------------------------------------------------------------------------
   MiKRO              XE/XL/MegaSTE/Falcon/CT60              mikro.atari.org
-------------------------------------------------------------------------------
Optimizing for FastRAM & 68060 CPU