myStiC bYTeS - atari fALcoN demoCREW

--- - -- - --------------------------------------------------------------------
68030
ST RAM and things around it
---------------------------------------------------------------------- ---- - -

I tell you the truth - I'm  always  reading  about  how slow ST RAM is and that
Fast RAM is something unbeatable etc, but  I  never knew WHY. There is only one
Falcon timing document - the one published  by  Rodolphe Czuba with CT2. But in
my eyes, this doc is so  short/unclear  and SHITTILY translated, there's no way
to get the idea for non-hw freaks like me.  But one nice day I told myself: "It
can't be so hard... other ppl understood  that,  I can too" :-) So I downloaded
both uk and fr version of the mentioned doc and started to read...

OK, after some weeks :) I've got  that  damned  idea... I have to say I haven't
got any oscilloscope or  any  other  device  to  verify  my  results, but there
shouldn't be big differences. From time to time I use some numbers by Rodolphe.
So, let's start!

Let's talk something about timing tables at first:

We want to know how many cycles the move.l (an)+,d0 instruction is. If you take
a look into official timing tables by Motorola (and rewritten by Zyx/Inferiors,
JHL/TRSI, aRt/TBL and others):

MOVE    EA,Dn   0       0       2(0/0/0)        2(0/1/0)

and for EA calculation:

        (An)+   0       1       3(1/0/0)        3(1/0/0)
                ----------------------------------------
                                5(1/0/0)        5(1/1/0)

From this example you can see:
- no overlapping will occur
- move (an)+,dn takes 5 bus-cycles at all (cache setting doesn't matter)
- data reading takes 1 bus-cycle
- i-prefetching takes 0 (if it's in i-cache) or 1 (if isn't) bus-cycle
- data writing takes 0 bus-cycles

According to Motorola, this table assumes:
- 32 bit data bus
- 1 bus-cycle = 2 clock (cpu) periods
- 2 i-prefetches per 1 bus cycle
- long aligned operands incl. system stack
- no wait states
- data cache disabled

Instruction prefetch means "loading"  of  the  instruction  into CPU, including
additional  extensions. In the  simplest  case,  one instruction prefetch =
one word prefetch (since basic instruction on 68k  is 16 bit). In case of movem
instruction for example, CPU has to prefetch  one long = 2 word prefetches. But
we see, it still takes the same time  (it isn't important if we transfer a word
or long with 32 bit  data  bus)  Conclusion:  it  doesn't  matter if we have to
prefetch one or two words for a complete instruction, it takes the same time.

bus activity:  1*2 (data read) + 0*2 (prefetching) = 2 in cache-case or
               1*2 (data read) + 1*2 (prefetching) = 4 in non-cache-case

So, number of internal cycles for execution:
all cycles - bus activity = 5 - 2 = 3 for cache-case or
                            5 - 4 = 1 for non-cache-case

Please note there's no dependency  between  internal  cycles for both cache and
non-cache cases. Sometimes it's faster executing/decoding with cache, sometimes
without cache. Ofcourse we're talking only  about internal cycles and very fast
(2 clock periods per 32 bit) RAM  access,  in  case of slow memory you will see
that difference. Also keep in  mind  values  for i-prefetching are AVERAGE ones
for odd/even word -> final value  is  always  less  or equal to the current bus
status. In practice: let's take our movem  example.  In  all cases it is min. 2
words long, right? But in calculations, it's always divided into 2 i-prefetches
-> i-prefetching takes 2*2 = 4 clock  periods  and  real bus status is only 1*2
clock periods. (in case of long aligned  operands  and 32 bit data bus ofcourse
!!!) Also don't be surprised if you get result of 0 internal cycles sometime.

But hey, that's our dream machine and not Falcon030! =) Reference for us should
be:

- 16 bit data bus
- 1 WORD bus-cycle = 4 clock periods (according to Czuba's doc)
- refreshing every 15.6 us [wait states] (again Czuba's doc)

This stuff implies that the table above is  bad. At least the overall number of
cycles. There's a need for some changes:

- if you have 32 bit data bus, doesn't matter if you read (long aligned)
  long or word - it still takes the same time. In our case it DOES matter
  since 1 bus read = 1 WORD read, NOT LONG
- as you can see, bus activity is 2 times longer (2 vs 4 clock periods)

So correct timing should look like this:

move.w (an)+,dn 0       1        7(1/0/0)        9(1/1/0)
move.l (an)+,dn 0       1       11(2/0/0)       13(2/1/0)

How I got these numbers?

bus activity for move.w:
1*4 (data read) + 0*4 (prefetching) = 4 in cache-case or
1*4 (data read) + 1*4 (prefetching) = 8 in non-cache-case

bus activity for move.l:
2*4 (data read) + 0*4 (prefetching) = 8  in cache-case or
2*4 (data read) + 1*4 (prefetching) = 12 in non-cache-case

And from example above we know this move takes 3 (cache) / 1 (non-cache)
internal cycles: 4+3=7 / 8+1=9 (move.w) and 8+3=11 / 12+1=13 (move.l)

Here we see we needn't care  about  1-word vs 2-word instructions anymore since
our bus can transfer only words :) That means if an instruction takes one word,
it needs 1*4 clock periods, if it  takes  two words, it needs 2*4 clock periods
etc.

We're more than 2 times slower than an 'original' 68030 in long transfer ! Here
I stop talking  about  timing  tables,  it's  more  complicated  than you could
expect, I mean especially  pipelining  &  intruction/data  cache  stuff, it's a
topic for separate article. (mail me and I'll write about it)

So, back to ST-RAM:

The Falcon is equiped with  Motorola  68030@16  MHz  and  16 bit data bus. That
means CPU can access to RAM only 1  word  per bus-cycle. The one bus-cycle is 4
clock periods long -> CPU can  read/write  one  word every 4th clock period (in
the next text I assume 16  MHz  clock  period  = 1 cycle). Luckily, during this
time CPU isn't halted  totally  -  if  CPU  doesn't  access any external source
(ST/TT RAM, hardware regs, ...) you  can  execute some other instructions which
are doing stuff in CPU (this should  allow  use  of  the data cache, too, but I
didn't get  very  good  results..)  By  the  way,  instructions...  the fastest
instruction on 030 takes 2  cycles,  but  you  can't  use these two intructions
during the bus write since  you  have  got  less  than  4  cycles, I don't know
exactly why.. Or, you can write to ST-RAM one long (that means <8 cycles pause)
and here you have time for three  2-cycles instructions.. not very much, I know
=)

So, if 1 cycle is 1/16000000 s  =  62.5  *  10^-9  s  = 62.5 ns and CPU needs 4
cycles for RAM access, complete  reading/writing  is  4  *  62.5 = 250 ns long.
Remember our SIMMs have usually 80 ns,  that  means  CPU can access to this RAM
3.125 times faster ! And this is only beginning of time wasting :)

1 word = 2 bytes -> we can  transfer  2  bytes per 4 cycles -> 1/2 byte[4 bits]
per cycle. If one cycle is 62.5 ns:

0.5/62.5ns = 8 000 000 bytes per second. Fiction.

Let's take a look at the move.l (an)+,dn again. We calculated in worst case (no
instruction cache) it takes 13 cycles, right? What is the max reading speed?

16000000/13 longs/s = 16000000*4/13 bytes/s = 4 923 076 bytes/s. Shame!

Do you think it can't be worse? Okie, let's continue...

Here is another bus-cycle-stealer: DMA.  DMA  means  Direct Memory Access. That
'direct' means if DMA device wants something from  RAM, CPU can fuck up and has
to wait until the DMA device  has  finished.  In  Falcon we have 3 DMA devices:
Videl, Blitter and Sound DMA. I'm not going to talk about Blitter or SDMA since
these chips aren't necessary for  EVERY  application,  but  grafix we need from
time to time :-)

Eeeehhh... what example to  give?  I  think  320x240xTC  could be interresting,
couldn't it?

320 x 240 x 2 bytes = 153600 bytes = 38400 longs. Videl access to video-data in
so called BURST mode, that means 17 longs  per one Videl (!) access. This BURST
mode looks like:

1st long = 2 (RAS cycle = init) + 1 (CAS cycle = data)  =  3 cycles
2nd ~ 17th long =                 16 (CAS cycles)       = 16 cycles
                                                          ---------
                                                          19 cycles
(thanks to Rodolphe for this explanation)

So one BURST access = 19 cycles.
38400/17 = 2259 BURST accesses and 2259 * 19 = 42921 cycles per screen.

That gives us 42921 * 60 =  2  575  260  cycles  per second for Videl (!!!). 60
stands for 60 Hz ofcourse. 50 could be for 50 Hz, 100 for 100 Hz etc

Due to DMA stealing we lose: 16000000 - 2575260 = 13 424 740 cycles/s !!!

So, what about our move.l ?

13424740/13 longs/s = 13424740*4/13 bytes/s = 4 130 689 bytes per second...

Are you still thinking Falcon is fantastically developed? =) Ok...

This example assumes all  your  RAM  reading  is  sequential, that means you're
reading address n+$0, n+$2, n+$4, n+$6 etc.  Maybe  you think: "nah and what? I
have all my data together and  my  program  doesn't  jump every time". Hahah my
little lady! =) How does your RAM accessing look?

lea     something,a0
move.l  d0,(a0)+
move.l  d3,(a0)+
:
:
for example. But remember  the  CPU  fetches  instructions from your code-space
(text-segment), then writes some long data  to  ANOTHER area, then fetches next
instruction, ... etc. And we still assume  all  your code and data is word/long
aligned ! (for more info  about  aligning  see below) For non-sequential access
you have  to  add  2  cycles  for  precharge  time  (again  Czuba's  number) to
instruction time. In our case:

13424740/(13+2) longs/s = 13424740*4/(13+2) bytes/s = 3 579 930 bytes/s...

Our 256 byte chache can solve these  troubles with precharging. If you put your
loop  into  the  instruction  cache,  the  data  bus  will  be  used  only  for
reading/writing data. If you're really,  really  lucky  boy, in case of reading
you can put both program  and  read  data  in  the  instr/data cache and no bus
transfer will be done.  In  case  of  writing,  data  is  ALWAYS transferred to
memory, since our cache works in so called 'write-through' mode. 040 and higher
works in 'write-back'  mode,  here  it  isn't  always  necessary  to access RAM
(better internal logic)

Now, we killed the last cycle in our  Falcon  :-)  Ah, not the last - there are
still those stupid wait  states...  but  as  Rodolphe  wrote, they don't affect
calculations a lot. If you want some additional info how to implement them into
timing tables, I refer you to the 68030 User's Manual by Motorola.

Aligning & cache stuff
======================

OK, let's start with a  cool  030  feature  -  burst mode. Our data/instruction
cache has lines of 16 bytes  =  4  longs.  If  you  align your data to 16 bytes
boundary and you enable the so called BURST mode (not the same as the Videl one
:), the 030 will read/write data in  4  longs at once which makes four accesses
to the bus in a faster way. But  here's  a bottleneck... There's a need for hw-
support for this ! Most of  dynamical  RAMs  allows  it (faster access to whole
pages against sequential reading), but in  Falcon  we can forget it completely.
I'll kill that bastard who came  with  idea  of  16  bit  width of data bus !!!
Conclusion: instr/data burst disabling has no effect on a standard Falcon :(

When I began coding, I thought  the  cache  solves everything. I've got a loop,
ok, let's put it into the cache  and  do  things. Shit, again not 100% true :-)
030 has two caches: instruction and data.

Instruction cache:

It's more "friendly" to  us  :-)  Only  one  thing  you  have  to keep in mind:
instruction fetching is a longword  operation!  That  means  if you have (long)
unaligned begining of code (not only  the  first  line of code, but begining of
every jump, too) you will fetch the unused instruction, too.

Data cache [reading]:

If you read some new data and  data cache is enabled/unfreezed, these data will
be stored as an entry in  one  of  64  locations.  64?  Yeah ppl, data cache is
longword one, again! So, if you  read  word  from  RAM at address $02, entry in
data cache will be: ($00):($02) = long ! In a normal situation, no tragedy. But
in the case of a 16 bit data  bus  this  really sux !!! One more bus access for
every reading !!! Solution: always read longs from (long-aligned) memory if you
can. Or disable data cache ;-)  Here's  an extremely important aligning tip. If
your long lies at address $03 for  example,  you'll get _four_ bus accesses !!!
How it's possible?

line in cache: $00         $04           ...
               -------------------------
               |b7|b6|b5|b4|b3|b2|b1|b0| ...
               -------------------------
and reading:             ** ** ** ** <- our long

1) b5:b4 [16 bits] - b4 is MSB of our long
2) b7:b6 [16 bits] - unimportant values
3) b3:b2 [16 bits] - next 16 bits of our long
4) b1:b0 [16 bits] - b1 is LSB of our long

Data cache [writing]:

Heh, even more of a performance killer  than reading. If you write something to
RAM, doesn't matter if you have it in cache or not, it's always written to this
RAM. I wonder why Motorola implemented this one, there's only one use of this -
if you save something to RAM and in a short time you want to restore it back to
the regs and you want to do it in  a loop. Maybe here it would be useful... but
in other cases, especially for things  like  clearing, always disable this data
cache !!!  Reading  is  very  obsolete,  too.  Maybe  for  things  like  matrix
multiplication it's useful (more values  than  registers). I recommend to flush
data cache, "load" these values via something like tst.w (a0), tst.w 4(a0), ...
then clear WA bit in CACR and do a loop with muls.w (a0)+, ...


Practical example
=================

I think famous the movem clear routine for tc may be interesting...

We want to clear 320*240*2 = 153600 bytes. Some movem.l timings:
movem.l d0-d6/a0-a5,-(a6)       4+op    0       110(0/0/26)     116(0/2/26)
movem.l d0-d6/a0-a4,-(a6)       4+op    0       102(0/0/24)     108(0/2/24)
movem.l d0-d6/a0-a3,-(a6)       4+op    0        94(0/0/22)     100(0/2/22)
dbra.w  [counter not expired]   6       0         6(0/0/0)       12(0/2/0)
dbra.w  [counter expired]       10      0        10(0/0/0)       16(0/3/0)

Let's look at 3 examples:

1) minimal length
=================
loop:   movem.l d0-d6/a0-a5,-(a6)
        dbra    d7,loop                 ;d7 = 2953-1
        movem.l d0-d6/a0-a3,-(a6)

Since no instruction has a tail, no  overlapping will occur. We divide clearing
into 3 phases:

a) executing & loading into cache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a5,-(a6) [116+2] (2 stands for precharge time)
dbra    d7,loop           [012]

b) executing from cache
~~~~~~~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a5,-(a6) [110]
dbra    d7,loop           [006]
(2951 times)

c) last execution
~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a5,-(a6) [110]
dbra    d7,loop           [010]
movem.l d0-d6/a0-a3,-(a6) [100]

result = (116+2+12) + (110+6)*2951 + (110+10+100) = 342 666 cycles

2) one movem
============
loop:   rept 50
        movem.l d0-d6/a0-a4,-(a6)
        endr
        dbra    d7,loop                 ;d7 = 64-1

a) executing & loading into cache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a4,-(a6) [108+2]*50
dbra    d7,loop           [012]

b) executing from cache
~~~~~~~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a4,-(a6) [102]*50
dbra    d7,loop           [006]
(62 times)

c) last execution
~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a4,-(a6) [102]*50
dbra    d7,loop           [010]

result = ((108+2)*50+12) + (102*50+6)*62 + (102*50+10) = 327 194 cycles

3) 100% of cache filled
=======================
loop:   rept 63
        movem.l d0-d6/a0-a5,-(a6)
        endr
        dbra    d7,loop                 ;d7 = 47-1

a) executing & loading into cache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a5,-(a6) [116+2]*63
dbra    d7,loop           [012]

b) executing from cache
~~~~~~~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a5,-(a6) [110]*63
dbra    d7,loop           [006]
(45 times)

c) last execution
~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a5,-(a6) [110]*63
dbra    d7,loop           [010]

result = ((116+2)*63+12) + (110*63+6)*45 + (110*63+10) = 326 506 cycles.
Please note in this case we're clearing 372 bytes more than needed.

One VBL on VGA 60 Hz can be max.  1/60  =  0.0167 s long, that means 0.0167 s /
62.5 ns = 266 666 cycles  per  VBL.  Keep  in  mind this is absolut maximum, in
practise this number is much lower. That gives us:

342 666 / 266 666 = 1.29 frames for clearing
327 194 / 266 666 = 1.23 frames for clearing
326 506 / 266 666 = 1.22 frames for clearing

Also, if your fx is able to run in 60 fps, in an ideal case, the fps count will
be decreased to 30 and lower fps (approx)

We reached the end, finally :-) As you can  see,  the 16 bit data bus at 16 MHz
is the single most performance killer hand in hand with Videl accesses. This is
the reason why  the  CT60+Fastram+SuperVidel  combination  is  so awaited. This
hardware eliminate most of the stupid  things  in the Falcon architecture since
we will get:
- ultra-fast CPU [ct60]
- 32 bit BURST access [fastram]
- ultra-fast data bus [ct60+fastram]
- no accesses to old st-ram via slow databus for program and data
- even no accesses to st-ram for gfx data !!! [supervidel]

Fantastic, isn't it?


-------------------------------------------------------------------------------
   MiKRO              XE/XL/MegaSTE/Falcon/CT60              mikro.atari.org
-------------------------------------------------------------------------------