--- - -- - --------------------------------------------------------------------
68030
ST RAM and things around it
---------------------------------------------------------------------- ---- - -
I tell you the truth - I'm always reading about how slow ST RAM is and that
Fast RAM is something unbeatable etc, but I never knew WHY. There is only one
Falcon timing document - the one published by Rodolphe Czuba with CT2. But in
my eyes, this doc is so short/unclear and SHITTILY translated, there's no way
to get the idea for non-hw freaks like me. But one nice day I told myself: "It
can't be so hard... other ppl understood that, I can too" :-) So I downloaded
both uk and fr version of the mentioned doc and started to read...
OK, after some weeks :) I've got that damned idea... I have to say I haven't
got any oscilloscope or any other device to verify my results, but there
shouldn't be big differences. From time to time I use some numbers by Rodolphe.
So, let's start!
Let's talk something about timing tables at first:
We want to know how many cycles the move.l (an)+,d0 instruction is. If you take
a look into official timing tables by Motorola (and rewritten by Zyx/Inferiors,
JHL/TRSI, aRt/TBL and others):
MOVE EA,Dn 0 0 2(0/0/0) 2(0/1/0)
and for EA calculation:
(An)+ 0 1 3(1/0/0) 3(1/0/0)
----------------------------------------
5(1/0/0) 5(1/1/0)
From this example you can see:
- no overlapping will occur
- move (an)+,dn takes 5 bus-cycles at all (cache setting doesn't matter)
- data reading takes 1 bus-cycle
- i-prefetching takes 0 (if it's in i-cache) or 1 (if isn't) bus-cycle
- data writing takes 0 bus-cycles
According to Motorola, this table assumes:
- 32 bit data bus
- 1 bus-cycle = 2 clock (cpu) periods
- 2 i-prefetches per 1 bus cycle
- long aligned operands incl. system stack
- no wait states
- data cache disabled
Instruction prefetch means "loading" of the instruction into CPU, including
additional extensions. In the simplest case, one instruction prefetch =
one word prefetch (since basic instruction on 68k is 16 bit). In case of movem
instruction for example, CPU has to prefetch one long = 2 word prefetches. But
we see, it still takes the same time (it isn't important if we transfer a word
or long with 32 bit data bus) Conclusion: it doesn't matter if we have to
prefetch one or two words for a complete instruction, it takes the same time.
bus activity: 1*2 (data read) + 0*2 (prefetching) = 2 in cache-case or
1*2 (data read) + 1*2 (prefetching) = 4 in non-cache-case
So, number of internal cycles for execution:
all cycles - bus activity = 5 - 2 = 3 for cache-case or
5 - 4 = 1 for non-cache-case
Please note there's no dependency between internal cycles for both cache and
non-cache cases. Sometimes it's faster executing/decoding with cache, sometimes
without cache. Ofcourse we're talking only about internal cycles and very fast
(2 clock periods per 32 bit) RAM access, in case of slow memory you will see
that difference. Also keep in mind values for i-prefetching are AVERAGE ones
for odd/even word -> final value is always less or equal to the current bus
status. In practice: let's take our movem example. In all cases it is min. 2
words long, right? But in calculations, it's always divided into 2 i-prefetches
-> i-prefetching takes 2*2 = 4 clock periods and real bus status is only 1*2
clock periods. (in case of long aligned operands and 32 bit data bus ofcourse
!!!) Also don't be surprised if you get result of 0 internal cycles sometime.
But hey, that's our dream machine and not Falcon030! =) Reference for us should
be:
- 16 bit data bus
- 1 WORD bus-cycle = 4 clock periods (according to Czuba's doc)
- refreshing every 15.6 us [wait states] (again Czuba's doc)
This stuff implies that the table above is bad. At least the overall number of
cycles. There's a need for some changes:
- if you have 32 bit data bus, doesn't matter if you read (long aligned)
long or word - it still takes the same time. In our case it DOES matter
since 1 bus read = 1 WORD read, NOT LONG
- as you can see, bus activity is 2 times longer (2 vs 4 clock periods)
So correct timing should look like this:
move.w (an)+,dn 0 1 7(1/0/0) 9(1/1/0)
move.l (an)+,dn 0 1 11(2/0/0) 13(2/1/0)
How I got these numbers?
bus activity for move.w:
1*4 (data read) + 0*4 (prefetching) = 4 in cache-case or
1*4 (data read) + 1*4 (prefetching) = 8 in non-cache-case
bus activity for move.l:
2*4 (data read) + 0*4 (prefetching) = 8 in cache-case or
2*4 (data read) + 1*4 (prefetching) = 12 in non-cache-case
And from example above we know this move takes 3 (cache) / 1 (non-cache)
internal cycles: 4+3=7 / 8+1=9 (move.w) and 8+3=11 / 12+1=13 (move.l)
Here we see we needn't care about 1-word vs 2-word instructions anymore since
our bus can transfer only words :) That means if an instruction takes one word,
it needs 1*4 clock periods, if it takes two words, it needs 2*4 clock periods
etc.
We're more than 2 times slower than an 'original' 68030 in long transfer ! Here
I stop talking about timing tables, it's more complicated than you could
expect, I mean especially pipelining & intruction/data cache stuff, it's a
topic for separate article. (mail me and I'll write about it)
So, back to ST-RAM:
The Falcon is equiped with Motorola 68030@16 MHz and 16 bit data bus. That
means CPU can access to RAM only 1 word per bus-cycle. The one bus-cycle is 4
clock periods long -> CPU can read/write one word every 4th clock period (in
the next text I assume 16 MHz clock period = 1 cycle). Luckily, during this
time CPU isn't halted totally - if CPU doesn't access any external source
(ST/TT RAM, hardware regs, ...) you can execute some other instructions which
are doing stuff in CPU (this should allow use of the data cache, too, but I
didn't get very good results..) By the way, instructions... the fastest
instruction on 030 takes 2 cycles, but you can't use these two intructions
during the bus write since you have got less than 4 cycles, I don't know
exactly why.. Or, you can write to ST-RAM one long (that means <8 cycles pause)
and here you have time for three 2-cycles instructions.. not very much, I know
=)
So, if 1 cycle is 1/16000000 s = 62.5 * 10^-9 s = 62.5 ns and CPU needs 4
cycles for RAM access, complete reading/writing is 4 * 62.5 = 250 ns long.
Remember our SIMMs have usually 80 ns, that means CPU can access to this RAM
3.125 times faster ! And this is only beginning of time wasting :)
1 word = 2 bytes -> we can transfer 2 bytes per 4 cycles -> 1/2 byte[4 bits]
per cycle. If one cycle is 62.5 ns:
0.5/62.5ns = 8 000 000 bytes per second. Fiction.
Let's take a look at the move.l (an)+,dn again. We calculated in worst case (no
instruction cache) it takes 13 cycles, right? What is the max reading speed?
16000000/13 longs/s = 16000000*4/13 bytes/s = 4 923 076 bytes/s. Shame!
Do you think it can't be worse? Okie, let's continue...
Here is another bus-cycle-stealer: DMA. DMA means Direct Memory Access. That
'direct' means if DMA device wants something from RAM, CPU can fuck up and has
to wait until the DMA device has finished. In Falcon we have 3 DMA devices:
Videl, Blitter and Sound DMA. I'm not going to talk about Blitter or SDMA since
these chips aren't necessary for EVERY application, but grafix we need from
time to time :-)
Eeeehhh... what example to give? I think 320x240xTC could be interresting,
couldn't it?
320 x 240 x 2 bytes = 153600 bytes = 38400 longs. Videl access to video-data in
so called BURST mode, that means 17 longs per one Videl (!) access. This BURST
mode looks like:
1st long = 2 (RAS cycle = init) + 1 (CAS cycle = data) = 3 cycles
2nd ~ 17th long = 16 (CAS cycles) = 16 cycles
---------
19 cycles
(thanks to Rodolphe for this explanation)
So one BURST access = 19 cycles.
38400/17 = 2259 BURST accesses and 2259 * 19 = 42921 cycles per screen.
That gives us 42921 * 60 = 2 575 260 cycles per second for Videl (!!!). 60
stands for 60 Hz ofcourse. 50 could be for 50 Hz, 100 for 100 Hz etc
Due to DMA stealing we lose: 16000000 - 2575260 = 13 424 740 cycles/s !!!
So, what about our move.l ?
13424740/13 longs/s = 13424740*4/13 bytes/s = 4 130 689 bytes per second...
Are you still thinking Falcon is fantastically developed? =) Ok...
This example assumes all your RAM reading is sequential, that means you're
reading address n+$0, n+$2, n+$4, n+$6 etc. Maybe you think: "nah and what? I
have all my data together and my program doesn't jump every time". Hahah my
little lady! =) How does your RAM accessing look?
lea something,a0
move.l d0,(a0)+
move.l d3,(a0)+
:
:
for example. But remember the CPU fetches instructions from your code-space
(text-segment), then writes some long data to ANOTHER area, then fetches next
instruction, ... etc. And we still assume all your code and data is word/long
aligned ! (for more info about aligning see below) For non-sequential access
you have to add 2 cycles for precharge time (again Czuba's number) to
instruction time. In our case:
13424740/(13+2) longs/s = 13424740*4/(13+2) bytes/s = 3 579 930 bytes/s...
Our 256 byte chache can solve these troubles with precharging. If you put your
loop into the instruction cache, the data bus will be used only for
reading/writing data. If you're really, really lucky boy, in case of reading
you can put both program and read data in the instr/data cache and no bus
transfer will be done. In case of writing, data is ALWAYS transferred to
memory, since our cache works in so called 'write-through' mode. 040 and higher
works in 'write-back' mode, here it isn't always necessary to access RAM
(better internal logic)
Now, we killed the last cycle in our Falcon :-) Ah, not the last - there are
still those stupid wait states... but as Rodolphe wrote, they don't affect
calculations a lot. If you want some additional info how to implement them into
timing tables, I refer you to the 68030 User's Manual by Motorola.
Aligning & cache stuff
======================
OK, let's start with a cool 030 feature - burst mode. Our data/instruction
cache has lines of 16 bytes = 4 longs. If you align your data to 16 bytes
boundary and you enable the so called BURST mode (not the same as the Videl one
:), the 030 will read/write data in 4 longs at once which makes four accesses
to the bus in a faster way. But here's a bottleneck... There's a need for hw-
support for this ! Most of dynamical RAMs allows it (faster access to whole
pages against sequential reading), but in Falcon we can forget it completely.
I'll kill that bastard who came with idea of 16 bit width of data bus !!!
Conclusion: instr/data burst disabling has no effect on a standard Falcon :(
When I began coding, I thought the cache solves everything. I've got a loop,
ok, let's put it into the cache and do things. Shit, again not 100% true :-)
030 has two caches: instruction and data.
Instruction cache:
It's more "friendly" to us :-) Only one thing you have to keep in mind:
instruction fetching is a longword operation! That means if you have (long)
unaligned begining of code (not only the first line of code, but begining of
every jump, too) you will fetch the unused instruction, too.
Data cache [reading]:
If you read some new data and data cache is enabled/unfreezed, these data will
be stored as an entry in one of 64 locations. 64? Yeah ppl, data cache is
longword one, again! So, if you read word from RAM at address $02, entry in
data cache will be: ($00):($02) = long ! In a normal situation, no tragedy. But
in the case of a 16 bit data bus this really sux !!! One more bus access for
every reading !!! Solution: always read longs from (long-aligned) memory if you
can. Or disable data cache ;-) Here's an extremely important aligning tip. If
your long lies at address $03 for example, you'll get _four_ bus accesses !!!
How it's possible?
line in cache: $00 $04 ...
-------------------------
|b7|b6|b5|b4|b3|b2|b1|b0| ...
-------------------------
and reading: ** ** ** ** <- our long
1) b5:b4 [16 bits] - b4 is MSB of our long
2) b7:b6 [16 bits] - unimportant values
3) b3:b2 [16 bits] - next 16 bits of our long
4) b1:b0 [16 bits] - b1 is LSB of our long
Data cache [writing]:
Heh, even more of a performance killer than reading. If you write something to
RAM, doesn't matter if you have it in cache or not, it's always written to this
RAM. I wonder why Motorola implemented this one, there's only one use of this -
if you save something to RAM and in a short time you want to restore it back to
the regs and you want to do it in a loop. Maybe here it would be useful... but
in other cases, especially for things like clearing, always disable this data
cache !!! Reading is very obsolete, too. Maybe for things like matrix
multiplication it's useful (more values than registers). I recommend to flush
data cache, "load" these values via something like tst.w (a0), tst.w 4(a0), ...
then clear WA bit in CACR and do a loop with muls.w (a0)+, ...
Practical example
=================
I think famous the movem clear routine for tc may be interesting...
We want to clear 320*240*2 = 153600 bytes. Some movem.l timings:
movem.l d0-d6/a0-a5,-(a6) 4+op 0 110(0/0/26) 116(0/2/26)
movem.l d0-d6/a0-a4,-(a6) 4+op 0 102(0/0/24) 108(0/2/24)
movem.l d0-d6/a0-a3,-(a6) 4+op 0 94(0/0/22) 100(0/2/22)
dbra.w [counter not expired] 6 0 6(0/0/0) 12(0/2/0)
dbra.w [counter expired] 10 0 10(0/0/0) 16(0/3/0)
Let's look at 3 examples:
1) minimal length
=================
loop: movem.l d0-d6/a0-a5,-(a6)
dbra d7,loop ;d7 = 2953-1
movem.l d0-d6/a0-a3,-(a6)
Since no instruction has a tail, no overlapping will occur. We divide clearing
into 3 phases:
a) executing & loading into cache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a5,-(a6) [116+2] (2 stands for precharge time)
dbra d7,loop [012]
b) executing from cache
~~~~~~~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a5,-(a6) [110]
dbra d7,loop [006]
(2951 times)
c) last execution
~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a5,-(a6) [110]
dbra d7,loop [010]
movem.l d0-d6/a0-a3,-(a6) [100]
result = (116+2+12) + (110+6)*2951 + (110+10+100) = 342 666 cycles
2) one movem
============
loop: rept 50
movem.l d0-d6/a0-a4,-(a6)
endr
dbra d7,loop ;d7 = 64-1
a) executing & loading into cache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a4,-(a6) [108+2]*50
dbra d7,loop [012]
b) executing from cache
~~~~~~~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a4,-(a6) [102]*50
dbra d7,loop [006]
(62 times)
c) last execution
~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a4,-(a6) [102]*50
dbra d7,loop [010]
result = ((108+2)*50+12) + (102*50+6)*62 + (102*50+10) = 327 194 cycles
3) 100% of cache filled
=======================
loop: rept 63
movem.l d0-d6/a0-a5,-(a6)
endr
dbra d7,loop ;d7 = 47-1
a) executing & loading into cache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a5,-(a6) [116+2]*63
dbra d7,loop [012]
b) executing from cache
~~~~~~~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a5,-(a6) [110]*63
dbra d7,loop [006]
(45 times)
c) last execution
~~~~~~~~~~~~~~~~~
movem.l d0-d6/a0-a5,-(a6) [110]*63
dbra d7,loop [010]
result = ((116+2)*63+12) + (110*63+6)*45 + (110*63+10) = 326 506 cycles.
Please note in this case we're clearing 372 bytes more than needed.
One VBL on VGA 60 Hz can be max. 1/60 = 0.0167 s long, that means 0.0167 s /
62.5 ns = 266 666 cycles per VBL. Keep in mind this is absolut maximum, in
practise this number is much lower. That gives us:
342 666 / 266 666 = 1.29 frames for clearing
327 194 / 266 666 = 1.23 frames for clearing
326 506 / 266 666 = 1.22 frames for clearing
Also, if your fx is able to run in 60 fps, in an ideal case, the fps count will
be decreased to 30 and lower fps (approx)
We reached the end, finally :-) As you can see, the 16 bit data bus at 16 MHz
is the single most performance killer hand in hand with Videl accesses. This is
the reason why the CT60+Fastram+SuperVidel combination is so awaited. This
hardware eliminate most of the stupid things in the Falcon architecture since
we will get:
- ultra-fast CPU [ct60]
- 32 bit BURST access [fastram]
- ultra-fast data bus [ct60+fastram]
- no accesses to old st-ram via slow databus for program and data
- even no accesses to st-ram for gfx data !!! [supervidel]
Fantastic, isn't it?
-------------------------------------------------------------------------------
MiKRO XE/XL/MegaSTE/Falcon/CT60 mikro.atari.org
-------------------------------------------------------------------------------
|