[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [MiNT] FreeMiNT & 68000
Alan Hourihane wrote:
Hi all,
So I've fixed the bug that was preventing FreeMiNT booting on normal
ST's. But unfortunately, there's more kernel problems running programs.
I'll need to tackle those next.
Alan.
i am not sure but this is a document I have had for some time. it talks
about optization for 68K in general. it was something I had for a project.
josephus
Optimizing 680x0 Applications
By Tomas Evensen, Diab Data, Inc.
The Motorola 68K family consists of a wide range of members from the
micro-coded MC68000 to the super-scalar, hard-wired MC68060.
This paper will discuss the integer performance of the 68K family and ways
to optimize both for the individual processors as well as produce code
that runs well on any 68K processor.
In general, optimizations can take place on three different levels:
1. On the assembly level for code that is written in assembly language.
2. On the compiler level if the application is written in a high level
language (i.e. C).
3. On the user level by changing algorithms.
We will discuss what can be done to optimize 68K code on all three levels.
THE 68000 FAMILY
The 68K family includes the following members:
MC68000
First generation 68K processor. 16 bit internal/external data paths. 16
Mb address space.
MC68008
8 bit external data path. 1-4 MB address space.
MC68010
Similar to MC68000, but with restartable instructions. Can be used in a
virtual memory environment. Loop mode.
MC68EC000
Low-power MC68000. 8 or 16 bit external data bus.
MC68020
32 bit virtual memory microprocessor. 32 bit internal/external data
paths. 4 GB address space. Can be used with floating point coprocessor.
New instructions added including bitfield instructions. New addressing
modes added. 256 bytes instruction cache.
MC68EC020
16 Mb address space.
MC68030
Similar to MC68020 but slightly faster. 256 bytes data cache added. On-
chip MMU.
MC68EC030
Low-power MC68030. No MMU.
CPU32
Basically a 68020 core but without cache, bitfield instructions, and
memory indirect addressing modes. 16 bit external data path. No
coprocessor. CPU32+ Same as CPU32 but with 32 bit external data path.
MC68040
Third generation 32 bit processor. 4K instruction cache. 4K data cache.
On chip floating point processor. On chip MMU. Most instructions take
one cycle.
MC68EC040
Low-power MC68040. No MMU. No FPU.
MC68060
Super scalar implementation of the 68K architecture. Can issue up to two
instructions per cycle. 8K instruction cache. 8K data cache.
MC68EC060
Similar to MC68060. No FPU. No MMU.
The following table summarizes the characteristics of the different members
in the 68000 family:
PROC CACHE RADD MADD MUL INDEX BRA UACC HWFP
68000 0/0 6 18 40 18 10/6 no no
68020 256/0 2 6 28 9 6/4 yes 68881/2
68030 256/256 2 5 28 8 6/4 yes 68881/2
CPU32 0/0 2 9 26 12 8/4 no no
68040 4K/4K 1 1 16 3 2/3 yes yes
68060 8K/8K 1 1 2 1 0/1 yes yes
RAdd: Register to register 32 bit add (add.l d0,d1).
MAdd: Absolute long address to register add (add.l _mem,d1).
Mul: 16x16 multiplication (max. time) (mulu.w d0,d1).
Index: Indexed addressing mode (move.l 2(a0,d0),d1).
Bra: Byte conditional branch taken/not taken (bne.b label).
UAcc: Unaligned access allowed (move.l 0xffff0001,d1).
HWFP: Hardware floating point support.
When optimizing for the 68K family, we divide the members into the
following groups:
68000
Optimize for the following processors:
MC68000/MC68008/MC68010/MC68EC000.
68020
Optimize for the following processors:
MC68020/MC68EC020/MC68030/MC68EC030/CPU32/CPU32+
68040
Optimize for the following processors:
MC68040/MC68EC040
68060
Optimize for the following processors:
MC68060/MC68EC060
680xx
Optimize so the code will execute reasonably on any 68K processor.
ASSEMBLY LEVEL OPTIMIZATIONS
Since optimizations for one 68K processor can make another one execute
slower, it is fairly important to know the individual instruction
timings for each member.
Here are some examples of different ways of doing operations and the
preferred method for each 68K processor:
Operations with long immediate values between -128 and 127:
A: add.l #20,d1 B: moveq.l #20,d0
add.l d0,d1
68040/xx: A
68000/20/60: B
Byte/word operations that could be replaced with long operations:
A: add.w d0,d1 B: add.l d0,d2
68000/xx: A
68020/40: Any
68060: B
Keep memory operands in registers:
A: add.l _var,d1 B: move.l _var,d0
add.l _var,d2 add.l d0,d1
add.l d0,d2
68040: A (as long as total # of instructions are less)
68000/20/60/xx: B
Reschedule operations using address registers:
A: add.l d0,d1 B: move.l (a1),a0
move.l (a1),a0 add.l d0,d1
move.l (a0),d2 move.l (a0),d2
68000/20: Any
68040/60/xx: B
Replace constant multiplications with adds/subs/shifts:
A: mulu.w #254,d1 B: move.l d1,d0
lsl.l #8,d1
lsl.l #1,d0
sub.l d0,d1
68060: A
68000/20/40/xx: B
Operations using indexing modes:
A: add.l (a0,d7),d1 B: add.l d7,a0
add.l (a0,d7),d2 add.l (a0),d1
add.l (a0),d2
68000/60: A
68020/40/xx: B
Saving/restoring registers:
A: movem.l d4-d7,-(a7) B: move.l d7,-(a7)
move.l d6,-(a7)
move.l d5,-(a7)
move.l d4,-(a7)
68000/20/60/xx: A
68040: B (if time critical)
Summary of characteristics for each processor:
68000:
Lacks 68020 instruction extensions:
- No extb.l instruction
- No 32 bit multiply
- No scaled indexing mode
- No 32 bit PC relative branches
Use short instructions
Keep values in registers
No scheduling necessary
Code optimized for 68020 or 68060 runs great
68020:
Use short instructions
Keep values in registers
Almost no scheduling necessary
Code optimized for the 68060 runs great
68040:
Use as few instructions as possible (even if they are longer)
Values can be kept in memory
Avoid pipe-line stalls for some effective addresses
Avoid subtracts to address registers
68060:
Use short instructions
Keep values in registers
Schedule instructions for superscalar execution
Inline short functions
680xx:
If the code is to be executed on a 68000 processor, the 68000
instruction subset must be used.
Avoid bitfield instructions.
Align all data.
Schedule the instructions for an 68060.
Avoid complex addressing modes (memory indirect).
Compiler level optimizations
When programming in a high level language, the speed of a given algorithm
depends on the hardware (processor, frequency, memory system, etc.), the
compiler (capability of doing optimizations and generating the fastest
instruction sequence), and on the programming style. The better the
compiler is, the less the programming style influences the resulting
efficiency.
Rule: Let the compiler do the optimizations
By using an optimizing compiler the programs will stay much cleaner and
will be easier to maintain if hand-optimizations is avoided.
The following example is a simple array copy in C:
#define SIZE 1000
int src[SIZE], dst[SIZE];
copy()
{
int i;
for(i = 0; i < SIZE; i++) {
dst[i] = src[i];
}
}
A compiler without sophisticated optimizations might produce the
following code:
_copy:
move.l d7,-(a7)
clr.l d7
.Loop:
move.l #_src,a0
move.l #_dst,a1
move.l (a0,d7*4),(a1,d7*4)
addq.l #1,d7
cmp.l #100,d7
bne.b .Loop
move.l (a7)+,d7
rts ; 1409 cycles (040)
An optimizing C compiler would generate something similar to the
following:
_copy:
moveq.l #25,d0
move.l #_src,a0
move.l #_dst,a1
.Loop:
move.l (a0)+,(a1)+ ; Loop is unrolled
move.l (a0)+,(a1)+ ; 4 times
move.l (a0)+,(a1)+
move.l (a0)+,(a1)+
subq.l #1,d0
bne.b .Loop
rts ; 284 cycles (040)
By "hand-optimizing" the code, the programmer can improve the speed
when using a non-optimizing compiler:
#define SIZE 1000
int src[SIZE], dst[SIZE];
copy()
{
int i, *s, *d;
for(i = SIZE, s = src, d = dst; i != 0; i--) {
*d++ = *s++;
}
}
The non-optimizing compiler will generate the following code:
_copy:
movem.l d7/a4/a5,-(a7)
moveq.l #100,d7
move.l #_src,a4
move.l #_dst,a5
.Loop:
move.l (a4)+,(a5)+
subq.l #1,d7
bne.b .Loop
movem.l (a7)+,d7/a4/a5
rts ; 518 cycles (040)
To get the same execution speed as the optimizing compiler, the
hand-optimized code must also be unrolled, which is easy as long as
SIZE is a multiple of four, but harder in the general case. Also the
hand optimized code will run slower on many other architectures.
Another example:
i += p[1];
j += p[1];
A good 68K compiler will translate this code differently depending on
which 68K processor it generates code to:
68000/20/60:
move.l 4(a5),d0 ; remember p[1] in d0
add.l d0,d7 ; i += p[1]
add.l d0,d6 ; j += p[1]
68040:
add.l 4(a5),d7 ; i += p[1]
add.l 4(a5),d6 ; j += p[1]
In the above case, trying to optimize the code in C will slow down the
68040:
i += (tmp = p[1]);
j += tmp;
As an example on how important the choice of compiler is, Motorola
improved their SPECint89 numbers for the 68040 processor by 29% just
by switching to a compiler that took advantage of the strengths of the
68040 processor.
Rule: Avoid taking the address of variables
In many cases it is impossible for the compiler to keep variables in
fast registers if their address is taken:
int func1(int var) {
far_away1(&var);
far_away2(var);
return var;
}
If far_away1() modifies var, but does not save the address, the
following code is more efficient:
int func1(int var) {
var = new_far_away1(var);
far_away2(var);
return var;
}
Rule: Use the static keyword
By using the static keyword for local functions and file local variables,
the compiler can do a much better job optimizing. For example:
static int si;
void func2(int *p) {
int j, i;
i = si;
*p = 0;
j = si;
...
}
In this simple example the compiler knows that the *p = 0 statement does
not kill the variable si since it is static and its address is not taken.
USER LEVEL OPTIMIZATIONS
The most dramatic speed improvements of a program typically comes from
finding the bottle-necks and changing the algorithm to something more
efficient. For example by setting flags instead of re-calculating some
expressions or replacing a linear search with a hash table.
A profiler is a program that helps the programmer to find the code
that is executed the most. Typically the profiling is done in three
steps:
- Compiling the code with profiling enabled.
- Executing the program with typical input data.
- Displaying the profiling data, for example how many times each
source line is executed.
After profiling, the programmer can examine the most frequently
executed source lines and re-write the bottle-necks.
Modern compilers take profiling one step further. By feeding the
profiling information back into the compiler, the optimizer can use
actual execution data to fine tune many optimizations, including the
following:
Loop unrolling
Only loops in time critical sections are unrolled
Function inlining
Only functions in time critical sections are inlined
Partial inlining
Functions that often return after the first conditional test can be
inlined up to that test
Basic block reorganization
Most executed block in if-then-else statement is placed at the end
Register allocation
Variables actually used the most are put in registers
CONCLUSION
The characteristics of the different members of the 68K family varies
a great deal, which makes it impossible to write code that will run as
fast as possible on all processors. When writing programs in a high
level language, the compiler can do the processor specific
optimizations, but in assembly language a decision has to be made on
which processor to optimize for.