Optimizing 680x0 Applications Optimizing 680x0 Applications By Tomas Evensen, Diab Data, Inc. The Motorola 68K family consists of a wide range of members from the micro-coded MC68000 to the super-scalar, hard-wired MC68060. This paper will discuss the integer performance of the 68K family and ways to optimize both for the individual processors as well as produce code that runs well on any 68K processor. In general, optimizations can take place on three different levels: 1. On the assembly level for code that is written in assembly language. 2. On the compiler level if the application is written in a high level language (i.e. C). 3. On the user level by changing algorithms. We will discuss what can be done to optimize 68K code on all three levels. THE 68000 FAMILY The 68K family includes the following members: MC68000 First generation 68K processor. 16 bit internal/external data paths. 16 Mb address space. MC68008 8 bit external data path. 1-4 MB address space. MC68010 Similar to MC68000, but with restartable instructions. Can be used in a virtual memory environment. Loop mode. MC68EC000 Low-power MC68000. 8 or 16 bit external data bus. MC68020 32 bit virtual memory microprocessor. 32 bit internal/external data paths. 4 GB address space. Can be used with floating point coprocessor. New instructions added including bitfield instructions. New addressing modes added. 256 bytes instruction cache. MC68EC020 16 Mb address space. MC68030 Similar to MC68020 but slightly faster. 256 bytes data cache added. On- chip MMU. MC68EC030 Low-power MC68030. No MMU. CPU32 Basically a 68020 core but without cache, bitfield instructions, and memory indirect addressing modes. 16 bit external data path. No coprocessor. CPU32+ Same as CPU32 but with 32 bit external data path. MC68040 Third generation 32 bit processor. 4K instruction cache. 4K data cache. On chip floating point processor. On chip MMU. Most instructions take one cycle. MC68EC040 Low-power MC68040. No MMU. No FPU. MC68060 Super scalar implementation of the 68K architecture. Can issue up to two instructions per cycle. 8K instruction cache. 8K data cache. MC68EC060 Similar to MC68060. No FPU. No MMU. The following table summarizes the characteristics of the different members in the 68000 family: PROC CACHE RADD MADD MUL INDEX BRA UACC HWFP 68000 0/0 6 18 40 18 10/6 no no 68020 256/0 2 6 28 9 6/4 yes 68881/2 68030 256/256 2 5 28 8 6/4 yes 68881/2 CPU32 0/0 2 9 26 12 8/4 no no 68040 4K/4K 1 1 16 3 2/3 yes yes 68060 8K/8K 1 1 2 1 0/1 yes yes RAdd: Register to register 32 bit add (add.l d0,d1). MAdd: Absolute long address to register add (add.l _mem,d1). Mul: 16x16 multiplication (max. time) (mulu.w d0,d1). Index: Indexed addressing mode (move.l 2(a0,d0),d1). Bra: Byte conditional branch taken/not taken (bne.b label). UAcc: Unaligned access allowed (move.l 0xffff0001,d1). HWFP: Hardware floating point support. When optimizing for the 68K family, we divide the members into the following groups: 68000 Optimize for the following processors: MC68000/MC68008/MC68010/MC68EC000. 68020 Optimize for the following processors: MC68020/MC68EC020/MC68030/MC68EC030/CPU32/CPU32+ 68040 Optimize for the following processors: MC68040/MC68EC040 68060 Optimize for the following processors: MC68060/MC68EC060 680xx Optimize so the code will execute reasonably on any 68K processor. ASSEMBLY LEVEL OPTIMIZATIONS Since optimizations for one 68K processor can make another one execute slower, it is fairly important to know the individual instruction timings for each member. Here are some examples of different ways of doing operations and the preferred method for each 68K processor: Operations with long immediate values between -128 and 127: A: add.l #20,d1 B: moveq.l #20,d0 add.l d0,d1 68040/xx: A 68000/20/60: B Byte/word operations that could be replaced with long operations: A: add.w d0,d1 B: add.l.d0,d2 68000/xx: A 68020/40: Any 68060: B Keep memory operands in registers: A: add.l _var,d1 B: move.l _var,d0 add.l _var,d2 add.l d0,d1 add.l d0,d2 68040: A (as long as total # of instructions are less) 68000/20/60/xx: B Reschedule operations using address registers: A: add.l d0,d1 B: move.l (a1),a0 move.l (a1),a0 add.l d0,d1 move.l (a0),d2 move.l (a0),d2 68000/20: Any 68040/60/xx: B Replace constant multiplications with adds/subs/shifts: A: mulu.w #254,d1 B: move.l d1,d0 lsl.l #8,d1 lsl.l #1,d0 sub.l d0,d1 68060: A 68000/20/40/xx: B Operations using indexing modes: A: add.l (a0,d7),d1 B: add.l d7,a0 add.l (a0,d7),d2 add.l (a0),d1 add.l (a0),d2 68000/60: A 68020/40/xx: B Saving/restoring registers: A: movem.l d4-d7,-(a7) B: move.l d7,-(a7) move.l d6,-(a7) move.l d5,-(a7) move.l d4,-(a7) 68000/20/60/xx: A 68040: B (if time critical) Summary of characteristics for each processor: 68000: Lacks 68020 instruction extensions: - No extb.l instruction - No 32 bit multiply - No scaled indexing mode - No 32 bit PC relative branches Use short instructions Keep values in registers No scheduling necessary Code optimized for 68020 or 68060 runs great 68020: Use short instructions Keep values in registers Almost no scheduling necessary Code optimized for the 68060 runs great 68040: Use as few instructions as possible (even if they are longer) Values can be kept in memory Avoid pipe-line stalls for some effective addresses Avoid subtracts to address registers 68060: Use short instructions Keep values in registers Schedule instructions for superscalar execution Inline short functions 680xx: If the code is to be executed on a 68000 processor, the 68000 instruction subset must be used. Avoid bitfield instructions. Align all data. Schedule the instructions for an 68060. Avoid complex addressing modes (memory indirect). Compiler level optimizations When programming in a high level language, the speed of a given algorithm depends on the hardware (processor, frequency, memory system, etc.), the compiler (capability of doing optimizations and generating the fastest instruction sequence), and on the programming style. The better the compiler is, the less the programming style influences the resulting efficiency. Rule: Let the compiler do the optimizations By using an optimizing compiler the programs will stay much cleaner and will be easier to maintain if hand-optimizations is avoided. The following example is a simple array copy in C: #define SIZE 1000 int src[SIZE], dst[SIZE]; copy() { int i; for(i = 0; i < SIZE; i++) { dst[i] = src[i]; } } A compiler without sophisticated optimizations might produce the following code: _copy: move.l d7,-(a7) clr.l d7 .Loop: move.l #_src,a0 move.l #_dst,a1 move.l (a0,d7*4),(a1,d7*4) addq.l #1,d7 cmp.l #100,d7 bne.b .Loop move.l (a7)+,d7 rts ; 1409 cycles (040) An optimizing C compiler would generate something similar to the following: _copy: moveq.l #25,d0 move.l #_src,a0 move.l #_dst,a1 .Loop: move.l (a0)+,(a1)+ ; Loop is unrolled move.l (a0)+,(a1)+ ; 4 times move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ subq.l #1,d0 bne.b .Loop rts ; 284 cycles (040) By "hand-optimizing" the code, the programmer can improve the speed when using a non-optimizing compiler: #define SIZE 1000 int src[SIZE], dst[SIZE]; copy() { int i, *s, *d; for(i = SIZE, s = src, d = dst; i != 0; i--) { *d++ = *s++; } } The non-optimizing compiler will generate the following code: _copy: movem.l d7/a4/a5,-(a7) moveq.l #100,d7 move.l #_src,a4 move.l #_dst,a5 .Loop: move.l (a4)+,(a5)+ subq.l #1,d7 bne.b .Loop movem.l (a7)+,d7/a4/a5 rts ; 518 cycles (040) To get the same execution speed as the optimizing compiler, the hand-optimized code must also be unrolled, which is easy as long as SIZE is a multiple of four, but harder in the general case. Also the hand optimized code will run slower on many other architectures. Another example: i += p[1]; j += p[1]; A good 68K compiler will translate this code differently depending on which 68K processor it generates code to: 68000/20/60: move.l 4(a5),d0 ; remember p[1] in d0 add.l d0,d7 ; i += p[1] add.l d0,d6 ; j += p[1] 68040: add.l.4(a5),d7 ; i += p[1] add.l.4(a5),d6 ; j += p[1] In the above case, trying to optimize the code in C will slow down the 68040: i += (tmp = p[1]); j += tmp; As an example on how important the choice of compiler is, Motorola improved their SPECint89 numbers for the 68040 processor by 29% just by switching to a compiler that took advantage of the strengths of the 68040 processor. Rule: Avoid taking the address of variables In many cases it is impossible for the compiler to keep variables in fast registers if their address is taken: int func1(int var) { far_away1(&var); far_away2(var); return var; } If far_away1() modifies var, but does not save the address, the following code is more efficient: int func1(int var) { var = new_far_away1(var); far_away2(var); return var; } Rule: Use the static keyword By using the static keyword for local functions and file local variables, the compiler can do a much better job optimizing. For example: static int si; void func2(int *p) { int j, i; i = si; *p = 0; j = si; ... } In this simple example the compiler knows that the *p = 0 statement does not kill the variable si since it is static and its address is not taken. USER LEVEL OPTIMIZATIONS The most dramatic speed improvements of a program typically comes from finding the bottle-necks and changing the algorithm to something more efficient. For example by setting flags instead of re-calculating some expressions or replacing a linear search with a hash table. A profiler is a program that helps the programmer to find the code that is executed the most. Typically the profiling is done in three steps: - Compiling the code with profiling enabled. - Executing the program with typical input data. - Displaying the profiling data, for example how many times each source line is executed. After profiling, the programmer can examine the most frequently executed source lines and re-write the bottle-necks. Modern compilers take profiling one step further. By feeding the profiling information back into the compiler, the optimizer can use actual execution data to fine tune many optimizations, including the following: Loop unrolling Only loops in time critical sections are unrolled Function inlining Only functions in time critical sections are inlined Partial inlining Functions that often return after the first conditional test can be inlined up to that test Basic block reorganization Most executed block in if-then-else statement is placed at the end Register allocation Variables actually used the most are put in registers CONCLUSION The characteristics of the different members of the 68K family varies a great deal, which makes it impossible to write code that will run as fast as possible on all processors. When writing programs in a high level language, the compiler can do the processor specific optimizations, but in assembly language a decision has to be made on which processor to optimize for.