[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [MiNT] Re: GCC
> > > Interesting. So far I haven't come across a PureC problem which made the
> > > compiler itself non-ANSI...
> >
> > Well, this problem (whatever it was exactly) definitely was a violation of
> > the standard.
>
> Well, I would like to see that... :-) I'll try to test it this evening.
As I said, though, it might not have been the way I described it.
It might have had to do with enumerations, for example.
> > Also, GCC could inline them for you...
>
> But that would make the code larger.
Yes, of course. But if you need the best possible speed for something,
it probably isn't all that much code, anyway.
> Fact is, back when it was possible to compile MiNT with PureC, the resulting
> code was both considerably smaller and slightly faster... (just to make this
> thread a bit MiNT related :-).
;-)
> Could you post the GCC code? I'll check with PureC.
I can do better than that. I just found one of my old Usenet articles on
the subject (couldn't locate it on dejanews for some reason). The following
includes the code generated by PureC (or perhaps TurboC, you could check
yourself if there's a difference), Lattice C and GCC:
In another thread the question of how well different C compilers
on the Atari optimized came up.
It seems to be a widely spread misconception that TurboC/PureC
produces good code, while Lattice C produces bad code.
Most people seem to agree that GCC is the best, though.
To shed some light on the issue, I threw together a small program
and compiled it under PureC, Lattice C and GCC. In each case I
tried to get the best code possible, but the program was the same.
The GCC code is the result of asking it to do 68030 code, but there
were no major differences between '000 and '030 code from any of
the three compilers on this test program.
My results and some explanations can be found below and I'd very
much like to hear what other compilers, such as Sozobon C, Laser C,
Mark Williams C, Megamax C, Prosepero C etc do with it.
Remember to use whatever options give the best possible code.
This is the C program that I used for this test.
It's not supposed to do anything useful.
#define ROWS 25
#define COLS 25
void myfunc(void)
{
long i, j, k;
long array[ROWS][COLS];
for(i = 0; i < ROWS * COLS; i++)
*((long *)array + i) = 0;
k = COLS * ROWS;
for(i = 0; i < ROWS; i++)
for(j = 0; j < COLS; j++)
array[i][j] += i * k;
}
I omit the function startup/end code below, but that was not more
than two or three instructions for any of the compilers.
GCC does not use the standard Motorola instruction format, but
with a little thought there should be no problem understanding it.
OK, here we go:
PureC, the array clear loop
T000008: MOVEQ.L #$00,D3
T00000A: BRA.B T000016
T00000C: MOVE.W D3,D0
T00000E: LSL.W #2,D0
T000010: CLR.L (A7,D0.W)
T000014: ADDQ.L #1,D3
T000016: CMP.L #$00000271,D3
T00001C: BLT.B T00000C
This looks exactly like the C code, doesn't it?
Set the start value, jump and check the condition by comparing against
the number at the bottom and then loop, adding one to 'i' every time.
A7 points to the start of the array, so just index with a properly
scaled 'i' and clear.
GCC, the array clear loop
clrl d0
movel #624,d1
lea sp@(2516),a0
L5:
clrl a0@(-2500)
addqw #4,a0
addql #1,d0
cmpl d0,d1
jge L5
GCC puts ROWS*COLS-1 in a register at the beginning and then uses
that for the compare after every loop.
There is no check for the first iteration since the compiler has
figured out that the loop will run more than once.
As you can see the variable 'i' is not used to update or index the
pointer. Instead the number 4 is added directly.
Lattice C, the array clear loop
| 0008 42ae ffec clr.l $ffec(a6)
| 000c 303c 09c3 move.w #$09c3,d0
| 0010 7200 moveq #$00,d1
| 0012 41ee f628 lea $f628(a6),a0
| 0016 10c1 move.b d1,(a0)+
| 0018 51c8 fffc dbf d0,$0016
Lattice C has discovered that the loop is a sequential access of an
area of memory and has therefore decided to use (a0)+ addressing.
But it has also decided to run the loop count backwards to be able to
use a dbra!
This really is quite smart, but since this is Lattice it couldn't stop
there of course. The Lattice optimizer often decides not to optimize very
well, for reasons like finding a 16 bit integer somewhere etc.
The array was declared as containing long integers.
That means that the compiler will have to allocate it on an even address.
Lattice C has decided, though, that it can't address the elements as
long integers, but only as bytes, thus the 'move.b' instead of 'move.l'.
The compiler does know that it has to quadruple the loop count to be able
to clear the same amount of memory, though...
PureC, the double loop (I will comment this directly)
T00001E: MOVE.L #$00000271,D4 k = COLS * ROWS
T000024: MOVEQ.L #$00,D3 i = 0
T000026: BRA.B T00004E Jump to the (i < ROWS) test
T000028: MOVEQ.L #$00,D5 j = 0
T00002A: BRA.B T000046 Jump to the (j < COLS) test
T00002C: MOVE.L D3,D0 Calculate i * k
T00002E: MOVE.L D4,D1 PureC would have used a
T000030: JSR _lmul muls if the values were of type int
T000036: MOVEQ.L #$64,D1 Calculate i * row width (COLS*4)
T000038: MULS D3,D1 for the array access
T00003A: MOVE.W D5,D2 Add on j * 4 for correct element
T00003C: LSL.W #2,D2
T00003E: ADD.W D2,D1
T000040: ADD.L D0,(A7,D1.W) Add i * k to the array element
T000044: ADDQ.L #1,D5 Increase j and test against COLS
T000046: MOVEQ.L #$19,D0
T000048: CMP.L D5,D0
T00004A: BGT.B T00002C
T00004C: ADDQ.L #1,D3 Increase i and test against ROWS
T00004E: MOVEQ.L #$19,D0
T000050: CMP.L D3,D0
T000052: BGT.B T000028
As you can see, again PureC did a direct translation of the C code.
Every loop does one ordinary 'muls' and one call to '_lmul', which in this
case will do one more 'muls'.
You think this code was good? Read on!
GCC, the double loop
movel #625,d3 This is ROWS*COLS
movel #-2500,d2 -(size of the array)
clrl d1 What might this be?
lea sp@(2516),a3 Point to the end of the array
subl a2,a2 What might this be?
L13:
lea a3@(0,d2:l),a0
lea a0@(96),a1
On the first loop, a0 will point to the first element in the array
and a1 will point to the last element on the first row.
L12:
movel d1,d0
addl a0@,d0
movel d0,a0@+
cmpl a0,a1
jge L12
Uh, this will only add the value in d1 (zero at the start) to the all
the elements on the first row. Can this be right?
addl d3,d1
addw #100,a3
addw #100,a2
cmpl #2400,a2
jle L13
What? It's finished already? What's happening here?
Well, clearly a3 is updated so that after executing the code at L13,
a0 will point to the next row in the array.
Since a2 is used for the final comparison, it must be a strange version
of the variable 'i'.
D1 was the value we added to every element on the first row and now that
has been increased by adding d3 (=ROWS*COLS) to it...
...but that means that the i*k operation has been reduced to a simple
add for every new row!
It should be obvious that the code produced by GCC will run _much_
faster than what we got from PureC.
I've not timed it, but the difference seems to be something like five
to ten times!
Lattice C, the double loop
| 001c 4bee f628 lea $f628(a6),a5
| 0020 7e00 moveq #$00,d7
| 0022 6018 bra.b $003c
This is much the same that GCC did, but Lattice C writes a bit smarter
assembly. It doesn't realize that the loop will run more than once,
though, so it jumps to the test.
| 0024 284d movea.l a5,a4
| 0026 47ec 0064 lea $0064(a4),a3
| 002a 6002 bra.b $002e
Like above, this is better code than GCC did, but again there's an
extra jump to the test.
| 002c df9c add.l d7,(a4)+
| 002e b9cb cmpa.l a3,a4
| 0030 65fa bcs.b $002c
Wow! Only _three_ instructions in the inner loop!
Again, this is the same thing that GCC did, only Lattice C has realized
that there is an instruction that does what GCC spends three on.
| 0032 dafc 0064 adda.w #$0064,a5
| 0036 0687 0000 0271 addi.l #$00000271,d7
| 003c 41ee ffec lea $ffec(a6),a0
| 0040 bbc8 cmpa.l a0,a5
| 0042 65e0 bcs.b $0024
Lattice C didn't put ROWS*COLS in a register like GCC did and the 'lea' is
not needed inside the loop since a0 is not changed anywhere. Still those
two instructions are not executed very often so they don't matter much.
Conclusions
-----------
GCC and Lattice optimized very well, but the code generation by GCC is not
as good as Lattice's. Both compilers do some unnecessary things and the
Lattice optimizer fools itself a bit on the first loop.
PureC shows itself for what it is, a compiler that does as well as it can
with next to no knowledge of optimization.
PureC keeps everything in registers all the time, but that doesn't help
much when it generates two unnecessary multiplies among lots of totally
unoptimized code.
Lattice C could definitely keep more things in registers. In the double
loop it only uses a single data register.
GCC seems not to know the M68k as well as it should, since some things
it did could easily be done with fewer instructions.
--
Chalmers University | Why are these | e-mail: rand@cd.chalmers.se
of Technology | .signatures | johan@rand.thn.htu.se
| so hard to do | WWW/ftp: rand.thn.htu.se
Gothenburg, Sweden | well? | (MGIFv5, QLem, BAD MOOD)