[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [MiNT] Re: GCC

To: mint@fishpool.com
Subject: Re: [MiNT] Re: GCC
From: Johan Klockars <rand@cd.chalmers.se>
Date: Tue, 23 Feb 1999 17:08:19 +0100 (MET)
In-reply-to: <001101be5f45$62b90f70$2e00a8c0@julian> from "Julian Reschke" at Feb 23, 99 04:58:42 pm
Sender: owner-mint@fishpool.com
> > > Interesting. So far I haven't come across a PureC problem which made the
> > > compiler itself non-ANSI...
> >
> > Well, this problem (whatever it was exactly) definitely was a violation of
> > the standard.
> 
> Well, I would like to see that... :-) I'll try to test it this evening.

As I said, though, it might not have been the way I described it.
It might have had to do with enumerations, for example.

> > Also, GCC could inline them for you...
> 
> But that would make the code larger.

Yes, of course. But if you need the best possible speed for something,
it probably isn't all that much code, anyway.

> Fact is, back when it was possible to compile MiNT with PureC, the resulting
> code was both considerably smaller and slightly faster... (just to make this
> thread a bit MiNT related :-).

;-)

> Could you post the GCC code? I'll check with PureC.

I can do better than that. I just found one of my old Usenet articles on
the subject (couldn't locate it on dejanews for some reason). The following
includes the code generated by PureC (or perhaps TurboC, you could check
yourself if there's a difference), Lattice C and GCC:


In another thread the question of how well different C compilers
on the Atari optimized came up.
It seems to be a widely spread misconception that TurboC/PureC
produces good code, while Lattice C produces bad code.
Most people seem to agree that GCC is the best, though.

To shed some light on the issue, I threw together a small program
and compiled it under PureC, Lattice C and GCC. In each case I
tried to get the best code possible, but the program was the same.
The GCC code is the result of asking it to do 68030 code, but there
were no major differences between '000 and '030 code from any of
the three compilers on this test program.

My results and some explanations can be found below and I'd very
much like to hear what other compilers, such as Sozobon C, Laser C,
Mark Williams C, Megamax C, Prosepero C etc do with it.
Remember to use whatever options give the best possible code.


This is the C program that I used for this test.
It's not supposed to do anything useful.

#define ROWS 25
#define COLS 25

void myfunc(void)
{
	long i, j, k;
	long array[ROWS][COLS];

	for(i = 0; i < ROWS * COLS; i++)
		*((long *)array + i) = 0;
   
	k = COLS * ROWS;
	for(i = 0; i < ROWS; i++)
		for(j = 0; j < COLS; j++)
			array[i][j] += i * k;
}


I omit the function startup/end code below, but that was not more
than two or three instructions for any of the compilers.

GCC does not use the standard Motorola instruction format, but
with a little thought there should be no problem understanding it.


OK, here we go:

PureC, the array clear loop
T000008:   MOVEQ.L   #$00,D3
T00000A:   BRA.B     T000016
T00000C:   MOVE.W    D3,D0
T00000E:   LSL.W     #2,D0
T000010:   CLR.L     (A7,D0.W)
T000014:   ADDQ.L    #1,D3
T000016:   CMP.L     #$00000271,D3
T00001C:   BLT.B     T00000C

This looks exactly like the C code, doesn't it?
Set the start value, jump and check the condition by comparing against
the number at the bottom and then loop, adding one to 'i' every time.
A7 points to the start of the array, so just index with a properly
scaled 'i' and clear.


GCC, the array clear loop
	clrl d0
	movel #624,d1
	lea sp@(2516),a0
L5:
	clrl a0@(-2500)
	addqw #4,a0
	addql #1,d0
	cmpl d0,d1
	jge L5

GCC puts ROWS*COLS-1 in a register at the beginning and then uses
that for the compare after every loop.
There is no check for the first iteration since the compiler has
figured out that the loop will run more than once.
As you can see the variable 'i' is not used to update or index the
pointer. Instead the number 4 is added directly.


Lattice C, the array clear loop
       | 0008  42ae ffec                      clr.l     $ffec(a6)
       | 000c  303c 09c3                      move.w    #$09c3,d0
       | 0010  7200                           moveq     #$00,d1
       | 0012  41ee f628                      lea       $f628(a6),a0
       | 0016  10c1                           move.b    d1,(a0)+
       | 0018  51c8 fffc                      dbf       d0,$0016

Lattice C has discovered that the loop is a sequential access of an
area of memory and has therefore decided to use (a0)+ addressing.
But it has also decided to run the loop count backwards to be able to
use a dbra!
This really is quite smart, but since this is Lattice it couldn't stop
there of course. The Lattice optimizer often decides not to optimize very
well, for reasons like finding a 16 bit integer somewhere etc.

The array was declared as containing long integers.
That means that the compiler will have to allocate it on an even address.
Lattice C has decided, though, that it can't address the elements as
long integers, but only as bytes, thus the 'move.b' instead of 'move.l'.
The compiler does know that it has to quadruple the loop count to be able
to clear the same amount of memory, though...


PureC, the double loop  (I will comment this directly)
T00001E:   MOVE.L    #$00000271,D4     k = COLS * ROWS

T000024:   MOVEQ.L   #$00,D3           i = 0
T000026:   BRA.B     T00004E           Jump to the (i < ROWS) test

T000028:   MOVEQ.L   #$00,D5           j = 0
T00002A:   BRA.B     T000046           Jump to the (j < COLS) test

T00002C:   MOVE.L    D3,D0             Calculate i * k
T00002E:   MOVE.L    D4,D1             PureC would have used a 
T000030:   JSR       _lmul             muls if the values were of type int

T000036:   MOVEQ.L   #$64,D1           Calculate i * row width (COLS*4)
T000038:   MULS      D3,D1             for the array access

T00003A:   MOVE.W    D5,D2             Add on j * 4 for correct element
T00003C:   LSL.W     #2,D2
T00003E:   ADD.W     D2,D1

T000040:   ADD.L     D0,(A7,D1.W)      Add i * k to the array element

T000044:   ADDQ.L    #1,D5             Increase j and test against COLS
T000046:   MOVEQ.L   #$19,D0
T000048:   CMP.L     D5,D0
T00004A:   BGT.B     T00002C

T00004C:   ADDQ.L    #1,D3             Increase i and test against ROWS
T00004E:   MOVEQ.L   #$19,D0
T000050:   CMP.L     D3,D0
T000052:   BGT.B     T000028

As you can see, again PureC did a direct translation of the C code.
Every loop does one ordinary 'muls' and one call to '_lmul', which in this
case will do one more 'muls'.
You think this code was good? Read on!


GCC, the double loop
	movel #625,d3                  This is ROWS*COLS
	movel #-2500,d2                -(size of the array)
	clrl d1                        What might this be?
	lea sp@(2516),a3               Point to the end of the array
	subl a2,a2                     What might this be?

L13:
	lea a3@(0,d2:l),a0
	lea a0@(96),a1

On the first loop, a0 will point to the first element in the array
and a1 will point to the last element on the first row.

L12:
	movel d1,d0
	addl a0@,d0
	movel d0,a0@+
	cmpl a0,a1
	jge L12

Uh, this will only add the value in d1 (zero at the start) to the all
the elements on the first row. Can this be right?

	addl d3,d1
	addw #100,a3
	addw #100,a2
	cmpl #2400,a2
	jle L13

What? It's finished already? What's happening here?

Well, clearly a3 is updated so that after executing the code at L13,
a0 will point to the next row in the array.
Since a2 is used for the final comparison, it must be a strange version
of the variable 'i'.
D1 was the value we added to every element on the first row and now that
has been increased by adding d3 (=ROWS*COLS) to it...
...but that means that the i*k operation has been reduced to a simple
add for every new row!

It should be obvious that the code produced by GCC will run _much_
faster than what we got from PureC.
I've not timed it, but the difference seems to be something like five
to ten times!


Lattice C, the double loop
       | 001c  4bee f628                      lea       $f628(a6),a5
       | 0020  7e00                           moveq     #$00,d7
       | 0022  6018                           bra.b     $003c

This is much the same that GCC did, but Lattice C writes a bit smarter
assembly. It doesn't realize that the loop will run more than once,
though, so it jumps to the test.

       | 0024  284d                           movea.l   a5,a4
       | 0026  47ec 0064                      lea       $0064(a4),a3
       | 002a  6002                           bra.b     $002e

Like above, this is better code than GCC did, but again there's an
extra jump to the test.

       | 002c  df9c                           add.l     d7,(a4)+
       | 002e  b9cb                           cmpa.l    a3,a4
       | 0030  65fa                           bcs.b     $002c

Wow! Only _three_ instructions in the inner loop!
Again, this is the same thing that GCC did, only Lattice C has realized
that there is an instruction that does what GCC spends three on.

       | 0032  dafc 0064                      adda.w    #$0064,a5
       | 0036  0687 0000 0271                 addi.l    #$00000271,d7
       | 003c  41ee ffec                      lea       $ffec(a6),a0
       | 0040  bbc8                           cmpa.l    a0,a5
       | 0042  65e0                           bcs.b     $0024

Lattice C didn't put ROWS*COLS in a register like GCC did and the 'lea' is
not needed inside the loop since a0 is not changed anywhere. Still those
two instructions are not executed very often so they don't matter much.


Conclusions
-----------
GCC and Lattice optimized very well, but the code generation by GCC is not
as good as Lattice's. Both compilers do some unnecessary things and the
Lattice optimizer fools itself a bit on the first loop.

PureC shows itself for what it is, a compiler that does as well as it can
with next to no knowledge of optimization.


PureC keeps everything in registers all the time, but that doesn't help
much when it generates two unnecessary multiplies among lots of totally
unoptimized code.

Lattice C could definitely keep more things in registers. In the double
loop it only uses a single data register.

GCC seems not to know the M68k as well as it should, since some things
it did could easily be done with fewer instructions.


-- 
  Chalmers University   | Why are these |  e-mail:   rand@cd.chalmers.se
     of Technology      |  .signatures  |            johan@rand.thn.htu.se
                        | so hard to do |  WWW/ftp:  rand.thn.htu.se
   Gothenburg, Sweden   |     well?     |            (MGIFv5, QLem, BAD MOOD)
Follow-Ups:
- Re: [MiNT] Re: GCC
  - From: Frank Naumann <fnaumann@prinz-atm.CS.Uni-Magdeburg.De>
References:
- RE: [MiNT] Re: GCC
  - From: "Julian Reschke" <reschke@medicaldataservice.de>
Prev by Date: RE: [MiNT] Re: GCC
Next by Date: RE: [MiNT] Re: GCC
Previous by thread: RE: [MiNT] Re: GCC
Next by thread: Re: [MiNT] Re: GCC
Index(es):
- Date
- Thread