[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [MiNT] C bit



> > Compiler output (-m68030 -O3):
> > 
> > _calc_load_average:
> > 	...
> > 	move.l	_uptime,a0 
> > 	lea	$0001(a0),a1
> > 	move.l	a1,_uptime
> > 	add.w	#$00c8,_uptimetick
> > 	move.l	a0,d1 /* notice: this is the UNUPDATED value of _uptime!!!!
> > */
> > 	addq.l	#$01,d1 /* how sucky: now they update it again, while this
> > value is already in a1!!!*/  
> > 	move.l	d1,d2
> > 	mulu.l	#$cccccccd,d0-d2
> > 	lsr.l	#$02,d0
> > 	move.l	d0,a1
> > 	lea	(a1,d0.l*4),a1
> > 	move.l	a1,d0
> > 	cmp.l	d1,d0
> > 	bne.s	L459
> > 	...
> > 
> > Comments?
> > 
> 	Yeah:
...
> 	NB!! To make things worse there is redundancy. Notice how they first
> add 1 to uptime and then do the same thing all over again.

The optimization of the modulo is probably done by itself. Anyway, this
inefficiency only costs two cycles.

> 	It seems this code is quite long and the GNU dudes obviously did
> their best to get rid of the costly divu.l dn:dn instruction. Very smart

I'd guess that this optimization is done in the hardware independent part
of GCC and that code may assume a larger difference between MUL and DIV
timings than the '030 has.

> though I ask myself if they really achieved something here. This is a
> typical generic kind of optimisation I expect from a compiler and can hardly
> be called efficient. Infact the code grows quite large.

If it's efficient or not depends on how long the code actually takes to
execute, of course. Code size isn't very often a problem nowadays, but it
would of course be bad if it caused the cache to be badly utilized.

> 	You should try higher modulo values instead of 5 and check if the
> compiler still tries to "optimise".

Could be interesting.

> 	The compiler generates:
> 	* 14 instructions covering 50 bytes, which include
> 	*  3 full 32bit address fetches
> 	*  3 memory accesses
> 	*  1 unsigned 32*32 multiply (Quite costly.. And since $cccccccd has
> a lot of bits set, this is prolly 60 cycles)

MULU.L never takes more than 44 cycles on the '030.

> 	I'd do it like this (cycles for 030): (Though I am uncertain if 
> 
> 		lea	_uptime(pc),a0

What makes you think you can use PC-relative addressing here?
C compilers put variables in the BSS or DATA sections, which are likely
to be far away. Anyway, the compiler can't know if it ends up near since
it's up to the linker to place these things.

> 		addq.l	#1,(a0)
> 		move.l	(a0),d0

	move.l	(a0),d0
	addq.l	#1,d0
	move.l	d0,(a0)

Would most likely be better. It only requires two memory accesses
compared to your three (your ADD needs to read memory as well as write).
Of course it takes another instruction which may hurt somewhat, but...
Note that this was what GCC did, although with other registers/instructions.
Since memory is normally cached on an '030, it shouldn't matter all that
much, though.

> 		addi.w	#200,_uptimetick

The same as GCC.

> 		divul.l	#5,d1:d0				; +/- 40
> cycles ??

DIVU.L can take as much as 78...
It is data dependent, though, and I don't know how far it might drop
with a simple enough number.
(The processor's idea of 'simple' may not be very obvious.)

> 		move.l	d1,d1					; 2 cycles
> (faster than tst.l I think)

TST.L is also 2 cycles, but for some reason those aren't in a 'head',
unlike the MOVE.L. Since DIVU.L doesn't have any 'tail' it doesn't
matter here, anyway.

> 	This is:
> 	* 7 instructions covering 24 bytes, which include:

Size can indeed sometimes be important.

> 	* 1 16bit address fetch

Which the compiler can't do.

> 	* 1 32bit address fetch

It would have been nice if the compiler had used LEA for the address, but
that would have required another register (or not using LEA for the +1).

> 	* 3 mem accesses

No, you do four.

> 	* 1 unsigned 32:32 divide (Costly, maybe as much as 60. I'd like to
> run a test of this! )

Could be as much as 78. Testing is probably a good idea.

> 	Optimising a divide with a multiply might be smart, but not in this
> case. Since $cccccccd has alot of set bits, the ALU has alot of work to do.

This depends on the implementation of the multiply. On most modern
processors, I'd assume all multiplies to take the same amount of time
(perhaps 2-5 cycles). The '030 may actually use the normal ALU for the job,
but nothing says that it has to iterate over the $cccccccd rather than
the _uptime value. Anyway, without information on the actual process,
you'd need to test it to find out.

> The good thing about the divide is, that it is relatively cheap when the
> _uptime value is low.

But the compiler can't know anything about that.

> 	Concluding: this optimisation makes:
> 	1) no difference in execution time at all, maybe even for the worse!

You don't know that.
I agree that it's unlikely to give substantial speedup on an '030, but
it definitely could be faster.
If it's much the same on the '030 and significantly better on a 68000,
68040 or 68060, it's still definitely worth doing, though.

> 	2) the code grows bigger.

Rather irrelevant, IMO.

> 	3) the generated code less readable (the life of a ripper is just

You get that a lot with optimizing compilers.

-- 
  Chalmers University   | Why are these |  e-mail:   rand@cd.chalmers.se
     of Technology      |  .signatures  |            johan@rand.thn.htu.se
                        | so hard to do |  WWW/ftp:  rand.thn.htu.se
   Gothenburg, Sweden   |     well?     |            (MGIFv5, QLem, BAD MOOD)