[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [MiNT] C bit



	Hi Draco, 

> Compiler output (-m68030 -O3):
> 
> _calc_load_average:
> 	...
> 	move.l	_uptime,a0 
> 	lea	$0001(a0),a1
> 	move.l	a1,_uptime
> 	add.w	#$00c8,_uptimetick
> 	move.l	a0,d1 /* notice: this is the UNUPDATED value of _uptime!!!!
> */
> 	addq.l	#$01,d1 /* how sucky: now they update it again, while this
> value is already in a1!!!*/  
> 	move.l	d1,d2
> 	mulu.l	#$cccccccd,d0-d2
> 	lsr.l	#$02,d0
> 	move.l	d0,a1
> 	lea	(a1,d0.l*4),a1
> 	move.l	a1,d0
> 	cmp.l	d1,d0
> 	bne.s	L459
> 	...
> 
> Comments?
> 
	Yeah:

	The compiler seems to replace:

	"
	if (uptime % 5)
	"

	with:

	"
	prod = (ulong) uptime*0.8;		/* no chance of errors, but
this might be alot different for values > 5) */
	prod = prod>>2;					/* no problem here,
basicly: (uptime/5) */
	prod += prod<<2;					/* basicly:
(uptime*5 - uptime%5) */
	if (prod == uptime)
	"

	NB!! To make things worse there is redundancy. Notice how they first
add 1 to uptime and then do the same thing all over again.

	It seems this code is quite long and the GNU dudes obviously did
their best to get rid of the costly divu.l dn:dn instruction. Very smart
though I ask myself if they really achieved something here. This is a
typical generic kind of optimisation I expect from a compiler and can hardly
be called efficient. Infact the code grows quite large.

	You should try higher modulo values instead of 5 and check if the
compiler still tries to "optimise".

	The compiler generates:
	* 14 instructions covering 50 bytes, which include
	*  3 full 32bit address fetches
	*  3 memory accesses
	*  1 unsigned 32*32 multiply (Quite costly.. And since $cccccccd has
a lot of bits set, this is prolly 60 cycles)

	I'd do it like this (cycles for 030): (Though I am uncertain if 

		lea	_uptime(pc),a0
		addq.l	#1,(a0)
		move.l	(a0),d0 
		addi.w	#200,_uptimetick
		divul.l	#5,d1:d0				; +/- 40
cycles ??
		move.l	d1,d1					; 2 cycles
(faster than tst.l I think)
		bne.s	L459						;
same as compiler code

	This is:
	* 7 instructions covering 24 bytes, which include:
	* 1 16bit address fetch
	* 1 32bit address fetch
	* 3 mem accesses
	* 1 unsigned 32:32 divide (Costly, maybe as much as 60. I'd like to
run a test of this! )

	Optimising a divide with a multiply might be smart, but not in this
case. Since $cccccccd has alot of set bits, the ALU has alot of work to do.
The good thing about the divide is, that it is relatively cheap when the
_uptime value is low.

	Concluding: this optimisation makes:
	1) no difference in execution time at all, maybe even for the worse!
	2) the code grows bigger.
	3) the generated code less readable (the life of a ripper is just
SOOOO hard =))).

	Your bitfucking atari buddy,

	Pieter van der Meer (aka EarX/fun)

	PS I'm sorry about the mails in Word format. I'll try to watch out
in the future. I don't have much of an option at work.