[Mono-gc-list] Fast allocation vs lightweight collection

Torstensson, Patrik patrik.torstensson@intel.com
Tue, 26 Aug 2003 08:59:02 +0100


Hello guys,=20

> It's pretty easy to avoid locking using the atomic test/set/exchange
> operations of modern processors.
> For example, x86 has "cmpxchg", the psudocode of this operation is:
>=20
>   [ http://www.geocities.com/bkartheek/files/chap6.htm ]
>=20
>         cmpxchg         operand1, operand2  [ ax/al/eax =3D compare_to =
]
>=20
>         if ({al/ax/eax} =3D operand1) then
>=20
>                 zero :=3D 1               ;Set the zero flag
>                 operand1 :=3D operand2
>=20
>         else
>=20
>                 zero :=3D 0               ;Clear the zero flag
>                 {al/ax/eax} :=3D operand1
>=20
>         endif
>=20
> Using this, you can do:
>=20
> try_again:
>   reg =3D mem(count)
>   compare_to =3D reg
>   reg =3D reg + 1      // or reg =3D reg -1
>   if (! cmpxchg mem(count),reg [ compare_to ] ) {
>     yeild(); // optional
>     goto try_again; =20
>   }=20
>=20
> When two threads collide, this looks like:
>  =20
> thread1 		thread2 		mem(count)
> reg =3D mem(count)	reg =3D mem(count) 	1
> reg =3D reg+1 		reg =3D reg+1 		1
> cmpxchg suceeds 	cmpxchg fails		2
> ...			reg =3D mem(count)	2
> 			reg =3D reg+1		2
> 			cmpxchg suceeds		3
> 			...
>=20
> No locking required.

Not really true. The cmpxchg (and other) will lock the CPU cache (or
cause a cache invalid signal to happen) on a x86. This causes serious
performance problems (it's better than a kernel lock but still..), the
CPU needs to refetch data into the cache again.

The biggest issue with the cache locking is that the performance hit
increases with the number of CPU:s. There is schemas to minimize the
locking for example building locks by using read and write operations
only (they are atomic on x86) and use a kernel lock when you need to
wait (spinning first).

Still, Paolo is right, reference counting is very expensive for
Multithreaded programs even if you use x86 specific helper functions.=20

(and another thing, the yield instruction is very important in that loop
if you want performance on a SMP box (use the PAUSE instruciton on the
new CPU:s))

Cheers,
 Patrik Torstensson
=20