[Mono-devel-list] mono AES performance woes (was: poor PPC JIT output)
allan at counterpop.net
Fri Jul 15 20:42:11 EDT 2005
On Jul 15, 2005, at 3:39 AM, Paolo Molaro wrote:
> On 07/14/05 Allan Hsu wrote:
>> Code generated by the PPC code emitter performs very poorly in
>> comparison to the same code emitted for other platforms (most
>> notably, x86). I had a brief conversation about this with Miguel in
>> #mono today and he suggested that I post some examples.
> I'm sure he meant an actual test case, which you didn't provide.
I apologize for that. I was sharing the information I had already
gathered as part of an investigation into the poor performance of the
OS X port of our product. I was not sure if this sort of data was
useful or if, as seems the case, I was doing something wrong. It
looks like the performance problems I was running into are not
specific to PPC, but the lack of JIT optimization (which I've
remedied) made them *very* apparent.
>> Preliminary profiling with Shark (a profiling tool that is part of
>> the Apple CHUD tools) shows some heinously inefficient JIT output on
>> both G4 and G5 machines. Here's some sample Shark analysis on the
>> code emitted by mono 18.104.22.168 from
>> System.Security.Cryptography.RijndaelTransform.ECB(byte, byte)
>> and System.Security.Cryptography.RijndaelTransform.ShiftRow(bool):
> It looks like optimizations are not enabled: are you embedding mono
> in your app?
> You should try adding:
> mono_set_defaults (0, mono_parse_default_optimizations (NULL));
> before the call to mono_jit_init ().
I am indeed using embedded mono, and I was not at all aware that
optimizations were disabled by default. This does not occur in any of
the sample code that I've seen and this is the first I've heard of it.
Is there any reference on what sorts of things you can change using
mono_set_defaults? Following the mono source for references to that
function wasn't particularly enlightening. It would be useful if the
Wiki page on embedding mono mentioned JIT optimization.
I have done some more isolated testing of AES performance after
turning on optimization and it seems that the JIT-emitted PPC code is
roughly on par with x86 mono performance. Here is the code I used for
some simple benchmarking:
Here's some times for 1000 encrypts/decrypts of 32768 byte chunks
from some machines we have here in the office, ordered by speed:
57.7 seconds under mono 22.214.171.124, OS X 10.4.2 (1.67 Ghz G4 1.2)
55.0 seconds under mono 126.96.36.199, Linux 2.6.9 (1.8 Ghz Athlon XP 2500+)
45.8 seconds under mono 188.8.131.52, Linux 2.6.9 (2.2 Ghz Athlon 64 3200+)
42.4 seconds under mono 184.108.40.206, OS X 10.4.2 (2.0 Ghz G5 3.0)
9.01 seconds under Microsoft .NET 1.1.4322, Windows XP Pro SP2 (2.0
Ghz Athlon 64 3200+)
If you look at the benchmark code, it uses RijndaelManaged to do
encrypt/decrypt. This class is supposedly 100% managed code in the
Included in the tarball is some native code that links against
OpenSSL to do the same thing. This is what native performance for the
same sized chunks looks like:
1.67 seconds under OpenSSL 0.9.7a, Linux 2.6.9 (1.8 Ghz Athlon XP 2500+)
1.44 seconds under OpenSSL 0.9.7, OS X 10.4.2 (1.67 Ghz G4 1.2)
1.05 seconds under OpenSSL 0.9.7, OS X 10.4.2 (2.0 Ghz G5 3.0)
.67 seconds under OpenSSL 0.9.7a, Linux 2.6.9 (2.2 Ghz Athlon 64 3200+)
To be fair, the native implementation is able to take advantage of 64-
bit processors when available, while all mono builds in the above
benchmarks are 32-bit. The Windows XP machine is the standard 32-bit
install, even though the processor is 64-bit. This is a pretty
informal benchmark, but all I'm interested in showing here is how bad
the AES performance under mono is.
It was suggested in #mono that I try compiling the mono AES
implementation under VS.NET and run it under the Microsoft VM to
The resulting project is available here:
The same operation benchmarks thusly:
22.76 seconds under Microsoft .NET 1.1.4322, Windows XP Pro SP2 (2.0
Ghz Athlon 64 3200+)
The AES code is taken from mono svn, so it may be different from the
code used in the mono 220.127.116.11 benchmarks above.
While switching to the Microsoft VM boosts speed significantly, it
looks like significant gains could be made by optimizing the mono
(some insightful comment would go here if I weren't so tired of
writing this email).
<everything below doesn't matter so much, since it was based on
information gathered from unoptimized JIT output>
>> Information on how to read Shark analysis comes with Shark (available
>> for free from the Apple Developer Connection website).
> A direct pointer to the doc would be useful.
Unfortunately, I can't find a copy of the documentation that's
available online (otherwise, I would have linked it). The closest
thing I can find to online documentation is this document: http://
>> (A summary:
>> numerous and frequent pipeline stalls, unoptimized loops).
> Some of the data looks definitely bogus: it reports a stall even on
> the addi, here:
> 0x2e143c8 lwz r4,32(r1) 3:1 Stall=2
> 0x2e143cc lwz r5,12(r4) 3:1 Stall=2
> 0x2e143d0 cmplwi r5,0x0000 3:1 Stall=2
> 0x2e143d4 blel $+696 <0x2e1468c [8B]> 2:1
> 0.4% 0x2e143d8 addi r4,r4,16 2:1 Stall=1
> How can it stall while adding an immediate value to a register
> that was loaded several instructions before? Anyway, maybe the
> for the output format will shed some light, once provided.
> As for the loop commentary: did you actually test how much you
> gain by aligning loop starts on 32 byte boundaries? It would be
> a huge waste of memory in most cases.
I was not implying that all of the Shark suggestions were useful. I
was simply summarizing the bulk of the suggestions. There other sorts
of optimizations that Shark often suggests that are absent from the
analysis of JIT code. I agree that loop alignment is probably
wasteful in the majority of cases.
As for the stall statistics, you have misread them. Each line that
says "Stall=N" is saying that the instruction latency of the marked
instruction will cause a subsequent dependent instruction to stall,
not that the marked instruction itself will stall. N is the maximum
number of stall cycles for the nearest dependent instruction. The
documentation claims that the register analysis algorithm they use is
"very conservative" and the actual stall cycles observed may be higher.
Allan Hsu <allan at counterpop dot net>
1E64 E20F 34D9 CBA7 1300 1457 AC37 CBBB 0E92 C779
More information about the Mono-devel-list