[Mono-dev] difference in performance between mono OSX and linux?
behrends at gmail.com
Sun Jan 22 11:39:00 UTC 2012
On 21/01/2012 19:28, Jonathan Shore wrote:
> So I am wondering whether there are differences in implementation
> between mono on these platforms that could account for a significant
> performance difference?
First of all, since your code appears to be multi-threaded, is your code
using thread-static variables extensively (including as part of a
library)? The Darwin ABI does not natively support support thread-local
storage, so Apple only supports it through pthread_get_specific() [1,2].
This makes thread-static variables comparatively slow in 2.10.
This is somewhat fixed in the current github master (and presumably will
be also fixed in 2.12). The new code attempts to disassemble
pthread_getspecific() to find out the gs register offset that the OS
uses and then uses that as a basis for generating thread-local code. The
performance difference is pretty dramatic if you use thread-static
variables a lot (caveat: if you want to experiment, from what I can
tell, it so far only properly works for the x86 target; the amd64
target, i.e. 64 bit, for some reason doesn't, so you want to build for a
32-bit host if experimenting with it).
Second, if you're running a benchmark that aggressively has multiple
threads use a single shared lock, that can lead to a form of
"thrashing", independently of the OS used. Basically, if a thread blocks
because of a contended lock, most simple lock implementation suspend the
thread (which involves an expensive kernel trap). If timing is
unfortunate, then you can waste a lot of time having threads suspending
themselves and getting immediately reawakened; the specific overhead and
circumstances where that happens vary by OS, but the effect can be very
unpretty (you can easily make a program 10x slower on most machines by
parallelizing it in a way that the architecture doesn't like). You can
recognize this scenario by using /usr/bin/time or something similar; an
otherwise CPU-bound process will have a disproportionate amount of time
allocated to system rather than user code.
A relatively simple workaround where you have this problem but expect a
critical section to only be short-lived is to repeatedly use a "try
lock" statement (such as Monitor.TryEnter()) before actually using a
lock-or-suspend type of operation. While this can be more expensive (and
potentially problematic if you have more threads than available
processors, or if you have a LOT of processors), in a lot of normal
situations it prevents unnecessary thread suspensions (essentially, it
tries to treat the lock as a spin lock and only falls back to a blocking
implementation if that seems unworkable).
Third, another cause might be that the Boehm GC is causing trouble here;
it (unavoidably) has a central lock and you say you're allocating
millions of objects. While the Boehm GC specifically tries to mitigate
the high contention scenario above (and has thread-local allocation if
enabled that largely avoids it for a lot of cases), there may still be
system-specific differences. Trying to run with --gc=sgen may help to
either identify or exclude this as a source of performance difference.
And, of course, there are a gazillion more causes why there may be a
performance difference, but these are common reasons you may encounter.
 As on Linux, Darwin stores thread-local variables relative to the
segment register gs; unlike Linux, Darwin gives you no way to tell at
what offset thread-local data is/can be stored nor does it promise that
it may not totally change its implementation in a later version of the OS.
 There are alternative implementations of fast thread-local storage,
but most of them have their own up- and downsides.
More information about the Mono-devel-list