[Mono-dev] Possible deadlock in sgen garbage collector
blinke at CeBiTec.Uni-Bielefeld.DE
Wed May 26 08:39:22 EDT 2010
I've stumpled over a possible deadlock in boehm GC some time ago. Since the
sgen GC uses the same mechanism for stopping the world, it may also be a
problem in that implementation.
Thread termination is signalled to the GC by the mean of a thread exit handler
(boehm) or a thread data key destructor (sgen). The function in question
removes the thread from the internal management tables and does necessary
POSIX does not specify the state of the thread's signal mask during exit
handlers or data key destructor. Linux pthreads afaik enable signals, so the
signal based suspend/restart mechanism is OK. But Solaris/x86 blocks signals
during these handlers. From the pthread_exit(3) manpage:
An exiting thread runs with all signals blocked. All thread
termination functions, including cancellation cleanup
handlers and thread-specific data destructor functions, are
called with all signals blocked.
And at this point a (unlikely, but possible) race condition occurs. If thread
A stop the world, it examines the thread table for active threads and sends a
suspend signal to each of them. If this happens while thread B is terminating
and executing its termination handlers, the signal will be blocked (and
probably lost, since the manpage does not mention unblocking the signals
again). The suspend handlers post to a semaphore thread A is waiting for. The
post of thread B never happens and thread A blocks forever. This error is not
reproducable in a reliable way, so no test case can be provided.
The patch for boehm GC requires adding another mutex for thread
termination/garbage collection and explicitly checking for pending signals in
the termination handler. I'll try to port this patch to sgen GC, unless
someone else has a better solution.
More information about the Mono-devel-list