[SCore-users] sleep or signal problems

Bogdan Costescu bogdan.costescu at iwr.uni-heidelberg.de
Wed Oct 9 05:47:27 JST 2002


On Tue, 8 Oct 2002, Atsushi HORI wrote:

> Hi, I am sorry for this late answer. I have been busy for preparing 
> the next SCore release.

Well, I'm waiting equally impatient for both this answer and the next 
SCore release, so it's up to you how to divide your time between them :-)
But thank you for doing both !

> Well, this can happen because user processes is kept receiving 
> SIGSTOP and SIGCONT for gang scheduling.

I missed this in the documentation, but now it's clear. However, this is 
not the real problem... read on.

> The easiest way is to change the function name of sleep() to sc_sleep().

Yes, however this is not my program, so I don't know if the original 
programmer also intended or not to be interrupted by some signals. Using 
sc_sleep() ignores the signals, so it's not a general replacement. 
However, this temporary change allowed me to go further in finding the 
real problem.

I "lost" about half a day to realize the same thing that was mentioned in 
Kameyama's last message on GlobalArray topic, but maybe not clear enough: 
the application installs a signal handler for SIGCLHD in which a wait(2) 
call tries to get more data about the dead child then prints it and exits.
However, the return value of the wait(2) call is NOT checked. By checking 
it, I found that is -1, which indicates an error, the error being 
(surprise! surprise!) ECHILD which indicates (if I interpret the 
description correctly) that there was actually NO child that sent that 
signal. This is actually what I expected, as there was no child process 
created at that point! So, I tried and succeeded in reproducing the 
problem with a simple non-MPI program:

#include <stdio.h>
#include <signal.h>
#include <errno.h>
#include <sys/types.h>

void sig_chld()
{
pid_t r;
int status;

signal(SIGCHLD, sig_chld);
r = wait(&status);
printf("%d : ", r);
if (r == -1) {
	switch (errno) {
		case ECHILD: printf("ECHILD : "); break;
		case EINVAL: printf("EINVAL : "); break;
		case EINTR: printf("EINTR/ERESTARTSYS : "); break;
		default: printf("Other code : "); break;
		}
	}
printf("Status = %d\n", status);
}

int main(int argc, char *argv[])
{

signal(SIGCHLD, sig_chld);
printf("Signal handler installed...\n");
pause();
printf("Finalizing\n");

return 1;
}

[ I know this is not 100% correct, there may be races between the
signal(2)  call and delivery of signal, plus the return value of signal(2)
is not checked, but it's only here to illustrate the point]

This program prints the "Signal handler installed..." message then pauses.
There is no child created by this program, so there could be no SIGCHLD
signal received.  However, using kill(1) to send a SIGCHLD signal will
result in exactly the same behaviour that I've observed: wait(2) returns
-1 with errno=ECHILD and the "status" value is just bogus (in my attempts
with the MPI program I obtained various values: 0, -1, two-digit positive
numbers, 9 digit positive and negative numbers; I failed to interpret them
based on the macros described in the wait(2) man page).

Things became even stranger after I put in the signal handler of the MPI 
application code to print the time when the signal occurs. At the 
beginning, the test program had only code to initialize everything, 
sleep/pause then finalize and exit so I observed some erratic timings; but 
at some point I added some code which does no function call, basically a 
tight loop that does nothing - and I observed that the signal handler got 
executed every 1/2 seconds...

So the big question is: what is generating this spurious SIGCHLD signals ?

There is certainly no dead child every half a second as there is no child 
created... and the man page of sc_signal_bcast() mentions that SCore-D 
uses only SIGSTOP, SIGCONT and SIGKILL.

[ Sorry about the long message. I wanted to show you my line of thinking 
so that you might try to find flows in it... :-) ]

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De




More information about the SCore-users mailing list