[SCore-users] Kernel oops

Bogdan Costescu bogdan.costescu at iwr.uni-heidelberg.de
Fri Mar 7 04:58:42 JST 2003


Dear SCore developers,

I've postponed trying to test GM on our nodes as I have observed that 
whenever SCoreD crashes and takes with it one node there is also an Oops 
displayed on the node. This is with the SCore 4.2.1 kernel patch applied 
to RH 2.4.18-24, so it might be some error that I have introduced, but the 
behaviour (SCoreD taking down one node) is the same with SCore 5.4 and 
kernel 2.4.19-1SCORE which I plan to test tomorrow.

So, the (decoded) Oops looks like this:

EIP is at __wake_up [kernel] 0x3c (2.4.18-24SCORE)
eax: c041c998   ebx: c25a4d80     ecx: 00000000       edx: 00000000
esi: 00000001   edi: c041c994     ebp: c25abf1c       esp: c25abf08
ds: 0018   es: 0018   ss: 0018
Process swapper (pid: 0, stackpage=c25ab000)
Stack:  00000282 00000001 c041c96c c041c840 c041c994 00000002 c019ec9a c25a4d80
        00000001 dbdfa015 00000010 c010a6e3 00000010 c041c840 c25abf7c c25abf7c
        c0398000 00000010 c25a4d80 c010a872 00000010 c25abf7c c25a4d80 00000001
Call trace: [<c019ec9a>] myri_pm_intr [kernel] 0x7a (0xc25abf20))
[<c010a63e>] handle_IRQ_event [kernel] 0x5e (oxc25abf34))
[<c010a872>] do_IRQ [kernel] 0xc2 (0c25abf54))
[<c0106e60>] default_idle [kernel] 0x0 (0xc25abf68))
[<c0106e60>] default_idle [kernel] 0x0 (0xc25abf74))
[<c010d098>] call_do_IRQ [kernel] 0x5 (0xc25abf78))
[<c0106e60>] default_idle [kernel] 0x0 (0xc25abf7c))
[<c0106e60>] default_idle [kernel] 0x0 (0xc25abf90))
[<c0106e89>] default_idle [kernel] 0x29 (0xc25abfa4))
[<c0106f02>] cpu_idle [kernel] 0x32 (0xc25abfb0))
[<c011dafb>] call_console_drivers [kernel] 0xeb (0xc25abfd0))
[<c011dca9>] printk [kernel] 0x129 (0xc25abffc))
Code: 8b 02 85 45 f0 74 ed 6a 00 52 e8 75 f0 ff ff 5a 85 c0 59 74
Using defaults from ksymoops -t elf32-i386 -a i386

Trace; c019ec9a <myri_pm_intr+7a/90>
Trace; c010a63e <handle_IRQ_event+5e/90>
Trace; c010a872 <do_IRQ+c2/110>
Trace; c0106e60 <default_idle+0/40>
Trace; c0106e60 <default_idle+0/40>
Trace; c010d098 <call_do_IRQ+5/d>
Trace; c0106e60 <default_idle+0/40>
Trace; c0106e60 <default_idle+0/40>
Trace; c0106e89 <default_idle+29/40>
Trace; c0106f02 <cpu_idle+32/50>
Trace; c011dafb <call_console_drivers+eb/100>
Trace; c011dca9 <printk+129/140>
Code;  00000000 Before first symbol
00000000 <_EIP>:
Code;  00000000 Before first symbol
   0:   8b 02                     mov    (%edx),%eax
Code;  00000002 Before first symbol
   2:   85 45 f0                  test   %eax,0xfffffff0(%ebp)
Code;  00000005 Before first symbol
   5:   74 ed                     je     fffffff4 <_EIP+0xfffffff4> fffffff4 <END_OF_CODE+1f463075/????>
Code;  00000007 Before first symbol
   7:   6a 00                     push   $0x0
Code;  00000009 Before first symbol
   9:   52                        push   %edx
Code;  0000000a Before first symbol
   a:   e8 75 f0 ff ff            call   fffff084 <_EIP+0xfffff084> fffff084 <END_OF_CODE+1f462105/????>
Code;  0000000f Before first symbol
   f:   5a                        pop    %edx
Code;  00000010 Before first symbol
  10:   85 c0                     test   %eax,%eax
Code;  00000012 Before first symbol
  12:   59                        pop    %ecx
Code;  00000013 Before first symbol
  13:   74 00                     je     15 <_EIP+0x15> 00000015 Before first symbol

 <0>Kernel panic: Aiee, killing interrupt handler!


Today I was able to reproduce this Oops several times on different nodes. 
The trace is always the same, except for the line(s) after cpu_idle, which 
can be replaced by:
[<c0105000>] stext [kernel] 0x0 (...))

I looked a bit through the code but I don't really understand Myrinet 
programming too well, so maybe this gives you some idea. Spurious 
interrupts ? Lost interrupts ? I'm still not confortable with the 
interrupt state on my machines as they have Tyan 760MP boards which are 
known for instabilities.
Anyway, as I said, I plan to try tomorrow with SCore 5.4 and kernel 
2.4.19-1SCORE to see if the locks there are also associated with such 
Oopses.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De




More information about the SCore-users mailing list