[SCore-users-jp] SCore-Dの自動運転と自動復旧でのエラー

matsuoka matsuoka @ arch.ce.hiroshima-cu.ac.jp
2006年 10月 23日 (月) 14:48:26 JST


メーリングリストの皆様

広島市立大学の松岡と申します。
お世話になります。

現在、SCore 5.8.3でSCore-D の自動運転と自動復旧を行おうとしているのですが、障害発生後score-dが再起動したところで

<1> ULT: Exception Signal (11)

が出力されます。


サーバ名はmaster.pccluster.org、計算ホスト名はpc1~pc6.pccluster.orgです。
master.pccluster.orgとpc1~pc6.pccluster.orgの7台がpccグループで、pc7.pccluster.org
をspareに指定してあります。


手順としてはまず、ルートで以下を実行しました。
sc_watch -g pcc -t 1m -l ./replace.sh scored_dev
replace.shの内容はWebと同じです。

SCore-Dがbootupしたところで別のシェルで
scrun -scored=master,checkpoint=4m,nodes=2x1 ./a.out
を実行しました。a.outはMPIプログラムです。

チェックポイントが行われた後、pc1をshutdownしました。


以下がそのときのログになります。

[root @ master ft]# sc_watch -g pcc -t 1m -l ./replace.sh scored_dev
[20/Oct/2006,15:48:37] SC_WATCH (5.8.3) started.
[20/Oct/2006,15:48:37] Interval is set to 1 minutes.
[20/Oct/2006,15:48:37] Init. Action  = (none)
[20/Oct/2006,15:48:37] Local Action  = ./replace.sh
[20/Oct/2006,15:48:37] Remote Action = (none)
[20/Oct/2006,15:48:37] Abort action  = (none)
[20/Oct/2006,15:48:37] Boot Retry Max. = 10
[20/Oct/2006,15:48:37] Booting System: scored_dev
SCOUT: Spawning done.
20/Oct/2006 15:48:39 SYSLOG: /opt/score5.8.3/deploy/scored_dev
20/Oct/2006 15:48:39 SYSLOG: SCore-D 5.8.3 $Id: init.cc,v 1.74
2005/02/24 07:47:54 hori Exp $
20/Oct/2006 15:48:39 SYSLOG: Compile option(s): DEVELOPMENT ULT_DO_TRACE
SCORE_DO_TRACE
20/Oct/2006 15:48:39 SYSLOG: SCore-D network: myrinetxp/myrinetxp
20/Oct/2006 15:48:39 SYSLOG: Cluster[0]: (0..6)x2.i386-fedoracore3-
linux2_6.pentium4.2800
20/Oct/2006 15:48:39 SYSLOG:   Memory: 1010[MB], Swap: 1028[MB], Disk:
9845[MB]
20/Oct/2006 15:48:39 SYSLOG:   Network[0]: myrinetxp/myrinetxp
20/Oct/2006 15:48:39 SYSLOG:   Network[1]: ethernet/ethernet
20/Oct/2006 15:48:39 SYSLOG: Scheduler initiated: Timeslice = 200 [msec]
20/Oct/2006 15:48:39 SYSLOG:   Queue[0] activated, exclusive scheduling
20/Oct/2006 15:48:39 SYSLOG:   Queue[1] activated, time-sharing
scheduling
20/Oct/2006 15:48:39 SYSLOG:   Queue[2] activated, time-sharing
scheduling
20/Oct/2006 15:48:39 SYSLOG: Session ID: 0
20/Oct/2006 15:48:39 SYSLOG: Server Host: master.pccluster.org
20/Oct/2006 15:48:39 SYSLOG: Backup Host: pc3.pccluster.org @ 2
20/Oct/2006 15:48:39 SYSLOG: Backup file is lost but created.
20/Oct/2006 15:48:39 SYSLOG: Server file is lost but created.
20/Oct/2006 15:48:39 SYSLOG: Operated by: root
20/Oct/2006 15:48:39 SYSLOG: SCore-D Watcher
(master.pccluster.org:33245)
20/Oct/2006 15:48:39 SYSLOG: --------- SCore-D (5.8.3) bootup --------
20/Oct/2006 15:49:04 SYSLOG: Login request:
matsuoka @ master.pccluster.org:33247
20/Oct/2006 15:49:04 SYSLOG: Login accepted:
matsuoka @ master.pccluster.org:33247, JID: 1, Hosts: 2(2x1)@0, Priority:
1, Command: ./a.out
[20/Oct/2006,15:57:08] System failure detected.
[20/Oct/2006,15:57:28] System has been shutdown.
[20/Oct/2006,15:57:28] Local Action: ./replace.sh
1 host not responding.
defected hosts
pc1.pccluster.org
new host list
pc1.pccluster.org pc2.pccluster.org pc3.pccluster.org pc4.pccluster.org
pc5.pccluster.org pc6.pccluster.org master.pccluster.org
7 hosts found.
Shutting down scoreboard services:                         [  OK  ]
Starting scoreboard services:                              [  OK  ]
scoreboard is restarted.
[20/Oct/2006,15:57:35] Rebooting System [2 times, first retry]:
scored_dev
SCOUT: Spawning done.
20/Oct/2006 15:57:37 SYSLOG: /opt/score5.8.3/deploy/scored_dev
<2> SCore-D:WARNING Host pc1.pccluster.org is replaced by
pc7.pccluster.org.
20/Oct/2006 15:57:37 SYSLOG: SCore-D 5.8.3 $Id: init.cc,v 1.74
2005/02/24 07:47:54 hori Exp $
20/Oct/2006 15:57:37 SYSLOG: Compile option(s): DEVELOPMENT ULT_DO_TRACE
SCORE_DO_TRACE
20/Oct/2006 15:57:37 SYSLOG: SCore-D network: myrinetxp/myrinetxp
20/Oct/2006 15:57:37 SYSLOG: Cluster[0]: (0..6)x2.i386-fedoracore3-
linux2_6.pentium4.2800
20/Oct/2006 15:57:37 SYSLOG:   Memory: 1010[MB], Swap: 1028[MB], Disk:
9845[MB]
20/Oct/2006 15:57:37 SYSLOG:   Network[0]: myrinetxp/myrinetxp
20/Oct/2006 15:57:37 SYSLOG:   Network[1]: ethernet/ethernet
20/Oct/2006 15:57:37 SYSLOG: Scheduler initiated: Timeslice = 200 [msec]
20/Oct/2006 15:57:37 SYSLOG:   Queue[0] activated, exclusive scheduling
20/Oct/2006 15:57:37 SYSLOG:   Queue[1] activated, time-sharing
scheduling
20/Oct/2006 15:57:37 SYSLOG:   Queue[2] activated, time-sharing
scheduling
20/Oct/2006 15:57:37 SYSLOG: Session ID: 0
20/Oct/2006 15:57:37 SYSLOG: Server Host: master.pccluster.org
20/Oct/2006 15:57:37 SYSLOG: Backup Host: pc3.pccluster.org @ 2
<6> SCore-D:WARNING Host pc1.pccluster.org is replaced by
pc7.pccluster.org.
20/Oct/2006 15:57:37 SYSLOG: Operated by: root
20/Oct/2006 15:57:37 SYSLOG: SCore-D Watcher
(master.pccluster.org:33248)
20/Oct/2006 15:57:37 SYSLOG: Recovery:
matsuoka @ master.pccluster.org:33247, JOB-ID: 1
20/Oct/2006 15:57:37 SYSLOG: --------- SCore-D (5.8.3) bootup --------
<1> ULT: Exception Signal (11)

<1> Attaching GDB: Exception signal
(no debugging symbols found)...Using host libthread_db library
"/lib/tls/libthread_db.so.1".
(no debugging symbols found)...`shared object read from target memory'
has disappeared; keeping its symbols.
0xffffe410 in __kernel_vsyscall ()
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x4683296d in wait () from /lib/tls/libc.so.6
#2  0x080de358 in score_attach_debugger ()
#3  0x080d8621 in ult_exception ()
#4  <signal handler called>
#5  0x08093fa8 in close_pm_context ()
#6  0x080940bc in close_context ()
#7  0x08094398 in free_user_network ()
#8  0x08097860 in reallocate_user_network ()
#9  0x080b8eb4 in fork_rstr_proc ()
#10 0x080681e9 in createPE ()
#11 0x080694f5 in createPPE ()
#12 0x08069e92 in fork_pegroup ()
#13 0x0809adb1 in fork_all ()
#14 0x080a1bec in _ainvoker4<int, GlobalPtr<ControlTree>, int, int,
int>::invoke ()
#15 0x4054ff0c in ?? ()
#16 0x0809acdc in setlimit_all ()
/opt/score5.8.3/deploy/score.gdb:1: Error in sourced command file:
Previous frame inner to this frame (corrupt stack?)
・・・


以下が別のシェルでMPIプログラムa.outを実行した時の出力です。

[matsuoka @ master test]$ scrun -
scored=master,checkpoint=4m,nodes=2x1 ./a.out
SCore-D 5.8.3 connected (jid=1,reconnect=33247).
<0:0> SCORE: 2 nodes (2x1) ready.

SCORE: Checkpointing ... done.

FEP: [20/Oct/2006 15:57:37] Restarted.
SCore-D 5.8.3 connected (jid=1,reconnect=33247).
<0> SCORE-D: System checkpoint file was lost, but recovered.
<0> SCORE-D: User checkpoint file was lost, but recovered.
FEP:WARNING SCore-D unexpectedly terminated.
FEP: [20/Oct/2006 15:57:45] Waiting for SCore-D restarted ...
・・・

ログを見た限りでは、障害が発生したホストの代替ホストへの切り替えと、
ユーザプログラムのリカバリは行っているように見えます。

例外の原因と解決方法についてよろしくご教授お願いします。





SCore-users-jp メーリングリストの案内