From jure.jerman ＠ rzs-hm.si  Thu May  1 15:59:10 2003
From: jure.jerman ＠ rzs-hm.si (Jure Jerman)
Date: Thu, 01 May 2003 06:59:10 +0000 (UTC)
Subject: [SCore-users-jp] Re: [SCore-users] Questions about Score scheduling scheme
In-Reply-To: <3134552759.hori0000@swimmy-soft.com>
Message-ID: <Pine.LNX.4.10.10305010618341.18750-100000@calvus.rzs-hm.si>

Hi,

I have to appologize for the usage of bandwith with the question
concerning checkpointing:

The problem was on our side because user was running score job via
system command.
The job was running fine, but checkpointing of it failed.
Afer correction of the script, checkpointing/restart worked well.

However, I have another question: 
We wanted to trick the score including call to mpi_init into the 
beginning of serial application in order that score would consider
the application as score application (perhaps there are other ways
to do this?). Everything was OK, checkpointing/aborting/restarting except
that sc_watch crashed with the message:
<1> ULT:PANIC No more thread stack space

Do you think that this error might be connected with the "dirty" trick we
use or with another words:  is there a way to force serial job into 
a score application?

Many thanks in advance for any hint,

Best regards, Jure Jerman

_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From hori ＠ swimmy-soft.com  Fri May  2 15:29:03 2003
From: hori ＠ swimmy-soft.com (Atsushi HORI)
Date: Fri, 2 May 2003 15:29:03 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Questions about Score scheduling scheme
In-Reply-To: <Pine.LNX.4.10.10305010618341.18750-100000@calvus.rzs-hm.si>
References: <3134552759.hori0000@swimmy-soft.com>
Message-ID: <3134734143.hori0002@swimmy-soft.com>

Hi,

>However, I have another question: 
>We wanted to trick the score including call to mpi_init into the 
>beginning of serial application in order that score would consider
>the application as score application (perhaps there are other ways
>to do this?). Everything was OK, checkpointing/aborting/restarting except
>that sc_watch crashed with the message:
><1> ULT:PANIC No more thread stack space

This messages says that SCore-D (Cluster Opereating System) is out of 
memory. I suppose your sequential (not SCore) job consumes most of 
the memory. How much swap space do you have ?

>Do you think that this error might be connected with the "dirty" trick we
>use or with another words:  is there a way to force serial job into 
>a score application?

As far as I understand what you did, there would be no problem with 
your "trick". The panic message comes from the system (SCore-D) 
software, and not coming from the SCore runtime library.

----
Atsushi HORI
SCore Developer
Swimmy Software, Inc.

_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From nrcb ＠ streamline-computing.com  Tue May  6 22:31:10 2003
From: nrcb ＠ streamline-computing.com (Nick Birkett)
Date: Tue, 6 May 2003 14:31:10 +0100
Subject: [SCore-users-jp] [SCore-users] Fwd: problem launching some jobs
Message-ID: <200305061431.10857.nrcb@streamline-computing.com>

Hi I have received this from one of our users.

The system is 128 dual Xeon compute nodes with 
fibre optic Myrinet 2000.

Has anyone seen a similar thing (SCore 5.0.1).

-------------------------------------------------------------------------

An error has been reported to me that occasionally occurs when launching
parallel jobs on Snowdon through Sun Grid Engine.

Basically, the only strange output is in the .e file which specifies:

SCOUT: bind: Address already in use.

The code does not get launched.

Trying to reproduce this error with any degree of consistency has been
tricky. The only way of achieving the error has been to submit multiple
copies of my "hello world" program to SGE and then grep-ing the output files
for the error.

I thought that there would probably be something wrong with a particular
node in the cluster. Looking at the .pe files, my code sometimes works and
sometimes doesn't work, using exactly the same nodes specified in the pe
file.

The error has been seen when running on anything from one node upwards.

I have not been able to reproduce the error by running the program directly
through "scout".

If you want to look at some output files, you're welcome to search through
the output files in ~issanr/test/

we are unsure about
whether the problem is with SGE, SCore or a node configuration.

Thanks for your help,

_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From hori ＠ swimmy-soft.com  Wed May  7 09:59:55 2003
From: hori ＠ swimmy-soft.com (Atsushi HORI)
Date: Wed, 7 May 2003 09:59:55 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Fwd: problem launching some jobs
In-Reply-To: <200305061431.10857.nrcb@streamline-computing.com>
References: <200305061431.10857.nrcb@streamline-computing.com>
Message-ID: <3135146395.hori0000@swimmy-soft.com>

Hi,

>Basically, the only strange output is in the .e file which specifies:
>
>SCOUT: bind: Address already in use.
>
>The code does not get launched.

I checked the code and I wonder how this could happen.

Well, I checked source code and I found that the message is output by 
the scout program.

The scout program opens a UNIX (not INET) socket to which scout 
commands in the scout environment connect. The path name is 
associated with PID and the UNIX socket is unlink()ed and then opened 
(created), and unlink()ed again when the scout environment 
terminates. Sometimes obsolete socket(s) can be found in the /tmp 
directory. 

So, I wonder how the message can be output .... Maybe, it is worth to 
try to delete /tmp/scout* files before submitting SGE jobs. But, be 
sure that nobody else is using scout environemnt at that time, unless 
the user in a scout environment can not do the scout command any more.

----
Atsushi HORI
Swimmy Software, Inc.

_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From gchoi ＠ cse.psu.edu  Fri May  9 04:43:06 2003
From: gchoi ＠ cse.psu.edu (Gyu Sang Choi)
Date: Thu, 08 May 2003 15:43:06 -0400
Subject: [SCore-users-jp] [SCore-users] pmGetNodeList: No route to host(113)
Message-ID: <3EBAB34A.8030208@cse.psu.edu>

Hi,

I installed SCORE into Linux Cluster.
The spec of cluster is Dual AMD Athlon, Myrinet 2000 card and 100 MB 
ethernet card.


After the installation, I tested SCOUT test. I got the correct results.
And then, I tested PM/Ethernet and it works.

The problem is here.
I tested PM/Myrinet using Loopback test and Point-to-Point test, but I 
got this error messages, "pmGetNodeList: No route to host(113)".
My server node name is "aum0.cse.psu.edu" and I tested PM/Myrinet in 
aum0.cse.psu.edu. My myrinet card is M3F-PCI64C-8.


Could anyone tell me why this happen and how to fix this proble?

Thanks,

-- 

  ---------------------------------------------------------------
| Gyu Sang Choi							|
| TA : CSE/EE 554						|
| Office Hour : Monday 3:00-4:30 and Tuesday 2:30-4:00		|
| Office : 313 Pond Lab (863-3814) 					|
| email : gchoi ＠ cse.psu.edu					|
| Tel : (814)861-6986					       	|
| Address : 425 Waupelani Drive #327, State College, PA 16801	|
  ---------------------------------------------------------------


_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama ＠ pccluster.org  Fri May  9 09:36:05 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Fri, 09 May 2003 09:36:05 +0900
Subject: [SCore-users-jp] Re: [SCore-users] pmGetNodeList: No route to host(113)
In-Reply-To: Your message of "Thu, 08 May 2003 15:43:06 JST."
             <3EBAB34A.8030208@cse.psu.edu>
Message-ID: <20030509003605.C90DA20066@neal.il.is.s.u-tokyo.ac.jp>

In article <3EBAB34A.8030208 ＠ cse.psu.edu> Gyu Sang Choi <gchoi ＠ cse.psu.edu> wrotes:
> The problem is here.
> I tested PM/Myrinet using Loopback test and Point-to-Point test, but I 
> got this error messages, "pmGetNodeList: No route to host(113)".
> My server node name is "aum0.cse.psu.edu" and I tested PM/Myrinet in 
> aum0.cse.psu.edu. My myrinet card is M3F-PCI64C-8.

I think myrinet configuration file has a problem.

Please execute to set PM_DEBUG environment variable to 1 for more information:
    % setenv PM_DEBUG 1
    % rpmtest ...
Or
    $ export PM_DEBUG=1
    $ rpmtest ...

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From hori ＠ swimmy-soft.com  Fri May  9 09:40:58 2003
From: hori ＠ swimmy-soft.com (Atsushi HORI)
Date: Fri, 9 May 2003 09:40:58 +0900
Subject: [SCore-users-jp] Re: [SCore-users] pmGetNodeList: No route to host(113)
In-Reply-To: <3EBAB34A.8030208@cse.psu.edu>
References: <3EBAB34A.8030208@cse.psu.edu>
Message-ID: <3135318058.hori0000@swimmy-soft.com>

Hi,

>The problem is here.
>I tested PM/Myrinet using Loopback test and Point-to-Point test, but I 
>got this error messages, "pmGetNodeList: No route to host(113)".
>My server node name is "aum0.cse.psu.edu" and I tested PM/Myrinet in 
>aum0.cse.psu.edu. My myrinet card is M3F-PCI64C-8.

I believe your PM/Myrinet configuration file (usually located in 
/opt/score/etc/pm-myrinet.conf) is wrong. Check the followings;

1. In a scout environment

% scout officialname

then you will get a list of "official hostnames". Check the output 
hostnames and the hostnames in the myrinet configuration file if they 
are the same.

2. If above is OK, then the myrinet configuration (routing) is wrong. 
Send the file to me, and I will check it.

----
Atsushi HORI
SCore Developer
Swimmy Software, Inc.

_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From nrcb ＠ streamline-computing.com  Fri May  9 16:01:54 2003
From: nrcb ＠ streamline-computing.com (Nick Birkett)
Date: Fri, 9 May 2003 08:01:54 +0100
Subject: [SCore-users-jp] [SCore-users] invalid fragment error
Message-ID: <200305090759.13497.nrcb@streamline-computing.com>

Hi, I have just upgraded a customer's machine from8 nodes to 12 nodes.

Previously it had 8 copper myrinet2k nodes using Score 5.0.1 and worked
without problem.
I added 4 nodes with new fibre myrinet2k cards and added a fibre optic spine
card to the switch. So switch now has 1 copper line card and one fibre spine
card. 

I ran the scstest over all nodes successfully. 

Now we get a intermittent problem. Most of the jobs run fine but sometimes get
a strange error.

This is the pallas PMB benchmarks on 24 cpus over myrinet2k.

SCOUT: Spawning                    com
p03.fdone
.
<0:0> SCORE: 24 nodes (12x2) ready.
<0:0> SCORE:WARNING MPICH/SCore: receive-request-queue:
<0:0> SCORE:WARNING MPICH/SCore    [buffer=0x91645b0, type=1025, from=2,
size=26
2144, offset=8240]
<0:0> SCORE:WARNING MPICH/SCore: receive-message-queue:
<0:0> SCORE:WARNING MPICH/SCore    (empty)
<0:0> SCORE:WARNING MPICH/SCore: received-fragment:
<0:0> SCORE:WARNING MPICH/SCore    [buffer=0x40069d28, type=1213202598,
from=121
3202854, size=1213203110, fragment_size=8240, offset=1213203366]
<0:0> SCORE:WARNING MPICH/SCore: received an invalid fragment (no previous
fragm
ent)
<0:0> SCORE:PANIC MPICH/SCore: critical error on message transfer
<0:0> Trying to attach GDB (no DISPLAY): PANIC
SCORE: Program aborted.
SCOUT: Session done.


However when we run the same job again it runs fine. 

Anyone know what might be the cause of this ?


Thanks,

Nick

_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From hori ＠ swimmy-soft.com  Fri May  9 16:16:26 2003
From: hori ＠ swimmy-soft.com (Atsushi HORI)
Date: Fri, 9 May 2003 16:16:26 +0900
Subject: [SCore-users-jp] Re: [SCore-users] invalid fragment error
In-Reply-To: <200305090759.13497.nrcb@streamline-computing.com>
References: <200305090759.13497.nrcb@streamline-computing.com>
Message-ID: <3135341786.hori0004@swimmy-soft.com>

Hi,

>This is the pallas PMB benchmarks on 24 cpus over myrinet2k.

Split the cluster into two halved clsuters and then run the benchmark 
program on each subcluster. I have no experience with the 
combinnation of copper and fiber. So if you split the cluster into 
former half and latter half, and if the problem is found in the 
latter half, then the combination might be the problem.

BTW, have you run scstest with the length of 8192 ?

----
Atsushi HORI
Swimmy Software, Inc.

_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kate ＠ pfu.fujitsu.com  Fri May  9 17:22:12 2003
From: kate ＠ pfu.fujitsu.com (KATAYAMA Yoshio)
Date: Fri, 09 May 2003 17:22:12 +0900
Subject: [SCore-users-jp] Garbage?
Message-ID: <200305090822.AA16332@flash.tokyo.pfu.co.jp>

PFUの片山です。

SCore 5.4 の CD-ROM を見ていましたら、

[root ＠ pcc2 RPMS]# pwd
/mnt/cdrom/RedHat/RPMS
[root ＠ pcc2 RPMS]# ls -l ,
-r--r--r--    1 root     root        33025  9月  4  2002 ,
[root ＠ pcc2 RPMS]# file ,
,: RPM v3 bin i386 portmap-4.0-41

というファイルを見つけました。内容が portmap-4.0-41.i386.rpm に
一致していますので、ゴミだと思いますが、如何でしょうか。
--
(株)ＰＦＵ　第二システム統括部　Ｌｉｎｕｘシステム部
片山　善夫
Tel 044-520-6617  Fax 044-556-1022


From hasebe ＠ civil.cst.nihon-u.ac.jp  Fri May  9 21:05:36 2003
From: hasebe ＠ civil.cst.nihon-u.ac.jp (Hiroshi Hasebe)
Date: Fri, 9 May 2003 21:05:36 +0900
Subject: [SCore-users-jp] EITによる計算ホストインストール時のエラー
Message-ID: <000901c31623$53266f00$0569a8c0@kamukaze>

皆様，はじめまして．
長谷部＠日本大学と申します．

現在当研究室では，
RedHat7.3およびSCore5.4を用いて，
PCクラスターを構築中です．

インストールマニュアルに沿って進めて参りましたが，
EITによる計算ホストインストール時（1台目）に
以下のようなエラーメッセージが出て，
インストールがストップしてしまいます．

（前半略）
No dhcp_server specified. Used Broadcast
SIOCSIFADDR: No such device
Try it again
SIOCSIFADDR: No such device
Try it again
SIOCSIFADDR: No such device
Try it again
Configure Network fails
done
NFS mount : /mnt/runtime
Cannot mount
exiting
See the documentation for this trouble
（終）

マシン構成は，

CPU：　PenIII733Hz
メモリー：　256MB
HDD：　20G
NIC： Corega FastEtherII PCI-TX （100Base）

となっております．

アーカイブを拝見すると，
NICのデバイスがブートフロッピーに入っていないことが，
原因のように思われますが，
いかがなものでしょうか？

他にチェックすべき点がございましたら，
是非ともお聞かせ下さい．


過去に同様の趣旨の質問がいくつかあったので，
質問させていただくのは大変申し訳ないのですが，
いまいち当方のエラーとどう違うのか，
理解できなかったので，質問させて頂きました．

申し訳ございませんが，
アドバイスのほう，よろしくお願いいたします．

================================
長谷部　寛　(Hiroshi Hasebe)
　日本大学理工学部　土木工学科
　〒101-8308
　東京都千代田区神田駿河台1-8-14
　TEL/FAX：03-3259-0411
　Email：hasebe ＠ civil.cst.nihon-u.ac.jp
================================


From kameyama ＠ pccluster.org  Mon May 12 16:52:23 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Mon, 12 May 2003 16:52:23 +0900
Subject: [SCore-users-jp] Garbage?
In-Reply-To: Your message of "Fri, 09 May 2003 17:22:12 JST."
             <200305090822.AA16332@flash.tokyo.pfu.co.jp>
Message-ID: <20030512075223.BEBA320068@neal.il.is.s.u-tokyo.ac.jp>

亀山です.

In article <200305090822.AA16332 ＠ flash.tokyo.pfu.co.jp> KATAYAMA Yoshio <kate ＠ pfu.fujitsu.com> wrotes:
> [root ＠ pcc2 RPMS]# ls -l ,
> -r--r--r--    1 root     root        33025  9月  4  2002 ,
> [root ＠ pcc2 RPMS]# file ,
> ,: RPM v3 bin i386 portmap-4.0-41
> 
> というファイルを見つけました。内容が portmap-4.0-41.i386.rpm に
> 一致していますので、ゴミだと思いますが、如何でしょうか。

すみません, ゴミです.
無視してください.

                       from Kameyama Toyohisa


From kameyama ＠ pccluster.org  Mon May 12 17:57:14 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Mon, 12 May 2003 17:57:14 +0900
Subject: [SCore-users-jp] EITによる計算ホストインストール時のエラー
In-Reply-To: Your message of "Fri, 09 May 2003 21:05:36 JST."
             <000901c31623$53266f00$0569a8c0@kamukaze>
Message-ID: <20030512085714.9E1712006C@neal.il.is.s.u-tokyo.ac.jp>

亀山です.

In article <000901c31623$53266f00$0569a8c0 ＠ kamukaze> "Hiroshi Hasebe" <hasebe ＠ civil.cst.nihon-u.ac.jp> wrotes:
> アーカイブを拝見すると，
> NICのデバイスがブートフロッピーに入っていないことが，
> 原因のように思われますが，
> いかがなものでしょうか？

インストーラがデバイスドライバを探せないか
デバイスドライバが入っていないかのどちらかだと思います.
(多分, ドライバは 8139too だと思うのですが...)

念の為確認しますが,
       Select Boot Network Device
のところで
   100 Mbps Ethernet
を選択しましたよね?


そのような状態になったホストで ALT-CNTL-F3 を行うと installer が
どんなデバイスを load しようとしているかがわかります.
ここで network device driver を load していないようでしたら
initrd の中の PCI device ID と devicce driver の対応づけを記述した
ファイルに問題がありそうです.
ALT-CNTL-F4 で kernel の log が出ます.
ALT-CNTL-F3 で device driver を認識しているけれど load に失敗した
というようなときは, ここにエラーが出ていると思います.

まず, この 2 つの出力をみてください.

                       from Kameyama Toyohisa


From kate ＠ pfu.fujitsu.com  Mon May 12 18:56:03 2003
From: kate ＠ pfu.fujitsu.com (KATAYAMA Yoshio)
Date: Mon, 12 May 2003 18:56:03 +0900
Subject: [SCore-users-jp] Garbage?
In-Reply-To: Your message of Mon, 12 May 2003 16:52:23 +0900.
             <20030512075223.BEBA320068@neal.il.is.s.u-tokyo.ac.jp> 
Message-ID: <200305120956.AA18849@flash.tokyo.pfu.co.jp>

PFUの片山です。

Date: Mon, 12 May 2003 16:52:23 +0900
From: kameyama ＠ pccluster.org

>> というファイルを見つけました。内容が portmap-4.0-41.i386.rpm に
>> 一致していますので、ゴミだと思いますが、如何でしょうか。

>すみません, ゴミです.
>無視してください.

わかりました。どうも有難う御座いました。
--
(株)ＰＦＵ　第二システム統括部　Ｌｉｎｕｘシステム部
片山　善夫
Tel 044-520-6617  Fax 044-556-1022


From ulrich.oelmann ＠ iws.uni-stuttgart.de  Tue May 13 19:04:47 2003
From: ulrich.oelmann ＠ iws.uni-stuttgart.de (Ulrich Oelmann)
Date: Tue, 13 May 2003 12:04:47 +0200
Subject: [SCore-users-jp] [SCore-users] Information provided by 'sctop'
Message-ID: <20030513120445.N21466@wum.bauingenieure.uni-stuttgart.de>

Hi there,

does anybody know where I can find more information concerning the
output of the 'sctop' command other than its man-page? I am interested
in the interpretation of the column titled "Resource" and what the
data in the "Memory" and "Disk" columns exactly mean. Is there a
paper where these details are explained? Or should I take a look at
the sources of 'scbcast'?

Best regards
Ulrich Oelmann
_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From jurij.jerman ＠ gov.si  Tue May 13 21:37:43 2003
From: jurij.jerman ＠ gov.si (Jure Jerman)
Date: Tue, 13 May 2003 14:37:43 +0200
Subject: [SCore-users-jp] [SCore-users] Adding an additional computing node
Message-ID: <3EC0E717.2030505@gov.si>

Hi,

some time ago we posted a question about adding an additional score compute node.

We did a clean procedure this time, additional node was installed by scratch with
RedHat 7.2 and score rpms we installed via bininstall -comp command.

Scout tests are working fine, pm tests are working fine when this host
is included in a group, but the score tests fail.

For instance, in single user mode hello program fails
with the message:

scrun -nodes=28 ./hello
FEP: Unable to connect with SCore-D (tuba0)
FEP:WARNING checkpoint option is ignored in single-user mode.
SCore-D 5.4.0 connected.
<13> SCORE: Program signaled (SIGSEGV).

or another time, the same hello test:

scrun -nodes=28 ./hello
FEP: Unable to connect with SCore-D (tuba0)
FEP:WARNING checkpoint option is ignored in single-user mode.
SCore-D 5.4.0 connected.
<13:1> SCORE:ERROR pmAttachContext(type=ethernet,fd=9)=135088608
<13:0> SCORE:ERROR pmAttachContext(type=ethernet,fd=9)=135088608
<0:0> SCORE: 28 nodes (14x2) ready.hello, world (from node 11)
hello, world (from node 2)
....

As one would guess, 13 is the number of newly added host :-)
Does anyone have any idea what could be wrong?

Thank you very much in advance,

Jure Jerman

_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama ＠ pccluster.org  Wed May 14 10:14:35 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Wed, 14 May 2003 10:14:35 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Adding an additional computing node
In-Reply-To: Your message of "Tue, 13 May 2003 14:37:43 JST."
             <3EC0E717.2030505@gov.si>
Message-ID: <20030514011435.6934D2006C@neal.il.is.s.u-tokyo.ac.jp>

In article <3EC0E717.2030505 ＠ gov.si> Jure Jerman <jurij.jerman ＠ gov.si> wrotes:
> scrun -nodes=28 ./hello
> FEP: Unable to connect with SCore-D (tuba0)
> FEP:WARNING checkpoint option is ignored in single-user mode.
> SCore-D 5.4.0 connected.
> <13:1> SCORE:ERROR pmAttachContext(type=ethernet,fd=9)=135088608
> <13:0> SCORE:ERROR pmAttachContext(type=ethernet,fd=9)=135088608
> <0:0> SCORE: 28 nodes (14x2) ready.hello, world (from node 11)

Please set PM_DEBUG environment variable to 1 and retry scrun to get more
information.

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From jure.jerman ＠ rzs-hm.si  Wed May 14 14:24:59 2003
From: jure.jerman ＠ rzs-hm.si (=?iso-2022-jp?b?anVyZS5qZXJtYW4gGyRCIXcbKEIgcnpzLWhtLnNp?=)
Date: Wed, 14 May 2003 07:24:59 +0200 (CEST)
Subject: [SCore-users-jp] Re: [SCore-users] Adding an additional computing node
In-Reply-To: <20030514011435.6934D2006C@neal.il.is.s.u-tokyo.ac.jp>
References: <20030514011435.6934D2006C@neal.il.is.s.u-tokyo.ac.jp>
Message-ID: <1052889899.3ec1d32baf153@webmail.xenya.si>

Hi,

I got the pmAtachContext error just once and I was not able to
reproduce it anymore.

If I export PM_DEBUG=1, I just get

[root ＠ tuba0 jure]# export PM_DEBUG=1
[root ＠ tuba0 jure]# scrun -nodes=28 ./hello
FEP: Unable to connect with SCore-D (tuba0)
FEP:WARNING checkpoint option is ignored in single-user mode.
SCore-D 5.4.0 connected.
<13> SCORE: Program signaled (SIGILL).

What I do believe it is quite normal since pmtests 
are going through without any problem.

I have simply no idea where to search for solution of 
the problem. 

Jure
Quoting kameyama ＠ pccluster.org:

> In article <3EC0E717.2030505 ＠ gov.si> Jure Jerman <jurij.jerman ＠ gov.si>
> wrotes:
> > scrun -nodes=28 ./hello
> > FEP: Unable to connect with SCore-D (tuba0)
> > FEP:WARNING checkpoint option is ignored in single-user mode.
> > SCore-D 5.4.0 connected.
> > <13:1> SCORE:ERROR pmAttachContext(type=ethernet,fd=9)=135088608
> > <13:0> SCORE:ERROR pmAttachContext(type=ethernet,fd=9)=135088608
> > <0:0> SCORE: 28 nodes (14x2) ready.hello, world (from node 11)
> 
> Please set PM_DEBUG environment variable to 1 and retry scrun to get
> more
> information.
> 
>                        from Kameyama Toyohisa
> _______________________________________________
> SCore-users mailing list
> SCore-users ＠ pccluster.org
> http://www.pccluster.org/mailman/listinfo/score-users
> 
> 
_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama ＠ pccluster.org  Wed May 14 14:46:01 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Wed, 14 May 2003 14:46:01 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Adding an additional computing node
In-Reply-To: Your message of "Wed, 14 May 2003 07:24:59 JST."
             <1052889899.3ec1d32baf153@webmail.xenya.si>
Message-ID: <20030514054601.7304220074@neal.il.is.s.u-tokyo.ac.jp>

In article <1052889899.3ec1d32baf153 ＠ webmail.xenya.si> jure.jerman ＠ rzs-hm.si wrotes:
> I got the pmAtachContext error just once and I was not able to
> reproduce it anymore.
> 
> If I export PM_DEBUG=1, I just get
> 
> [root ＠ tuba0 jure]# export PM_DEBUG=1
> [root ＠ tuba0 jure]# scrun -nodes=28 ./hello
> FEP: Unable to connect with SCore-D (tuba0)
> FEP:WARNING checkpoint option is ignored in single-user mode.
> SCore-D 5.4.0 connected.
> <13> SCORE: Program signaled (SIGILL).

All scored binary (/opt/score/deploy/bin.*/scored*.exe) must be same.
Your cluster installed SCore 5.4?

If you want to check score version on all compute hosts.
Please issue:
    % scout cat /opt/score/etc/version

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From m-kawaguchi ＠ pst.fujitsu.com  Wed May 14 17:57:47 2003
From: m-kawaguchi ＠ pst.fujitsu.com (Mitsugu Kawaguchi)
Date: Wed, 14 May 2003 17:57:47 +0900
Subject: [SCore-users-jp] rcstestでカーネルパニック
Message-ID: <20030514175747.5e6de26b.m-kawaguchi@pst.fujitsu.com>

川口＠富士通プライムソフトテクノロジです。
いつもお世話になっております。

現在、rcstestを実行すると、カーネルパニックが発生するという
現象が発生しています。環境は以下の通りです。

 - SCore 5.0.1
 - kernel 2.4.18-3ベース
 - 管理ノードも計算ノード用カーネルで動作
   (管理ノードも計算ノードとして利用するため)

管理ノード上で計算ノード(box00)に対し、rcstestを実行すると
box00がカーネルパニックを起こします。

# rcstest box00 ethernet -v -timeout 10
ethernet_open_device(): -config /var/scored/scoreboard/paradox.0000V300EiDY
pmEthernetOpenDevice: Library version
  $Id: pm_ethernet.c,v 1.64 2002/03/04 09:44:42 s-sumi Exp $
pmEthernetReadConfig(0x83dafe8, unit, 0): set unit number "0" (MAX: 4).
pmEthernetReadConfig(0x83dafe8, maxnsend, 16): set maxnsend "16".
pmEthernetReadConfig(0x83dafe8, backoff, 4800): set backoff "4800" usec.
pmEthernetReadConfig(0x83dafe8, checksum, 0): set checksum "0" off.
pmEthernetOpenDevice("/var/scored/scoreboard/paradox.0000V300EiDY", 0xbffff894): pmEthernetMapEthernet(0, 0xbffff5d8): 0
Ethernet(0): fd=512
self box00.test.domain n 0 of 9 nodes 
pm_ethernetCalibrateTimer(): loop t:1.613887e+07, vt: 1.867100e-02
pm_ethernetCalibrateTimer(): loop t:1.723504e+07, vt: 1.993800e-02
pm_ethernetCalibrateTimer(): end loop t:1.723504e+07, vt: 1.993800e-02
pm_ethernetCalibrateTimer(): d0:8.643818e+08, d1:8.644319e+08
pm_ethernetCalibrateTimer(): clk:864, clock 8.644068e+02
pmEthernetOpenDevice: Driver version
  $Id: pm_ethernet_dev.c,v 1.1.1.1 2002/08/01 07:47:11 kameyama Exp $
ethernet_open_device(): success
 [0] pmEthernetCloseDevice(0x83db028): called
 starting master 0 : pe=9
starting slave:  3 2 6 7 1 5 4 8.
testing*..**..*.*.**.*.*.*..*.*.*.*.*.*.*.

この段階で、box00でカーネルパニック。
その時のメッセージ。

<0> Kernel panic : Aiee,killing interruput handler:
In interruput handler -not syncing

なお、scstestなどは正常に動作します。

また関係するかどうか不明ですが、
管理ノード上からrcp-allコマンドで巨大ファイル(その時は50MB)を
各計算ノードにコピーしようとすると管理ノードがハングします。
(小さいサイズでは発生せず)

これはSCore5.0.1の問題なのでしょうか?
以上、宜しくお願い致します。

-- 
川口 ==>  m-kawaguchi ＠ pst.fujitsu.com


From kameyama ＠ pccluster.org  Wed May 14 18:28:50 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Wed, 14 May 2003 18:28:50 +0900
Subject: [SCore-users-jp] rcstestでカーネルパニック
In-Reply-To: Your message of "Wed, 14 May 2003 17:57:47 JST."
             <20030514175747.5e6de26b.m-kawaguchi@pst.fujitsu.com>
Message-ID: <20030514092851.02A1F20074@neal.il.is.s.u-tokyo.ac.jp>

亀山です.

In article <20030514175747.5e6de26b.m-kawaguchi ＠ pst.fujitsu.com> Mitsugu Kawaguchi <m-kawaguchi ＠ pst.fujitsu.com> wrotes:
> 現在、rcstestを実行すると、カーネルパニックが発生するという
> 現象が発生しています。環境は以下の通りです。
> 
>  - SCore 5.0.1
>  - kernel 2.4.18-3ベース
>  - 管理ノードも計算ノード用カーネルで動作
>    (管理ノードも計算ノードとして利用するため)

管理ノードの NIC は何でしょうか?

> この段階で、box00でカーネルパニック。
> その時のメッセージ。
> 
> <0> Kernel panic : Aiee,killing interruput handler:
> In interruput handler -not syncing

kernel の割り込み処理中に exit() 動作をしようとしたようです.
NIC のドライバも関係する可能性があります.

                       from Kameyama Toyohisa


From jure.jerman ＠ rzs-hm.si  Wed May 14 18:58:37 2003
From: jure.jerman ＠ rzs-hm.si (Jure Jerman)
Date: Wed, 14 May 2003 11:58:37 +0200
Subject: [SCore-users-jp] Re: [SCore-users] Adding an additional computing node
References: <20030514054601.7304220074@neal.il.is.s.u-tokyo.ac.jp>
Message-ID: <3EC2134D.8000105@rzs-hm.si>

Hi,

I checked, the score version is 5.4.0 and
binaries in /opt/score/deploy/* are the same
everywhere.

Behaviour of node numer 13 is getting stranger.
After reboot of score I was able to run hello
program, but not always as it can be seen from
attached output:

The outpout is now:

[root ＠ tuba0 root]# scrun -nodes=28 /home/jure/hello
SCore-D 5.4.0 connected (jid=7).
_pmEthernetAttachContext(9, 0x40017ebc): _pmEthernetMapContext(9, 0x40017ebc): 135088608
<13:0> SCORE:ERROR _pmEthernetAttachContext(9, 0x40017ebc): _pmEthernetMapContext(9, 0x40017ebc): 135088608
pmAttachContext(type=ethernet,fd=9)=135088608
<13:1> SCORE:ERROR pmAttachContext(type=ethernet,fd=9)=135088608
<0:0> SCORE: 28 nodes (14x2) ready.
hello, world (from node 12)
hello, world (from node 10)
hello, world (from node 14)
hello, world (from node 4)
hello, world (from node 23)
hello, world (from node 2)
hello, world (from node 8)
hello, world (from node 6)
hello, world (from node 15)
hello, world (from node 21)
hello, world (from node 22)
hello, world (from node 13)
hello, world (from node 19)
hello, world (from node 17)
hello, world (from node 18)
hello, world (from node 16)
hello, world (from node 20)
hello, world (from node 11)
hello, world (from node 5)
hello, world (from node 3)
hello, world (from node 9)
hello, world (from node 7)
hello, world (from node 1)
hello, world (from node 25)
hello, world (from node 24)
hello, world (from node 0)
[root ＠ tuba0 root]# scrun -nodes=28 /home/jure/hello
SCore-D 5.4.0 connected (jid=8).
<13> SCORE: Program signaled (SIGSEGV).


Note, that behaviour is quite random, it can even happen that
hello runs without any problem what is not the case with cpi
which fails every time.

On the other hand, pm tests run without any problem.

Any additional idea where to search would be very appreciated.

Thanks, Jure

kameyama ＠ pccluster.org wrote:
> In article <1052889899.3ec1d32baf153 ＠ webmail.xenya.si> jure.jerman ＠ rzs-hm.si wrotes:
> 
>>I got the pmAtachContext error just once and I was not able to
>>reproduce it anymore.
>>
>>If I export PM_DEBUG=1, I just get
>>
>>[root ＠ tuba0 jure]# export PM_DEBUG=1
>>[root ＠ tuba0 jure]# scrun -nodes=28 ./hello
>>FEP: Unable to connect with SCore-D (tuba0)
>>FEP:WARNING checkpoint option is ignored in single-user mode.
>>SCore-D 5.4.0 connected.
>><13> SCORE: Program signaled (SIGILL).
> 
> 
> All scored binary (/opt/score/deploy/bin.*/scored*.exe) must be same.
> Your cluster installed SCore 5.4?
> 
> If you want to check score version on all compute hosts.
> Please issue:
>     % scout cat /opt/score/etc/version
> 
>                        from Kameyama Toyohisa
> _______________________________________________
> SCore-users mailing list
> SCore-users ＠ pccluster.org
> http://www.pccluster.org/mailman/listinfo/score-users
> 
> 


-- 
--------------------------------------------------------------
Jure Jerman                       Email: jure.jerman ＠ rzs-hm.si
Environmental Agency of Slovenia  tel:   xx 386 1 478 41 43
Meteorological office             fax:   xx 386 1 478 40 54
Vojkova 1b
SI-1001 Ljubljana
SLOVENIA
--------------------------------------------------------------

_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From m-kawaguchi ＠ pst.fujitsu.com  Wed May 14 19:01:47 2003
From: m-kawaguchi ＠ pst.fujitsu.com (Mitsugu Kawaguchi)
Date: Wed, 14 May 2003 19:01:47 +0900
Subject: [SCore-users-jp] rcstestでカーネルパニック
In-Reply-To: <20030514092851.02A1F20074@neal.il.is.s.u-tokyo.ac.jp>
References: <20030514175747.5e6de26b.m-kawaguchi@pst.fujitsu.com>
	<20030514092851.02A1F20074@neal.il.is.s.u-tokyo.ac.jp>
Message-ID: <20030514190147.4f6e8d67.m-kawaguchi@pst.fujitsu.com>

亀山殿

川口＠富士通プライムソフトテクノロジです。
いつもお世話になっております。

On Wed, 14 May 2003 18:28:50 +0900
kameyama ＠ pccluster.org wrote:

> 亀山です.
> 
> In article <20030514175747.5e6de26b.m-kawaguchi ＠ pst.fujitsu.com> Mitsugu Kawaguchi <m-kawaguchi ＠ pst.fujitsu.com> wrotes:
> > 現在、rcstestを実行すると、カーネルパニックが発生するという
> > 現象が発生しています。環境は以下の通りです。
> > 
> >  - SCore 5.0.1
> >  - kernel 2.4.18-3ベース
> >  - 管理ノードも計算ノード用カーネルで動作
> >    (管理ノードも計算ノードとして利用するため)
> 
> 管理ノードの NIC は何でしょうか?

NICはオンボードです。(富士通 PRIMERGY BX300)
以下、/proc/pci の抜粋です。
 BROADCOM Corpration NetXtreme BCM5701 Gigabit Ethernet(rev21)


> > この段階で、box00でカーネルパニック。
> > その時のメッセージ。
> > 
> > <0> Kernel panic : Aiee,killing interruput handler:
> > In interruput handler -not syncing
> 
> kernel の割り込み処理中に exit() 動作をしようとしたようです.
> NIC のドライバも関係する可能性があります.

ドライバは bcm5700.o を利用しています。
以上、宜しくお願い致します。

> 
>                        from Kameyama Toyohisa


-- 
川口 ==> m-kawaguchi ＠ pst.fujitsu.com


From kameyama ＠ pccluster.org  Wed May 14 19:26:02 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Wed, 14 May 2003 19:26:02 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Adding an additional computing node
In-Reply-To: Your message of "Wed, 14 May 2003 11:58:37 JST."
             <3EC2134D.8000105@rzs-hm.si>
Message-ID: <20030514102602.2388F20068@neal.il.is.s.u-tokyo.ac.jp>

In article <3EC2134D.8000105 ＠ rzs-hm.si> Jure Jerman <jure.jerman ＠ rzs-hm.si> wrotes:
> On the other hand, pm tests run without any problem.

Do you try pm tests with node 13?
For example:
    % rpmtest node13_hostname ethernet -reply
And other termninal
    % rpmtest node0_hostname ethernet -dest 13 -ping

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From jure.jerman ＠ rzs-hm.si  Wed May 14 19:58:38 2003
From: jure.jerman ＠ rzs-hm.si (Jure Jerman)
Date: Wed, 14 May 2003 12:58:38 +0200
Subject: [SCore-users-jp] Re: [SCore-users] Adding an additional computing node
References: <20030514102602.2388F20068@neal.il.is.s.u-tokyo.ac.jp>
Message-ID: <3EC2215E.5040508@rzs-hm.si>

Hi,

all pm tests are running:

[root ＠ tuba0 sbin]# ./rpmtest tuba0 gigaethernet -dest 13 -ping
8       4.68597e-05

the same for burst test. I would say that pm is not the problem
but rather something connected with score.

I have nod idea how to debug scored.


Jure


kameyama ＠ pccluster.org wrote:
> In article <3EC2134D.8000105 ＠ rzs-hm.si> Jure Jerman <jure.jerman ＠ rzs-hm.si> wrotes:
> 
>>On the other hand, pm tests run without any problem.
> 
> 
> Do you try pm tests with node 13?
> For example:
>     % rpmtest node13_hostname ethernet -reply
> And other termninal
>     % rpmtest node0_hostname ethernet -dest 13 -ping
> 
>                        from Kameyama Toyohisa
> 
> 


_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama ＠ pccluster.org  Wed May 14 20:24:23 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Wed, 14 May 2003 20:24:23 +0900
Subject: [SCore-users-jp] rcstestでカーネルパニック
In-Reply-To: Your message of "Wed, 14 May 2003 19:01:47 JST."
             <20030514190147.4f6e8d67.m-kawaguchi@pst.fujitsu.com>
Message-ID: <20030514112423.7235B2006C@neal.il.is.s.u-tokyo.ac.jp>

亀山です.

In article <20030514190147.4f6e8d67.m-kawaguchi ＠ pst.fujitsu.com> Mitsugu Kawaguchi <m-kawaguchi ＠ pst.fujitsu.com> wrotes:
> > 管理ノードの NIC は何でしょうか?
> 
> NICはオンボードです。(富士通 PRIMERGY BX300)
> 以下、/proc/pci の抜粋です。
>  BROADCOM Corpration NetXtreme BCM5701 Gigabit Ethernet(rev21)

(中略)

> ドライバは bcm5700.o を利用しています。

SCore 5.0.1 の bcm5700 のドライバは 2.0.28 です.
SCore 5.2.0 のほうは 2.2.27 です.
    http://support.3com.com/infodeli/tools/nic/linux/linux996release505.txt
に

   v2.2.19 (04/10/02)

- Fixed a panic problem on 5700 under heavy traffic on certain machines.

という記述がありますので, この bug にあたった可能性があります.
bcm5700 ドライバを update すれば動くようになるかもしれません.

                       from Kameyama Toyohisa


From nrcb ＠ streamline-computing.com  Thu May 15 06:43:32 2003
From: nrcb ＠ streamline-computing.com (Nick Birkett)
Date: Wed, 14 May 2003 22:43:32 +0100
Subject: [SCore-users-jp] [SCore-users] myrinet2k hardware failure  ??
Message-ID: <200305142243.32561.nrcb@streamline-computing.com>

System : RedHat Linux, Score 5.0.1, myrinet2k network

I presume this means we have a hardware failure:

[nrcb ＠ snowdon sbin]$ export PM_DEBUG=2
[nrcb ＠ snowdon sbin]$ ./rpmtest comp058 myrinet2k -dest 58 -ping
myriMapContext(1, 513, 0xbffff8fc): mmap((nil), 14000, 3, 1, 201, 0): Sys: Bad address(14)
myriCreateContext(0x83f66e0, 0, 0xbffffa38): myriMapContext(1, 513, 0xbffff8fc): Bad address(14)
myri_open_context(0x83f66e0, 0, 0x80adc98): myriOpenContext(0x83f66e0, 0, 0xbffffa38): Bad address(14)
pmOpenContext: Bad address(14)

?

I have rebooted the compute node comp058 and still get the same error.

The other 127 nodes seem to work ok.

Thanks,

Nick


_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama ＠ pccluster.org  Thu May 15 09:40:09 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Thu, 15 May 2003 09:40:09 +0900
Subject: [SCore-users-jp] Re: [SCore-users] myrinet2k hardware failure ??
In-Reply-To: Your message of "Wed, 14 May 2003 22:43:32 JST."
             <200305142243.32561.nrcb@streamline-computing.com>
Message-ID: <20030515004010.0010620068@neal.il.is.s.u-tokyo.ac.jp>

In article <200305142243.32561.nrcb ＠ streamline-computing.com> Nick Birkett <nrcb ＠ streamline-computing.com> wrotes:
> System : RedHat Linux, Score 5.0.1, myrinet2k network
> 
> I presume this means we have a hardware failure:

I think so...

> [nrcb ＠ snowdon sbin]$ export PM_DEBUG=2
> [nrcb ＠ snowdon sbin]$ ./rpmtest comp058 myrinet2k -dest 58 -ping
> myriMapContext(1, 513, 0xbffff8fc): mmap((nil), 14000, 3, 1, 201, 0): Sys: Ba
> d address(14)
> myriCreateContext(0x83f66e0, 0, 0xbffffa38): myriMapContext(1, 513, 0xbffff8f
> c): Bad address(14)
> myri_open_context(0x83f66e0, 0, 0x80adc98): myriOpenContext(0x83f66e0, 0, 0xb
> ffffa38): Bad address(14)
> pmOpenContext: Bad address(14)

Please check kernel log on the node.
In the log, you find this message.
    myir_paddr: offset=xxx
Probabry xxx is grater then Murinet sram size.
(If you want to find sram size, please look at
    /proc/pm/myrinet/0/info
on the node.)

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From hasebe ＠ civil.cst.nihon-u.ac.jp  Mon May 19 11:33:27 2003
From: hasebe ＠ civil.cst.nihon-u.ac.jp (Hiroshi Hasebe)
Date: Mon, 19 May 2003 11:33:27 +0900
Subject: [SCore-users-jp] EITによる計算ホストインストール時のエラー
References: <20030512085714.9E1712006C@neal.il.is.s.u-tokyo.ac.jp>
Message-ID: <000901c31daf$0f734240$0569a8c0@kamukaze>

亀山様

お世話になります．
長谷部＠日大です．


> 念の為確認しますが,
>        Select Boot Network Device
> のところで
>    100 Mbps Ethernet
> を選択しましたよね?

確認のため”100 Mbps Ethernet”を選択し，
もう一度インストールを行いましたが，
結果は同様のものでした．


> そのような状態になったホストで ALT-CNTL-F3 を行うと installer が
> どんなデバイスを load しようとしているかがわかります.
> ここで network device driver を load していないようでしたら
> initrd の中の PCI device ID と devicce driver の対応づけを記述した
> ファイルに問題がありそうです.

ALT-CNTL-F3の結果を以下に示します．

*  probing  buses
*  finished  bus  probing
*  found  nothing
*  writing  /tmp/modules.conf
*  going  to  insmod  sunrpc.o  ( path  is  NULL )
*  going  to  insmod  locked.o  ( path  is  NULL )
*  going  to  insmod  nfs.o  ( path  is  NULL )

つまり，インストーラーがデバイスを探せない，
と理解してよろしいのでしょうか？

もし，そうであるならば対処法をお教え願えませんか？
以上，よろしくお願いいたします．

================================
長谷部　寛　(Hiroshi Hasebe)
　日本大学理工学部　土木工学科
　〒101-8308
　東京都千代田区神田駿河台1-8-14
　TEL/FAX：03-3259-0411
　Email：hasebe ＠ civil.cst.nihon-u.ac.jp
================================


From kameyama ＠ pccluster.org  Mon May 19 12:31:28 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Mon, 19 May 2003 12:31:28 +0900
Subject: [SCore-users-jp] EITによる計算ホストインストール時のエラー
In-Reply-To: Your message of "Mon, 19 May 2003 11:33:27 JST."
             <000901c31daf$0f734240$0569a8c0@kamukaze>
Message-ID: <20030519033128.45E3820074@neal.il.is.s.u-tokyo.ac.jp>

亀山です.

In article <000901c31daf$0f734240$0569a8c0 ＠ kamukaze> "Hiroshi Hasebe" <hasebe ＠ civil.cst.nihon-u.ac.jp> wrotes:
> > そのような状態になったホストで ALT-CNTL-F3 を行うと installer が
> > どんなデバイスを load しようとしているかがわかります.
> > ここで network device driver を load していないようでしたら
> > initrd の中の PCI device ID と devicce driver の対応づけを記述した
> > ファイルに問題がありそうです.
> 
> ALT-CNTL-F3の結果を以下に示します．
> 
> *  probing  buses
> *  finished  bus  probing
> *  found  nothing
> *  writing  /tmp/modules.conf
> *  going  to  insmod  sunrpc.o  ( path  is  NULL )
> *  going  to  insmod  locked.o  ( path  is  NULL )
> *  going  to  insmod  nfs.o  ( path  is  NULL )
> 
> つまり，インストーラーがデバイスを探せない，
> と理解してよろしいのでしょうか？

そのようですね.

> もし，そうであるならば対処法をお教え願えませんか？
> 以上，よろしくお願いいたします．

まず, その NIC の PCI の vendor ID および product ID
を知る必要があります.
その ID がわかれば floppy の中の initrd.img の中の
pcitable をいじります.
いじりかたは
    http://www.pccluster.org/pipermail/score-users/2002-October/000237.html
が参考になると思います.

                       from Kameyama Toyohisa


From m-kawaguchi ＠ pst.fujitsu.com  Mon May 19 21:24:22 2003
From: m-kawaguchi ＠ pst.fujitsu.com (Mitsugu Kawaguchi)
Date: Mon, 19 May 2003 21:24:22 +0900
Subject: [SCore-users-jp] rcstestでカーネルパニック
In-Reply-To: <20030514112423.7235B2006C@neal.il.is.s.u-tokyo.ac.jp>
Message-ID: <000401c31e01$942b6ba0$58bd220a@Globus>

亀山様

川口＠富士通プライムソフトテクノロジ　です。
回答、ありがとうございました。

> -----Original Message-----
> From: score-users-jp-admin ＠ pccluster.org
> [mailto:score-users-jp-admin ＠ pccluster.org] On Behalf Of
> kameyama ＠ pccluster.org
> Sent: Wednesday, May 14, 2003 8:24 PM
> To: Mitsugu Kawaguchi
> Cc: kameyama ＠ pccluster.org; score-users-jp ＠ pccluster.org
> Subject: Re: [SCore-users-jp] rcstestでカーネルパニック
>
>
> 亀山です.
>
> In article
> <20030514190147.4f6e8d67.m-kawaguchi ＠ pst.fujitsu.com> Mitsugu
> Kawaguchi <m-kawaguchi ＠ pst.fujitsu.com> wrotes:
> > > 管理ノードの NIC は何でしょうか?
> >
> > NICはオンボードです。(富士通 PRIMERGY BX300)
> > 以下、/proc/pci の抜粋です。
> >  BROADCOM Corpration NetXtreme BCM5701 Gigabit Ethernet(rev21)
>
> (中略)
>
> > ドライバは bcm5700.o を利用しています。
>
> SCore 5.0.1 の bcm5700 のドライバは 2.0.28 です.
> SCore 5.2.0 のほうは 2.2.27 です.
>
>
http://support.3com.com/infodeli/tools/nic/linux/linux996release505.txt
> に
>
>    v2.2.19 (04/10/02)
>
> - Fixed a panic problem on 5700 under heavy traffic on certain
machines.
>
> という記述がありますので, この bug にあたった可能性があります.
> bcm5700 ドライバを update すれば動くようになるかもしれません.

ドライバを更新して試したのですが、結果的には同じ（kernel panic)になりま
した。
試したドライバは v2.2.19とv2.2.27です。
但し、更新前と比較して、kernel panicになるまでの時間は延びました。
（数秒　→　10～20秒程度）

以上、宜しくお願いします。

--
川口 ==>  m-kawaguchi ＠ pst.fujitsu.com


From kameyama ＠ pccluster.org  Mon May 19 21:49:15 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Mon, 19 May 2003 21:49:15 +0900
Subject: [SCore-users-jp] rcstestでカーネルパニック
In-Reply-To: Your message of "Mon, 19 May 2003 21:24:22 JST."
             <000401c31e01$942b6ba0$58bd220a@Globus>
Message-ID: <20030519124916.F310220068@neal.il.is.s.u-tokyo.ac.jp>

亀山です.

In article <000401c31e01$942b6ba0$58bd220a ＠ Globus> "Mitsugu Kawaguchi" <m-kawaguchi ＠ pst.fujitsu.com> wrotes:
> > > ドライバは bcm5700.o を利用しています。
> >
> > SCore 5.0.1 の bcm5700 のドライバは 2.0.28 です.
> > SCore 5.2.0 のほうは 2.2.27 です.

間違えました.

SCore 5.2.0 は 2.2.15 (ただし, ia64 020508 patch に含まれているもの)
でした.

> > - Fixed a panic problem on 5700 under heavy traffic on certain
> machines.
> >
> > という記述がありますので, この bug にあたった可能性があります.
> > bcm5700 ドライバを update すれば動くようになるかもしれません.
> 
> ドライバを更新して試したのですが、結果的には同じ（kernel panic)になりま
> した。
> 試したドライバは v2.2.19とv2.2.27です。

最新の 5.0.5 ではどうなるでしょうか?

                       from Kameyama Toyohisa


From m-kawaguchi ＠ pst.fujitsu.com  Tue May 20 22:01:40 2003
From: m-kawaguchi ＠ pst.fujitsu.com (Mitsugu Kawaguchi)
Date: Tue, 20 May 2003 22:01:40 +0900
Subject: [SCore-users-jp] rcstestでカーネルパニック
In-Reply-To: <20030519124916.F310220068@neal.il.is.s.u-tokyo.ac.jp>
Message-ID: <001101c31ecf$f55926f0$58bd220a@Globus>

亀山様

川口@富士通プライムソフトテクノロジ です。
回答、ありがとうございました。

v5.0.5でも確認したのですが、やはり同じ結果でした。
（v2.2.19やv2.2.27と同レベル）

> -----Original Message-----
> From: score-users-jp-admin ＠ pccluster.org
> [mailto:score-users-jp-admin ＠ pccluster.org] On Behalf Of
> kameyama ＠ pccluster.org
> Sent: Monday, May 19, 2003 9:49 PM
> To: Mitsugu Kawaguchi
> Cc: kameyama ＠ pccluster.org; score-users-jp ＠ pccluster.org
> Subject: Re: RE: [SCore-users-jp] rcstestでカーネルパニック
>
>
> 亀山です.
>
> In article <000401c31e01$942b6ba0$58bd220a ＠ Globus> "Mitsugu
> Kawaguchi" <m-kawaguchi ＠ pst.fujitsu.com> wrotes:
> > > > ドライバは bcm5700.o を利用しています。
> > >
> > > SCore 5.0.1 の bcm5700 のドライバは 2.0.28 です.
> > > SCore 5.2.0 のほうは 2.2.27 です.
>
> 間違えました.
>
> SCore 5.2.0 は 2.2.15 (ただし, ia64 020508 patch に含まれているもの)
> でした.
>
> > > - Fixed a panic problem on 5700 under heavy traffic on certain
> > machines.
> > >
> > > という記述がありますので, この bug にあたった可能性があります.
> > > bcm5700 ドライバを update すれば動くようになるかもしれません.
> >
> > ドライバを更新して試したのですが、結果的には同じ（kernel panic)にな
り
> ま
> > した。
> > 試したドライバは v2.2.19とv2.2.27です。
>
> 最新の 5.0.5 ではどうなるでしょうか?
>
>                        from Kameyama Toyohisa
> _______________________________________________
> SCore-users-jp mailing list
> SCore-users-jp ＠ pccluster.org
> http://www.pccluster.org/mailman/listinfo/score-users-jp

以上、宜しくお願いします。

--
川口 ==>  m-kawaguchi ＠ pst.fujitsu.com


From nrcb ＠ streamline-computing.com  Tue May 27 04:41:58 2003
From: nrcb ＠ streamline-computing.com (Nick Birkett)
Date: Mon, 26 May 2003 20:41:58 +0100
Subject: [SCore-users-jp] [SCore-users] some errors compiling on ia64
Message-ID: <200305262041.58414.nrcb@streamline-computing.com>

I am compiling Score for IA64 (Itanium2) .

I have setup the IA64 Score kernel 2.4.19-1SCORE and compiled it from source.

I am using Red Hat Linux Advanced Server release 2.1AS (Derry).

Score configure does not recognise this as a redhat7 system so I have tried to configure it
as a ia64-redhat-linux2_4.

Most of the source code builds fine except I don't have javac installed:

make[1]: *** [tlogview.jar] Error 127
make[1]: *** [mpstat.jar] Error 127
make[1]: *** [exc.jar] Error 127
make[1]: *** [ompsm_scash.o] Error 4
make[1]: *** [tlogview.jar] Error 127
make[1]: *** [mpstat.jar] Error 127


I hope this is not important ?

Next scash fails to compile:

/usr/src/linux-2.4/include/asm/page.h(26): catastrophic error: #error directive: Unsupported page size!^M
  # error Unsupported page size!

Again I am not too concerned about this yet as we are running mpi.

When installing I get this error:

[root ＠ tiger4-2 score-src]# /opt/score/install/setup -score_server
/opt/score/install/setup: /opt/score/install/bin.ia64-redhat-linux2_4/setup: No such file or directory
/opt/score/install/setup: exec: /opt/score/install/bin.ia64-redhat-linux2_4/setup: cannot execute: No such file or directory
[root ＠ tiger4-2 score-src]#

I am a bit worried that the above failed. 

The mpi compiler wrappers for gnu and Intel appear to work.

If anyone has Score 5.4 on Itanium 2 working I would appreciate your help.

Many thanks,

Nick
_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From nrcb ＠ streamline-computing.com  Tue May 27 05:06:57 2003
From: nrcb ＠ streamline-computing.com (Nick Birkett)
Date: Mon, 26 May 2003 21:06:57 +0100
Subject: [SCore-users-jp] [SCore-users] mixed clusters
Message-ID: <200305262106.57885.nrcb@streamline-computing.com>

I know Score supports heterogeneous clusters (mixed architecture).

However is it possible to run a single mpi job across different types
of compute nodes (eg Xeon 32bit, Itanium 64 bit) ?

I assume the answer is no ?

Thanks,

Nick
_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From hori ＠ swimmy-soft.com  Tue May 27 10:48:33 2003
From: hori ＠ swimmy-soft.com (Atsushi HORI)
Date: Tue, 27 May 2003 10:48:33 +0900
Subject: [SCore-users-jp] Re: [SCore-users] mixed clusters
In-Reply-To: <200305262106.57885.nrcb@streamline-computing.com>
References: <200305262106.57885.nrcb@streamline-computing.com>
Message-ID: <3136877313.hori0001@swimmy-soft.com>

Hi, Nick,

>I know Score supports heterogeneous clusters (mixed architecture).
>
>However is it possible to run a single mpi job across different types
>of compute nodes (eg Xeon 32bit, Itanium 64 bit) ?
>
>I assume the answer is no ?

The answer is yes, SCore is designed to be and was working on the 
cluster having alpha and ia32 processors. But as far as I know, 
nobody tried on the cluster having ia32 and ia64 processors.

----
Atsushi HORI
Swimmy Software, Inc.

_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From n-masuda ＠ sp.nas.nec.co.jp  Wed May 28 16:31:31 2003
From: n-masuda ＠ sp.nas.nec.co.jp (増田　尚美)
Date: Wed, 28 May 2003 16:31:31 +0900
Subject: [SCore-users-jp] fork(),execl()を使用したプログラムの並列化
Message-ID: <003301c324eb$2b52ff80$6400a8c0@sein>

This is a multi-part message in MIME format.

------=_NextPart_000_002F_01C32536.985588F0
Content-Type: text/plain;
	charset="iso-2022-jp"
Content-Transfer-Encoding: 7bit

メーリングリストの皆様

増田です。
お世話になります。

現在、システム設計フェーズで
SCoreに関する調査を行なっています。
知識不足で、言葉足らずな説明になっているようでしたら
大変申し訳ありませんが、よろしくお願いいたします。

【作業内容】
　SCore環境における各種システム関数の動作検証
　　各種システム関数を使用して作成されているCプログラムを
　　MPI,OpenMPを用いて並列化させ、SCore環境においても
　　正常に動作するかを検証しています。

【環境】
　　SCore      　：　5.4.0
　　RedHat Linux ：　7.3

　　サーバ兼計算ホスト
　　計算ホスト1台
　　　　　　  ・・・共にシングルCPU

【コンパイル方法】
　　MPI    ：　mpicc -o main main.c
　　OpenMP ：　omcc -omniconfig=scash -o main main.c

【実行環境】
　　マルチユーザモード

そこで、教えていただきたいことがあります。

【質問】
　ネット等から、以下に示す(1)(2)の並列化についての情報を
　収集したのですが、全く見当たりませんでした。
　一般的ではないのでしょうか？

　　(1)子プロセスを生成するプログラム（fork(),execl()使用）
　　(2)スレッドを生成するプログラム（POSIXスレッド使用）

【検証内容】
　【質問】の(1)(2)とも動作検証を実施しました。

　　(1)子プロセスを生成するプログラム（fork(),execl()使用）
　　　＜MPI使用時＞
　　　　複数の子プロセスを生成しようとした場合、
　　　　50%以上の確率でいくつかの子プロセスが起動しません。
　　　　但し、正常に起動され期待した通りの並列化処理が実施できる場合もありま
す。

　　　＜OpemMP使用時＞
　　　　動作しません。例えば、メインルーチンから子プロセスを1本
fork(),execl()
       するプログラムを実行すると、以下のメッセージが表示されて終了します。
　　　　# scrun -nodes=2,scored=comp0.pccluster.org ./main
　　　　SCore-D 5.4.0 connected (jid=25).
　　　　<0:0> SCORE: 2 nodes (2x1) ready.
　　　　<0:0> SCORE: 2 nodes (2x1) ready.
　　　　<0>scash_pa_init:can not create page table file: Operation not
permitted[1]
　　　　<1>scash_pa_init:can not create page table file: Operation not
permitted[1]

　　(2)スレッドを生成するプログラム（POSIXスレッド使用）
　　　＜MPI使用時＞
　　　　正常動作しているようにみえます。
　　　　但し、defunct（ゾンビ）プロセスができています。
　　　　例えば、メインルーチンから1本のスレッドを生成（pthread_create()）す
る
　　　　プログラムを実行した時のプロセス状態が以下のようになります。

        ・マスタ側
　　  　　# ps -ef |grep main
    　　　UID        PID  PPID  C STIME TTY          TIME CMD
　 　   　msv       3761  3620  0 14:45 ?        00:00:00 ./main
   　 　　msv       3767  3761  0 14:45 ?        00:00:00 [main.1 <defunct>]
   　　 　msv       3768  3761  0 14:45 ?        00:00:00 ./main
   　　 　msv       3769  3768  0 14:45 ?        00:00:00 ./main

        ・スレーブ側
　　  　　# ps -ef |grep main
    　　　UID        PID  PPID  C STIME TTY          TIME CMD
   　 　　#501      1184  1161 49 14:49 ?        00:03:07 ./main
   　 　　#501      1185  1184  0 14:49 ?        00:00:00 [main.1 <defunct>]
   　　 　#501      1186  1184  0 14:49 ?        00:00:00 ./main
   　　 　#501      1187  1186 50 14:49 ?        00:03:11 ./main

　　　＜OpenMP使用時＞
　　　　まだ実施していないのですが、
　　　　この場合は、OpenMPが提供しているライブラリ関数に書き換える必要が
　　　　あるのでしょうか？

【知りたい事】
　以下の2点です。
　・【質問】の(1)(2)は、MPI,OpenMPで並列化にした場合に動作が保証されているのか
？
　・保証されている場合、【検証内容】に記したような事象が発生するのはなぜか？
      （MPI,OpenMPの書式の組み込みに問題があるのでしょうか？）

以上です。
よろしくお願いいたします。

------=_NextPart_000_002F_01C32536.985588F0
Content-Type: text/html;
	charset="iso-2022-jp"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-2022-jp">
<META content=3D"MSHTML 6.00.2800.1141" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3D"=1B$B#M#S=1B(B =1B$B%4%7%C%/=1B(B"=20
size=3D2>=1B$B%a!<%j%s%0%j%9%H$N3'MM=1B(B<BR><BR>=1B$BA}ED$G$9!#=1B(B<BR>=
=1B$B$*@$OC$K$J$j$^$9!#=1B(B<BR><BR>=1B$B8=3D:_!"%7%9%F%`@_7W%U%'!<%:$G=1B=
(B</FONT></DIV>
<DIV><FONT face=3D"=1B$B#M#S=1B(B =1B$B%4%7%C%/=1B(B"=20
size=3D2>SCore=1B$B$K4X$9$kD4::$r9T$J$C$F$$$^$9!#=1B(B<BR>=1B$BCN<1ITB-$G=
!"8 ＠ MUB-$i$:$J ＠ bL@$K$J$C$F$$$k$h$&$G$7$?$i=1B(B<BR>=1B$BBgJQ?=3D$7Lu$"$j$=
^$;$s$,!"$h$m$7$/$*4j$$$$$?$7$^$9!#=1B(B<BR><BR>=1B$B!Z:n6HFbMF![=1B(B<BR=
>=1B$B!!=1B(BSCore=1B$B4D6-$K$*$1$k3F<o%7%9%F%`4X?t$NF0:n8!>Z=1B(B<BR>=1B=
$B!!!!3F<o%7%9%F%`4X?t$r;HMQ$7$F:n ＠ .$5$l$F$$$k=1B(BC=1B$B%W%m%0%i%`$r=1B(=
B<BR>=1B$B!!!!=1B(BMPI,OpenMP=1B$B$rMQ$$$FJBNs2=3D$5$;!"=1B(BSCore=1B$B4D=
6-$K$*$$$F$b=1B(B<BR>=1B$B!!!!@5>o$KF0:n$9$k$+$r8!>Z$7$F$$$^$9!#=1B(B<BR>=
<BR>=1B$B!Z4D6-![=1B(B<BR>=1B$B!!!!=1B(BSCore&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;=20
=1B$B!!!'!!=1B(B5.4.0<BR>=1B$B!!!!=1B(BRedHat Linux =
=1B$B!'!!=1B(B7.3</FONT></DIV>
<DIV><FONT face=3D"=1B$B#M#S=1B(B =1B$B%4%7%C%/=1B(B" =
size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3D"=1B$B#M#S=1B(B =1B$B%4%7%C%/=1B(B" =
size=3D2>=1B$B!!!!%5!<%P7s7W;;%[%9%H=1B(B</FONT></DIV>
<DIV><FONT face=3D"=1B$B#M#S=1B(B =1B$B%4%7%C%/=1B(B" =
size=3D2>=1B$B!!!!7W;;%[%9%H=1B(B1=1B$BBf=1B(B</FONT></DIV>
<DIV><FONT face=3D"=1B$B#M#S=1B(B =1B$B%4%7%C%/=1B(B" =
size=3D2>=1B$B!!!!!!!!!!!!=1B(B&nbsp; =
=1B$B!&!&!&6&$K%7%s%0%k=1B(BCPU</FONT></DIV>
<DIV><FONT face=3D"=1B$B#M#S=1B(B =1B$B%4%7%C%/=1B(B" =
size=3D2></FONT><FONT size=3D+0><BR><FONT face=3D"=1B$B#M#S=1B(B =
=1B$B%4%7%C%/=1B(B"=20
size=3D2>=1B$B!Z%3%s%Q%$%kJ}K!![=1B(B<BR>=1B$B!!!!=1B(BMPI&nbsp;&nbsp;&nb=
sp; =1B$B!'!!=1B(Bmpicc -o main main.c<BR>=1B$B!!!!=1B(BOpenMP=20
=1B$B!'!!=1B(Bomcc -omniconfig=3Dscash -o main=20
main.c<BR><BR>=1B$B!Z<B9T4D6-![=1B(B<BR>=1B$B!!!!%^%k%A%f!<%6%b!<%I=1B(B<=
BR><BR>=1B$B$=3D$3$G!"65$($F$$$?$@$-$?$$$3$H$,$"$j$^$9!#=1B(B<BR><BR>=1B$=
B!Z<ALd![=1B(B<BR>=1B$B!!%M%C%HEy$+$i!"0J2<$K<($9-!-"$NJBNs2=3D$K$D$$$F$N=
>pJs$r=1B(B<BR>=1B$B!!<}=3D8$7$?$N$G$9$,!"A4$/8+Ev$?$j$^$;$s$G$7$?!#=1B(B=
<BR>=1B$B!!0lHLE*$G$O$J$$$N$G$7$g$&$+!)=1B(B<BR><BR>=1B$B!!!!-!;R%W%m%;%9=
$r ＠ 8@.$9$k%W%m%0%i%`!J=1B(Bfork(),execl()=1B$B;HMQ!K=1B(B<BR>=1B$B!!!!-"%=
9%l%C%I$r ＠ 8@.$9$k%W%m%0%i%`!J=1B(BPOSIX=1B$B%9%l%C%I;HMQ!K=1B(B<BR><BR>=1B=
$B!Z8!>ZFbMF![=1B(B<BR>=1B$B!!!Z<ALd![$N-!-"$H$bF0:n8!>Z$r<B;\$7$^$7$?!#=1B=
(B<BR><BR>=1B$B!!!!-!;R%W%m%;%9$r ＠ 8@.$9$k%W%m%0%i%`!J=1B(Bfork(),execl()=1B=
$B;HMQ!K=1B(B<BR>=1B$B!!!!!!!c=1B(BMPI=1B$B;HMQ;~!d=1B(B<BR>=1B$B!!!!!!!!=
J#?t$N;R%W%m%;%9$r ＠ 8@.$7$h$&$H$7$?>l9g!"=1B(B<BR>=1B$B!!!!!!!!=1B(B50%=1B=
$B0J>e$N3NN($G$$$/$D$+$N;R%W%m%;%9$,5/F0$7$^$;$s!#=1B(B<BR>=1B$B!!!!!!!!C=
"$7!"@5>o$K5/F0$5$l4|BT$7$?DL$j$NJBNs2=3D=3DhM}$,<B;\$G$-$k>l9g$b$"$j$^$9=
!#=1B(B<BR><BR>=1B$B!!!!!!!c=1B(BOpemMP=1B$B;HMQ;~!d=1B(B<BR>=1B$B!!!!!!!=
!F0:n$7$^$;$s!#Nc$($P!"%a%$%s%k!<%A%s$+$i;R%W%m%;%9$r=1B(B1=1B$BK\=1B(Bfo=
rk(),execl()<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=1B$B$9$k%W%m%0=
%i%`$r<B9T$9$k$H!"0J2<$N%a%C%;!<%8$,I=3D<($5$l$F=3D*N;$7$^$9!#=1B(B<BR>=1B=
$B!!!!!!!!=1B(B#=20
scrun -nodes=3D2,scored=3Dcomp0.pccluster.org =
./main<BR>=1B$B!!!!!!!!=1B(BSCore-D 5.4.0 connected=20
(jid=3D25).<BR>=1B$B!!!!!!!!=1B(B&lt;0:0&gt; SCORE: 2 nodes (2x1) =
ready.<BR>=1B$B!!!!!!!!=1B(B&lt;0:0&gt;=20
SCORE: 2 nodes (2x1) =
ready.<BR>=1B$B!!!!!!!!=1B(B&lt;0&gt;scash_pa_init:can not create page=20
table file: Operation not =
permitted[1]<BR>=1B$B!!!!!!!!=1B(B&lt;1&gt;scash_pa_init:can not=20
create page table file: Operation not=20
permitted[1]<BR><BR>=1B$B!!!!-"%9%l%C%I$r ＠ 8@.$9$k%W%m%0%i%`!J=1B(BPOSIX=1B=
$B%9%l%C%I;HMQ!K=1B(B<BR>=1B$B!!!!!!!c=1B(BMPI=1B$B;HMQ;~!d=1B(B<BR>=1B$B=
!!!!!!!!@5>oF0:n$7$F$$$k$h$&$K$_$($^$9!#=1B(B<BR>=1B$B!!!!!!!!C"$7!"=1B(B=
defunct=1B$B!J%>%s%S!K%W%m%;%9$,$G$-$F$$$^$9!#=1B(B<BR>=1B$B!!!!!!!!Nc$($=
P!"%a%$%s%k!<%A%s$+$i=1B(B1=1B$BK\$N%9%l%C%I$r ＠ 8@.!J=1B(Bpthread_create()=
=1B$B!K$9$k=1B(B<BR>=1B$B!!!!!!!!%W%m%0%i%`$r<B9T$7$?;~$N%W%m%;%9>uBV$,0J=
2<$N$h$&$K$J$j$^$9!#=1B(B<BR><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs=
p;=20
=1B$B!&%^%9%?B&=1B(B<BR>=1B$B!!!!=1B(B&nbsp; =1B$B!!!!=1B(B# ps -ef =
|grep main<BR>&nbsp;&nbsp;&nbsp;=20
=1B$B!!!!!!=1B(BUID&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; PID&nbsp; =
PPID&nbsp; C STIME=20
TTY&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; TIME =
CMD<BR>=1B$B!!=1B(B=20
=1B$B!!=1B(B&nbsp;&nbsp; =
=1B$B!!=1B(Bmsv&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3761&nbsp; =
3620&nbsp; 0=20
14:45 ?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 00:00:00=20
./main<BR>&nbsp;&nbsp; =1B$B!!=1B(B =
=1B$B!!!!=1B(Bmsv&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3767&nbsp;=20
3761&nbsp; 0 14:45 ?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 00:00:00 =
[main.1=20
&lt;defunct&gt;]<BR>&nbsp;&nbsp; =1B$B!!!!=1B(B =
=1B$B!!=1B(Bmsv&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
3768&nbsp; 3761&nbsp; 0 14:45 =
?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
00:00:00 ./main<BR>&nbsp;&nbsp; =1B$B!!!!=1B(B =
=1B$B!!=1B(Bmsv&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
3769&nbsp; 3768&nbsp; 0 14:45 =
?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
00:00:00 ./main<BR><BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
=1B$B!&%9%l!<%VB&=1B(B<BR>=1B$B!!!!=1B(B&nbsp; =1B$B!!!!=1B(B# ps -ef =
|grep main<BR>&nbsp;&nbsp;=20
&nbsp;=1B$B!!!!!!=1B(BUID&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
PID&nbsp; PPID&nbsp; C=20
STIME TTY&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; TIME=20
CMD<BR>&nbsp;&nbsp; =1B$B!!=1B(B =
=1B$B!!!!=1B(B#501&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1184&nbsp; 1161 49=20
14:49 ?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 00:03:07=20
./main<BR>&nbsp;&nbsp; =1B$B!!=1B(B =
=1B$B!!!!=1B(B#501&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1185&nbsp;=20
1184&nbsp; 0 14:49 ?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 00:00:00 =
[main.1=20
&lt;defunct&gt;]<BR>&nbsp;&nbsp; =1B$B!!!!=1B(B =
=1B$B!!=1B(B#501&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
1186&nbsp; 1184&nbsp; 0 14:49 =
?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
00:00:00 ./main<BR>&nbsp;&nbsp; =1B$B!!!!=1B(B =
=1B$B!!=1B(B#501&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
1187&nbsp; 1186 50 14:49 ?&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
00:03:11=20
./main<BR><BR>=1B$B!!!!!!!c=1B(BOpenMP=1B$B;HMQ;~!d=1B(B<BR>=1B$B!!!!!!!!=
$^$@<B;\$7$F$$$J$$$N$G$9$,!"=1B(B<BR>=1B$B!!!!!!!!$3$N>l9g$O!"=1B(BOpenMP=
=1B$B$,Ds6!$7$F$$$k%i%$%V%i%j4X?t$K=3Dq$-49$($kI,MW$,=1B(B<BR>=1B$B!!!!!!=
!!$"$k$N$G$7$g$&$+!)=1B(B<BR><BR>=1B$B!ZCN$j$?$$;v![=1B(B<BR>=1B$B!!0J2<$=
N=1B(B2=1B$BE@$G$9!#=1B(B<BR>=1B$B!!!&!Z<ALd![$N-!-"$O!"=1B(BMPI,OpenMP=1B=
$B$GJBNs2=3D$K$7$?>l9g$KF0:n$,J]>Z$5$l$F$$$k$N$+!)=1B(B<BR>=1B$B!!!&J]>Z$=
5$l$F$$$k>l9g!"!Z8!>ZFbMF![$K5-$7$?$h$&$J;v>]$,H/@8$9$k$N$O$J$<$+!)=1B(B<=
BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
=1B$B!J=1B(BMPI,OpenMP=1B$B$N=3Dq<0$NAH$_9~$_$KLdBj$,$"$k$N$G$7$g$&$+!)!K=
=1B(B<BR><BR>=1B$B0J>e$G$9!#=1B(B<BR>=1B$B$h$m$7$/$*4j$$$$$?$7$^$9!#=1B(B=
</DIV></FONT></FONT></BODY></HTML>

------=_NextPart_000_002F_01C32536.985588F0--


From nagaoka ＠ de.takuma-ct.ac.jp  Wed May 28 17:00:02 2003
From: nagaoka ＠ de.takuma-ct.ac.jp (長岡　史郎)
Date: Wed, 28 May 2003 17:00:02 +0900
Subject: [SCore-users-jp] PM/Ethernet テストで立ち往生
Message-ID: <200305280800.RAA26308@neptune.de.takuma-ct.ac.jp>

 TO　：score-users-jp ＠ pccluster.org
 FROM：長岡史郎
 DATE：2003年5月28日(水)
 RE  ：PM/Ethernetテストで立ち往生


 詫間電波高専の長岡と申します。SCORE5.0.1インストール後のPM/Ethernetのテストで立ち往生しています。
 同様の件でのやりとりが過去のログにありましたが、今回の問題の解決策を見つけられませんでしたので、メ
 ーリングリストで尋ねる決心をした次第です。初歩的な質問で大変恐縮なのですが、お力をお貸し下さい。
 よろしくお願いします。
 
 いま、PC４台（AMD K6-2 500MHz 構成は最後に示します)を用いてSCOREのシステムを作ろうとしてい ます。
 構成は、１台をサーバホスト、残り３台を計算ホストです。キーボード、マウス、モニタは１組のみで、切り
 替え機で切り替えています。
 
 書籍"Linuxで並列処理をしよう"に付属のCD-ROMとRedHat7.2(インプレス)を使ってEITによりSCOREをイン
 ストールした後、書籍にあるとおり、サーバホストの　/opt/score/doc/html/ja/installation/index.html
 にあるシステムテストの手順に従ってテストを行いました。その結果、SCOUTテストは無事終了したのですが、
 PM/Ethernetテストで以下の様なエラーがでて、先に進めません。
 
  [root ＠ server root]# /etc/rc.d/init.d/pm_ethernet start
 　bash: /etc/rc.d/init.d/pm_ethernet: そのようなファイルやディレクトリはありません
 
  [root ＠ server sbin]# ./rpmtest comp0 ethernet -dest 1 -ping
  Ethernet PM context #0 information (unit 0)
  channel 0 descripter information
   rx_p=00000000, rx_c=00000000, rx_bp=00000000, rx_bc=00000000
   tx_p=00000001, tx_c=00000000, tx_bp=00000080, tx_bc=00000000
   channel 0 statistics information
   st_txmit=ff0101ff, st_rexmit=00000000, st_xmit_ctl=00000001
   st_xmit_ack=00000008 st_xmit_lost=00000002, st_xmit_stop=0000001f
   st_xmit_err=00000000, st_xmit_received=00000000, st_rcv_valid=0000001f
   st_rcv_ackonly=00000000, st_rcv_igonore=00000000, st_rcv_lose=00000000
   st_rcv_ov=00000000,st_rcv_ov=00000000
   st_rcv_stop=00000000, st_rcv_go=00000000
   pmReceive: Connection timed out(110)
 
 過去ログ（佐賀大学の山本さん）にある指示を参考に確認したところ以下のことがわかりました。
 
  (1)  サーバホストのカーネルがSCOREのものになっていません(uname -rで　2.4.10と表示されます)。
  (2)　計算ホストのカーネルはSCOREのものになっています（2.4.18-2SCORE)
  (3)　＄dmesg | grep PM　を計算ホスト３台で実行したところ、３台とも同じ以下のような結果でした。
 
 　　　PM memory support
 　　　PM/Ethernet: "$Id:pm_Ethernet_dev.c,v 1.1.2.1 2002/03/28 03:05:14 kameyama Exp $"
 　　　PM/Ethernet:register etherpm device as major(122)
 
  (4)　また、気がついた点としては、４台のPCのLANカードのIRQは、同じ値ではありませんでした。
 　　(comp2が１０、それ以外の３台は５。但し、IRQの競合はありません（cat /proc/interruptsで確認しました）)
 
 サーバホストのdmesgの結果のうち、eth0に関すると思われるところを抜き出すと、以下のようになります。
　　
 ・・・・・・・
 epro100.c:v1.09j-t 9/29/99 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html
 eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin <saw ＠ sa
 w.sw.com.sg> and others
 PCI: Found IRQ 5 for device 00:0b.0
 eth0: Intel Corporation 82557 [Ethernet Pro 100], 00:A0:C9:2A:09:55, IRQ 5.
 Receiver lock-up bug exists -- enabling work-around.
 Board assembly 352509-003, Physical connectors present: RJ45
 Primary interface chip DP83840 PHY #1.
 DP83840 specific setup, setting register 23 to 8462.
 General self-test: passed.
 Serial sub-system self-test: passed.
 Internal registers self-test: passed.
 ROM checksum self-test: passed (0x49caa8d6).
 Receiver lock-up workaround activated.
 Installing knfsd (copyright (C) 1996 okir ＠ monad.swb.de).  
 SCSI subsystem driver Revision: 1.00
 iSCSI version 2.0.1.8 ( 8-Aug-2001)
 iSCSI control device major number 254
 ・・・・・・・・・

 3行目は計算ホストによって違っています。
　comp0の場合：PCI found IRQ 10 for device 00:0a:0
              IRQ routing conflict for 00:0a:0, have irq5, want irq10　
  comp1の場合：サーバと同じ 
  comp2の場合：PCI:Assigned IRQ10 00:0a:0

 また、下から４行目、Installing knfsd ・・・以下はcomp0からcomp2ともに

　etherpm0：16context using 4096KB MEM, maxunit=4, maxnodes=512, mtu=1468,eth0
  ehterpm0: Interrupt Reaping on eth0, irq5(comp2はirq10です)・・・・・
　
　と続きます。
 
 今回使用したPC４台は、ほぼ同じ仕様です（一部（相違点としては、以下の２点；(1) HDD：サーバホストが40GB、
 計算ホスト３台 が20GB、 (2)グラフィックカードが４台とも異なる、が違っていますが他はおなじです）。

 但し、サーバホストはWindows2000とデュアルブートにしています。これらは、今回の不具合の原因になっている
 のでしょうか？　また、scoreのインストール終了時、"Setup Server Host Done"　"Congratulation! ・・・"
 のメッセージは出たのですが、書籍"Linuxで並列処理をしよう"の図9.22のSerever Setup画面のメッセージの他
 にも出ていた気がするのですが、記録するのを怠り、いまとなってはそれが何だったかわかりません)。

 サーバホストのカーネルがSCOREのものではないので、この入れ替えが必要なのでしょうか。 もしそうであれば、
 どのようにすればよいか、具体的な手順をお教え下さい。
 

 以上が状況説明です。長くなってすみません。初心者ゆえ、今後どのように対処すればよいか、全くわからずお手
 上げ状態です。 アドバイスを頂けると助かります。

 お忙しいところ恐縮ですが、よろしくお願いします。
 
 
 PCの概要を参考までに添付します。
 
 [root ＠ server /]# /sbin/lspci
 00:00.0 Host bridge: VIA Technologies, Inc. VT82C598 [Apollo MVP3] (rev 04)
 00:01.0 PCI bridge: VIA Technologies, Inc. VT82C598/694x [Apollo MVP3/Pro133x AGP]
 00:07.0 ISA bridge: VIA Technologies, Inc. VT82C586/A/B PCI-to-ISA [Apollo VP] (rev 47)
 00:07.1 IDE interface: VIA Technologies, Inc. Bus Master IDE (rev 06)
 00:07.2 USB Controller: VIA Technologies, Inc. UHCI USB (rev 02)
 00:07.3 Host bridge: VIA Technologies, Inc. VT82C586B ACPI (rev 10)
 00:0b.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100] (rev 01)
 01:00.0 VGA compatible controller: Matrox Graphics, Inc. MGA G400 AGP (rev 04)
 [root ＠ server /]# 
 
 
 [root ＠ server /]# cat /proc/interrupts
            CPU0       
   0:    1253317          XT-PIC  timer
   1:       1370          XT-PIC  keyboard
   2:          0          XT-PIC  cascade
   5:      16805          XT-PIC  eth0　　　　
   8:          1          XT-PIC  rtc
  10:          0          XT-PIC  usb-uhci
  12:      52225          XT-PIC  PS/2 Mouse
  14:      25320          XT-PIC  ide0
  15:     136376          XT-PIC  ide1
 NMI:          0 
 ERR:          0
　
　comp0からcomp2まではeth0のIRQが違いますが、同様の結果（重複なし）でした。

end


From kameyama ＠ pccluster.org  Wed May 28 17:14:59 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Wed, 28 May 2003 17:14:59 +0900
Subject: [SCore-users-jp] PM/Ethernet テストで立ち往生
In-Reply-To: Your message of "Wed, 28 May 2003 17:00:02 JST."
             <200305280800.RAA26308@neptune.de.takuma-ct.ac.jp>
Message-ID: <20030528081332.E5BA1128944@neal.il.is.s.u-tokyo.ac.jp>

亀山です.

In article <200305280800.RAA26308 ＠ neptune.de.takuma-ct.ac.jp> "長岡 史郎" <nagaoka ＠ de.takuma-ct.ac.jp> wrotes:
>   [root ＠ server root]# /etc/rc.d/init.d/pm_ethernet start
>  　bash: /etc/rc.d/init.d/pm_ethernet: そのようなファイルやディレクトリはあり
> ません

これは少しおかしいようですが...

> 　etherpm0：16context using 4096KB MEM, maxunit=4, maxnodes=512, mtu=1468,eth
> 0
>   ehterpm0: Interrupt Reaping on eth0, irq5(comp2はirq10です)・・・・・

とでているのでしたら問題ないと思います.
> 　


>   [root ＠ server sbin]# ./rpmtest comp0 ethernet -dest 1 -ping

このコマンドを実行する前に別の window で
    [root ＠ server sbin]# ./rpmtest comp1 ethernet -reply
を実行する必要があるのですが,
(この dest 1 のホストで動いている rpmtest -reply が rpmtest -ping の
返事をします.)
動いていますでしょうか?
(そして, -ping の実行が終わったら -reply のほうも停止させます.)

>  過去ログ（佐賀大学の山本さん）にある指示を参考に確認したところ以下のことがわ
> かりました。
>  
>   (1)  サーバホストのカーネルがSCOREのものになっていません(uname -rで　2.4.10
> と表示されます)。

server が compute host を兼用しないのであれば問題ありません.

>   (4)　また、気がついた点としては、４台のPCのLANカードのIRQは、同じ値ではあり
> ませんでした。
>  　　(comp2が１０、それ以外の３台は５。但し、IRQの競合はありません（cat /proc
> /interruptsで確認しました）)

これも問題ないと思います.

>  但し、サーバホストはWindows2000とデュアルブートにしています。これらは、今回
> の不具合の原因になっている
>  のでしょうか？　

関係ないと思います.
(server で Windows2000 を立ち上げたときは SCore がつかえなくなりますが...)

                       from Kameyama Toyohisa


From nrcb ＠ streamline-computing.com  Wed May 28 19:46:16 2003
From: nrcb ＠ streamline-computing.com (Nick Birkett)
Date: Wed, 28 May 2003 11:46:16 +0100
Subject: [SCore-users-jp] [SCore-users] Myrinet2k problems
Message-ID: <200305281146.16947.nrcb@streamline-computing.com>

We had a power failure on one of our systems which had been working
fine for 1 month.

System: Score 5.4.0, copper Myrinet 2k. dual Xeon cpus.

After reboot we now get this error:

=========================================================
SGE job: submitted date = Tue May 27 18:07:30 BST 2003
32 cpus on 16 nodes ( SMP=2 )
Executable file: /home/users/nrcb/bin/pgi/jm33_parallel_pgi
MPI parallel job.
16 hosts used:
-------------
comp08 comp09 comp10 comp11
comp12 comp13 comp14 comp15
comp17 comp18 comp19 comp20
comp21 comp22 comp23 comp24
=========================================================
Job output begins
-----------------

<11> SCore-D:WARNING Unable to open PM myrinet2k/myrinet2k (error=19).
<11> SCore-D:WARNING   argv[0] -firmware
<11> SCore-D:WARNING   argv[1] /var/scored/scoreboard/eagle.0000V3000VHR
<11> SCore-D:WARNING   argv[2] -config
<11> SCore-D:WARNING   argv[3] /var/scored/scoreboard/eagle.0000V300370H
<11> SCore-D:ERROR No PM device opened.


Does anyone know what this indicates ?

We have tried restarting everything, and using different compute nodes but get 
same error.

Thanks,

Nick

_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From nagaoka ＠ de.takuma-ct.ac.jp  Wed May 28 20:27:46 2003
From: nagaoka ＠ de.takuma-ct.ac.jp (長岡　史郎)
Date: Wed, 28 May 2003 20:27:46 +0900
Subject: [SCore-users-jp] PM/Ethernet テストで立ち往生（２）
Message-ID: <200305281127.UAA26637@neptune.de.takuma-ct.ac.jp>

亀山様

早速に対応して頂き感謝しております。よろしくお願いします。以下が、お返事の
内容に従い実行した結果です。


>>   [root ＠ server sbin]# ./rpmtest comp0 ethernet -dest 1 -ping
>
>このコマンドを実行する前に別の window で
>    [root ＠ server sbin]# ./rpmtest comp1 ethernet -reply
>を実行する必要があるのですが,
>(この dest 1 のホストで動いている rpmtest -reply が rpmtest -ping の
>返事をします.)
>動いていますでしょうか?
>(そして, -ping の実行が終わったら -reply のほうも停止させます.)
>


指示に従い、SCOUTの動作確認をしたWindowとは別に、Window(ターミナルエミュ
レータ)を開いてそこで以下のコマンドを実行しました。この時、scoutは終了して
います。

[root ＠ server root]# cd /opt/score/sbin
[root ＠ server sbin]# ./rpmtest comp1 ethernet -reply

何分待っても返事がかえって来ないので、ctl+Cで中断しました。その時返ってきた
返事が以下のようなものです。

Ethernet PM context #0 information (unit 0)
 channel 0 descripter information
  rx_p=00000000, rx_c=00000000, rx_bp=00000000, rx_bc=00000000
  tx_p=00000000, tx_c=00000000, tx_bp=00000000, tx_bc=00000000

 channel 0 statistics information
  st_txmit=ff0101ff, st_rexmit=00000000, st_xmit_ctl=00000001
  st_xmit_ack=00000000 st_xmit_lost=00000002, st_xmit_stop=00000000
  st_xmit_err=00000000, st_xmit_received=00000000, st_rcv_valid=00000000
  st_rcv_ackonly=00000000, st_rcv_igonore=00000000, st_rcv_lose=00000000
  st_rcv_ov=00000000,st_rcv_ov=00000000
  st_rcv_stop=00000000, st_rcv_go=00000000
[root ＠ server sbin]# 

動いていないと理解すればよいのでしょうか。それとも最初はかなり時間がかかるものな
ので、もう少し時間をかけた方が良かったのでしょうか？(正常なときは、何か２つの数字
が返って来るという理解でよいのでしょうか)


この後、 ./rpmtest cmp0 ethernet -dest 1 -ping をおなじwindowで続けて実行しま
したが上と同じ結果が返ってきました。(実行させると、何分も返事が返って来ず、強制中
断すると上と全く同じ結果が返って来たという意味です。)

前回のメールでは ./rpmtest comp1 ethernet -reply の実行結果を書くのを飛ばして
しまっていました。済みません。


一つ質問があります。サーバホストの　/opt/score/doc/html/ja/installation/index
.html にあるシステムテストのところは以下のようにかかれています。


>  1. etherpmctl コマンド
>    etherpmctl コマンドが全てのホスト上の pm_ethernet rc スクリプトでブート
>    時に実行されていることを確認してください。 まだ実行されていなかった場合
>    は次のようにして全てのホストで実行してください:
>
>     $ su
>     Password:
>     # /etc/rc.d/init.d/pm_ethernet start


この、”pm_ethernet rc スクリプトでブート時に実行されていることを確認してください。”
をどうやって確認するのか、恥ずかしい話ですがわかりませんでしたので（すみませんが、どう
やって確認するのか教えて頂けないでしょうか。）、ここでは実行されていないものとして、
上のコマンド

    # /etc/rc.d/init.d/pm_ethernet start

を実行しました。その結果が、以下のようなものです。

[root ＠ comp0 root]# /etc/rc.d/init.d/pm_ethernet start
n Starting PM/Ethernet:
 devoce:eth0
etherpmctl:ERROR on unit 0:"Device or resource busy (16)" Check dmesg log!!

これはethrpmctlがすでに実行されているので"Device or resource busy (16)" がでる、と
いう理解でよいのでしょうか？

comp1,comp2も同じ結果でした。pm_ethernetというファイルは３台ともこのディレクトリにあり
ます。サーバホストにはありませんでした。

　　蛇足ですが、サーバホストで行ったところ、前回の質問に書いたような
　　　[root ＠ server root]# /etc/rc.d/init.d/pm_ethernet start
　　　bash: /etc/rc.d/init.d/pm_ethernet: そのようなファイルやディレクトリはありません
　　という結果（前回報告した結果）でした。


/etc/rc.d/init.d/pm_ethernet startの結果、"Check dmesg log!!"とあったので、この命令に
続いて計算ホストで

[root ＠ comp0 root]# dmesg | grep PM　

を実行しました。これも３台とも同じ以下のような結果がかえってきました。
 
PM memory support
PM/Ethernet: "$Id:pm_Ethernet_dev.c,v 1.1.2.1 2002/03/28 03:05:14 kameyama Exp $"
PM/Ethernet:register etherpm device as major(122)


遅くなりましたが、お返事致します。よろしくお願いします。

ところで返事は直接亀山さん宛にお送りした方がよいのでしょうか、それともscoreメーリングリスト
の方に送ればよいのでしょうか。


end


From kameyama ＠ pccluster.org  Wed May 28 20:43:25 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Wed, 28 May 2003 20:43:25 +0900
Subject: [SCore-users-jp] PM/Ethernet テストで立ち往生（２）
In-Reply-To: Your message of "Wed, 28 May 2003 20:27:46 JST."
             <200305281127.UAA26637@neptune.de.takuma-ct.ac.jp>
Message-ID: <20030528114157.96812128944@neal.il.is.s.u-tokyo.ac.jp>

亀山です.

In article <200305281127.UAA26637 ＠ neptune.de.takuma-ct.ac.jp> "長岡 史郎" <nagaoka ＠ de.takuma-ct.ac.jp> wrotes:
> >>   [root ＠ server sbin]# ./rpmtest comp0 ethernet -dest 1 -ping
> >
> >このコマンドを実行する前に別の window で
> >    [root ＠ server sbin]# ./rpmtest comp1 ethernet -reply
> >を実行する必要があるのですが,
> >(この dest 1 のホストで動いている rpmtest -reply が rpmtest -ping の
> >返事をします.)
> >動いていますでしょうか?
> >(そして, -ping の実行が終わったら -reply のほうも停止させます.)
> >
> 
> 
> 指示に従い、SCOUTの動作確認をしたWindowとは別に、Window(ターミナルエミュ
> レータ)を開いてそこで以下のコマンドを実行しました。この時、scoutは終了して
> います。
> 
> [root ＠ server root]# cd /opt/score/sbin
> [root ＠ server sbin]# ./rpmtest comp1 ethernet -reply
> 
> 何分待っても返事がかえって来ないので、ctl+Cで中断しました。その時返ってきた
> 返事が以下のようなものです。

こちらはなにも出力されません.
rpmtest -ping 相手をするだけです.
これは中断させないでください.

> この後、 ./rpmtest cmp0 ethernet -dest 1 -ping をおなじwindowで続けて実行しま
> したが上と同じ結果が返ってきました。(実行させると、何分も返事が返って来ず、強
> 制中
> 断すると上と全く同じ結果が返って来たという意味です。)

このコマンドを実行するときには -reply のほうは動かしておく必要があります.

rpmtest は -ping をつけたほうの program から packet を出力し,
-reply をつけた program がその packet を受け取って -ping のほうに
packet を返します.
それで, -ping のほうで packet をうけとったときの経過時間を出力します.


> >  1. etherpmctl コマンド
> >    etherpmctl コマンドが全てのホスト上の pm_ethernet rc スクリプトでブート
> >    時に実行されていることを確認してください。 まだ実行されていなかった場合
> >    は次のようにして全てのホストで実行してください:
> >
> >     $ su
> >     Password:
> >     # /etc/rc.d/init.d/pm_ethernet start
> 
> 
> この、”pm_ethernet rc スクリプトでブート時に実行されていることを確認してくだ
> さい。”
> をどうやって確認するのか、恥ずかしい話ですがわかりませんでしたので（すみませ
> んが、どう
> やって確認するのか教えて頂けないでしょうか。）、ここでは実行されていないもの

redhat の場合は
    # /sbin/chkconfig --list pm_ethernet
でわかります.

> [root ＠ comp0 root]# /etc/rc.d/init.d/pm_ethernet start
> n Starting PM/Ethernet:
>  devoce:eth0
> etherpmctl:ERROR on unit 0:"Device or resource busy (16)" Check dmesg log!!
> 
> これはethrpmctlがすでに実行されているので"Device or resource busy (16)" がで
> る、と
> いう理解でよいのでしょうか？

はい.

> 　　蛇足ですが、サーバホストで行ったところ、前回の質問に書いたような
> 　　　[root ＠ server root]# /etc/rc.d/init.d/pm_ethernet start
> 　　　bash: /etc/rc.d/init.d/pm_ethernet: そのようなファイルやディレクトリは
> ありません
> 　　という結果（前回報告した結果）でした。

server では PM を使用しないので問題ありません.

> ところで返事は直接亀山さん宛にお送りした方がよいのでしょうか、それともscoreメ
> ーリングリスト
> の方に送ればよいのでしょうか。

mailling list のほうに送ってください.

                       from Kameyama Toyohisa


From nagaoka ＠ de.takuma-ct.ac.jp  Wed May 28 22:37:50 2003
From: nagaoka ＠ de.takuma-ct.ac.jp (長岡　史郎)
Date: Wed, 28 May 2003 22:37:50 +0900
Subject: [SCore-users-jp] PM/Etherenet テストで立ち往生（３）
Message-ID: <200305281337.WAA26754@neptune.de.takuma-ct.ac.jp>

亀山様

回答有り難うございました。ずっと１つのWindowで実行していたので
２つのWindowで交互に行うということが、恥ずかしながら読みとれま
せんでした。

恥を忍んでまとめると以下のようなものですね。
(1) このテストは、２つのWindow(W-1,W-2とする)を使って行うテスト
　 だということ。
(2) W-1では任意の計算ホスト（例えばcomp0）に-reply、つまり返事を
   するために信号（パケット）を待ち受けるように指示をだし（設定
　 し）待機させておく。
(3) もう一つのWindow(W-2)で、-pingコマンドにより、別の任意の計算
　　ホストから待ち受けている計算ホストに"呼びかけ"を行い、パケッ
　　トのキャッチボールを行わせる。 
　 その時の、あるサイズのパケットのやりとり（通信）にかかる時間
   を測定することで一対一通信のテストを行う。

以下が、私のPCで行った結果です。

W-1で./rmtest comp1 ethernet -replyを実行。この時は■のまま
になっている。（前回は無反応だったので強制終了させてしまいましたが）
W-1は、このままにしておいて、W-2で
[root ＠ server sbin]# ./rpmtest comp0 ethernet -dest 1 -ping
を実行させる。しばらく待っていると
8	0.000123455
が返ってくる。
以下同様に、W-1で待ち受けをcomp2として（./rmtest comp2 ethernet
 -reply）、comp1からの通信を行うと
[root ＠ server sbin]# ./rpmtest comp0 ethernet -dest 2 -ping
8	0.000122487
comp1から通信を行うと
[root ＠ server sbin]# ./rpmtest comp1 ethernet -dest 2 -ping
8	0.000123415
W-1でcomp0として（./rmtest comp0 ethernet -reply）、comp1から
の通信では
[root ＠ server sbin]# ./rpmtest comp1 ethernet -dest 0 -ping
8	0.000123476
以下同じです。
comp2->comp0
[root ＠ server sbin]# ./rpmtest comp2 ethernet -dest 0 -ping
8	0.000124038
comp2 -> comp1
[root ＠ server sbin]# ./rpmtest comp2 ethernet -dest 1 -ping
8	0.00012301

サーバホストからやろうとすると

[root ＠ server sbin]# ./rpmtest server ethernet -dest 1 -ping
env: /opt/score5.0.0/deploy/scbpmexec: No such file or directory

という結果になりますが、これはサーバに通信のプログラムが入ってい
ないからできない、という理解でよいのでしょうか？


続けてストレス結果を行いましたので、その結果を報告します。
[root ＠ server sbin]# cd /opt/score/deploy
[root ＠ server deploy]# scout -g pcc2
SCOUT: Spawning done.                         
SCOUT: session started.
[root ＠ server deploy]# ./scstest -network ethernet
SCSTEST: BURST on ethernet(chan=0,ctx=0,len=16)

[root ＠ server root]# cd /opt/score/deploy
[root ＠ server deploy]# ./scstest -network ethernet
SCSTEST: BURST on ethernet(chan=0,ctx=0,len=16)
50 K packets.
100 K packets.
150 K packets.
200 K packets.
250 K packets.
300 K packets.

このように、試験手順に記載された通りの結果が得られました。


最後になりましたが、教えて頂いた
”pm_ethernet rc スクリプトでブート時に実行されていることを確認して下さい”
の件ですが、教えて頂いた命令を実行した結果以下のようになりましたので、
報告します。

[root ＠ comp0 root]# /sbin/chkconfig --list pm_ethernet
pm_ethernet 0:off 1:off 2:off 3:on 4:on 5:on 6:off

comp1,comp2も同じ結果でした。


最後の質問なのですが、これで今回４台のPCで構成したシステムは（たぶん）
問題なく動作している、と理解してよろしいのでしょうか？


どうも有り難うございました。おかげさまでPM/Etherenetのテストを最後ま
で行うことができました。アドバイスが無かったら、まだ、頭を抱えていたと
思います。

まだこれから初歩的な質問が多々出てくると思います。その時は恥を忍んで
質問させて頂きたいと思います。今後ともよろしくお願いします。


end


From a-hasega ＠ ats.nis.nec.co.jp  Wed May 28 23:26:45 2003
From: a-hasega ＠ ats.nis.nec.co.jp (長谷川　篤史)
Date: Wed, 28 May 2003 23:26:45 +0900
Subject: [SCore-users-jp] fork(),execl()を使用したプログラムの並列化
References: <003301c324eb$2b52ff80$6400a8c0@sein>
Message-ID: <3ED4C725.1010200@ats.nis.nec.co.jp>

長谷川＠NEC情報システムズです。

Omni OpenMPの開発に参加させていただいたものです。


> 　　　＜OpemMP使用時＞
> 　　　　動作しません。例えば、メインルーチンから子プロセスを1本fork(), 
> execl()
>        するプログラムを実行すると、以下のメッセージが表示されて終了します。
> 　　　　# scrun -nodes=2,scored=comp0.pccluster.org ./main
> 　　　　SCore-D 5.4.0 connected (jid=25).
> 　　　　<0:0> SCORE: 2 nodes (2x1) ready.
> 　　　　<0:0> SCORE: 2 nodes (2x1) ready.
> 　　　　<0>scash_pa_init:can not create page table file: Operation not 
> permitted[1]
> 　　　　<1>scash_pa_init:can not create page table file: Operation not 
> permitted[1]

fork, execlで、./main を実行されたのでしょうか？

MPI,OpenMP共に評価した事がないので保証できませんが、fork, execl自体は
動作すると思います。ただ、execlで、SCoreのプログラム(OpenMPやMPIで書か
れたプログラム)を実行した場合、動かないはずです。

上記のエラーは、また、別の問題で出ていますが、これを修正したとしても、
Omni OpenMPの実装上、そういう実行はできません。

> 　　　＜OpenMP使用時＞
> 　　　　まだ実施していないのですが、
> 　　　　この場合は、OpenMPが提供しているライブラリ関数に書き換える必要が
> 　　　　あるのでしょうか？

"#pragma omp parallel" で、並列化して使用する事を前提としています。
pthreadと併用する事は考慮していませんので、OpenMPのruntime libraryで問
題が出ると思われます。

何をされたいのか判らないのですが、scrun実行時に、必要なノードを確保し
て実行するというのでは、ダメなのでしょうか？

-- 
長谷川 篤史  E-Mail:a-hasega ＠ ats.nis.nec.co.jp
株式会社NEC情報システムズ 基盤ソフトウェア事業部 サイエンス基盤部
外線:03-3798-9991(Fax.03-3798-9198) / 内線:8-115-2410(Fax.8-115-2419)


From nrcb ＠ streamline-computing.com  Thu May 29 07:48:34 2003
From: nrcb ＠ streamline-computing.com (Nick Birkett)
Date: Wed, 28 May 2003 23:48:34 +0100
Subject: [SCore-users-jp] [SCore-users] Myrinet2k problems fixed
Message-ID: <200305282348.34184.nrcb@streamline-computing.com>

<11> SCore-D:WARNING Unable to open PM myrinet2k/myrinet2k (error=19).
<11> SCore-D:WARNING   argv[0] -firmware
<11> SCore-D:WARNING   argv[1] /var/scored/scoreboard/eagle.0000V3000VHR
<11> SCore-D:WARNING   argv[2] -config
<11> SCore-D:WARNING   argv[3] /var/scored/scoreboard/eagle.0000V300370H
<11> SCore-D:ERROR No PM device opened.

we are running a modular kernel and 2 of the compute nodes had pm modules
disabled.

Sorry !!

Nick

_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama ＠ pccluster.org  Thu May 29 09:38:18 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Thu, 29 May 2003 09:38:18 +0900
Subject: [SCore-users-jp] PM/Etherenet テストで立ち往生（３）
In-Reply-To: Your message of "Wed, 28 May 2003 22:37:50 JST."
             <200305281337.WAA26754@neptune.de.takuma-ct.ac.jp>
Message-ID: <20030529003649.BB25E128944@neal.il.is.s.u-tokyo.ac.jp>

亀山です.

In article <200305281337.WAA26754 ＠ neptune.de.takuma-ct.ac.jp> "長岡 史郎" <nagaoka ＠ de.takuma-ct.ac.jp> wrotes:
> 回答有り難うございました。ずっと１つのWindowで実行していたので
> ２つのWindowで交互に行うということが、恥ずかしながら読みとれま
> せんでした。

交互にというか, -reply を動かしておいて -ping を行うということですが...

> 恥を忍んでまとめると以下のようなものですね。
> (1) このテストは、２つのWindow(W-1,W-2とする)を使って行うテスト
> 　 だということ。
> (2) W-1では任意の計算ホスト（例えばcomp0）に-reply、つまり返事を
>    するために信号（パケット）を待ち受けるように指示をだし（設定
> 　 し）待機させておく。
> (3) もう一つのWindow(W-2)で、-pingコマンドにより、別の任意の計算
> 　　ホストから待ち受けている計算ホストに"呼びかけ"を行い、パケッ
> 　　トのキャッチボールを行わせる。 
> 　 その時の、あるサイズのパケットのやりとり（通信）にかかる時間
>    を測定することで一対一通信のテストを行う。

そのとおりです.
インストールガイドでは読み取れませんでしたでしょうか?

> サーバホストからやろうとすると
> 
> [root ＠ server sbin]# ./rpmtest server ethernet -dest 1 -ping
> env: /opt/score5.0.0/deploy/scbpmexec: No such file or directory
> 
> という結果になりますが、これはサーバに通信のプログラムが入ってい
> ないからできない、という理解でよいのでしょうか？

はい, そうです.
標準的な設定ですと, SCore は server host では SCore program を
動かしません.
もし. server と計算ホストを兼用したい場合は, server も
計算ホストとしての設定を行う必要があります.

> [root ＠ comp0 root]# /sbin/chkconfig --list pm_ethernet
> pm_ethernet 0:off 1:off 2:off 3:on 4:on 5:on 6:off
> 
> comp1,comp2も同じ結果でした。

これは pm_ethernet が run level 3 4 5 で立ち上がることを意味します.
通常, multi user mode で動いているときは run level 3 なので,
起動したときには pm_ethernet の script が動きます.

> 最後の質問なのですが、これで今回４台のPCで構成したシステムは（たぶん）
> 問題なく動作している、と理解してよろしいのでしょうか？

はい.

                       from Kameyama Toyohisa


From kameyama ＠ pccluster.org  Thu May 29 10:06:11 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Thu, 29 May 2003 10:06:11 +0900
Subject: [SCore-users-jp] fork(),execl()を使用したプログラムの並列化
In-Reply-To: Your message of "Wed, 28 May 2003 16:31:31 JST."
             <003301c324eb$2b52ff80$6400a8c0@sein>
Message-ID: <20030529010442.C7D9E128944@neal.il.is.s.u-tokyo.ac.jp>

亀山です.

In article <003301c324eb$2b52ff80$6400a8c0 ＠ sein> "増田 尚美" <n-masuda ＠ sp.nas.nec.co.jp> wrotes:
> 　　(2)スレッドを生成するプログラム（POSIXスレッド使用）

一応, SCore 的には POSIXスレッドの使用を想定してはいますが,
glibc の pthread を書き換えているため, GNU 以外のコンパイラで
独自のスレッドライブラリを持っているものに関しては動かない可能性があります.

それとは別に MPICH 自体が http://www-unix.mcs.anl.gov/mpi/mpich/ に

Thread Safety

The MPICH implementation is not thread-safe. In many cases, it may be
possible to use MPICH in what in MPI-2 are called MPI_THREAD_FUNNELED
or MPI_THREAD_SERIALIZED modes when kernel (as opposed to user) threads
are used. We plan to support a MPI_THREAD_MULTIPLE in a later release.


とかかれているので問題があるかもしれません.

いずれにしても, checkpoint/restart はできないと思います.

                       from Kameyama Toyohisa


From n-masuda ＠ sp.nas.nec.co.jp  Thu May 29 11:23:55 2003
From: n-masuda ＠ sp.nas.nec.co.jp (増田　尚美)
Date: Thu, 29 May 2003 11:23:55 +0900
Subject: [SCore-users-jp] fork(),execl()を使用したプログラムの並列化(2)
Message-ID: <002501c32589$5d5ed870$6400a8c0@sein>

長谷川様

増田です。
お世話になっております。

ご回答ありがとうございます。


> fork, execlで、./main を実行されたのでしょうか？

わかりづらい説明で申し訳ありません。
./main はコマンドラインからの実行です。
例えば、実行モジュールが
　　main
    subpro1
    subpro2
とあった場合に、subpro1,subpro2をmainから
fork,execlして起動させようとしています。


> MPI,OpenMP共に評価した事がないので保証できませんが、fork, execl自体は
> 動作すると思います。ただ、execlで、SCoreのプログラム(OpenMPやMPIで書か
> れたプログラム)を実行した場合、動かないはずです。

了解しました。
上記に示したsubpro1,subpro2もMPI,OpenMPを使用したプログラムになっています。
これをexeclで起動させている為に、動作が不安定なのですね。

execlのように一般的に使用しているようなシステム関数が
SCore環境下では動作保証されていないというような情報は
どこからか入手できるものでしょうか？
もし、ご存知であれば教えていただければありがたいです。


> 上記のエラーは、また、別の問題で出ていますが、これを修正したとしても、
> Omni OpenMPの実装上、そういう実行はできません。

OpenMPのエラーはおそらく書式の組み込みの問題ですね。
このエラーについては、もう少しOpenMPの勉強をします。


> "#pragma omp parallel" で、並列化して使用する事を前提としています。
> pthreadと併用する事は考慮していませんので、OpenMPのruntime libraryで問
> 題が出ると思われます。

了解しました。
実際にどうなるものか、一度試してみたいと思います。


> 何をされたいのか判らないのですが、scrun実行時に、必要なノードを確保し
> て実行するというのでは、ダメなのでしょうか？

既に子プロセスやスレッドを多用したシステムが存在して
それを流用してSCore環境下で、動作させた場合にどれほどの影響が
でるかを調査しています。
　・Score環境下で動作させることを前提として設計からやり直すべきか
　・流用した場合のバグの作りこみ等の危険性
　・MPI、OpenMPどちらを使用するのが適当か
等々です。

SCoreが、そもそも処理の分散化を目的としているわけなので
処理分散させる為に、子プロセスやスレッドを生成しているプログラムを
流用するという考え方に無理があるのでしょうか？

実際にSCore環境下で動作するシステムを見たことがないので
ばかげた質問かもしれませんが。。
SCore環境下で動作させようとする場合
一般的には、実行モジュールはひとつにするものなのでしょうか？
つまり、子プロセスやスレッドを生成する必要性はないものなのでしょうか？

以上です。
よろしくお願いいたします。


From n-masuda ＠ sp.nas.nec.co.jp  Thu May 29 14:14:27 2003
From: n-masuda ＠ sp.nas.nec.co.jp (増田　尚美)
Date: Thu, 29 May 2003 14:14:27 +0900
Subject: [SCore-users-jp] fork(),execl()を使用したプログラムの並列化
References: <20030529010442.C7D9E128944@neal.il.is.s.u-tokyo.ac.jp>
Message-ID: <004301c325a1$2fdcfcc0$6400a8c0@sein>

亀山様

増田です。
お世話になっております。

ご回答ありがとうございました。

> 一応, SCore 的には POSIXスレッドの使用を想定してはいますが,
> glibc の pthread を書き換えているため, GNU 以外のコンパイラで
> 独自のスレッドライブラリを持っているものに関しては動かない可能性があります
.

そうですね、そういう部分は全く意識しておりませんでした。
当初、　/usr/lib/のlibpthread.aをリンクしてコンパイルすると、
リンクエラーとなったので
/opt/score5.4.0/lib/i386-redhat7-linux2_4/のlibpthread.aを
リンクするように変更しました。
そうすると、エラーがなくなりリンクが通ったので
普通に使えるものと思い込んおりました。


> それとは別に MPICH 自体が http://www-unix.mcs.anl.gov/mpi/mpich/ に
>
> Thread Safety
>
> The MPICH implementation is not thread-safe. In many cases, it may be
> possible to use MPICH in what in MPI-2 are called MPI_THREAD_FUNNELED
> or MPI_THREAD_SERIALIZED modes when kernel (as opposed to user) threads
> are used. We plan to support a MPI_THREAD_MULTIPLE in a later release.
>
>
> とかかれているので問題があるかもしれません.
>
> いずれにしても, checkpoint/restart はできないと思います.
>

了解いたしました。
どうしても、ゾンビプロセスができてしまうことも解決しないので
不安要素をかかえたまま、POSIXスレッドを使用することは
避けたほうが賢明そうです。


また、MPIに関する日本語訳の書籍で
参考になるものがありましたら
お教えいただけますでしょうか。

以上です。
今後ともよろしくお願いいたします。


From nagaoka ＠ de.takuma-ct.ac.jp  Thu May 29 17:43:18 2003
From: nagaoka ＠ de.takuma-ct.ac.jp (長岡　史郎)
Date: Thu, 29 May 2003 17:43:18 +0900
Subject: [SCore-users-jp] M/Etherenet テストで立ち往生（ 4 ）　お礼
Message-ID: <200305290843.RAA28737@neptune.de.takuma-ct.ac.jp>

亀山様

先日は、私の初歩的な質問に丁寧にお答え下さりありがとうございました。
最後の質問は、手順書をよく読んでいなかったため、早とちりで変な返答を
してしまい、顔から火がでる思いです。

手持ちの部品を集めて作ったＰＣで並列計算機を作ることができたのが何よ
りです。今後ともよろしくお願いします。


From makino ＠ fbox.ath.cx  Thu May 29 23:49:46 2003
From: makino ＠ fbox.ath.cx (Tetsuhisa Makino)
Date: Thu, 29 May 2003 23:49:46 +0900 (   )
Subject: [SCore-users-jp] SMP クラスタ上へのインストールについて
Message-ID: <20030529.234946.45281613.makino@fbox.ath.cx>

はじめまして．
牧野＠岡山大学と申します．

この度，私の研究室のある SMP クラスタシステムに SCore をインストールしようと
思い立ったのですが，うまく設定できません．
そこで，どなたか知恵を貸して頂こうと思い，投稿しました．

詳細は後ほどに説明しますので，ここでは問題を書きます．

1. SMP のカーネルがうまく起動しない．
   起動中に固まってしまう．
2. PM Test が実行できない．
   エラーメッセージが表示される．(カーネルは UP を使用)
   > No route to host(113)

以下に，当研究室における環境を簡単に説明します．

サーバホスト：
  Host name "server"
  Processor Pentium4 2.0GHz
  NIC       Intel Pro/100
計算ホスト：
  Host name "smp[0-3]" (4 台)
  Processor Pentium3 1.0GHz( x2 )
  NIC       Intel Pro/100
            Intel Pro/1000

接続形態：
              +--------+
              | server |
              +--------+
                  |                       IP: xxx.yyy.zzz.95 (*1)
   +------------------------------+
   |    100 Base Switching HUB    +------- to other hosts(研究室内 LAN)
   +------------------------------+
     |        |        |        |         IP: xxx.yyy.zzz.[96-99] (*1)
+------+  +------+  +------+  +------+
| smp0 |  | smp1 |  | smp2 |  | smp3 |
+------+  +------+  +------+  +------+
     |        |        |        |         IP: 192.168.10.[1-3]
   +------------------------------+
   |    Giga bit Switching HUB    |
   +------------------------------+

(*1) xxx.yyy.zzz は研究室のネットワークアドレス

このような環境下でクラスタを構成したいと考えています．

まず，テストとして smp0, smp1 だけを計算ホストとして用いるようにインストールを
試みました．
server に RedHat Linux 7.3 をフルインストールして，/etc/hosts を以下のように
記述．

--------------- /etc/hosts ---------------
127.0.0.1 localhost.localdomain localhost
xxx.yyy.zzz.95 server.lab.ac.jp server
xxx.yyy.zzz 96 smp0.lab.ac.jp smp0
xxx.yyy.zzz 97 smp1.lab.ac.jp smp1
#xxx.yyy.zzz 98 smp2.lab.ac.jp smp2
#xxx.yyy.zzz 99 smp3.lab.ac.jp smp3
--------------- /etc/hosts ---------------

# ただし，これらの IP アドレスは研究室の DNS にも登録されている．

NIS の設定は研究室の NIS サーバが上がっているのでそちらを利用するようにした．
# nis nis-server

SCore 5.4 の ISO イメージから作成した CD-ROM を使って server にインストール．
EIT を起動する．

# /opt/score/bin/eit -nisonly

ネットワークの設定のダイアログに進み，以下のように設定．

Server Name:  server.lab.ac.jp
Domain Name:  lab.ac.jp
Net mask   :  255.255.255.0
Gateway    :  xxx.yyy.zzz.1
NIS        :  nis
Name Server:  (blank)
Mount point:  /mnt/cdrom
Display    :  server.lab.ac.jp

Disk, NFS の設定を適切に行った．

Host Information の設定は以下の通り．

Number of hosts     : 2
Number of processors: 2
Name prefix         : smp
numbered            : digit
start - end         : 0 - 1

Group Creation の設定は以下の通り．

Group name: smpc
Network   : 100M Eth, shmem[2] (*2)
Hosts     : smp0, smp1

(*2) とりあえず，テストということで 100M を選択．

以上の設定の後，計算ホスト用の起動ディスクを作成し，計算ホストをインストールした．
よろしくお願いします．

--
Name Tetsuhisa Makino (makino ＠ fbox.ath.cx)


From ersoz ＠ cse.psu.edu  Fri May 30 04:28:09 2003
From: ersoz ＠ cse.psu.edu (Deniz Ersoz)
Date: Thu, 29 May 2003 15:28:09 -0400
Subject: [SCore-users-jp] [SCore-users] Execution Time Problem
Message-ID: <20030529192809.GA9244@titan.cse.psu.edu>


Hi,

I have been running some NAS benchmarks using Score-D on a myrinet network and the results are much 
worse than the ones on GM (without score). For example SP (class B, 9 nodes) finishes in around 230 
seconds on GM and it takes 1000 seconds when I use score...

Is there something wrong or is it normal???

PS: I tried changing the timeslice and also the "mpi_zerocopy=on" option, but the times didn't 
change...

Thank you very much...

Deniz Ersoz
_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From nrcb ＠ streamline-computing.com  Fri May 30 06:24:31 2003
From: nrcb ＠ streamline-computing.com (Nick Birkett)
Date: Thu, 29 May 2003 22:24:31 +0100
Subject: [SCore-users-jp] [SCore-users] ia32/ia64 cluster
Message-ID: <200305292224.31997.nrcb@streamline-computing.com>

We need some help on configuring mixed clusters.

Here is our test system:

8 dual Xeon ia32 cluster nodes.
1 quad Itanium2   cluster node.

Score 5.4

Network = gigabit

The Xeon cluster has been running Score MPI jobs for the last 2 months.

The Itanium has kernel 2.4.19-1SCORE patched and compiled for Itanium as in
user documents.

The Itanium has been added, and is running SCore multiprocessor kernel (we have run several OpenMP codes and it works fine).


The front end server is called server and is a Xeon 32 bit.

scout works fine:

server nrcb:$ cat hosts
itanium01
server nrcb:$ scout -F hosts
done.
SCOUT: session started.
server nrcb:$ scout uname -a
[itanium01]:
Linux tiger4.streamline 2.4.19-1SCORE_ia64 #2 SMP Mon May 26 12:36:20 PDT 2003 ia64 unknown
server nrcb:$ 

Also mpi compilers work fine on Itanium:

[nrcb ＠ tiger4 mpi]$ mpif77 -compiler intel7 -O3 -w  -o jacobi_mpi_64 jacobi_mpi_param.f
   program JACOBI
   external subroutine OUTPUT
   external function UEXACT
   external function FEXT
   external subroutine INITIALISE
   external subroutine ERROR
   external subroutine ITERATE

2057 Lines Compiled
[nrcb ＠ tiger4 mpi]$  ./jacobi_mpi
<0:0> SCORE: One local node ready.
  Running with nprocs=           1
  Array size nxg,nyg =         4096        4096
  Iteration count    =          128
 cpus=           1 : Iteration =            1   2.24994472335565D+015
 cpus=           1 : Iteration =            2   2.24924498762084D+015
 cpus=           1 : Iteration =            3   2.24870040652968D+015
 cpus=           1 : Iteration =            4   2.24823845276853D+015
 cpus=           1 : Iteration =            5   2.24783004116974D+015
 cpus=           1 : Iteration =            6   2.24745998334097D+015
 cpus=           1 : Iteration =            7   2.24711915332825D+015

So code runs both on Itanium and Xeon (2 binaries).

I have copied 32 bit and 64 bit binaries to: 

/opt/score/bin/bin.ia64-redhat-linux2_4/jacobi_mpi.exe
/opt/score/bin/bin.i386-redhat7-linux2_4/jacobi_mpi.exe

on  itanium 64 bit and server 32 bit

and set link for .wrapper :

[nrcb ＠ tiger4 mpi]$ ls -al jacobi_mpi
lrwxrwxrwx    1 nrcb     streamc        23 May 29 11:46 jacobi_mpi -> /opt/score/bin/.wrapper


(server and Itanium share a common user filesystem via nfs).

I have an entry for scorehosts.db for Itanium:

server mpi:$ grep IA /opt/score/etc/scorehosts.db
itanium01.streamline    HOST_8 network=gigabit,shmem0,shmem1,shmem2,shmem3 group=_scoreall_,IA64,SHMEM smp=4 MSGBSERV
server mpi:$ 


scoreboard and msgbserv services are restarted.

Running a Xeon application on a Xeon node works fine:


server mpi:$  scout -F hosts -e scrun -nodes=2 ./jacobi_mpi
done.
FEP: Unable to connect with SCore-D (comp07)
FEP:WARNING checkpoint option is ignored in single-user mode.
SCore-D 5.4.0 connected.
<0:0> SCORE: 2 nodes (1x2) ready.
  Running with nprocs=           2
  Array size nxg,nyg =         4096        4096
  Iteration count    =          128
  Running with nprocs=           2


Trying to run Itanium application from Xeon server  gives this:

server mpi:$  scout -F hosts -e scrun -nodes=2 ./jacobi_mpi
done.
FEP: Unable to connect with SCore-D (comp07)
FEP:WARNING checkpoint option is ignored in single-user mode.
<0> SCore-D:WARNING Unable to open a network configuration file (2):
network='gigabit', attribute='-config:file'
<0> SCore-D:ERROR No PM device opened.


Itanium has these modules loaded:

pm_shmem               42976   0 (unused)
pm_ethernet_dev       148296   0 (unused)
pm_memory              19216   0 [pm_shmem pm_ethernet_dev]


Do  scoreboard and pm config files need to be running on Itanium as well as Xeon
main server ?


Any help appreciated.

Many thanks,

Nick
_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama ＠ pccluster.org  Fri May 30 08:54:28 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Fri, 30 May 2003 08:54:28 +0900
Subject: [SCore-users-jp] SMP クラスタ上へのインストールについて
In-Reply-To: Your message of "Thu, 29 May 2003 23:49:46 JST."
             <20030529.234946.45281613.makino@fbox.ath.cx>
Message-ID: <20030529235256.39038128944@neal.il.is.s.u-tokyo.ac.jp>

亀山です.
install 手順は間違っていないようですが...

In article <20030529.234946.45281613.makino ＠ fbox.ath.cx> Tetsuhisa Makino <makino ＠ fbox.ath.cx> wrotes:
> 1. SMP のカーネルがうまく起動しない．
>    起動中に固まってしまう．

どこで固まるかわかりますでしょうか?

> 2. PM Test が実行できない．
>    エラーメッセージが表示される．(カーネルは UP を使用)
>    > No route to host(113)

    % cluster-hostname-check smpc
の結果はどうなるでしょうか?

                       from Kameyama Toyohisa


From kameyama ＠ pccluster.org  Fri May 30 09:25:10 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Fri, 30 May 2003 09:25:10 +0900
Subject: [SCore-users-jp] Re: [SCore-users] ia32/ia64 cluster
In-Reply-To: Your message of "Thu, 29 May 2003 22:24:31 JST."
             <200305292224.31997.nrcb@streamline-computing.com>
Message-ID: <20030530002338.5E84E128944@neal.il.is.s.u-tokyo.ac.jp>

In article <200305292224.31997.nrcb ＠ streamline-computing.com> Nick Birkett <nrcb ＠ streamline-computing.com> wrotes:
> Trying to run Itanium application from Xeon server  gives this:
> 
> server mpi:$  scout -F hosts -e scrun -nodes=2 ./jacobi_mpi
> done.
> FEP: Unable to connect with SCore-D (comp07)
> FEP:WARNING checkpoint option is ignored in single-user mode.
> <0> SCore-D:WARNING Unable to open a network configuration file (2):
> network='gigabit', attribute='-config:file'
> <0> SCore-D:ERROR No PM device opened.

Is there directory /var/scored/scoreboard on Itanium hosts?
If /var/scored directory is not found, please execute following command
to create /var/scored/* directories:
    # /opt/score/install/setup -score_comp

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From makino ＠ giga.it.okayama-u.ac.jp  Fri May 30 10:30:59 2003
From: makino ＠ giga.it.okayama-u.ac.jp (Tetsuhisa MAKINO)
Date: Fri, 30 May 2003 10:30:59 +0900
Subject: [SCore-users-jp] SMP クラスタ上へのインストールについて
Message-ID: <20030530103059.76a08f5d.makino@giga.it.okayama-u.ac.jp>

牧野@岡大です。

SMP カーネルの起動は以下のサイトの情報により解決しました。

http://mandrakeforum.com/article.php?sid=869&lang=en

どうもマザーボード自体のバグだったようです。
しかし、やはり、/opt/score/sbin/rpmtest が失敗します。
>> 2. PM Test が実行できない．
>>    エラーメッセージが表示される．(カーネルは UP を使用)
>>    > No route to host(113)
>
>    % cluster-hostname-check smpc
>の結果はどうなるでしょうか?

こんな感じででました。
perl のエラーがうっとうしいので削除しています。

[makino ＠ anzu ~]$ cluster-hostname-check smpc
SCOUT: Spawning done.                       
[smp0-1]:
smp0.giga.it.okayama-u.ac.jp is OK
smp1.giga.it.okayama-u.ac.jp is OK
SCOUT: Session done.

宜しくお願いします。

-- 
岡山大学大学院自然科学研究科博士前期課程2年
Name:	Tetsuhisa Makino(牧野 哲久)
E-MAIL: makino ＠ giga.it.okayama-u.ac.jp


From kameyama ＠ pccluster.org  Fri May 30 15:30:02 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Fri, 30 May 2003 15:30:02 +0900
Subject: [SCore-users-jp] SMP クラスタ上へのインストールについて
In-Reply-To: Your message of "Fri, 30 May 2003 10:30:59 JST."
             <20030530103059.76a08f5d.makino@giga.it.okayama-u.ac.jp>
Message-ID: <20030530062829.D6FFE128944@neal.il.is.s.u-tokyo.ac.jp>

亀山です.

In article <20030530103059.76a08f5d.makino ＠ giga.it.okayama-u.ac.jp> Tetsuhisa MAKINO <makino ＠ giga.it.okayama-u.ac.jp> wrotes:
> >> 2. PM Test が実行できない．
> >>    エラーメッセージが表示される．(カーネルは UP を使用)
> >>    > No route to host(113)
> >
> >    % cluster-hostname-check smpc
> >の結果はどうなるでしょうか?

実は hostname の設定を疑ったのですが...

> こんな感じででました。
> perl のエラーがうっとうしいので削除しています。
> 
> [makino ＠ anzu ~]$ cluster-hostname-check smpc
> SCOUT: Spawning done.                       
> [smp0-1]:
> smp0.giga.it.okayama-u.ac.jp is OK
> smp1.giga.it.okayama-u.ac.jp is OK
> SCOUT: Session done.


正しいようですね.

とすると, PM の config file のほうの問題かもしれません.
    $ cluster-network-check -v -group smpc
を実行してみてください.
これは SCore の config file をチェックするコマンドです.
正常ならば, 以下のように出力されると思います.
    smp0.giga.it.okayama-u.ac.jp has 2 cpu, network: ethernet shmem0 shmem1
        scored use ethernet
    smp1.giga.it.okayama-u.ac.jp has 2 cpu, network: ethernet shmem0 shmem1
        scored use ethernet
    2 hosts has 2 cpu
    all hosts has  ethernet
    scored use ethernet
そのほかの変なメッセージが出たのならば, それを教えてください.
これが正常でしたら, 環境変数 PM_DEBUG に 5 を設定して pmtest を
実行して, その出力をお知らせ下さい.

                       from Kameyama Toyohisa


From hori ＠ swimmy-soft.com  Fri May 30 15:56:00 2003
From: hori ＠ swimmy-soft.com (Atsushi HORI)
Date: Fri, 30 May 2003 15:56:00 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Execution Time Problem
In-Reply-To: <20030529192809.GA9244@titan.cse.psu.edu>
References: <20030529192809.GA9244@titan.cse.psu.edu>
Message-ID: <3137154960.hori0000@swimmy-soft.com>

Hi,

>I have been running some NAS benchmarks using Score-D on a myrinet 
>network and the results are much 
>worse than the ones on GM (without score). For example SP (class B, 
>9 nodes) finishes in around 230 
>seconds on GM and it takes 1000 seconds when I use score...
>
>Is there something wrong or is it normal???

How did you switch GM and SCore/PM ? Since both GM and SCore try to 
open Myrinet device, but they conflicts. I wonder you are using 
Ethernet instead of Myrinet.

Try to run with stat=all option for SCore, for example

% scrun -nodes=XXX,stat=all ./a.out

or 

% mpirun -score stat=all ./a.out

and you will get statistics information, like the following

-=-=-=-= SCore-D Statistics =-=-=-=-
Nodes:8, User:287.0[m], Elapsed:3.857[S], CSW:1, CKPT:0
 [0:0] 4[hosts]x2[procs], comp0.pccluster.org...comp3.pccluster.org

Network:
shmem0/shmem[1], myrinet/myrinet:1[2];
 [0:1] shmem1/shmem[1], myrinet/myrinet:1[2];
 [1:0] shmem0/shmem[1], myrinet/myrinet:1[2];
 [1:1] shmem1/shmem[1], myrinet/myrinet:1[2];
 [2:0] shmem0/shmem[1], myrinet/myrinet:1[2];
 [2:1] shmem1/shmem[1], myrinet/myrinet:1[2];
 [3:0] shmem0/shmem[1], myrinet/myrinet:1[2];
 [3:1] shmem1/shmem[1], myrinet/myrinet:1[2];

#Node UsrTime  SysTime  Mem   Disk   #SC  IO    Exit
    0 2.450[S]  90.0[m] 0[KB]  1[MB]   8 0[B]      1
    1 2.920[S]  20.0[m] 0[KB]  1[MB]   6 0[B] 9[sig]
    2 2.500[S]  90.0[m] 0[KB]  1[MB]   6 0[B] 9[sig]
    3 2.960[S]  10.0[m] 0[KB]  1[MB]   6 0[B] 9[sig]
    4 2.370[S]  60.0[m] 0[KB]  1[MB]   6 0[B] 9[sig]
    5 2.830[S]  10.0[m] 0[KB]  1[MB]   6 0[B] 9[sig]
    6 2.560[S]  60.0[m] 0[KB]  1[MB]   6 0[B] 9[sig]
    7 3.010[S]  10.0[m] 0[KB]  1[MB]   6 0[B] 9[sig]
  Min 2.370[S]  10.0[m] 0[KB]  1[MB]   6 0[B]    ---
  Max 3.010[S]  90.0[m] 0[KB]  1[MB]   8 0[B]    ---
  Ave 2.700[S]  43.0[m] 0[KB]  1[MB]   6 0[B]    ---


---
As you can see, if the network is Myrinet then you will find Myrinet 
in the statistics.

----
Atsushi HORI
Swimmy Software, Inc.

_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From nrcb ＠ streamline-computing.com  Fri May 30 17:04:28 2003
From: nrcb ＠ streamline-computing.com (Nick Birkett)
Date: Fri, 30 May 2003 09:04:28 +0100
Subject: [SCore-users-jp] Re: [SCore-users] ia32/ia64 cluster
In-Reply-To: <20030530002338.5E84E128944@neal.il.is.s.u-tokyo.ac.jp>
References: <20030530002338.5E84E128944@neal.il.is.s.u-tokyo.ac.jp>
Message-ID: <200305300904.28514.nrcb@streamline-computing.com>

On Friday 30 May 2003 01:25 am, kameyama ＠ pccluster.org wrote:
> In article <200305292224.31997.nrcb ＠ streamline-computing.com> Nick Birkett 
<nrcb ＠ streamline-computing.com> wrotes:
> > Trying to run Itanium application from Xeon server  gives this:
> >
> > server mpi:$  scout -F hosts -e scrun -nodes=2 ./jacobi_mpi
> > done.
> > FEP: Unable to connect with SCore-D (comp07)
> > FEP:WARNING checkpoint option is ignored in single-user mode.
> > <0> SCore-D:WARNING Unable to open a network configuration file (2):
> > network='gigabit', attribute='-config:file'
> > <0> SCore-D:ERROR No PM device opened.
>
> Is there directory /var/scored/scoreboard on Itanium hosts?
> If /var/scored directory is not found, please execute following command
> to create /var/scored/* directories:
>     # /opt/score/install/setup -score_comp

>

Dear Mr Kamayama, thanks.  Ok I think it is nearly there.

server - Xeon front end to cluster
itanium01  - Itanium2 comp node 


server mpi:$ rsh itanium01 cat /var/scored/scoreboard/server.0000V2002zBJ
unit 0
maxnsend 24
backoff 2000
0 00:30:48:27:5F:B6 comp00.streamline
1 00:30:48:27:5F:10 comp01.streamline
2 00:30:48:27:5F:02 comp02.streamline
3 00:30:48:27:72:CC comp03.streamline
4 00:30:48:27:5F:B0 comp04.streamline
5 00:30:48:27:5F:C4 comp05.streamline
6 00:30:48:27:72:C8 comp06.streamline
7 00:30:48:27:5F:0C comp07.streamline
8 00:07:E9:D8:0E:37 itanium01.streamline


ok rpmtest and scstest both run across Xeon and Itanium.

My .wrapper links to jacobi_mpi and I  have the  2 executable files
as jacobi_mpi.exe under

 /opt/score/bin/bin.ia64-redhat-linux2_4 on itanium and Xeon server,
 and under /opt/score/bin/bin.i386-redhat7-linux2_4 on server .

Do I need the i386 binary also on the itanium compute node ? 
 

server mpi:$ cat hosts
comp00.streamline
itanium01.streamline
server mpi:$ scout -F hosts -e scrun -nodes=2x1 ./jacobi_mpi
SCOUT: Spawning done.            
FEP: Unable to connect with SCore-D (comp07)
FEP:WARNING checkpoint option is ignored in single-user mode.
<1> ULT:ERROR Unable to open binary file 
(/opt/score/deploy/bin.ia64-redhat-linux2_4/scored.exe)=0
server mpi:$ 

I have mounted the itanium01  /opt/score/deploy/bin.ia64-redhat-linux2_4/ at 
the same location on server, and scored.exe is there:

[root ＠ tiger4 bin.ia64-redhat-linux2_4]# pwd
/opt/score/deploy/bin.ia64-redhat-linux2_4
[root ＠ tiger4 bin.ia64-redhat-linux2_4]# ls -al score*
-rwxr-xr-x    1 root     root     18676089 May 26 15:19 scored_dev.exe
-rwxr-xr-x    1 root     root     17489790 May 26 15:19 scored.exe
[root ＠ tiger4 bin.ia64-redhat-linux2_4]# 


Thanks for your help,

Nick
_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From makino ＠ giga.it.okayama-u.ac.jp  Fri May 30 17:19:25 2003
From: makino ＠ giga.it.okayama-u.ac.jp (Tetsuhisa MAKINO)
Date: Fri, 30 May 2003 17:19:25 +0900
Subject: [SCore-users-jp] SMP クラスタ上へのインストールについて
In-Reply-To: <20030530062829.D6FFE128944@neal.il.is.s.u-tokyo.ac.jp>
References: <20030530103059.76a08f5d.makino@giga.it.okayama-u.ac.jp>
	<20030530062829.D6FFE128944@neal.il.is.s.u-tokyo.ac.jp>
Message-ID: <20030530171925.3c3fa418.makino@giga.it.okayama-u.ac.jp>

牧野です。

でました…(^^;

On Fri, 30 May 2003 15:30:02 +0900
kameyama ＠ pccluster.org wrote:

> 亀山です.
> 
> > >> 2. PM Test が実行できない．
> > >>    エラーメッセージが表示される．(カーネルは UP を使用)
> > >>    > No route to host(113)
> > >
> > >    % cluster-hostname-check smpc
> > >の結果はどうなるでしょうか?
> 
> 実は hostname の設定を疑ったのですが...
> 
> > こんな感じででました。
> > perl のエラーがうっとうしいので削除しています。
> > 
> > [makino ＠ anzu ~]$ cluster-hostname-check smpc
> > SCOUT: Spawning done.                       
> > [smp0-1]:
> > smp0.giga.it.okayama-u.ac.jp is OK
> > smp1.giga.it.okayama-u.ac.jp is OK
> > SCOUT: Session done.
> 
> 
> 正しいようですね.
> 
> とすると, PM の config file のほうの問題かもしれません.
>     $ cluster-network-check -v -group smpc
> を実行してみてください.
> これは SCore の config file をチェックするコマンドです.
> 正常ならば, 以下のように出力されると思います.
>     smp0.giga.it.okayama-u.ac.jp has 2 cpu, network: ethernet shmem0 shmem1
>         scored use ethernet
>     smp1.giga.it.okayama-u.ac.jp has 2 cpu, network: ethernet shmem0 shmem1
>         scored use ethernet
>     2 hosts has 2 cpu
>     all hosts has  ethernet
>     scored use ethernet
> そのほかの変なメッセージが出たのならば, それを教えてください.

# cluster-network-check -v -group smpc
smp0.giga.it.okayama-u.ac.jp has 2 cpu, network:gigaethernet shmem0 shmem1
WARNING: smp0.giga.it.okayama-u.ac.jp use gigaethernet in scoreboard but it is not found in config file
WARNING: No full coverage networks
smp1.giga.it.okayama-u.ac.jp has 2 cpu, network:gigaethernet shmem0 shmem1
WARNING: smp1.giga.it.okayama-u.ac.jp use gigaethernet in scoreboard but it is not found in config file
2 hosts has 2 cpu
scored not work this configuration

ワーニングメッセージがでてます。
メッセージの内容から察するにどうもネットワークとして gigaethernet となっているようです。
gigaethernet と ethernet は違うものですよね?
あと、config file とはどこにあるのでしょうか?

お手数をかけますが、よろしくお願いします。

-- 
岡山大学大学院自然科学研究科博士前期課程2年
Name:	Tetsuhisa Makino(牧野 哲久)
E-MAIL: makino ＠ giga.it.okayama-u.ac.jp


From kameyama ＠ pccluster.org  Fri May 30 17:50:13 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Fri, 30 May 2003 17:50:13 +0900
Subject: [SCore-users-jp] SMP クラスタ上へのインストールについて
In-Reply-To: Your message of "Fri, 30 May 2003 17:19:25 JST."
             <20030530171925.3c3fa418.makino@giga.it.okayama-u.ac.jp>
Message-ID: <20030530084839.B93C5128944@neal.il.is.s.u-tokyo.ac.jp>

亀山です.

In article <20030530171925.3c3fa418.makino ＠ giga.it.okayama-u.ac.jp> Tetsuhisa MAKINO <makino ＠ giga.it.okayama-u.ac.jp> wrotes:
> > とすると, PM の config file のほうの問題かもしれません.
> >     $ cluster-network-check -v -group smpc
> > を実行してみてください.
> > これは SCore の config file をチェックするコマンドです.
> > 正常ならば, 以下のように出力されると思います.
> >     smp0.giga.it.okayama-u.ac.jp has 2 cpu, network: ethernet shmem0 shmem1
> >         scored use ethernet
> >     smp1.giga.it.okayama-u.ac.jp has 2 cpu, network: ethernet shmem0 shmem1
> >         scored use ethernet
> >     2 hosts has 2 cpu
> >     all hosts has  ethernet
> >     scored use ethernet
> > そのほかの変なメッセージが出たのならば, それを教えてください.
> 
> # cluster-network-check -v -group smpc
> smp0.giga.it.okayama-u.ac.jp has 2 cpu, network:gigaethernet shmem0 shmem1
> WARNING: smp0.giga.it.okayama-u.ac.jp use gigaethernet in scoreboard but it i
> s not found in config file
> WARNING: No full coverage networks
> smp1.giga.it.okayama-u.ac.jp has 2 cpu, network:gigaethernet shmem0 shmem1
> WARNING: smp1.giga.it.okayama-u.ac.jp use gigaethernet in scoreboard but it i
> s not found in config file
> 2 hosts has 2 cpu
> scored not work this configuration
> 
> ワーニングメッセージがでてます。
> メッセージの内容から察するにどうもネットワークとして gigaethernet となってい
> るようです。

> gigaethernet と ethernet は違うものですよね?

違うものです.

> あと、config file とはどこにあるのでしょうか?

この場合は, serrver host の
    /opt/score/etc/scorehosts.db
をチェックしてください.
この format については
    http://www.pccluster.org/score/dist/score/html/ja/man/man5/scorehosts.db.html
にかかれていますが, 最初に

ethernet        type=ethernet \
                -config:file=/opt/score/etc/pm-ethernet.conf

などとかかれた行があると思います.
これが SCore の network の記述で, 最初の ethernet が rpmtst の 2 番目や
scstest で -network で指定するものです. -type は network の type
を -config:file にはその network の config file の位置を書きます.

もっと下には
    smp0.giga.it.okayama-u.ac.jp HOST_0 network=gigaethernet,shmem0,shmem1 group=_scoreall_,pcc smp=2 MSGBSERV
などと記述があると思います.
(EIT で作成した場合... HOST_0, MSGBSERV はこの上か #include されている
ファイルの中で定義されていると思います.)
この行は個々のホストについてかかれていて, network にはそのホストで使用可能な
network が group には scout などで使用する group が記述されています.
そして, network にかかれている名前は, このファイルの中で定義されている
必要があります.
今回の場合は host の gigaethernet が定義されていないのが原因だと思います.
もし, ethernet の定義があるのでしたら ethernet に変更すればよいと思います.

しかし, 多分, rpmtest は
    % rpmtest smp1 ethernet -reply &
    % rpmtest smp0 ethernet -dest 1 -ping
などと実行したのだと推測します.
これで失敗したとしたら, ethernet の config file が間違っている可能性が
あります.
   /opt/score/deploy/mkpmethernetconf
を使用して config file を作りなおしたほうがよいかもしれません.

より詳しくは
    http://www.pccluster.org/score/dist/score/html/ja/installation/sys-server.html
あたりを参照してください.

                       from Kameyama Toyohisa


From hori ＠ swimmy-soft.com  Fri May 30 18:20:18 2003
From: hori ＠ swimmy-soft.com (Atsushi HORI)
Date: Fri, 30 May 2003 18:20:18 +0900
Subject: [SCore-users-jp] Re: [SCore-users] ia32/ia64 cluster
In-Reply-To: <200305300904.28514.nrcb@streamline-computing.com>
References: <20030530002338.5E84E128944@neal.il.is.s.u-tokyo.ac.jp>
Message-ID: <3137163618.hori0001@swimmy-soft.com>

Hi,

><1> ULT:ERROR Unable to open binary file 
>(/opt/score/deploy/bin.ia64-redhat-linux2_4/scored.exe)=0

The ULT, which is a (user-level) thread library used by SCore-D, 
opens its binary file to obtain symbol information when a cluster is 
heterogeneous, so that every remote thread invokation is translated 
based on the symbol information. 

This message says ULT failed to open (bfd_openr) its binary file, but 
it fails. I have no ia64 environment, and I can not investigate 
further. 

Kamayama-san, can you ?

----
Atsushi HORI
Swimmy Software, Inc.

_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama ＠ pccluster.org  Fri May 30 18:27:20 2003
From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Fri, 30 May 2003 18:27:20 +0900
Subject: [SCore-users-jp] Re: [SCore-users] ia32/ia64 cluster
In-Reply-To: Your message of "Fri, 30 May 2003 18:20:18 JST."
             <3137163618.hori0001@swimmy-soft.com>
Message-ID: <20030530092547.850C3128944@neal.il.is.s.u-tokyo.ac.jp>

In article <3137163618.hori0001 ＠ swimmy-soft.com> Atsushi HORI <hori ＠ swimmy-soft.com> wrotes:
> This message says ULT failed to open (bfd_openr) its binary file, but 
> it fails. I have no ia64 environment, and I can not investigate 
> further. 

On redhat 7.3 (and 8.0 and 9) on i386, libbfd don't support ia64 binaries.

You must rebuild binutils packages on server host with following flag:
    --enable-targets=ia64-linux
And please rebuild SCore (at least scored).

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From nrcb ＠ streamline-computing.com  Sat May 31 02:48:34 2003
From: nrcb ＠ streamline-computing.com (Nick Birkett)
Date: Fri, 30 May 2003 18:48:34 +0100
Subject: [SCore-users-jp] Re: [SCore-users] ia32/ia64 cluster
In-Reply-To: <20030530092547.850C3128944@neal.il.is.s.u-tokyo.ac.jp>
References: <20030530092547.850C3128944@neal.il.is.s.u-tokyo.ac.jp>
Message-ID: <200305301848.34646.nrcb@streamline-computing.com>

On Friday 30 May 2003 10:27 am, kameyama ＠ pccluster.org wrote:
> In article <3137163618.hori0001 ＠ swimmy-soft.com> Atsushi HORI <hori ＠ swimmy-soft.com> wrotes:
> > This message says ULT failed to open (bfd_openr) its binary file, but
> > it fails. I have no ia64 environment, and I can not investigate
> > further.
>
> On redhat 7.3 (and 8.0 and 9) on i386, libbfd don't support ia64 binaries.
>
> You must rebuild binutils packages on server host with following flag:
>     --enable-targets=ia64-linux
> And please rebuild SCore (at least scored).
>
>                        from Kameyama Toyohisa

Ok I have rebuilt libbfd.a  (static lib) on the Xeon main server and
recompiled Score 5.4 (score and mpi) from source.

Everything is working fine on the Xeon nodes. 

scored.exe seems to have ia64 symbols:

nm /opt/score/deploy/bin.i386-redhat7-linux2_4/scored.exe | grep ia64  
08255d40 R bfd_efi_app_ia64_vec
08254ea0 R bfd_elf64_ia64_aix_big_vec
08255020 R bfd_elf64_ia64_aix_little_vec
08254aa0 R bfd_elf64_ia64_big_vec
082552a0 R bfd_elf64_ia64_hpux_big_vec
08254c20 R bfd_elf64_ia64_little_vec
08256440 R bfd_ia64_arch

etc

However I still have the same error:
 
server mpi:$ cat hosts
comp00.streamline
itanium01.streamline 
server mpi:$ scout -F hosts -e scrun -nodes=2x1 ./jacobi_mpi
SCOUT: Spawning done.            
FEP: Unable to connect with SCore-D (comp07)
FEP:WARNING checkpoint option is ignored in single-user mode.
<1> ULT:ERROR Unable to open binary file (/opt/score/deploy/bin.ia64-redhat-linux2_4/scored.exe)=0

The itanium01 bin.ia64-redhat-linux2_4 directories are mounted on the Xeon server as well.

server mpi:$ df
itanium01:/opt/score/deploy/bin.ia64-redhat-linux2_4
                      33171392  15447184  16039152  50% /opt/score/deploy/bin.ia64-redhat-linux2_4
itanium01:/opt/score/bin/bin.ia64-redhat-linux2_4/
                      33171392  15447184  16039152  50% /opt/score/bin/bin.ia64-redhat-linux2_4
server mpi:$ 


Many thanks for your help.

Nick
_______________________________________________
SCore-users mailing list
SCore-users ＠ pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users