From gchoi ＠ cse.psu.edu Tue Jul 1 03:25:06 2003 From: gchoi ＠ cse.psu.edu (Gyu Sang Choi) Date: Mon, 30 Jun 2003 14:25:06 -0400 Subject: [SCore-users-jp] [SCore-users] rpmtest with pingpong test Message-ID: <3F008082.6090500@cse.psu.edu> Hi there, I am testing the latency between two nodes using rpmtest. Command in one node is "./rpmtest aum1 myrinet2k -reply" and the command in other node is "./rpmtest aum0 myrinet2k -dest 1 -ping -iter 100 -len 10000". But it doesn't work. When the messages size is small, rpmtest works. However, if the message size is getting big, rpmtest doesn't work? Best Regards, -- --------------------------------------------------------------- | Gyu Sang Choi | | TA : CSE/EE 554 | | Office Hour : Monday 3:00-4:30 and Tuesday 2:30-4:00 | | Office : 313 Pond Lab (863-3814) | | email : gchoi ＠ cse.psu.edu | | Tel : (814)861-6986 | | Address : 425 Waupelani Drive #327, State College, PA 16801 | --------------------------------------------------------------- _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama ＠ pccluster.org Tue Jul 1 09:10:48 2003 From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Tue, 01 Jul 2003 09:10:48 +0900 Subject: [SCore-users-jp] Re: [SCore-users] rpmtest with pingpong test In-Reply-To: Your message of "Mon, 30 Jun 2003 14:25:06 JST." <3F008082.6090500@cse.psu.edu> Message-ID: <20030701000736.05D54128944@neal.il.is.s.u-tokyo.ac.jp> In article <3F008082.6090500 ＠ cse.psu.edu> Gyu Sang Choi wrotes: > When the messages size is small, rpmtest works. > However, if the message size is getting big, rpmtest doesn't work? rpmtest (without -vwrite option) call pmGetSendBuffer() for sending the message. (http://www.pccluster.org/score/dist/score/html/en/man/man3/pmGetSendBuffer.html) And pmGetSendBuffer() must specify langth from 1 to MTU. MTU is PM deice dependent, You can get MTU by pmGetMtu(). In myrinet2k device (and myrinet), the MTU value is 8256. So you must specify message size less than 8256. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From h035102m ＠ mbox.nagoya-u.ac.jp Thu Jul 3 03:18:20 2003 From: h035102m ＠ mbox.nagoya-u.ac.jp (ueda) Date: Thu, 3 Jul 2003 03:18:20 +0900 Subject: [SCore-users-jp] kernel panic について Message-ID: <200307030318.HCC01712.4I20N306@mbox.nagoya-u.ac.jp> 名古屋大学の上田です。どなたか以下の問題の原因を教えていただけないでしょうか？現在、MPIを用いた自作プログラムで、並列計算を行っております。計算は、構造解析を行っておりまして、領域分割法を利用しているのですが、プログラムの実行の最中にいつも処理が停止するといった問題がおきました。調べてみますと、DOループ中にMPI_REDUCEを行っている最中に、送信先のPCの画面に、'kernel panic'なる文字がでておりました。この現象は、 DOループが4000を越えるあたりで発生するということもわかりましたが、なぜこのようなことが起こるのかがわかりません。 kernel panicを起こしたPCは再起動後も正常に起動しなくて、また、私の知識の無さのために、現在は再インストールをすることによって対処しております。どなたか、似たような経験をされたかた、あるいは、解決方法などを教えてください。 -------------------------------------- 並列計算環境 SCore5.2.0を利用 OS：RedHut Linux7.3 PC　server 1台+host 8台 compiler g77 -------------------------------------- 追伸 kernel panicを起こしたPCの画面は、全体にわたって []　 [] ....(続く）などの文字がでており、画面の下のほうには、 Code: 8b b6 80 00 00 00 77 1d 89 .........(続く） <0>Kernel panic : Aiee, killing interrupt hundler! In interrupt handler - not syncing と表示されています。メモリ領域の問題かと思いまして、プログラム中で配列の大きさが足りているかチェックしましたが、それは問題はありませんでした。情報不足かも知れませんが、よろしくお願いします。 =============================================== 上田　尚史 E-MAIL：h035102m ＠ mbox.nagoya-u.ac.jp =============================================== From kameyama ＠ pccluster.org Thu Jul 3 09:23:08 2003 From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Thu, 03 Jul 2003 09:23:08 +0900 Subject: [SCore-users-jp] kernel panic について In-Reply-To: Your message of "Thu, 03 Jul 2003 03:18:20 JST." <200307030318.HCC01712.4I20N306@mbox.nagoya-u.ac.jp> Message-ID: <20030703001950.80366128944@neal.il.is.s.u-tokyo.ac.jp> 亀山です. In article <200307030318.HCC01712.4I20N306 ＠ mbox.nagoya-u.ac.jp> ueda wrotes: > 調べてみますと、DOループ中にMPI_REDUCEを行っている最中に、送信先 > のPCの画面に、'kernel panic'なる文字がでておりました。この現象は、 > DOループが4000を越えるあたりで発生するということもわかりましたが、 > なぜこのようなことが起こるのかがわかりません。 > kernel panicを起こしたPCは再起動後も正常に起動しなくて、また、私 > の知識の無さのために、現在は再インストールをすることによって対処 > しております。過去に BCM5701 の NIC で PM/ethernet を使用したときに同様のエラーが出ています (ただし SCore 5.0.1). http://www.pccluster.org/pipermail/score-users-jp/2003-May/001409.html (そういえば, これは解決していなかったような...) 使用している PM network, PM/Etthernet だとしたら, 使用している NIC はなんでしょうか? from Kameyama Toyohisa From h035102m ＠ mbox.nagoya-u.ac.jp Thu Jul 3 12:38:31 2003 From: h035102m ＠ mbox.nagoya-u.ac.jp (ueda) Date: Thu, 3 Jul 2003 12:38:31 +0900 Subject: [SCore-users-jp] kernel panic について In-Reply-To: <20030703001950.80366128944@neal.il.is.s.u-tokyo.ac.jp> References: <200307030318.HCC01712.4I20N306@mbox.nagoya-u.ac.jp> <20030703001950.80366128944@neal.il.is.s.u-tokyo.ac.jp> Message-ID: <200307031238.FBJ97167.N6I34020@mbox.nagoya-u.ac.jp> 名古屋大学の上田です。ご回答ありがとうございました。 NICは3COMのGigabit ServerのModel:3C996B-Tです。ドライバはtg3というものだと思います。以上、よろしくお願いします。 =============================================== 上田　尚史 E-MAIL：h035102m ＠ mbox.nagoya-u.ac.jp =============================================== From s-sumi ＠ flab.fujitsu.co.jp Thu Jul 3 12:49:32 2003 From: s-sumi ＠ flab.fujitsu.co.jp (Shinji Sumimoto) Date: Thu, 03 Jul 2003 12:49:32 +0900 (JST) Subject: [SCore-users-jp] kernel panic について In-Reply-To: <200307031238.FBJ97167.N6I34020@mbox.nagoya-u.ac.jp> References: <200307030318.HCC01712.4I20N306@mbox.nagoya-u.ac.jp> <20030703001950.80366128944@neal.il.is.s.u-tokyo.ac.jp> <200307031238.FBJ97167.N6I34020@mbox.nagoya-u.ac.jp> Message-ID: <20030703.124932.846950508.s-sumi@flab.fujitsu.co.jp> 富士通研の住元です。 From: ueda Subject: Re: [SCore-users-jp] kernel panic について Date: Thu, 3 Jul 2003 12:38:31 +0900 Message-ID: <200307031238.FBJ97167.N6I34020 ＠ mbox.nagoya-u.ac.jp> h035102m> h035102m> 名古屋大学の上田です。 h035102m> ご回答ありがとうございました。 h035102m> h035102m> NICは3COMのGigabit ServerのModel:3C996B-Tです。 h035102m> ドライバはtg3というものだと思います。 /sbin/lsmodコマンドか /etc/modules.confを見れば分かります。デバイスドライバはtg3ではなく、実績のあるbcm5700を使ってください。 PM/Ethernetを使った場合、tg3はどうも不安定、かつ、性能が出ません。 h035102m> 以上、よろしくお願いします。 h035102m> h035102m> =============================================== h035102m> 上田　尚史 h035102m> E-MAIL：h035102m ＠ mbox.nagoya-u.ac.jp h035102m> =============================================== h035102m> h035102m> h035102m> _______________________________________________ h035102m> SCore-users-jp mailing list h035102m> SCore-users-jp ＠ pccluster.org h035102m> http://www.pccluster.org/mailman/listinfo/score-users-jp h035102m> ------ Shinji Sumimoto, Fujitsu Labs From masa ＠ soldec-solution.jp Thu Jul 3 20:40:31 2003 From: masa ＠ soldec-solution.jp (MASA(tm)) Date: Thu, 03 Jul 2003 20:40:31 +0900 Subject: [SCore-users-jp] Hyper Threading が有効な 1台の PCで SCore Message-ID: <200307031157.h63BvHAd026572@soldec-solution.jp> 菊池です。こんにちは。　以下の構成で SCore を構築しようとしています。　Hyper Threading が有効な 1台の PC (＋サーバホスト）のみで SCore クラスタは構成できるでしょうか？　また、可能な場合の方法をご教示ください。（新しい CPU とチップセットの利用のためにテストです。SCore の動作が確認できれば、HT を OFF にして複数ノードで導入する予定です。）　構成：・計算ホスト（組み立て）　　CPU: Pentium4 3.0GHz M/B: Intel D875PBZ Memory: DDR400 512MB x 2 　　NIC: On-board Intel Pro1000 (82547EI) HDD: UATA5 120GB Video: AOpen GeForce FX5200 他: FDD, ATAPI CD-ROM 　Red Hat Linux 7.3 が動作し(UP/SMP共)、Intel のサイトで入手　した e1000 で NIC の動作も確認しています。・サーバホスト（Gateway GP7-500）　　CPU: PentiumIII 500MHz Memory: 384MB やったこと： a.EIT にて　1.サーバホストで bininstall。　2.ブートディスクを作成。（ホストは 1台で Shmem[2] のみ。）　　計算ホストを起動するもブラックアウト。　3.ふと思いついて、計算ホストにて　　　boot: lowres 　　とすると、起動自体は行なっていた。NIC を認識せず。　4.ブートディスクのドライバを Intel 提供のものに差し替えれ　　ばよいかと思ったが、方法がわからず中断。 b.バイナリ RPM から　1.手順に従い RPM からインストール　2.再起動すると panic 　3.あきらめて RHL を再インストール c.ソースから 1.ソースを展開、パッチを当てて config.daily.pccc を Load。　　カーネルを再構築し、計算ホストは正常な様子。　2.http://www.pccluster.org/score/dist/score/html/ja/installation/scout-test.html 　　に従いテスト。　 $ scorehosts -l -g pcc comp1.score.example 1 host found. $ sceptic -v -g pcc comp1.score.example: scping FAILED comp1.score.example: OK All host responding. $ msgb -group pcc & $ scout -g pcc bash: /opt/score5.4.0/deploy/scremote: No such file or directory ^C (msgb ではロックされている赤色表示。戻ってこないので止める。）　　と正常に動作しない。　　別のクラスタ（別件でテスト中。稼動していない）にも　　scremote なるファイルは無いのでお手上げ。　3.sceptic の2行はそれぞれ scping, rsh だと確認、以下を試す。　　　$ scping comp1 Unable to make connection. 　　　$ scping comp1.score.example Unable to make connection. 　　　$ ping comp1 (正常に到達）　　　$ rsh comp1 uname -a (正常に表示）　　 (comp1.score.example でも同様）　　お手上げ。　以上です。よろしくお願いいたします。 -- ----------------------------------- 菊池　匡洋 mailto:masa ＠ soldec-solution.jp ----------------------------------- From kameyama ＠ pccluster.org Thu Jul 3 21:17:52 2003 From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Thu, 03 Jul 2003 21:17:52 +0900 Subject: [SCore-users-jp] Hyper Threading が有効な 1 台の PCで SCore In-Reply-To: Your message of "Thu, 03 Jul 2003 20:40:31 JST." <200307031157.h63BvHAd026572@soldec-solution.jp> Message-ID: <20030703121432.72032128944@neal.il.is.s.u-tokyo.ac.jp> 亀山です. In article <200307031157.h63BvHAd026572 ＠ soldec-solution.jp> "MASA(tm)" wrotes: > 　以下の構成で SCore を構築しようとしています。 > 　Hyper Threading が有効な 1台の PC (＋サーバホスト）のみで > SCore クラスタは構成できるでしょうか？　また、可能な場合の方 > 法をご教示ください。 Chipset E7505, Xeon 2.4 GHz の組合わせですが, 一応 2 CPU としては動いています. (環境は redhat 8.0, kernel は 2.4.20 + score patch です. パッチに失敗したものは若干手動で直しました.) 2.4.19 では試していません.) しかし, 実際に NAS 並列ベンチマークをやらせると全く性能がでませんでした. > やったこと： > a.EIT にて > 　1.サーバホストで bininstall。 > 　2.ブートディスクを作成。（ホストは 1台で Shmem[2] のみ。） > 　　計算ホストを起動するもブラックアウト。 > 　3.ふと思いついて、計算ホストにて > 　　　boot: lowres > 　　とすると、起動自体は行なっていた。NIC を認識せず。 Gigabit ether のほうの floppy を使用したのですよね? SCore 5.4.0 でしたら e1000 は 4.4.19 が入っているのですが... > c.ソースから > 1.ソースを展開、パッチを当てて config.daily.pccc を Load。 > 　　カーネルを再構築し、計算ホストは正常な様子。 > 　2.http://www.pccluster.org/score/dist/score/html/ja/installation/scout-test > .html > 　　に従いテスト。 > 　 $ scorehosts -l -g pcc > comp1.score.example > 1 host found. > $ sceptic -v -g pcc > comp1.score.example: scping FAILED > comp1.score.example: OK > All host responding. > $ msgb -group pcc & > $ scout -g pcc > bash: /opt/score5.4.0/deploy/scremote: No such file or directory > ^C (msgb ではロックされている赤色表示。戻ってこないので止める。） > 　　と正常に動作しない。 compute host に /opt/score/deploy は入っていますでしょうか? scremote は server host には入っていなくて compute host のみにはいっています. server を rpm で install したのでしたら http://www.pccluster.org/score/dist/score/html/ja/installation/sys-compute.html のように # ./bininstall -compute で compute host にコマンドをインストールしてください. from Kameyama Toyohisa From nakata ＠ bestsystems.co.jp Fri Jul 4 17:27:32 2003 From: nakata ＠ bestsystems.co.jp (Hisaho Nakata) Date: Fri, 4 Jul 2003 17:27:32 +0900 Subject: [SCore-users-jp] Hyper Threading が有効な 1 台の PC で SCore In-Reply-To: <20030703121432.72032128944@neal.il.is.s.u-tokyo.ac.jp> References: <200307031157.h63BvHAd026572@soldec-solution.jp> <20030703121432.72032128944@neal.il.is.s.u-tokyo.ac.jp> Message-ID: <20030704172732.2e24d557.nakata@bestsystems.co.jp> 中田＠ベストシステムズです。オリジナルの2.4.20ですとICH5をうまく認識できず、IDE HDDのDMAが使えないなどの障害がでたと思います。このあたりを修正するには、RedHatカーネル2.4.20-13で採用されているパッチをいくつかあてれば対応可能です。 Hyper Threadingの機能はHPC系のアプリケーションではそれほど有効に機能しませんし、有効にすると逆に遅くなることもよくあるので、無効にしておいたほうが無難です。 >In article <200307031157.h63BvHAd026572 ＠ soldec-solution.jp> "MASA(tm)" > wrotes:> 　以下の構成で SCore を構築しようと >しています。> 　Hyper Threading が有効な 1台の PC (＋サーバホスト）のみ >で > SCore クラスタは構成できるでしょうか？　また、可能な場合の方 >> 法をご教示ください。 > >Chipset E7505, Xeon 2.4 GHz の組合わせですが, >一応 2 CPU としては動いています. >(環境は redhat 8.0, kernel は 2.4.20 + score patch です. >パッチに失敗したものは若干手動で直しました.) >2.4.19 では試していません.) >しかし, 実際に NAS 並列ベンチマークをやらせると全く性能がでませんでした >. > >> やったこと： >> a.EIT にて >> 　1.サーバホストで bininstall。 >> 　2.ブートディスクを作成。（ホストは 1台で Shmem[2] のみ。） >> 　　計算ホストを起動するもブラックアウト。 >> 　3.ふと思いついて、計算ホストにて >> 　　　boot: lowres >> 　　とすると、起動自体は行なっていた。NIC を認識せず。 > >Gigabit ether のほうの floppy を使用したのですよね? >SCore 5.4.0 でしたら e1000 は 4.4.19 が入っているのですが... > >> c.ソースから >> 1.ソースを展開、パッチを当てて config.daily.pccc を Load。 >> 　　カーネルを再構築し、計算ホストは正常な様子。 >> 　 >2.http://www.pccluster.org/score/dist/score/html/ja/installation/scout >-test> .html >> 　　に従いテスト。 >> 　 $ scorehosts -l -g pcc >> comp1.score.example >> 1 host found. >> $ sceptic -v -g pcc >> comp1.score.example: scping FAILED >> comp1.score.example: OK >> All host responding. >> $ msgb -group pcc & >> $ scout -g pcc >> bash: /opt/score5.4.0/deploy/scremote: No such file or >directory> ^C (msgb ではロックされている赤色表示。戻ってこないの >で止める。）> 　　と正常に動作しない。 > >compute host に /opt/score/deploy は入っていますでしょうか? >scremote は server host には入っていなくて compute host のみにはいって >います. server を rpm で install したのでしたら > http://www.pccluster.org/score/dist/score/html/ja/installation/sys-compute.html >のように > # ./bininstall -compute >で compute host にコマンドをインストールしてください. > > from Kameyama Toyohisa >_______________________________________________ >SCore-users-jp mailing list >SCore-users-jp ＠ pccluster.org >http://www.pccluster.org/mailman/listinfo/score-users-jp ======================================================================== 株式会社ベストシステムズシステムソリューション事業部テクニカルサポート中田寿穗 (nakata ＠ bestsystems.co.jp) 〒111-0054 東京都台東区鳥越2-7-4 ヘブン鳥越 1-2F 東京第二事業所 Tel: 03-5825-0652 Fax: 03-5825-0645 ======================================================================== From akaikoji ＠ po.cc.yamaguchi-u.ac.jp Fri Jul 4 11:01:01 2003 From: akaikoji ＠ po.cc.yamaguchi-u.ac.jp (=?iso-2022-jp?b?YWthaWtvamkgGyRCIXcbKEIgcG8uY2MueWFtYWd1Y2hpLXUuYWMuanA=?=) Date: Fri, 4 Jul 2003 11:01:01 +0900 Subject: [SCore-users-jp] PM-Myrinietを用いたNFS Message-ID: <5CB3EF5C-ADC3-11D7-82D6-00039377AE42@po.cc.yamaguchi-u.ac.jp> Score使いの方々へ Score5.4.0をインストールし使っております。動作環境等は以下の通りです。 Xeon2.4 dual x 64(128cpu) + file server Score5.4.0 RedHat 7.3 Myrinet2000 100base(on Board) 各計算ノードのホームはNFSを使い100base経由でファイルサーバーのディスクをマウントしています。ファイルサーバーにもMyrinet2000 がついているので、PM-Myrinet2kを使って、NFSを使えないかと考えているのですが、可能でしょうか。PM ドライバーでなくてもかまいません。よろしくお願いします。 ############################################ 山口大学　メディア基盤センター赤井光治 phone: 0836 85 9900 fax: 0836 85 9901 e-mail: akaikoji ＠ po.cc.yamaguchi-u.ac.jp ######################################## From nakata ＠ bestsystems.co.jp Fri Jul 4 20:22:11 2003 From: nakata ＠ bestsystems.co.jp (Hisaho Nakata) Date: Fri, 4 Jul 2003 20:22:11 +0900 Subject: [SCore-users-jp] PM-Myrinietを用いたNFS In-Reply-To: <5CB3EF5C-ADC3-11D7-82D6-00039377AE42@po.cc.yamaguchi-u.ac.jp> References: <5CB3EF5C-ADC3-11D7-82D6-00039377AE42@po.cc.yamaguchi-u.ac.jp> Message-ID: <20030704202211.13347d59.nakata@bestsystems.co.jp> 中田＠ベストシステムズです。 GMドライバをつかってTCP over GMを使えば、できないことはないですが、まったく性能でませんよ。 >Score5.4.0をインストールし使っております。 >動作環境等は以下の通りです。 > >Xeon2.4 dual x 64(128cpu) + file server >Score5.4.0 >RedHat 7.3 > Myrinet2000 > 100base(on Board) > >各計算ノードのホームはNFSを使い100base経由でファイルサーバー >のディスクをマウントしています。ファイルサーバーにもMyrinet2000 >がついているので、PM-Myrinet2kを使って、NFSを使えないかと >考えているのですが、可能でしょうか。PM ドライバーでなくてもかま >いません。 > >よろしくお願いします。 > >############################################ > 山口大学　メディア基盤センター > 赤井光治 > > phone: 0836 85 9900 > fax: 0836 85 9901 > e-mail: akaikoji ＠ po.cc.yamaguchi-u.ac.jp >######################################## > >_______________________________________________ >SCore-users-jp mailing list >SCore-users-jp ＠ pccluster.org >http://www.pccluster.org/mailman/listinfo/score-users-jp ======================================================================== 株式会社ベストシステムズシステムソリューション事業部テクニカルサポート中田寿穗 (nakata ＠ bestsystems.co.jp) 〒111-0054 東京都台東区鳥越2-7-4 ヘブン鳥越 1-2F 東京第二事業所 Tel: 03-5825-0652 Fax: 03-5825-0645 ======================================================================== From s-sumi ＠ flab.fujitsu.co.jp Fri Jul 4 11:25:30 2003 From: s-sumi ＠ flab.fujitsu.co.jp (Shinji Sumimoto) Date: Fri, 04 Jul 2003 11:25:30 +0900 (JST) Subject: [SCore-users-jp] PM-Myrinietを用いたNFS In-Reply-To: <5CB3EF5C-ADC3-11D7-82D6-00039377AE42@po.cc.yamaguchi-u.ac.jp> References: <5CB3EF5C-ADC3-11D7-82D6-00039377AE42@po.cc.yamaguchi-u.ac.jp> Message-ID: <20030704.112530.730575858.s-sumi@flab.fujitsu.co.jp> 富士通研の住元です。将来的には、PMを利用したファイル共有の実現は考えていますが、原状では、PM上でNFSなど既存の分散ファイルシステムをそのまま使うことはできません。 From: akaikoji ＠ po.cc.yamaguchi-u.ac.jp Subject: [SCore-users-jp] PM-Myrinietを用いたNFS Date: Fri, 4 Jul 2003 11:01:01 +0900 Message-ID: <5CB3EF5C-ADC3-11D7-82D6-00039377AE42 ＠ po.cc.yamaguchi-u.ac.jp> akaikoji> Score使いの方々へ akaikoji> akaikoji> Score5.4.0をインストールし使っております。 akaikoji> 動作環境等は以下の通りです。 akaikoji> akaikoji> Xeon2.4 dual x 64(128cpu) + file server akaikoji> Score5.4.0 akaikoji> RedHat 7.3 akaikoji> Myrinet2000 akaikoji> 100base(on Board) akaikoji> akaikoji> 各計算ノードのホームはNFSを使い100base経由でファイルサーバー akaikoji> のディスクをマウントしています。ファイルサーバーにもMyrinet2000 akaikoji> がついているので、PM-Myrinet2kを使って、NFSを使えないかと akaikoji> 考えているのですが、可能でしょうか。PM ドライバーでなくてもかま akaikoji> いません。 akaikoji> akaikoji> よろしくお願いします。 akaikoji> akaikoji> ############################################ akaikoji> 山口大学　メディア基盤センター akaikoji> 赤井光治 akaikoji> akaikoji> phone: 0836 85 9900 akaikoji> fax: 0836 85 9901 akaikoji> e-mail: akaikoji ＠ po.cc.yamaguchi-u.ac.jp akaikoji> ######################################## akaikoji> akaikoji> _______________________________________________ akaikoji> SCore-users-jp mailing list akaikoji> SCore-users-jp ＠ pccluster.org akaikoji> http://www.pccluster.org/mailman/listinfo/score-users-jp akaikoji> ------ Shinji Sumimoto, Fujitsu Labs From masa ＠ soldec-solution.jp Fri Jul 4 14:02:54 2003 From: masa ＠ soldec-solution.jp (MASA(tm)) Date: Fri, 04 Jul 2003 14:02:54 +0900 Subject: [SCore-users-jp] Re: Hyper Threading が有効な 1 台の PCで SCore In-Reply-To: <200307031157.h63BvHAd026572@soldec-solution.jp> References: <200307031157.h63BvHAd026572@soldec-solution.jp> Message-ID: <200307040519.h645JrAd000555@soldec-solution.jp> 菊池です。こんにちは。　ご回答ありがとうございました。　計算ホストにて　　# cd /mnt/cdrom/score.rpm # ./bininstall -compute を行い、あっさりと動作いたしました。 5.4から(?)、サーバ／計算ホストのディレクトリ構成が変わったことを忘れていました。（Web上で読み、実際そうであることを以前に確認もしたのに...） # 2台だけをつないで作業を行なっていたので、CD-ROM 上のドキュ # メントを参照してサーバホストの deploy/ をコピーしていまし # た。 kameyama ＠ pccluster.org wrote in <20030703121432.72032128944 ＠ neal.il.is.s.u-tokyo.ac.jp> at Thu, 03 Jul 2003 21:17:52 +0900 > Chipset E7505, Xeon 2.4 GHz の組合わせですが, > 一応 2 CPU としては動いています. 　確認ですが、実 CPU x1 に対して HT ON として、 > しかし, 実際に NAS 並列ベンチマークをやらせると全く性能がでませんでした. ということでしょうか。　ちなみに、 > (環境は redhat 8.0, kernel は 2.4.20 + score patch です. > パッチに失敗したものは若干手動で直しました.) の難易度は高いでしょうか。 # patch のサイズを見て臆しています。　チップセットへの対応など、最近のディストリビューションを使用できると楽だな、と思っています。　が、 Hisaho Nakata wrote in <20030704172732.2e24d557.nakata ＠ bestsystems.co.jp> at Fri, 4 Jul 2003 17:27:32 +0900 > オリジナルの2.4.20ですとICH5をうまく認識できず、IDE HDDのDMAが使えないなどの > 障害がでたと思います。 > このあたりを修正するには、RedHatカーネル2.4.20-13で採用されているパッチをいくつか > あてれば対応可能です。とのことですので、まだまだ課題はあるようです。＜自身の練度 > Hyper Threadingの機能はHPC系のアプリケーションではそれほど有効に機能しませんし、 > 有効にすると逆に遅くなることもよくあるので、無効にしておいたほうが無難です。　はい。　今回は「CPU は一個だけど SMP カーネルが動作するんだから、やってみて。新しい物（3GHz超CPU,DDR400）試したい。」との希望があったものです。運用時は HT OFF にします。　ところで、 kameyama ＠ pccluster.org wrote in <20030703121432.72032128944 ＠ neal.il.is.s.u-tokyo.ac.jp> at Thu, 03 Jul 2003 21:17:52 +0900 > Gigabit ether のほうの floppy を使用したのですよね? 　はい。起動時に以下のメッセージが表示されて停止します。　　SIOCSIFADDR: No such device 　 Try it again 　(2回繰り返す) 　 Configure Network fails 　Alt+F3 のコンソールには　　found nothing という行がありました。 > SCore 5.4.0 でしたら e1000 は 4.4.19 が入っているのですが... 　Intel で配布しているものは 5.0.43 でした。以上です。 -- ----------------------------------- 菊池　匡洋 mailto:masa ＠ soldec-solution.jp ----------------------------------- From kameyama ＠ pccluster.org Fri Jul 4 14:44:21 2003 From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Fri, 04 Jul 2003 14:44:21 +0900 Subject: [SCore-users-jp] Re: Hyper Threading が有効な 1 台の PC で SCore In-Reply-To: Your message of "Fri, 04 Jul 2003 14:02:54 JST." <200307040519.h645JrAd000555@soldec-solution.jp> Message-ID: <20030704054059.75D55128944@neal.il.is.s.u-tokyo.ac.jp> 亀山です. In article <200307040519.h645JrAd000555 ＠ soldec-solution.jp> "MASA(tm)" wrotes: > kameyama ＠ pccluster.org wrote in <20030703121432.72032128944 ＠ neal.il.is.s.u-to > kyo.ac.jp> > at Thu, 03 Jul 2003 21:17:52 +0900 > > Chipset E7505, Xeon 2.4 GHz の組合わせですが, > > 一応 2 CPU としては動いています. > 　確認ですが、実 CPU x1 に対して HT ON として、 > > > しかし, 実際に NAS 並列ベンチマークをやらせると全く性能がでませんでした. > ということでしょうか。はい. (実数計算に関しては効果がないので当然ですが...) > 　ちなみに、 > > (環境は redhat 8.0, kernel は 2.4.20 + score patch です. > > パッチに失敗したものは若干手動で直しました.) > の難易度は高いでしょうか。 > # patch のサイズを見て臆しています。失敗する大部分は network driver の部分なので, e1000 を使用することを前提にするのでしたら無視できます. 手動での修正は 2 つぐらいだったと思います. > > 障害がでたと思います。 > > このあたりを修正するには、RedHatカーネル2.4.20-13で採用されているパッチをい > くつか > > あてれば対応可能です。 > とのことですので、まだまだ課題はあるようです。＜自身の練度 2.4.21 で対応しているかも... > 　ところで、 > kameyama ＠ pccluster.org wrote in <20030703121432.72032128944 ＠ neal.il.is.s.u-to > kyo.ac.jp> > at Thu, 03 Jul 2003 21:17:52 +0900 > > Gigabit ether のほうの floppy を使用したのですよね? > 　はい。起動時に以下のメッセージが表示されて停止します。 > 　　SIOCSIFADDR: No such device > 　 Try it again > 　(2回繰り返す) > 　 Configure Network fails > 　Alt+F3 のコンソールには > 　　found nothing > という行がありました。 PCI の ID が変わっているかもしれませんね. > > SCore 5.4.0 でしたら e1000 は 4.4.19 が入っているのですが... > 　Intel で配布しているものは 5.0.43 でした。 e1000_main.c に * o Feature: Added support for 82541 and 82547 hardware. とかかれていますね. (で, 現在の version は 5.1.11 になっています...) from Kameyama Toyohisa From masa ＠ soldec-solution.jp Fri Jul 4 16:19:08 2003 From: masa ＠ soldec-solution.jp (MASA(tm)) Date: Fri, 04 Jul 2003 16:19:08 +0900 Subject: [SCore-users-jp] Re: Hyper Threading が有効な 1 台の PC で SCore In-Reply-To: <20030704054059.75D55128944@neal.il.is.s.u-tokyo.ac.jp> References: <200307040519.h645JrAd000555@soldec-solution.jp> <20030704054059.75D55128944@neal.il.is.s.u-tokyo.ac.jp> Message-ID: <200307040736.h647a9Ad002160@soldec-solution.jp> 菊池です。こんにちは。　ご回答ありがとうございます。 kameyama ＠ pccluster.org wrote in <20030704054059.75D55128944 ＠ neal.il.is.s.u-tokyo.ac.jp> at Fri, 04 Jul 2003 14:44:21 +0900 > > > しかし, 実際に NAS 並列ベンチマークをやらせると全く性能がでませんでした. 　： > (実数計算に関しては効果がないので当然ですが...) 　やはりそうですか。　 > 失敗する大部分は network driver の部分なので, e1000 を使用することを > 前提にするのでしたら無視できます. > 手動での修正は 2 つぐらいだったと思います. 　： > 2.4.21 で対応しているかも... 　カーネルソースを入手して、まずはやってみようと思います。 > > 　Intel で配布しているものは 5.0.43 でした。 > > e1000_main.c に > * o Feature: Added support for 82541 and 82547 hardware. > とかかれていますね. > > (で, 現在の version は 5.1.11 になっています...) 　http://downloadfinder.intel.com/scripts-df/File_Filter.asp?FileName=e1000- に 5.1.11 を見つけました。マザーボードのページ　http://developer.intel.com/design/motherbd/bz/bz_drive.htm から辿ると 3 Apr 2003 である 5.0.43 のようです。　　Intel のサイトはよくわかりません... 　以上です。ありがとうございました。 -- ----------------------------------- 菊池　匡洋 mailto:masa ＠ soldec-solution.jp ----------------------------------- From kraehe ＠ copyleft.de Sat Jul 5 00:35:59 2003 From: kraehe ＠ copyleft.de (Michael Koehne) Date: Fri, 4 Jul 2003 17:35:59 +0200 Subject: [SCore-users-jp] [SCore-users] urgent: backup needed Message-ID: <20030704153559.GA6180@bakunin.copyleft.de> Moin Guru's, the main machine of our cluster had an ext2 crash on a ext3fs for /opt containing /opt/score ;( I reinstalled SCore 5.0.0 from CD, but *blush* I do not have a backup of this directory. My plan of restoring the cluster into a workable state would be : - get some template files, so I know what should be there. - ping the ethernet on broadcast, to gain the MACs of the 40 dual P4 computing hosts. - write a Perl script merge the arp list with the templates. So mail me your opt.score.etc.tgz, if you have a SCore5.0.0 cluster with PBS and Ethernet. You could save my weekend! Bye Michael -- mailto:kraehe ＠ copyleft.de UNA:+.? 'CED+2+:::Linux:2.4.18'UNZ+1' http://www.xml-edifact.org/ CETERUM CENSEO WINDOWS ESSE DELENDAM _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From bogdan.costescu ＠ iwr.uni-heidelberg.de Fri Jul 4 23:31:59 2003 From: bogdan.costescu ＠ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Fri, 4 Jul 2003 16:31:59 +0200 (CEST) Subject: [SCore-users-jp] Re: [SCore-users] urgent: backup needed In-Reply-To: <20030704153559.GA6180@bakunin.copyleft.de> Message-ID: On Fri, 4 Jul 2003, Michael Koehne wrote: > My plan of restoring the cluster into a workable state would be : Maybe you should start by reading: http://www.pccluster.org/score/dist/score/html/en/installation/sys-server.html (one long line above) which explains the files and settings and also mentions a tool for setting up the Ethernet parameters (MAC addresses) automatically. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu ＠ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From s-sumi ＠ bd6.so-net.ne.jp Sun Jul 6 22:00:49 2003 From: s-sumi ＠ bd6.so-net.ne.jp (Shinji Sumimoto) Date: Sun, 06 Jul 2003 22:00:49 +0900 (JST) Subject: [SCore-users-jp] Re: [SCore-users] 128 limit on score 5.0.1 In-Reply-To: <200307051429.35075.nick@streamline-computing.com> References: <200307051429.35075.nick@streamline-computing.com> Message-ID: <20030706.220049.730554846.s-sumi@bd6.so-net.ne.jp> Hi. Nick. Could you try with following modification? In such a hetero cluster, the first score network must cover whole of cluster nodes. ============================================ Original scorehosts.db comp000.leeds.ac.uk HOST_0 network=myrinet2k,ethernet,shmem0,shmem1 group=_scoreall_,ETHER,MYRI,SHMEM smp=2 MSGBSERV ... /* other */ ... comp128.leeds.ac.uk HOST_128 network=myrinet2kII,ethernet,shmem0,shmem1 group=MYRI2 smp=2 MSGBSERV ... /* other */ ... ============================================ New scorehosts.db comp000.leeds.ac.uk HOST_0 network=ethernet,myrinet2k,shmem0,shmem1 group=_scoreall_,ETHER,MYRI,SHMEM smp=2 MSGBSERV ... /* other */ ... comp128.leeds.ac.uk HOST_128 network=ethernet,myrinet2kII,shmem0,shmem1 group=_scoreall_,MYRI2 smp=2 MSGBSERV ... /* other */ ... ============================================ If the situation does not change, could you send more information (rpmtest test with -debug 3)? Shinji. From: Nick Birkett Subject: [SCore-users] 128 limit on score 5.0.1 Date: Sat, 5 Jul 2003 14:29:35 +0100 Message-ID: <200307051429.35075.nick ＠ streamline-computing.com> nick> Just wondering if we have hit the 128 compute node limit on Score 5.0.1 ? nick> nick> We plan to upgrade to 5.4 in the next 4 weeks. nick> nick> We have 136 compute nodes configured with 2 Myrinet 2k fibre switches nick> nick> 128 hosts on a 128 port switch running batch nick> 8 hosts on an 8 port switch running mult-user nick> nick> There are 2 pm-myrinet.conf files pm-myrinet.conf (128 hosts) and nick> pm-myrinet2.conf (8 hosts), 2 myrinet groups MYRI and MYRI2 and nick> 2 Myrinet networks - myrinet2k (128) and myrinet2kII (8 hosts) in nick> scorehosts.db. nick> nick> (The opt/score/etc directory is attached as compressed tar). nick> nick> Both systems work ok as long as there are no more than 4 compute nick> nodes (128,129,130,131) listed in scorehosts.db. nick> nick> The compute nodes 132,133,134,135 are commented out in scorehosts.db. nick> Score multi-user is working fin on comp128-131 (4 hosts) on second switch. nick> nick> However the mult-user system always fails the basic ping tests with nick> [root ＠ snowdon sbin]# ./rpminit comp128 myrinet2kII nick> [root ＠ snowdon sbin]# ./rpmtest comp128 myrinet2kII -dest 128 -ping nick> pmGetNodeList: No route to host(113) nick> [root ＠ snowdon sbin]# nick> nick> nick> Regards, nick> nick> Nick nick> ------ Shinji Sumimoto, Fujitsu Labs _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From h035102m ＠ mbox.nagoya-u.ac.jp Mon Jul 7 15:13:40 2003 From: h035102m ＠ mbox.nagoya-u.ac.jp (Naoshi Ueda) Date: Mon, 7 Jul 2003 15:13:40 +0900 Subject: [SCore-users-jp] kernel panic について In-Reply-To: <20030703.124932.846950508.s-sumi@flab.fujitsu.co.jp> References: <200307030318.HCC01712.4I20N306@mbox.nagoya-u.ac.jp> <20030703001950.80366128944@neal.il.is.s.u-tokyo.ac.jp> <200307031238.FBJ97167.N6I34020@mbox.nagoya-u.ac.jp> <20030703.124932.846950508.s-sumi@flab.fujitsu.co.jp> Message-ID: <200307071513.IBG56484.0362I0N4@mbox.nagoya-u.ac.jp> 名古屋大学の上田です。 > デバイスドライバはtg3ではなく、実績のあるbcm5700を使ってください。 > PM/Ethernetを使った場合、tg3はどうも不安定、かつ、性能が出ません。 > サーバーの方はドライバをbcm5700に変えたのですが、計算ホストの方のドライバのインストール方法が分かりません。何か特別な方法があるのでしょうか？よろしくお願いします。 =============================================== 名古屋大学大学院工学研究科博士課程（前期課程）土木工学専攻1年　コンクリート構造研究室　上田　尚史 E-MAIL：h035102m ＠ mbox.nagoya-u.ac.jp nao4_22 ＠ hotmail.com TEL　：052-789-5478 =============================================== From kameyama ＠ pccluster.org Mon Jul 7 15:21:11 2003 From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Mon, 07 Jul 2003 15:21:11 +0900 Subject: [SCore-users-jp] kernel panic について In-Reply-To: Your message of "Mon, 07 Jul 2003 15:13:40 JST." <200307071513.IBG56484.0362I0N4@mbox.nagoya-u.ac.jp> Message-ID: <20030707061739.A6E29128944@neal.il.is.s.u-tokyo.ac.jp> 亀山です. In article <200307071513.IBG56484.0362I0N4 ＠ mbox.nagoya-u.ac.jp> Naoshi Ueda wrotes: > > デバイスドライバはtg3ではなく、実績のあるbcm5700を使ってください。 > > PM/Ethernetを使った場合、tg3はどうも不安定、かつ、性能が出ません。 > > > > サーバーの方はドライバをbcm5700に変えたのですが、計算ホストの方 > のドライバのインストール方法が分かりません。何か特別な方法がある > のでしょうか？ SCore 5.2.0 の binary に付属の kernel を使用しているのでしたら, bcm5700 の driver module はすでに入っています. (version は最新ではありませんが...) 多分, /etc/modules.conf の tg3.o の部分を bcm5700.o に変更して reboot すれば使用できると思います. from Kameyama Toyohisa From h035102m ＠ mbox.nagoya-u.ac.jp Mon Jul 7 16:11:34 2003 From: h035102m ＠ mbox.nagoya-u.ac.jp (ueda) Date: Mon, 7 Jul 2003 16:11:34 +0900 Subject: [SCore-users-jp] kernel panic について In-Reply-To: <20030707061739.A6E29128944@neal.il.is.s.u-tokyo.ac.jp> References: <200307071513.IBG56484.0362I0N4@mbox.nagoya-u.ac.jp> <20030707061739.A6E29128944@neal.il.is.s.u-tokyo.ac.jp> Message-ID: <200307071611.FJF97350.432IN600@mbox.nagoya-u.ac.jp> 名古屋大学の上田です。ご返信ありがとうございました。 > /etc/modules.conf > の > tg3.o > の部分を > bcm5700.o > に変更して reboot すれば使用できると思います. > 上記の操作についてですが、以前おこなったところ、MACアドレスが自動的に変わってしまい、ネットワークが正常に作動しなくなってしまいました。詳しく申し上げますと、現在計算ホストは8台あるのですが、全てのPC において、/etc/modules.confでbcm5700に変更するとMACアドレスが、全て同じ番号になってました。そこで、MACアドレスを変更させてやり、また、/opt/score/etc/ethernet.confの中のMACアドレスもそれにあわせて変更させたのですが、全く動作しないといった状態になりました。何か特別な操作が必要なのでしょうか？よろしくお願いします。 =============================================== 上田　尚史 E-MAIL：h035102m ＠ mbox.nagoya-u.ac.jp =============================================== From kameyama ＠ pccluster.org Mon Jul 7 16:19:02 2003 From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Mon, 07 Jul 2003 16:19:02 +0900 Subject: [SCore-users-jp] kernel panic について In-Reply-To: Your message of "Mon, 07 Jul 2003 16:11:34 JST." <200307071611.FJF97350.432IN600@mbox.nagoya-u.ac.jp> Message-ID: <20030707071530.4EB2F128944@neal.il.is.s.u-tokyo.ac.jp> 亀山です. In article <200307071611.FJF97350.432IN600 ＠ mbox.nagoya-u.ac.jp> ueda wrotes: > > /etc/modules.conf > > の > > tg3.o > > の部分を > > bcm5700.o > > に変更して reboot すれば使用できると思います. > > > > 上記の操作についてですが、以前おこなったところ、MACアドレスが自 > 動的に変わってしまい、ネットワークが正常に作動しなくなってしまい > ました。 > 詳しく申し上げますと、現在計算ホストは8台あるのですが、全てのPC > において、/etc/modules.confでbcm5700に変更するとMACアドレスが、 > 全て同じ番号になってました。そこで、MACアドレスを変更させてやり、 > また、/opt/score/etc/ethernet.confの中のMACアドレスもそれにあわ > せて変更させたのですが、全く動作しないといった状態になりました。 MAC アドレスがすべて同じ番号になると, そもそも TCP/IP も使用できないような... tg3 にしたときと bcm5700 にしたときで, 使用している device は同じでしょうか? (tg3 にしたとき, eth0 だったのが, bcm5700 にすると eth1 になるというようなことは...) できれば, 少なくても ether 部分の kernel の log をみせてください. from Kameyama Toyohisa From s-sumi ＠ flab.fujitsu.co.jp Mon Jul 7 16:28:56 2003 From: s-sumi ＠ flab.fujitsu.co.jp (Shinji Sumimoto) Date: Mon, 07 Jul 2003 16:28:56 +0900 (JST) Subject: [SCore-users-jp] kernel panic について In-Reply-To: <200307071611.FJF97350.432IN600@mbox.nagoya-u.ac.jp> References: <200307071513.IBG56484.0362I0N4@mbox.nagoya-u.ac.jp> <20030707061739.A6E29128944@neal.il.is.s.u-tokyo.ac.jp> <200307071611.FJF97350.432IN600@mbox.nagoya-u.ac.jp> Message-ID: <20030707.162856.719917728.s-sumi@flab.fujitsu.co.jp> 富士通研の住元です。 From: ueda Subject: Re: [SCore-users-jp] kernel panic について Date: Mon, 7 Jul 2003 16:11:34 +0900 Message-ID: <200307071611.FJF97350.432IN600 ＠ mbox.nagoya-u.ac.jp> h035102m> h035102m> 名古屋大学の上田です。ご返信ありがとうございました。 h035102m> h035102m> > /etc/modules.conf h035102m> > の h035102m> > tg3.o h035102m> > の部分を h035102m> > bcm5700.o h035102m> > に変更して reboot すれば使用できると思います. h035102m> > h035102m> h035102m> 上記の操作についてですが、以前おこなったところ、MACアドレスが自 h035102m> 動的に変わってしまい、ネットワークが正常に作動しなくなってしまい h035102m> ました。 h035102m> 詳しく申し上げますと、現在計算ホストは8台あるのですが、全てのPC h035102m> において、/etc/modules.confでbcm5700に変更するとMACアドレスが、 h035102m> 全て同じ番号になってました。そこで、MACアドレスを変更させてやり、 h035102m> また、/opt/score/etc/ethernet.confの中のMACアドレスもそれにあわ h035102m> せて変更させたのですが、全く動作しないといった状態になりました。 h035102m> h035102m> 何か特別な操作が必要なのでしょうか？ h035102m> よろしくお願いします。 3Com 996B-Tについては、特になにもせずに使えていますし安定して利用できています。 http://www.pccluster.org/score/dist/score/html/ja/overview/pm-perf.html チップが変更になったのでしょうか? lspciの結果はどうでしょう? こちらでの結果を以下に示します。 01:03.0 Ethernet controller: BROADCOM Corporation NetXtreme BCM5701 Gigabit Ethernet (rev 15) ------ Shinji Sumimoto, Fujitsu Labs From nick ＠ streamline-computing.com Sat Jul 5 22:29:35 2003 From: nick ＠ streamline-computing.com (Nick Birkett) Date: Sat, 5 Jul 2003 14:29:35 +0100 Subject: [SCore-users-jp] [SCore-users] 128 limit on score 5.0.1 Message-ID: <200307051429.35075.nick@streamline-computing.com> Just wondering if we have hit the 128 compute node limit on Score 5.0.1 ? We plan to upgrade to 5.4 in the next 4 weeks. We have 136 compute nodes configured with 2 Myrinet 2k fibre switches 128 hosts on a 128 port switch running batch 8 hosts on an 8 port switch running mult-user There are 2 pm-myrinet.conf files pm-myrinet.conf (128 hosts) and pm-myrinet2.conf (8 hosts), 2 myrinet groups MYRI and MYRI2 and 2 Myrinet networks - myrinet2k (128) and myrinet2kII (8 hosts) in scorehosts.db. (The opt/score/etc directory is attached as compressed tar). Both systems work ok as long as there are no more than 4 compute nodes (128,129,130,131) listed in scorehosts.db. The compute nodes 132,133,134,135 are commented out in scorehosts.db. Score multi-user is working fin on comp128-131 (4 hosts) on second switch. However the mult-user system always fails the basic ping tests with [root ＠ snowdon sbin]# ./rpminit comp128 myrinet2kII [root ＠ snowdon sbin]# ./rpmtest comp128 myrinet2kII -dest 128 -ping pmGetNodeList: No route to host(113) [root ＠ snowdon sbin]# Regards, Nick -------------- next part -------------- テキスト形式以外の添付ファイルを保管しました... ファイル名: opt_score_etc.tar 型: application/x-tar サイズ: 327680 バイト説明: 無し URL: From rene.storm ＠ emplics.com Mon Jul 7 18:39:45 2003 From: rene.storm ＠ emplics.com (Rene Storm) Date: Mon, 7 Jul 2003 11:39:45 +0200 Subject: [SCore-users-jp] [SCore-users] gm while pm Message-ID: <29B376A04977B944A3D87D22C495FB23012764@vertrieb.emplics.com> Dear Score Users, I would like to know if there is any chance to get gm running on a 5.x.x cluster (kernel)? To be more precisely, we would like to use gm_debug and gm_board_info to get informations which will help us to analyse hardware and configuration problems (e.g. CRC, PCI bus speed). This information could also be necessary to get myricom support. This means we have to load the gm module, but we don't use modules for our score installation, so (of course) we could not unload them at first (e.g. pm_myrinet, pm_shmem). In my opinion, the gm module wouldn't work while the score drivers block out the hardware. Is there a chance to get this programs running without having a new, "un-scored" kernel installed or recompiled the "scored" one with pm device support as modules ? Some magic 'echo "do not use" > /proc/.... ' would be nice. ;o) Thanks in advance Rene Storm __________________________ emplics AG -------------- next part -------------- HTMLの添付ファイルを保管しました... URL: From kameyama ＠ pccluster.org Mon Jul 7 18:51:20 2003 From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Mon, 07 Jul 2003 18:51:20 +0900 Subject: [SCore-users-jp] Re: [SCore-users] gm while pm In-Reply-To: Your message of "Mon, 07 Jul 2003 11:39:45 JST." <29B376A04977B944A3D87D22C495FB23012764@vertrieb.emplics.com> Message-ID: <20030707094748.4FDB1128944@neal.il.is.s.u-tokyo.ac.jp> In article <29B376A04977B944A3D87D22C495FB23012764 ＠ vertrieb.emplics.com> "Rene Storm" wrotes: > This means we have to load the gm module, but we don't use modules for > our score installation, so (of course) > we could not unload them at first (e.g. pm_myrinet, pm_shmem). On SCore 5.4.0 kernel, you can disable PM/myrinet driver to add following kernel option on boot time: pmmyri=0 Please see also release note of SCore 5.4.0. http://www.pccluster.org/score/dist/score/html/en/release/new5-4.html from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From h035102m ＠ mbox.nagoya-u.ac.jp Tue Jul 8 12:26:49 2003 From: h035102m ＠ mbox.nagoya-u.ac.jp (Naoshi Ueda) Date: Tue, 8 Jul 2003 12:26:49 +0900 Subject: [SCore-users-jp] kernel panic について In-Reply-To: <20030707.162856.719917728.s-sumi@flab.fujitsu.co.jp> References: <200307071513.IBG56484.0362I0N4@mbox.nagoya-u.ac.jp> <20030707061739.A6E29128944@neal.il.is.s.u-tokyo.ac.jp> <200307071611.FJF97350.432IN600@mbox.nagoya-u.ac.jp> <20030707.162856.719917728.s-sumi@flab.fujitsu.co.jp> Message-ID: <200307081226.JCD51586.203N64I0@mbox.nagoya-u.ac.jp> 名古屋大学の上田です。 ---------------------------------------------------------------- 亀山様の件 > > tg3 にしたときと bcm5700 にしたときで, 使用している device は > 同じでしょうか? > (tg3 にしたとき, eth0 だったのが, bcm5700 にすると eth1 に > なるというようなことは...) ifconfig　または、/etc/modules.conf　により確認したところ、 bcm5700でもeth0でした。 > できれば, 少なくても ether 部分の kernel の log をみせてください. > まず、tg3の場合は、 kernel 3c59x : Donald Becker and others. www.scyld....(省略） kernel 02:08.0 : 3Com PCI 3c905c Tornado at 0x2000.VersLK1.1.16 kernel tg3.c : v0.99(Jun 11,2002) kernel eth1 : Tigon3 [partno(3C996B-T) rev0105 PHY(5701)] (PCI:66MHz:64-bit) 10/100/1000BaseT Ethernet 00:04:76:f6:24:5c これが、bcm5700では、 kernel 3c59x : Donald Becker and others. www.scyld....(省略） kernel 02:08.0 : 3Com PCI 3c905c Tornado at 0x2000.VersLK1.1.16 だけとなり、下2つが表示されなかったです。 ---------------------------------------------------------------- 住元様の件 > チップが変更になったのでしょうか? lspciの結果はどうでしょう? > > こちらでの結果を以下に示します。 > 01:03.0 Ethernet controller: BROADCOM Corporation NetXtreme BCM5701 Gigabit Ethernet (rev 15) 確認したところ、 00:09.0 Ethernet controller: BROADCOM Corporation NetXtreme　 BCM5701 Gigabit Ethernet (rev 15) がありました。 ---------------------------------------------------------------- 言い忘れておりましたが、現在利用しているPCは全て、オンボードの 100MのNICがあり、1000MのNICとの2枚差しになっております。また、計算ホストのインストール時には、1000Mがネットワークを接続していると、アドレスの取得が出来なかったために、100Mでインストールした後に、1000Mにつなぎかえて、/etc/modules.confおよび、/opt/ score/etc/ethernet.confを変更して利用しております。ですので、100Mでインストールしたために、bcm5700が正常に動作しない、あるいは、ドライバがインストールされていないのではないのでしょうか？もしそうでしたら、何か方法はあるのでしょうか？よろしくお願いします。 =============================================== 上田　尚史 E-MAIL：h035102m ＠ mbox.nagoya-u.ac.jp =============================================== From kameyama ＠ pccluster.org Tue Jul 8 14:48:51 2003 From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Tue, 08 Jul 2003 14:48:51 +0900 Subject: [SCore-users-jp] kernel panic について In-Reply-To: Your message of "Tue, 08 Jul 2003 12:26:49 JST." <200307081226.JCD51586.203N64I0@mbox.nagoya-u.ac.jp> Message-ID: <20030708054516.E93A5128944@neal.il.is.s.u-tokyo.ac.jp> 亀山です. In article <200307081226.JCD51586.203N64I0 ＠ mbox.nagoya-u.ac.jp> Naoshi Ueda wrotes: > > > > tg3 にしたときと bcm5700 にしたときで, 使用している device は > > 同じでしょうか? > > (tg3 にしたとき, eth0 だったのが, bcm5700 にすると eth1 に > > なるというようなことは...) > > ifconfig　または、/etc/modules.conf　により確認したところ、 > bcm5700でもeth0でした。下記の log によると eth1 ですね. > > できれば, 少なくても ether 部分の kernel の log をみせてください. > > > > まず、tg3の場合は、 > > kernel 3c59x : Donald Becker and others. www.scyld....(省略） > kernel 02:08.0 : 3Com PCI 3c905c Tornado at 0x2000.VersLK1.1.16 > kernel tg3.c : v0.99(Jun 11,2002) > kernel eth1 : Tigon3 [partno(3C996B-T) rev0105 PHY(5701)] > (PCI:66MHz:64-bit) 10/100/1000BaseT Ethernet 00:04:76:f6:24:5c > > これが、bcm5700では、 > > kernel 3c59x : Donald Becker and others. www.scyld....(省略） > kernel 02:08.0 : 3Com PCI 3c905c Tornado at 0x2000.VersLK1.1.16 > > だけとなり、下2つが表示されなかったです。つまり, eth1 が認識されなかったということでしょうか? 3C996B-T は SCore 5.2.0 の bcm5700 でもサポートしているようなのですが... from Kameyama Toyohisa From ersoz ＠ cse.psu.edu Wed Jul 9 03:21:16 2003 From: ersoz ＠ cse.psu.edu (Deniz Ersoz) Date: Tue, 8 Jul 2003 14:21:16 -0400 Subject: [SCore-users-jp] [SCore-users] limiting the number of jobs running at each node ?? Message-ID: <20030708182116.GA24010@zerg.cse.psu.edu> Hi, I am using SCore 5.4.0 to get some simulartion results using NAS benchmarks for my research and I need to limit the number of jubs running each node. more specifically I want no more than 3 jobs/benchmarks running at each node... Can you give me some suggestions on how to do that on SCored? Thanks, Deniz Ersoz _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama ＠ pccluster.org Wed Jul 9 09:11:10 2003 From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Wed, 09 Jul 2003 09:11:10 +0900 Subject: [SCore-users-jp] Re: [SCore-users] limiting the number of jobs running at each node ?? In-Reply-To: Your message of "Tue, 08 Jul 2003 14:21:16 JST." <20030708182116.GA24010@zerg.cse.psu.edu> Message-ID: <20030709000733.28F2D128944@neal.il.is.s.u-tokyo.ac.jp> In article <20030708182116.GA24010 ＠ zerg.cse.psu.edu> Deniz Ersoz wrotes: > I am using SCore 5.4.0 to get some simulartion results using NAS benchmarks for my > research and I need to limit the number of jubs running each node. more specifically I > want no more than 3 jobs/benchmarks running at each node... Can you give me some > suggestions on how to do that on SCored? Please use sc_console and following sc_console command: limit cluster all jobs 3 Please see also sc_console man page: http://www.pccluster.org/score/dist/score/html/en/man/man8/sc_console.html And sc_console command: http://www.pccluster.org/score/dist/score/html/en/reference/scored/console_command.html from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From hori ＠ swimmy-soft.com Wed Jul 9 17:47:07 2003 From: hori ＠ swimmy-soft.com (Atsushi HORI) Date: Wed, 9 Jul 2003 17:47:07 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Question: Checkpointing and Sequential Programs In-Reply-To: <20030625120537.Y23276@wum.bauingenieure.uni-stuttgart.de> References: <20030625120537.Y23276@wum.bauingenieure.uni-stuttgart.de> Message-ID: <3140617627.hori0007@swimmy-soft.com> Hi, I found that nobody answered for you !! >If I start a sequential program on our SCore_5.0 Cluster using SCores >system(6) command and try to generate a checkpoint later, the program >catches the sent SIGQUIT signal and immediately terminates instead of >producing a checkpoint. > >As the program does not violate any of the limitations that are >necessary for checkpointing, I suspect the way system(6) works to make >checkpointing impossible. Aha !! There is another SCore checkpoint limitation undocummented. Any sequential program, I mean any program which is not linked with SCore libraries, are NOT checkpointable. My feeling is that a sequential program running on SCore is SPECIAL and the system command is the only way to execute sequential program in a SPECIAL way. I am sorry for this inconvenience, but I can not avoid this. ---- Atsushi HORI SCore Developer Swimmy Software, Inc. _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From tachi ＠ mickey.ai.kyutech.ac.jp Thu Jul 10 14:17:30 2003 From: tachi ＠ mickey.ai.kyutech.ac.jp (立川純) Date: Thu, 10 Jul 2003 14:17:30 +0900 Subject: [SCore-users-jp] ヘテロネットワーク対応について Message-ID: <20030710141730.511bac11.tachi@mickey.ai.kyutech.ac.jp> 九州工業大学の立川と申します．ネットワークの設定が異なるクラスタ環境についての質問ですが，例えば，PM/UDPを選択したホストとPM/Ethernetを選択したホスト同士は正しく通信ができるのでしょうか? 確かPMv2プロトコルは，PM/Compositeデバイスによってそれらが透過に通信できるとあったと思いますが，PM/Compositeに関する説明がないようですので，公式にはサポートされていないと判断して良いのでしょうか．同じく，PM/Ethernetのトランキングを使っているホストと使っていないホスト間での通信についてもサポートされていないのでしょうか．以上，ご存じの片がおられましたらよろしくお願いします． -- _/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/ _/ 立川純(E-mail:tachi ＠ mickey.ai.kyutech.ac.jp) /_/ _/ 九州工業大学大学院情報工学研究科情報科学専攻 /_/ _/ 知能情報アーキテクチャ講座小出研究室 /_/ _/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/ From kameyama ＠ pccluster.org Thu Jul 10 15:27:46 2003 From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Thu, 10 Jul 2003 15:27:46 +0900 Subject: [SCore-users-jp] ヘテロネットワーク対応について In-Reply-To: Your message of "Thu, 10 Jul 2003 14:17:30 JST." <20030710141730.511bac11.tachi@mickey.ai.kyutech.ac.jp> Message-ID: <20030710062405.733BC128944@neal.il.is.s.u-tokyo.ac.jp> 亀山です. In article <20030710141730.511bac11.tachi ＠ mickey.ai.kyutech.ac.jp> 立川純 wrotes: > ネットワークの設定が異なるクラスタ環境についての質問ですが， > 例えば，PM/UDPを選択したホストとPM/Ethernetを選択したホスト同士は > 正しく通信ができるのでしょうか? 残念ながらできません. 同じ媒体を使用していても PM ネットワークによってプロトコルが全く異なります. PM は個々のホスト間で直接通信できることを前提としていて, どこかの host が中継を行うことはしていません. > 確かPMv2プロトコルは，PM/Compositeデバイスによってそれらが透過に > 通信できるとあったと思いますが，PM/Compositeに関する説明がないようですので， > 公式にはサポートされていないと判断して良いのでしょうか． PM/Composite 自体はサポートされています. (SCore-D が使用しています.) PM/Composite を使用すれば. たとえば, ホスト A とホスト B の間は PM/Etherent を使用し, ホスト A とホスト C の間は PM/Myrinet を使用する, というようなことはできます. (PM/Composite だけが使用できる pmAddNode() を使用します.) しかし, その host が PM 的に通信できないものとは通信できません. > 同じく，PM/Ethernetのトランキングを使っているホストと使っていないホスト間での > 通信についてもサポートされていないのでしょうか．これも同様にできません. ちなみに, SCore-D は SCore-D が動くすべてのホストで使用可能な (composite でない) PM ネットワークが必要です. from Kameyama Toyohisa From arpiruk ＠ yahoo.com Thu Jul 10 20:52:59 2003 From: arpiruk ＠ yahoo.com (=?iso-2022-jp?b?YXJwaXJ1ayAbJEIhdxsoQiB5YWhvby5jb20=?=) Date: Thu, 10 Jul 2003 04:52:59 -0700 (PDT) Subject: [SCore-users-jp] [SCore-users] race condition in reduction clause in a function call In-Reply-To: <20030709030000.16676.69745.Mailman@www.pccluster.org> Message-ID: <20030710115259.53684.qmail@web13903.mail.yahoo.com> I have problem of race condition in subfunction. The code is shown below. main.c ------------------------------------------ #pragma omp parallel private(it) { for (it=1;it Hi, I have a defect node-hd. What must I do to reinstall score on the new hd? Thanks Michael _________________________________________________________________ Mit dem MSN Messenger eine Reise für 4 Personen nach Barcelona gewinnen ? jetzt mitmachen! http://www.sweepstakes2003.com/entry.aspx?locationid=15 _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From hori ＠ swimmy-soft.com Fri Jul 11 10:40:14 2003 From: hori ＠ swimmy-soft.com (Atsushi HORI) Date: Fri, 11 Jul 2003 10:40:14 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Defect HD of SCore node In-Reply-To: References: Message-ID: <3140764814.hori0000@swimmy-soft.com> HI, >I have a defect node-hd. What must I do to reinstall score on the new hd? I suggest to copy new harddisk from a harddisk of one of live node using the dd command. Then change its hostname and IP address setting. ---- Atsushi HORI Swimmy Software, Inc. _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From rene.storm ＠ emplics.com Fri Jul 11 18:49:55 2003 From: rene.storm ＠ emplics.com (Rene Storm) Date: Fri, 11 Jul 2003 11:49:55 +0200 Subject: [SCore-users-jp] AW: [SCore-users] Defect HD of SCore node Message-ID: <29B376A04977B944A3D87D22C495FB2301276D@vertrieb.emplics.com> Hi Michael, maybe it could be easier to something like that: start eit in /opt/score/bin hit "load configuration" Button Make a boot disk (If you haven't got one already) and reinstall the node all necessary informations should be store by the score installer. (like HW-Addr) Please look www.pccluster.org into installation guide. It is described well. Broken HDs are very easy to handle with score, it needs a little hack if you want to change the MAC Addr but that's easy too. Greetings Rene -------------- emplics AG -----Ursprüngliche Nachricht----- Von: Michael N. [mailto:lunetix83 ＠ hotmail.com] Gesendet: Freitag, 11. Juli 2003 01:48 An: score-users ＠ pccluster.org Betreff: [SCore-users] Defect HD of SCore node Hi, I have a defect node-hd. What must I do to reinstall score on the new hd? Thanks Michael _________________________________________________________________ Mit dem MSN Messenger eine Reise für 4 Personen nach Barcelona gewinnen - jetzt mitmachen! http://www.sweepstakes2003.com/entry.aspx?locationid=15 _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From h035102m ＠ mbox.nagoya-u.ac.jp Tue Jul 15 17:08:32 2003 From: h035102m ＠ mbox.nagoya-u.ac.jp (ueda) Date: Tue, 15 Jul 2003 17:08:32 +0900 Subject: [SCore-users-jp] bc5700 In-Reply-To: <20030707071530.4EB2F128944@neal.il.is.s.u-tokyo.ac.jp> References: <200307071611.FJF97350.432IN600@mbox.nagoya-u.ac.jp> <20030707071530.4EB2F128944@neal.il.is.s.u-tokyo.ac.jp> Message-ID: <200307151708.DDJ01078.60024NI3@mbox.nagoya-u.ac.jp> 名古屋大学の上田です。 bcm5700の件ですが、どうやらサーバー側に最新のドライバをインストールしていることで、うまくいかなかったみたいでした。もう一度SCoreを最初からインストールしなおして、etc/modules.conf のeth0をtg3からbcm5700に書き換えたらうまくいきました。お騒がせしてすみませんでした。 =============================================== 上田　尚史 E-MAIL：h035102m ＠ mbox.nagoya-u.ac.jp =============================================== From ersoz ＠ cse.psu.edu Thu Jul 17 03:49:16 2003 From: ersoz ＠ cse.psu.edu (Deniz Ersoz) Date: Wed, 16 Jul 2003 14:49:16 -0400 Subject: [SCore-users-jp] [SCore-users] physical memory limit?? In-Reply-To: <20030708182116.GA24010@zerg.cse.psu.edu> References: <20030708182116.GA24010@zerg.cse.psu.edu> Message-ID: <20030716184916.GA12704@titan.cse.psu.edu> Hi, I was trying to see the effect of swapping in SCore. So, I tried to run three MG (NAS) benchmarks at the same time but one of them is rejected because of memory limitations. I think it rejects jobs when the physical memory is exhausted. Is there a way to use the swap memory also? Thanks in advance, Deniz Ersoz _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From hori ＠ swimmy-soft.com Thu Jul 17 10:27:26 2003 From: hori ＠ swimmy-soft.com (Atsushi HORI) Date: Thu, 17 Jul 2003 10:27:26 +0900 Subject: [SCore-users-jp] Re: [SCore-users] physical memory limit?? In-Reply-To: <20030716184916.GA12704@titan.cse.psu.edu> References: <20030708182116.GA24010@zerg.cse.psu.edu> Message-ID: <3141282446.hori0002@swimmy-soft.com> Hi, >I was trying to see the effect of swapping in SCore. So, I tried to >run three MG (NAS) >benchmarks at the same time but one of them is rejected because of >memory limitations. I think >it rejects jobs when the physical memory is exhausted. Is there a >way to use the swap memory >also? Use the sc_sonsole command and type the following % sc_console SCore-D Console: limit queue all memory 300 SCore-D Console: limit SCore-D Console: exit Disconnected. Reconnect ? (or ^D to exit) ^D % In this case, limit value is set to 3 times (300%) of physical memory size. I have ever tried what you are going to do. Amazingly, the NPB jobs could run very well (scale ?) even if the total memory consumption is more than the physical memory size at that time. I would appreciate it if you would let me know your result. ---- Atsushi HORI Swimmy Software, Inc. _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From M.Newiger ＠ deltacomputer.de Thu Jul 17 17:09:01 2003 From: M.Newiger ＠ deltacomputer.de (Martin Newiger) Date: Thu, 17 Jul 2003 10:09:01 +0200 Subject: [SCore-users-jp] [SCore-users] New SCore-Version Message-ID: Hi, when will the next version of SCore will be released and what RedHat Linux and what Hardware will it support? >Regards > >Martin Newiger > _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From hchen ＠ Mdl.ipc.pku.edu.cn Thu Jul 17 23:00:13 2003 From: hchen ＠ Mdl.ipc.pku.edu.cn (Chen Hao) Date: Thu, 17 Jul 2003 22:00:13 +0800 Subject: [SCore-users-jp] [SCore-users] score 5.4 install problem Message-ID: <002701c34c6b$dbba09c0$9101a8c0@chen> Hi all I wanted to install score 5.4 to my small pc cluster, but when I try to install client machines, I got the following error: ----------------------------------------- VFS :Mounted root (ext2 filesystem). Using EIT5 feature mounting /proc filesystem.... done Testing.......... No dhcp_server specified. Used Broadcast setupNetwork cannot set the gateway address done NFS mount 192.168.1.254: /mnt/runtime Cannot mount exiting See the documentation for this trouble ------------------------------------------ After I checked my server(RH 7.3)'s log, I found the client tried to mount the / of the server through NFS. What's the matter? -------------- next part -------------- HTMLの添付ファイルを保管しました... URL: From kameyama ＠ pccluster.org Fri Jul 18 09:31:42 2003 From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Fri, 18 Jul 2003 09:31:42 +0900 Subject: [SCore-users-jp] Re: [SCore-users] score 5.4 install problem In-Reply-To: Your message of "Thu, 17 Jul 2003 22:00:13 JST." <002701c34c6b$dbba09c0$9101a8c0@chen> Message-ID: <20030718002736.6FA57128944@neal.il.is.s.u-tokyo.ac.jp> In article <002701c34c6b$dbba09c0$9101a8c0 ＠ chen> "Chen Hao" wrotes: > After I checked my server(RH 7.3)'s log, I found the client tried to mount the > / of the server through NFS. What's the matter? The client hosts mount servers /mnt/cdrom and /opt/score/setup. Please see also following kickstart files to install client hosts: /opt/score/setup/RedHat/instimage/compconf/* Please check /etc/exports on the server host. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From hqzhou ＠ nju.edu.cn Fri Jul 18 14:06:14 2003 From: hqzhou ＠ nju.edu.cn (Huiqun Zhou) Date: Fri, 18 Jul 2003 13:06:14 +0800 Subject: [SCore-users-jp] [SCore-users] Help: Too slow running any program Message-ID: <002401c34cea$4fe9b380$1a00a8c0@goofy> Hi, Score users, We built a small cluster with 8 compute nodes, but it's too slow when running a program. It even took 2 minutes or more to run the simple examples, hello and cpi. Although the results are correct, it seems that it keeps looking for something at the begining of the run. What's wrong? On each of the compute node, there are two NICs, one is 3COM 100Mb card, another D-Link DGE-550T Gigabit card. As the latter has some problems, we removed configuration for DGE-550T, only 3COM card is in use. Thanks in advance. Huiqun Zhou -------------- next part -------------- HTMLの添付ファイルを保管しました... URL: From nick ＠ streamline-computing.com Fri Jul 18 16:14:32 2003 From: nick ＠ streamline-computing.com (Nick Birkett) Date: Fri, 18 Jul 2003 08:14:32 +0100 Subject: [SCore-users-jp] [SCore-users] Network freezing Message-ID: <200307180814.32877.nick@streamline-computing.com> Dear Score. One of our users is trying to run a very large 64 cpu job using Myrinet2k (C cards) and Score 5.4. The jobs uses most of the memory on 32 compute nodes (32x2 job). <0:0> SCORE: 64 nodes (32x2) ready. <6> SCORE WARNING: Physical memory might be exhausted. <13> SCORE WARNING: Physical memory might be exhausted. <17> SCORE WARNING: Physical memory might be exhausted. <14> SCORE WARNING: Physical memory might be exhausted. <10> SCORE WARNING: Physical memory might be exhausted. <12> SCORE WARNING: Physical memory might be exhausted. <11> SCORE WARNING: Physical memory might be exhausted. <22> SCORE WARNING: Physical memory might be exhausted. <15> SCORE WARNING: Physical memory might be exhausted. <3> SCORE WARNING: Physical memory might be exhausted. <20> SCORE WARNING: Physical memory might be exhausted. <28> SCORE WARNING: Physical memory might be exhausted. <31> SCORE WARNING: Physical memory might be exhausted. <25> SCORE WARNING: Physical memory might be exhausted. <29> SCORE WARNING: Physical memory might be exhausted. <24> SCORE WARNING: Physical memory might be exhausted. <30> SCORE WARNING: Physical memory might be exhausted. <27> SCORE WARNING: Physical memory might be exhausted. <1> SCORE WARNING: Physical memory might be exhausted. <16> SCORE WARNING: Physical memory might be exhausted. <21> SCORE WARNING: Physical memory might be exhausted. <7> SCORE WARNING: Physical memory might be exhausted. The job starts to run but then we get an error: <13> SCore-D:PANIC Network freezing timed out !! <15> SCore-D:PANIC Network freezing timed out !! <12> SCore-D:PANIC Network freezing timed out !! <2> SCore-D:PANIC Network freezing timed out !! <4> SCore-D:PANIC Network freezing timed out !! <26> SCore-D:PANIC Network freezing timed out !! The system has been welll tested using Pallas benchmarks (full suite of tests) and has also run for 2 days with the top 500 HPL benchmark (180 Gflops using all cpus). Any suggestions as to why we have this problem ? Is it hardware or software ? Thanks, Nick _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From hori ＠ swimmy-soft.com Fri Jul 18 16:27:13 2003 From: hori ＠ swimmy-soft.com (Atsushi HORI) Date: Fri, 18 Jul 2003 16:27:13 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Network freezing In-Reply-To: <200307180814.32877.nick@streamline-computing.com> References: <200307180814.32877.nick@streamline-computing.com> Message-ID: <3141390433.hori0001@swimmy-soft.com> Hi, >The system has been welll tested using Pallas benchmarks (full suite >of tests) >and has also run for 2 days with the top 500 HPL benchmark (180 Gflops using >all cpus). > >Any suggestions as to why we have this problem ? Is it hardware or software ? It is very difficult to distinguish due to the lack of information. Anyway, I suggest you to have hardware test again, just because this is much simpler than finding (or proofing) software bug(s). ---- Atsushi HORI Swimmy Software, Inc. _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From pvenka ＠ yahoo.com Mon Jul 21 03:26:41 2003 From: pvenka ＠ yahoo.com (parthasarathy venkataraman) Date: Sun, 20 Jul 2003 11:26:41 -0700 (PDT) Subject: [SCore-users-jp] [SCore-users] qsub problem Message-ID: <20030720182641.25892.qmail@web12303.mail.yahoo.com> Hi, When I submit a job sing qsub, some shell script commands like 'cd' etc do not work. Any suggestions? Here is a copy of error file: /var/scored/pbs/mom_priv/jobs/305.master..SC: cd: /xtal/ccp4-4.2.2/ccp4i/bin: No such file or directory /var/scored/pbs/mom_priv/jobs/305.master..SC: /opt/score/bin/scrun: No such file or directory ~ In both the cases the files exist, and I am able to run the commands on the comand lines. sincerely, Venkat __________________________________ Do you Yahoo!? SBC Yahoo! DSL - Now only $29.95 per month! http://sbc.yahoo.com _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From bogdan.costescu ＠ iwr.uni-heidelberg.de Mon Jul 21 23:04:15 2003 From: bogdan.costescu ＠ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Mon, 21 Jul 2003 16:04:15 +0200 (CEST) Subject: [SCore-users-jp] Re: [SCore-users] Help: Too slow running any program In-Reply-To: <002401c34cea$4fe9b380$1a00a8c0@goofy> Message-ID: On Fri, 18 Jul 2003, Huiqun Zhou wrote: > It even took 2 minutes or more to run the simple examples, hello and > cpi. Although the results are correct, it seems that it keeps looking > for something at the begining of the run. I the beginning of the run the only part affected ? I mean, does it take a lot of time before printing the first line of output, but afterwards everthing works with normal speed ? What is the load of the nodes when this happens ? > only 3COM card is in use. With Ethernet hardware, especially with FastEthernet one, care must be taken that the speed and duplex setting of the switch ports and cards are in harmony. The best solution is to set both the card (default with the 3c59x driver) and the switch port to autonegotiate/NWAY. Failure to do so will provide a link with very high probability of packet loss which drastically reduces the maximum data transfer rate that can be obtained as packets have to be retransmitted. Obviously, any kind of parallel computation that uses such a broken link would be very slow. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu ＠ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama ＠ pccluster.org Tue Jul 22 09:09:41 2003 From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Tue, 22 Jul 2003 09:09:41 +0900 Subject: [SCore-users-jp] Re: [SCore-users] qsub problem In-Reply-To: Your message of "Sun, 20 Jul 2003 11:26:41 JST." <20030720182641.25892.qmail@web12303.mail.yahoo.com> Message-ID: <20030722000522.D256712894C@neal.il.is.s.u-tokyo.ac.jp> In article <20030720182641.25892.qmail ＠ web12303.mail.yahoo.com> parthasarathy venkataraman wrotes: > When I submit a job sing qsub, some shell script > commands like 'cd' etc do not work. Any suggestions? > Here is a copy of error file: > /var/scored/pbs/mom_priv/jobs/305.master..SC: cd: > /xtal/ccp4-4.2.2/ccp4i/bin: No such file or directory > /var/scored/pbs/mom_priv/jobs/305.master..SC: > /opt/score/bin/scrun: No such file or directory > ~ If you don't specify score attribute, this job execute on the compute host. (The compute hosts cannot start SWCore job.) If you want to execute SCore job with PBS, you must specify score attribute: % qsub -l nodes=4:score score.sh Please see also PBS/SCore User's Guide: http://www.pccluster.org/score/dist/score/html/en/reference/pbs/user.html from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From m.newiger ＠ web.de Thu Jul 24 03:23:25 2003 From: m.newiger ＠ web.de (Martin Newiger) Date: Wed, 23 Jul 2003 20:23:25 +0200 Subject: [SCore-users-jp] [SCore-users] scout -g pcc no respond Message-ID: <200307231823.h6NINKQ12624@mailgate5.cinetic.de> Hi, i changed my Score-Interfaces to eth1. All the test routines succeed exept scout -g pcc. Where is the mistake? Regards M.Newiger ______________________________________________________________________________ ComputerBild 15-03 bestaetigt: Den besten Spam-Schutz gibt es bei WEB.DE FreeMail - Deutschlands beste E-Mail - http://s.web.de/?mc=021121 _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama ＠ pccluster.org Thu Jul 24 09:09:12 2003 From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Thu, 24 Jul 2003 09:09:12 +0900 Subject: [SCore-users-jp] Re: [SCore-users] scout -g pcc no respond In-Reply-To: Your message of "Wed, 23 Jul 2003 20:23:25 JST." <200307231823.h6NINKQ12624@mailgate5.cinetic.de> Message-ID: <20030724000448.5AF3F12894C@neal.il.is.s.u-tokyo.ac.jp> In article <200307231823.h6NINKQ12624 ＠ mailgate5.cinetic.de> "Martin Newiger" wrotes: > i changed my Score-Interfaces to eth1. All the test routines succeed exept sc > out -g > pcc. Where is the mistake? Do you execute SCOUT TEST Procedure? http://www.pccluster.org/score/dist/score/html/en/installation/scout-test.html If scorehosts, sceptic, msgb test is OK, please check rsh-all: % rsh-all -g pcc date If rsh-all is successfull, please check /etc/hosts.equiv on the compute hosts. Note that scout execute scremote on the first compute host. The first compute host execute scrmote on the secound compute hosts, and so on. So /etc/hosts.equiv on the compute hosts must include server host and ALL compute hosts. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From yeahme ＠ 163.net Sat Jul 26 14:52:59 2003 From: yeahme ＠ 163.net (=?iso-2022-jp?b?eWVhaG1lIBskQiF3GyhCIDE2My5uZXQ=?=) Date: Sat, 26 Jul 03 14:52:59 Pacific Daylight Time Subject: [SCore-users-jp] [SCore-users] you didn't reply my email ? why ? ??? Message-ID: <200307261004.h6QA4im16548@pccluster.org> HTMLの添付ファイルを保管しました... URL: From lunetix83 ＠ hotmail.com Mon Jul 28 21:41:28 2003 From: lunetix83 ＠ hotmail.com (Michael N.) Date: Mon, 28 Jul 2003 12:41:28 +0000 Subject: [SCore-users-jp] [SCore-users] Spawn timed out Message-ID: Hi list, my Scoutd session times out. I get the following messages: [root ＠ masternode]# scout -g pcc bash: /opt/score5.4.0/deploy/scremote: No such file or directory [computenode01.cluster.domain]: Spawn timed out. SCOUT: Session done. I assume this happens because one node is missing. How could I test this? With kind Regards Michael _________________________________________________________________ Fotos - MSN Fotos das virtuelle Fotoalbum. Allen Freunden zeigen oder einfach online entwickeln lassen: http://photos.msn.de/support/worldwide.aspx _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama ＠ pccluster.org Tue Jul 29 08:41:44 2003 From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Tue, 29 Jul 2003 08:41:44 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Spawn timed out In-Reply-To: Your message of "Mon, 28 Jul 2003 12:41:28 JST." Message-ID: <20030728233703.E2AB712894C@neal.il.is.s.u-tokyo.ac.jp> In article "Michael N." wrotes: > my Scoutd session times out. I get the following messages: > > [root ＠ masternode]# scout -g pcc > bash: /opt/score5.4.0/deploy/scremote: No such file or directory Please exec following command: % rsh-all -g pcc ls -l /opt/score/deploy/scremote % rsh-all -g pcc rpm -q score5.4.0-comp If some cluster is not installed score-5.4.0-comp rpm, please install it. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From yeahme ＠ 163.net Mon Jul 28 21:35:54 2003 From: yeahme ＠ 163.net (=?iso-2022-jp?b?eWVhaG1lIBskQiF3GyhCIDE2My5uZXQ=?=) Date: Mon, 28 Jul 03 21:35:54 Pacific Daylight Time Subject: [SCore-users-jp] [SCore-users] you didn't reply my email ? why ? ??? Message-ID: <200307290842.h6T8gfm12910@pccluster.org> HTMLの添付ファイルを保管しました... URL: From h035102m ＠ mbox.nagoya-u.ac.jp Tue Jul 29 19:35:18 2003 From: h035102m ＠ mbox.nagoya-u.ac.jp (Naoshi Ueda) Date: Tue, 29 Jul 2003 19:35:18 +0900 Subject: [SCore-users-jp] error について Message-ID: <200307291935.GAI82364.2360NI40@mbox.nagoya-u.ac.jp> 名古屋大学の上田です。先日ご相談申し上げた、kernel　panicの件ですが、ドライバをbcm5700 に変えた後も時々起きてしまい、プログラムがとまってしまいます。またさらに、別のエラーによってもプログラムが停止することが起きました。これは、計算ホストのある1台のPCに生じるのですが、画面上に eth0:Duplicate entry of the interrupt handler by processor 0. eth0:Duplicate entry of the interrupt handler by processor 0. eth0:Duplicate entry of the interrupt handler by processor 0. eth0:Duplicate entry of the interrupt handler by processor 0. eth0:Duplicate entry of the interrupt handler by processor 0. eth0:Duplicate entry of the interrupt handler by processor 0. eth0:Duplicate entry of the interrupt handler by processor 0. : : :（画面全体に表示されている） : : と表示されて、コントロール不能になってしまうといったエラーです。表示された文から、通信が上手くいってないようなのですが、どのように対処したらよいのかが分かりません。以上2点についての、アドバイス等よろしくお願いします。 =============================================== 名古屋大学大学院工学研究科博士課程（前期課程）上田　尚史 E-MAIL：h035102m ＠ mbox.nagoya-u.ac.jp =============================================== From s-sumi ＠ flab.fujitsu.co.jp Tue Jul 29 21:06:13 2003 From: s-sumi ＠ flab.fujitsu.co.jp (Shinji Sumimoto) Date: Tue, 29 Jul 2003 21:06:13 +0900 (JST) Subject: [SCore-users-jp] error について In-Reply-To: <200307291935.GAI82364.2360NI40@mbox.nagoya-u.ac.jp> References: <200307291935.GAI82364.2360NI40@mbox.nagoya-u.ac.jp> Message-ID: <20030729.210613.596527866.s-sumi@flab.fujitsu.co.jp> 上田様富士通研の住元です。こちらでも、ほぼ同じ環境Xeon x2 bcm5701 NICでクラスタを使っていますが、問題は起きていません。このエラーは、デバイスドライバ内で排他制御がきちんと動いていない時に発生する場合が多いです。以下の確認をお願いします。 1) HWは Single or Dual CPUでしょうか? 2) /etc/modules.confに以下のオプションを付けた場合どうなりますか? options bcm5700 adaptive_coalesce=0 rx_coalesce_ticks=1 tx_coalesce_ticks=1 3) pm-ethernet.confの内容はどうなっていますでしょうか? maxnsend 24 backoff 2400 くらいにするとどうでしょう? From: Naoshi Ueda Subject: [SCore-users-jp] error について Date: Tue, 29 Jul 2003 19:35:18 +0900 Message-ID: <200307291935.GAI82364.2360NI40 ＠ mbox.nagoya-u.ac.jp> h035102m> h035102m> 名古屋大学の上田です。 h035102m> h035102m> 先日ご相談申し上げた、kernel　panicの件ですが、ドライバをbcm5700 h035102m> に変えた後も時々起きてしまい、プログラムがとまってしまいます。 h035102m> h035102m> またさらに、別のエラーによってもプログラムが停止することが起きま h035102m> した。これは、計算ホストのある1台のPCに生じるのですが、画面上に h035102m> h035102m> eth0:Duplicate entry of the interrupt handler by processor 0. h035102m> eth0:Duplicate entry of the interrupt handler by processor 0. h035102m> eth0:Duplicate entry of the interrupt handler by processor 0. h035102m> eth0:Duplicate entry of the interrupt handler by processor 0. h035102m> eth0:Duplicate entry of the interrupt handler by processor 0. h035102m> eth0:Duplicate entry of the interrupt handler by processor 0. h035102m> eth0:Duplicate entry of the interrupt handler by processor 0. h035102m> : h035102m> : h035102m> :（画面全体に表示されている） h035102m> : h035102m> : h035102m> h035102m> と表示されて、コントロール不能になってしまうといったエラーです。 h035102m> 表示された文から、通信が上手くいってないようなのですが、どのよう h035102m> に対処したらよいのかが分かりません。 h035102m> h035102m> 以上2点についての、アドバイス等よろしくお願いします。 h035102m> h035102m> =============================================== h035102m> 名古屋大学大学院工学研究科博士課程（前期課程） h035102m> h035102m> 上田　尚史 h035102m> E-MAIL：h035102m ＠ mbox.nagoya-u.ac.jp h035102m> =============================================== h035102m> _______________________________________________ h035102m> SCore-users-jp mailing list h035102m> SCore-users-jp ＠ pccluster.org h035102m> http://www.pccluster.org/mailman/listinfo/score-users-jp h035102m> h035102m> ------ Shinji Sumimoto, Fujitsu Labs From nick ＠ streamline-computing.com Wed Jul 30 16:28:26 2003 From: nick ＠ streamline-computing.com (Nick Birkett) Date: Wed, 30 Jul 2003 08:28:26 +0100 Subject: [SCore-users-jp] [SCore-users] MPI-IO Message-ID: <200307300828.26514.nick@streamline-computing.com> Hi Score team. We seem to be having an MPI-IO problem with Score 5.4 x86 RedHat 7.3. The users files are NFS mounted from the front end (RedHat 7.3 x86) ---------- Forwarded Message ---------- Here is an issue that one of our users discovered, which has only started occuring with the new version of SCore. > <0:0> SCORE: 32 nodes (16x2) ready. > File locking failed in ADIOI_Set_lock. If the file system is NFS, you need > to use NFS version 3 and mount the directory with the 'noac' option (no > attribute caching). > [0] MPI Abort by user Aborting program ! > [0] Aborting program! I have remounted the nfs partition '/scratch' on all nodes with this noac option: (fstab) server:/scratch /scratch nfs rw,suid,dev,exec,auto,nouser,async,noac 0 0 (output from mount) server:/scratch on /scratch type nfs (rw,noac,addr=192.168.1.254) The problem is still there. Any ideas? Thanks _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From hori ＠ swimmy-soft.com Wed Jul 30 16:42:07 2003 From: hori ＠ swimmy-soft.com (Atsushi HORI) Date: Wed, 30 Jul 2003 16:42:07 +0900 Subject: [SCore-users-jp] Re: [SCore-users] MPI-IO In-Reply-To: <200307300828.26514.nick@streamline-computing.com> References: <200307300828.26514.nick@streamline-computing.com> Message-ID: <3142428127.hori0004@swimmy-soft.com> Hi, Nick, >Hi Score team. We seem to be having an MPI-IO problem with Score 5.4 x86 >RedHat 7.3. >The users files are NFS mounted from the front end (RedHat 7.3 x86) I am not familiar with MPI-IO at all, but as far as I undestand, MPI-IO on NFS is not recommended. I found MPICH web page , and it says, --- Problems with MPI-IO and NFS The network file system (NFS) must be configured extremely carefully for MPI-IO (and many other programs) to work correctly. Unfortunately, few systems are so configured, and doing so can adversely impact performance. As a result, programs using files on an NFS system may hang or produce incorrect results. Note that this is, officially, a design feature of NFS; unless the NFS system is configured with no attribute caching, any two processes, accessing the same file, may produce incorrect results. You can use the -file_system=ufs option of configure to build an MPICH that supports only UFS (Unix File System); MPI-IO works correctly with UFS, XFS, PIOFS, HFS, SFS, etc. (more precisely, those file system that correctly implement basic Unix I/O system calls; something that NFS does not do). ---- Atsushi HORI Swimmy Software, Inc. _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama ＠ pccluster.org Wed Jul 30 17:02:45 2003 From: kameyama ＠ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Wed, 30 Jul 2003 17:02:45 +0900 Subject: [SCore-users-jp] Re: [SCore-users] MPI-IO In-Reply-To: Your message of "Wed, 30 Jul 2003 08:28:26 JST." <200307300828.26514.nick@streamline-computing.com> Message-ID: <20030730075801.393F612894C@neal.il.is.s.u-tokyo.ac.jp> I don't try this... In article <200307300828.26514.nick ＠ streamline-computing.com> Nick Birkett wrotes: > > <0:0> SCORE: 32 nodes (16x2) ready. > > File locking failed in ADIOI_Set_lock. If the file system is NFS, you > > need > > > to use NFS version 3 and mount the directory with the 'noac' option (no > > attribute caching). > (fstab) > server:/scratch /scratch nfs > rw,suid,dev,exec,auto,nouser,async,noac 0 0 Please add following option, too: nfsvers=3 nfs(5) says default nfs version is 2. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From h035102m ＠ mbox.nagoya-u.ac.jp Wed Jul 30 17:32:46 2003 From: h035102m ＠ mbox.nagoya-u.ac.jp (Naoshi Ueda) Date: Wed, 30 Jul 2003 17:32:46 +0900 Subject: [SCore-users-jp] error について In-Reply-To: <20030729.210613.596527866.s-sumi@flab.fujitsu.co.jp> References: <200307291935.GAI82364.2360NI40@mbox.nagoya-u.ac.jp> <20030729.210613.596527866.s-sumi@flab.fujitsu.co.jp> Message-ID: <200307301732.AAI60452.3400N26I@mbox.nagoya-u.ac.jp> 名古屋大学の上田です。ご回答ありがとうございました。以下、確認の結果をお伝えします。 > 1) HWは Single or Dual CPUでしょうか? > 全てSingle　CPUです。 > 2) /etc/modules.confに以下のオプションを付けた場合どうなりますか? > > options bcm5700 adaptive_coalesce=0 rx_coalesce_ticks=1 tx_coalesce_ticks=1 > > 3) pm-ethernet.confの内容はどうなっていますでしょうか? > > maxnsend 24 > backoff 2400 > > くらいにするとどうでしょう? 以上の2つを行っても、結果は変わりませんでした。なお、 eth0:Duplicate entry of the interrupt handler by　processor 0. との表示によるプログラムの停止は、たまに起こるだけであり、ほとんどが、kernel　 panicでプログラムがとまってしまいます。またこのとき、以下の表示が計算ホストのある1台になされました。 Unable to handle kernel NULL pointer dereference at virtual address 00000070 printing eip: f8945743 *pde = 00000000 Oops : 0002 CPU : 0 EIP : 0010 : [] Not tainted EFLAGS : 00010282 eax : f131ca10 ebx : 00000202 ecx : 00000001 edx : 00000000 esi : f12a8160 edi : f131ca10 edp : f12aadfc esp : f120fd90 ds : 0018 es : 0018 ss : 0018 Process scored.exe (pid : 933, stackpage = f120f000) Stack : f12a8160 f12a8160 f131ca48 f12aa488 f12a8000 f894b25a f12a8160 f12a8160 000000d9 00000000 f89433c0 f12a8160 f4102900 04000001 00000011 f120fe14 c0109c59 00000011 f12a8000 f120fe14 f120fe14 00000011 c02ccd40 f4102900 Call Trace : [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] Code : ff 4a 70 0f 94 c0 84 c0 74 1e 9c 5b fa a1 e0 03 2d c0 89 02 <0> Kernel panic : Aiee,killing interrupt handler! In interrupt handler - not syncing 恐らくメモリの問題だと思いますが、どうでしょう？宜しくお願いします。 =============================================== 名古屋大学大学院工学研究科博士課程（前期課程）上田　尚史 E-MAIL：h035102m ＠ mbox.nagoya-u.ac.jp =============================================== From s-sumi ＠ flab.fujitsu.co.jp Wed Jul 30 17:34:59 2003 From: s-sumi ＠ flab.fujitsu.co.jp (Shinji Sumimoto) Date: Wed, 30 Jul 2003 17:34:59 +0900 (JST) Subject: [SCore-users-jp] error について In-Reply-To: <200307301732.AAI60452.3400N26I@mbox.nagoya-u.ac.jp> References: <200307291935.GAI82364.2360NI40@mbox.nagoya-u.ac.jp> <20030729.210613.596527866.s-sumi@flab.fujitsu.co.jp> <200307301732.AAI60452.3400N26I@mbox.nagoya-u.ac.jp> Message-ID: <20030730.173459.304102105.s-sumi@flab.fujitsu.co.jp> 上田様富士通研の住元です。お手数ですが、SMP用カーネルで試して頂けないでしょうか? bcm5700についてはUP カーネルの場合に同様の問題がPentium III マシンですが発生しています。 From: Naoshi Ueda Subject: Re: [SCore-users-jp] error について Date: Wed, 30 Jul 2003 17:32:46 +0900 Message-ID: <200307301732.AAI60452.3400N26I ＠ mbox.nagoya-u.ac.jp> h035102m> h035102m> 名古屋大学の上田です。 h035102m> ご回答ありがとうございました。 h035102m> 以下、確認の結果をお伝えします。 h035102m> h035102m> h035102m> > 1) HWは Single or Dual CPUでしょうか? h035102m> > h035102m> h035102m> 全てSingle　CPUです。 h035102m> h035102m> h035102m> > 2) /etc/modules.confに以下のオプションを付けた場合どうなりますか? h035102m> > h035102m> > options bcm5700 adaptive_coalesce=0 rx_coalesce_ticks=1 tx_coalesce_ticks=1 h035102m> > h035102m> > 3) pm-ethernet.confの内容はどうなっていますでしょうか? h035102m> > h035102m> > maxnsend 24 h035102m> > backoff 2400 h035102m> > h035102m> > くらいにするとどうでしょう? h035102m> h035102m> 以上の2つを行っても、結果は変わりませんでした。 h035102m> なお、 h035102m> eth0:Duplicate entry of the interrupt handler by　processor 0. h035102m> との表示によるプログラムの停止は、たまに起こるだけであり、ほとんどが、kernel　 h035102m> panicでプログラムがとまってしまいます。 h035102m> またこのとき、以下の表示が計算ホストのある1台になされました。 h035102m> h035102m> h035102m> h035102m> Unable to handle kernel NULL pointer dereference at virtual address h035102m> 00000070 h035102m> printing eip: h035102m> f8945743 h035102m> *pde = 00000000 h035102m> Oops : 0002 h035102m> CPU : 0 h035102m> EIP : 0010 : [] Not tainted h035102m> EFLAGS : 00010282 h035102m> eax : f131ca10 ebx : 00000202 ecx : 00000001 edx : 00000000 h035102m> esi : f12a8160 edi : f131ca10 edp : f12aadfc esp : f120fd90 h035102m> ds : 0018 es : 0018 ss : 0018 h035102m> Process scored.exe (pid : 933, stackpage = f120f000) h035102m> Stack : f12a8160 f12a8160 f131ca48 f12aa488 f12a8000 f894b25a f12a8160 f12a8160 h035102m> 000000d9 00000000 f89433c0 f12a8160 f4102900 04000001 00000011 f120fe14 h035102m> c0109c59 00000011 f12a8000 f120fe14 f120fe14 00000011 c02ccd40 f4102900 h035102m> Call Trace : [] [] [] [] [] h035102m> [] [] [] [] [] [] h035102m> [] [] [] [] h035102m> h035102m> Code : ff 4a 70 0f 94 c0 84 c0 74 1e 9c 5b fa a1 e0 03 2d c0 89 02 h035102m> <0> Kernel panic : Aiee,killing interrupt handler! h035102m> In interrupt handler - not syncing h035102m> h035102m> h035102m> h035102m> 恐らくメモリの問題だと思いますが、どうでしょう？ h035102m> 宜しくお願いします。 h035102m> h035102m> =============================================== h035102m> 名古屋大学大学院工学研究科博士課程（前期課程） h035102m> 上田　尚史 h035102m> E-MAIL：h035102m ＠ mbox.nagoya-u.ac.jp h035102m> =============================================== h035102m> _______________________________________________ h035102m> SCore-users-jp mailing list h035102m> SCore-users-jp ＠ pccluster.org h035102m> http://www.pccluster.org/mailman/listinfo/score-users-jp h035102m> ------ Shinji Sumimoto, Fujitsu Labs From issmde ＠ leeds.ac.uk Wed Jul 30 20:21:19 2003 From: issmde ＠ leeds.ac.uk (Mark Ellerby) Date: Wed, 30 Jul 2003 12:21:19 +0100 (BST) Subject: [SCore-users-jp] [SCore-users] (no subject) Message-ID: Hi Kameyama and SCore list, Thanks for your suggestion (below). I have added the option nfsvers=3 to the mount options on all compute nodes, and remounted the shared partition: snowdon.leeds.ac.uk:/scratch on /scratch type nfs (rw,noac,nfsvers=3,addr=192.168.1.254) But we are still having the same problem: <0:0> SCORE: 32 nodes (16x2) ready. File locking failed in ADIOI_Set_lock. If the file system is NFS, you need to use NFS version 3 and mount the directory with the 'noac' option (no attribute caching). [0] MPI Abort by user Aborting program ! [0] Aborting program! -------------- >In article <200307300828.26514.nick ＠ streamline-computing.com> Nick >Birkett > wrotes: >> > <0:0> SCORE: 32 nodes (16x2) ready. > File locking failed in >> ADIOI_Set_lock. If the file system is NFS, you >> >> need >> >> > to use NFS version 3 and mount the directory with the 'noac' option >(no >> > attribute caching). >> >> (fstab) server:/scratch /scratch nfs >> rw,suid,dev,exec,auto,nouser,async,noac 0 0 > >Please add following option, too: > nfsvers=3 nfs(5) says default nfs version is 2. --------------------- Streamline Computing recently did a SCore upgrade from 5.3 to 5.4 for us. The code was working before the upgrade, there were no such complaints from SCore about MPI-IO. I mention it just in case it helps you diagnose the problem. Thanks in advance for your help Mark -- Mark Ellerby ISS, Leeds University White Rose Grid issmde ＠ leeds.ac.uk Support Officer 0113 3435429 _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From hqzhou ＠ nju.edu.cn Thu Jul 31 16:48:17 2003 From: hqzhou ＠ nju.edu.cn (Huiqun Zhou) Date: Thu, 31 Jul 2003 15:48:17 +0800 Subject: [SCore-users-jp] Re: [SCore-users] Help: Too slow running any program References: Message-ID: <003701c35738$1afeff10$1a00a8c0@goofy> Hi, Bogdan, Sorry for my delay of responding you. It seems that programs running in my Score environment not only took long, long time between "Score-D connected" and the first line of output displayed, but also spent more time on communication during computation. I tried a fluid dynamics code on my cluster. For its serial version, it took about 2 seconds per each time step. While for its parallel version, besides it took about 3 minutes at the beginning of the run, it took 4 to 6 seconds per time step even on a 7 compute node cluster! The code I tried run very well on my previous cluster, which comprised of 4 compute nodes and a dedicated server. All machines of my previous cluster are P III 800MHz, 256 MB with Intel NIC. My current cluster is comprised of 8 machines with P4 2.4GHz, 512MB and 3Com NIC, plus D-Link DGE-550T Gigabit NIC (not in use yet). I'm using D-Link's 8 port Gigabit Ethernet switch DGS-1008T. Looking forward to your help. ---------------------------- Huiqun Zhou Department of Earth Sciences Nanjing University China e-mail: hqzhou ＠ nju.edu.cn Tel: 86(25)359-4664 FAX: 86(25)368-6016 Mobil: 13182856800 ---------------------------- ----- Original Message ----- From: "Bogdan Costescu" To: "Huiqun Zhou" Cc: Sent: Monday, July 21, 2003 10:04 PM Subject: Re: [SCore-users] Help: Too slow running any program > On Fri, 18 Jul 2003, Huiqun Zhou wrote: > > > It even took 2 minutes or more to run the simple examples, hello and > > cpi. Although the results are correct, it seems that it keeps looking > > for something at the begining of the run. > > I the beginning of the run the only part affected ? I mean, does it take a > lot of time before printing the first line of output, but afterwards > everthing works with normal speed ? What is the load of the nodes when > this happens ? > > > only 3COM card is in use. > > With Ethernet hardware, especially with FastEthernet one, care must be > taken that the speed and duplex setting of the switch ports and cards are > in harmony. The best solution is to set both the card (default with the > 3c59x driver) and the switch port to autonegotiate/NWAY. Failure to do so > will provide a link with very high probability of packet loss which > drastically reduces the maximum data transfer rate that can be obtained as > packets have to be retransmitted. Obviously, any kind of parallel > computation that uses such a broken link would be very slow. > > -- > Bogdan Costescu > > IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen > Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY > Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 > E-mail: Bogdan.Costescu ＠ IWR.Uni-Heidelberg.De > > _______________________________________________ > SCore-users mailing list > SCore-users ＠ pccluster.org > http://www.pccluster.org/mailman/listinfo/score-users > > _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From bogdan.costescu ＠ iwr.uni-heidelberg.de Thu Jul 31 21:07:54 2003 From: bogdan.costescu ＠ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Thu, 31 Jul 2003 14:07:54 +0200 (CEST) Subject: [SCore-users-jp] Re: [SCore-users] Help: Too slow running any program In-Reply-To: <003701c35738$1afeff10$1a00a8c0@goofy> Message-ID: On Thu, 31 Jul 2003, Huiqun Zhou wrote: > It seems that programs running in my Score environment not only took > long, long time between "Score-D connected" and the first line of output > displayed, but also spent more time on communication during computation. Then most likely the network is not configured properly, some packets are lost and because of the retries the run time increases very much. I think that you need to check the network: start by using netperf or ttcp or something similar which can give you an idea about capabilities of the network in terms of (TCP or UDP) transfer speed. Check the duplex settings of the switch and of the network cards: for the cards, a tool like mii-tool (comes with most distributions, including Red Hat) or mii-diag (ftp://ftp.scyld.com/pub/diag) will tell you if the media was negotiated or not and what are the current values; if the switch is manageable, the control interface will give the same data - if not, you'll have to assume that it does what it should, which normally means that it tries to autonegotiate. Also check cables, I've seen puzzling problems because of bad cables. If you are using the 3c59x driver, you can ask for more help by writting to vortex ＠ scyld.com. > it took 4 to 6 seconds per time step even on a 7 compute node cluster! Please compare apples to apples! If your previous results come from a 4-node cluster, please give new results also from a 4-nodes cluster (or a run using 4 nodes on a larger cluster). Otherwise the differences might very well be explained by the Amdahl's law. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu ＠ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From issmde ＠ leeds.ac.uk Thu Jul 31 21:18:15 2003 From: issmde ＠ leeds.ac.uk (Mark Ellerby) Date: Thu, 31 Jul 2003 13:18:15 +0100 (BST) Subject: [SCore-users-jp] [SCore-users] Re: SCore-users digest, Vol 1 #256 - 4 msgs In-Reply-To: <20030731030001.24940.65938.Mailman@www.pccluster.org> Message-ID: Hi Atsushi (& SCore team), Thankyou for your reply. I have talked to our key MPI-IO user and he insists that his code was running perfectly without errors with the previous version of SCore/MPICH/Redhat 7.2. And the MPI-IO users are mathemeticians, so they're pretty handy with numbers ;-) Assuming that they know their code is giving correct results, is there anyway they can bypass these error messages that they are getting now? If they can do that, then they can verify the results for themselves. If it's not possible to do that, is there any other way around this issue that you can think of? (which allows them to use a shared filesystem) Best wishes, Mark > I am not familiar with MPI-IO at all, but as far as I undestand, > MPI-IO on NFS is not recommended. I found MPICH web page > , and it says, > > --- > Problems with MPI-IO and NFS > The network file system (NFS) must be configured extremely carefully > for MPI-IO (and many other programs) to work correctly. > Unfortunately, few systems are so configured, and doing so can > adversely impact performance. As a result, programs using files on an > NFS system may hang or produce incorrect results. Note that this is, > officially, a design feature of NFS; unless the NFS system is > configured with no attribute caching, any two processes, accessing > the same file, may produce incorrect results. You can use the > -file_system=ufs option of configure to build an MPICH that supports > only UFS (Unix File System); MPI-IO works correctly with UFS, XFS, > PIOFS, HFS, SFS, etc. (more precisely, those file system that > correctly implement basic Unix I/O system calls; something that NFS > does not do). > > ---- > Atsushi HORI > Swimmy Software, Inc. -- Mark Ellerby ISS, Leeds University White Rose Grid issmde ＠ leeds.ac.uk Support Officer 0113 3435429 _______________________________________________ SCore-users mailing list SCore-users ＠ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users