From bogdan.costescu бў iwr.uni-heidelberg.de  Wed Mar  5 03:45:35 2003
From: bogdan.costescu бў iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Tue, 4 Mar 2003 19:45:35 +0100 (CET)
Subject: [SCore-users-jp] [SCore-users] Myrinet deadlock
Message-ID: <Pine.LNX.4.44.0303041945160.18506-100000@kenzo.iwr.uni-heidelberg.de>

Dear SCore developers,

When trying to test SCore 5.4, I get what it looks like a deadlock when 
using Myrinet, association with Shmem making it appear faster.

Setup:
The cluster is composed of 8 nodes, each with dual Athlon, 512 MB RAM and 
older Myrinet (LANai 4) cards. I kept the configuration files from an 
older (4.2.1) SCore installation which worked flawlessly for more than a 
year, so I believe that there are no errors in this part. I installed the 
kernel RPM provided in the distribution, but compiled here all the 
user-level stuff.

The problem:
When trying to run a job that uses Myrinet with or without Shmem 
(-nodes=8x2 or -nodes=8x1) the job locks at random places. When running a 
job that uses Ethernet (either -nodes=8x1 or -nodes=8x2) the lockup does 
not occur even if I put more load on the nodes, like starting several jobs 
at the same time on the same nodes.
When the job is in this state, it can sometimes (but not always) be 
interrupted with Ctrl-C (if it's still connected to the terminal). But 
sometimes not even pskill is able to get rid of it, the message indicating 
that the job is killed appears every time pskill is executed, but the job 
is still there - at some point SCoreD dies and it's restarted by sc_watch.

Attaching gdb to the job in this state gives something like:

#0  0x082c2702 in shmemReceive ()
#1  0x082b2f4d in composite_attach_context ()
#2  0x0829a485 in MPID_SCORE_Recv_Message ()
#3  0x082999f6 in MPID_SCORE_PIwrecv ()
#4  0x08299754 in MPID_SCORE_PIbrecv ()
#5  0x0829e2b1 in MPID_CH_Check_incoming ()
#6  0x082948d7 in MPID_RecvComplete ()
#7  0x0828a1ff in PMPI_Waitall ()

or

#0  0x082c6080 in myriReceive ()
#1  0x0829a485 in MPID_SCORE_Recv_Message ()
#2  0x082999f6 in MPID_SCORE_PIwrecv ()
#3  0x08299754 in MPID_SCORE_PIbrecv ()
#4  0x0829e2b1 in MPID_CH_Check_incoming ()
#5  0x082948d7 in MPID_RecvComplete ()
#6  0x0828a1ff in PMPI_Waitall ()

from which I assume that this is a deadlock. However, the application that 
produced this (CHARMM) is very stable and worked flawlessly with older 
versions of SCore, so deadlocks caused by bad programming in the 
application are to be excluded.

The /proc/pm/myrinet/0/info file on all nodes indicates 0; we never had 
any problems with these cards with older SCore versions.

Do you have any idea about what is going on ?

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu бў IWR.Uni-Heidelberg.De


_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From s-sumi бў flab.fujitsu.co.jp  Wed Mar  5 18:48:14 2003
From: s-sumi бў flab.fujitsu.co.jp (Shinji Sumimoto)
Date: Wed, 05 Mar 2003 18:48:14 +0900 (JST)
Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock
In-Reply-To: <Pine.LNX.4.44.0303041945160.18506-100000@kenzo.iwr.uni-heidelberg.de>
References: <Pine.LNX.4.44.0303041945160.18506-100000@kenzo.iwr.uni-heidelberg.de>
Message-ID: <20030305.184814.596528202.s-sumi@flab.fujitsu.co.jp>

Hi.

The default mpich version of mpich is changed from mpich 1.2.0 to mpich 1.2.4.
Could you build mpich 1.2.0 from source and test it?

If once mpich 1.2.0 is installed, you can choose mpich1.2.0 and mpich1.2.4 by -mpi option.

PSбз How about mpi_zerocopy=on option?

Shinji.

From: Bogdan Costescu <bogdan.costescu бў iwr.uni-heidelberg.de>
Subject: [SCore-users] Myrinet deadlock
Date: Tue, 4 Mar 2003 19:45:35 +0100 (CET)
Message-ID: <Pine.LNX.4.44.0303041945160.18506-100000 бў kenzo.iwr.uni-heidelberg.de>

bogdan.costescu> 
bogdan.costescu> Dear SCore developers,
bogdan.costescu> 
bogdan.costescu> When trying to test SCore 5.4, I get what it looks like a deadlock when 
bogdan.costescu> using Myrinet, association with Shmem making it appear faster.
bogdan.costescu> 
bogdan.costescu> Setup:
bogdan.costescu> The cluster is composed of 8 nodes, each with dual Athlon, 512 MB RAM and 
bogdan.costescu> older Myrinet (LANai 4) cards. I kept the configuration files from an 
bogdan.costescu> older (4.2.1) SCore installation which worked flawlessly for more than a 
bogdan.costescu> year, so I believe that there are no errors in this part. I installed the 
bogdan.costescu> kernel RPM provided in the distribution, but compiled here all the 
bogdan.costescu> user-level stuff.
bogdan.costescu> 
bogdan.costescu> The problem:
bogdan.costescu> When trying to run a job that uses Myrinet with or without Shmem 
bogdan.costescu> (-nodes=8x2 or -nodes=8x1) the job locks at random places. When running a 
bogdan.costescu> job that uses Ethernet (either -nodes=8x1 or -nodes=8x2) the lockup does 
bogdan.costescu> not occur even if I put more load on the nodes, like starting several jobs 
bogdan.costescu> at the same time on the same nodes.
bogdan.costescu> When the job is in this state, it can sometimes (but not always) be 
bogdan.costescu> interrupted with Ctrl-C (if it's still connected to the terminal). But 
bogdan.costescu> sometimes not even pskill is able to get rid of it, the message indicating 
bogdan.costescu> that the job is killed appears every time pskill is executed, but the job 
bogdan.costescu> is still there - at some point SCoreD dies and it's restarted by sc_watch.
bogdan.costescu> 
bogdan.costescu> Attaching gdb to the job in this state gives something like:
bogdan.costescu> 
bogdan.costescu> #0  0x082c2702 in shmemReceive ()
bogdan.costescu> #1  0x082b2f4d in composite_attach_context ()
bogdan.costescu> #2  0x0829a485 in MPID_SCORE_Recv_Message ()
bogdan.costescu> #3  0x082999f6 in MPID_SCORE_PIwrecv ()
bogdan.costescu> #4  0x08299754 in MPID_SCORE_PIbrecv ()
bogdan.costescu> #5  0x0829e2b1 in MPID_CH_Check_incoming ()
bogdan.costescu> #6  0x082948d7 in MPID_RecvComplete ()
bogdan.costescu> #7  0x0828a1ff in PMPI_Waitall ()
bogdan.costescu> 
bogdan.costescu> or
bogdan.costescu> 
bogdan.costescu> #0  0x082c6080 in myriReceive ()
bogdan.costescu> #1  0x0829a485 in MPID_SCORE_Recv_Message ()
bogdan.costescu> #2  0x082999f6 in MPID_SCORE_PIwrecv ()
bogdan.costescu> #3  0x08299754 in MPID_SCORE_PIbrecv ()
bogdan.costescu> #4  0x0829e2b1 in MPID_CH_Check_incoming ()
bogdan.costescu> #5  0x082948d7 in MPID_RecvComplete ()
bogdan.costescu> #6  0x0828a1ff in PMPI_Waitall ()
bogdan.costescu> 
bogdan.costescu> from which I assume that this is a deadlock. However, the application that 
bogdan.costescu> produced this (CHARMM) is very stable and worked flawlessly with older 
bogdan.costescu> versions of SCore, so deadlocks caused by bad programming in the 
bogdan.costescu> application are to be excluded.
bogdan.costescu> 
bogdan.costescu> The /proc/pm/myrinet/0/info file on all nodes indicates 0; we never had 
bogdan.costescu> any problems with these cards with older SCore versions.
bogdan.costescu> 
bogdan.costescu> Do you have any idea about what is going on ?
bogdan.costescu> 
bogdan.costescu> -- 
bogdan.costescu> Bogdan Costescu
bogdan.costescu> 
bogdan.costescu> IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
bogdan.costescu> Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
bogdan.costescu> Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
bogdan.costescu> E-mail: Bogdan.Costescu бў IWR.Uni-Heidelberg.De
bogdan.costescu> 
bogdan.costescu> 
bogdan.costescu> _______________________________________________
bogdan.costescu> SCore-users mailing list
bogdan.costescu> SCore-users бў pccluster.org
bogdan.costescu> http://www.pccluster.org/mailman/listinfo/score-users
bogdan.costescu> 
bogdan.costescu> 
------
Shinji Sumimoto, Fujitsu Labs
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From bogdan.costescu бў iwr.uni-heidelberg.de  Wed Mar  5 19:30:51 2003
From: bogdan.costescu бў iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Wed, 5 Mar 2003 11:30:51 +0100 (CET)
Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock
In-Reply-To: <20030305.184814.596528202.s-sumi@flab.fujitsu.co.jp>
Message-ID: <Pine.LNX.4.44.0303051053540.32571-100000@kenzo.iwr.uni-heidelberg.de>

On Wed, 5 Mar 2003, Shinji Sumimoto wrote:

> The default mpich version of mpich is changed from mpich 1.2.0 to mpich 1.2.4.

Yes, I was aware of this.

> Could you build mpich 1.2.0 from source and test it?

As I built from source all user-level stuff, I already got mpi-1.2.0. But 
now I'm wondering how to build the ch_score2 device as this seems not to 
be built by default and I wanted to test it as well.

> If once mpich 1.2.0 is installed, you can choose mpich1.2.0 and mpich1.2.4 by -mpi option.

Actually the -mpi option doesn't seem to work, but I now set my path to 
include first the bin directory of mpi-1.2.0.

> PS$B!'(B How about mpi_zerocopy=on option?

I tried it and it seemed to lower the chances of locking up, but it still 
happens. When it does, I get sometimes:

SCORE: Deadlock detected
<0:0>SCore: *** SIGNAL EXCEPTION eip=0x08299a6b, cr2=0x       0 ***
...

With mpich-1.2.0 I get the same lock-ups. Another thing which is worth 
mentioning is that whenever the jobs are not interruptible and killable 
with pskill and SCoreD has to restart, it always takes down one of the 
nodes. It's not the same node (and with older SCore we didn't have such 
problem), so now because of this and because of independence of MPI 
library I start to suspect the kernel-side.

I'll try next to see if I can get SCore 4.2.1 to work with a newer kernel 
(2.4.18-19 or so, maybe some RedHat variant) to see if the problem comes 
from the newer kernel or from newer SCore.

Thank you for any suggestion!

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu бў IWR.Uni-Heidelberg.De


_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From hori бў swimmy-soft.com  Wed Mar  5 19:46:55 2003
From: hori бў swimmy-soft.com (Atsushi HORI)
Date: Wed, 5 Mar 2003 19:46:55 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock
References: <20030305.184814.596528202.s-sumi@flab.fujitsu.co.jp>
Message-ID: <3129738415.hori0008@swimmy-soft.com>

Hi.

>I'll try next to see if I can get SCore 4.2.1 to work with a newer kernel 
>(2.4.18-19 or so, maybe some RedHat variant) to see if the problem comes 
>from the newer kernel or from newer SCore.

Another suggestion, no question.

Have you recompiled your application program(s) ?

----
Atsushi HORI
Swimmy Software, Inc.

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama бў pccluster.org  Wed Mar  5 19:55:16 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Wed, 05 Mar 2003 19:55:16 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock
In-Reply-To: Your message of "Wed, 05 Mar 2003 11:30:51 JST."
             <Pine.LNX.4.44.0303051053540.32571-100000@kenzo.iwr.uni-heidelberg.de>
Message-ID: <20030305105516.5336F20057@neal.il.is.s.u-tokyo.ac.jp>

In article <Pine.LNX.4.44.0303051053540.32571-100000 бў kenzo.iwr.uni-heidelberg.de> Bogdan Costescu <bogdan.costescu бў iwr.uni-heidelberg.de> wrotes:
> On Wed, 5 Mar 2003, Shinji Sumimoto wrote:
> now I'm wondering how to build the ch_score2 device as this seems not to 
> be built by default and I wanted to test it as well.

Note that ch_score2 dose not support from SCore 4.1!
ch_score2 use PM internal header file, so if you want to compile
ch_score2, you must install SCore source file and set
MADE_CHSCORE2 make variable to yes.

> > If once mpich 1.2.0 is installed, you can choose mpich1.2.0 and mpich1.2.4 
> by -mpi option.
> 
> Actually the -mpi option doesn't seem to work, but I now set my path to 
> include first the bin directory of mpi-1.2.0.

-mpi option is specified on compile time.
    $ mpicc -mpi mpich-1.2.0 ...
Or plese set SCORE_MPI to mpich-1.2.0.

    
> so now because of this and because of independence of MPI 
> library I start to suspect the kernel-side.

Which do you use kernel rpm?
There are some SCore 5.4.0 kernel rpm.
Probably I think you use kernel-smp-2.4.19-1SCORE.athlon.rpm or
kernel-smp-2.4.19-1SCORE.i686.rpm.
Please try to change unused kernel.
    
                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From bogdan.costescu бў iwr.uni-heidelberg.de  Wed Mar  5 19:56:45 2003
From: bogdan.costescu бў iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Wed, 5 Mar 2003 11:56:45 +0100 (CET)
Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock
In-Reply-To: <3129738415.hori0008@swimmy-soft.com>
Message-ID: <Pine.LNX.4.44.0303051155240.32571-100000@kenzo.iwr.uni-heidelberg.de>

On Wed, 5 Mar 2003, Atsushi HORI wrote:

> Have you recompiled your application program(s) ?

Yes, of course. The first phrase in the documentation mentions that there 
is no binary compatibility with older SCore versions. And as I was using 
previously 4.2.1 I didn't even attempt to run those binaries with the 
newer SCore.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu бў IWR.Uni-Heidelberg.De

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From bogdan.costescu бў iwr.uni-heidelberg.de  Wed Mar  5 20:24:00 2003
From: bogdan.costescu бў iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Wed, 5 Mar 2003 12:24:00 +0100 (CET)
Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock
In-Reply-To: <20030305105516.5336F20057@neal.il.is.s.u-tokyo.ac.jp>
Message-ID: <Pine.LNX.4.44.0303051156530.32571-100000@kenzo.iwr.uni-heidelberg.de>

On Wed, 5 Mar 2003 kameyama бў pccluster.org wrote:

> Note that ch_score2 dose not support from SCore 4.1!

I wanted to compile it only for 5.4 to compare stability and speed with 
ch_score.

> ch_score2 use PM internal header file, so if you want to compile
> ch_score2, you must install SCore source file and set
> MADE_CHSCORE2 make variable to yes.

OK, thank you.

> -mpi option is specified on compile time.
>     $ mpicc -mpi mpich-1.2.0 ...

Ahh, now I see what I did wrong: I tried with:

mpicc -mpi 1.2.0

so it's my fault, sorry for the false alarm...

> Probably I think you use kernel-smp-2.4.19-1SCORE.athlon.rpm or
> kernel-smp-2.4.19-1SCORE.i686.rpm.

I tried already both of these (with mpich-1.2.0 only the i686 one).
I also tried with the non-SMP athlon kernel and it also locks up; of 
course as I can't run -nodes=8x2 it takes a bit longer now to lock, but 
that was what I experienced with the SMP kernels as well.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu бў IWR.Uni-Heidelberg.De


_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From bogdan.costescu бў iwr.uni-heidelberg.de  Thu Mar  6 02:32:30 2003
From: bogdan.costescu бў iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Wed, 5 Mar 2003 18:32:30 +0100 (CET)
Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock
In-Reply-To: <Pine.LNX.4.44.0303051053540.32571-100000@kenzo.iwr.uni-heidelberg.de>
Message-ID: <Pine.LNX.4.44.0303051815060.32571-100000@kenzo.iwr.uni-heidelberg.de>

On Wed, 5 Mar 2003, Bogdan Costescu wrote:

> I'll try next to see if I can get SCore 4.2.1 to work with a newer kernel 
> (2.4.18-19 or so, maybe some RedHat variant) to see if the problem comes 
> from the newer kernel or from newer SCore.

I managed to patch RedHat's 2.4.18-24 with the kernel patch for SCore 
4.2.1 and built the SMP athlon kernel (haven't tested yet the SMP i686 
kernel but I'll do it this evening). However with this kernel I also 
experience the lockups. So, still using SCore 4.2.1 I went back to the 
2.4.16-based kernel that I used before and I was again able to run without 
any lockup for more than 1 hour which already means "stable".

So, some problem with the kernel... I also tried to boot with "noapic" for 
both Score 5.4 with kernel 2.4.19-1SCORE and SCore 4.2.1 with my 2.4.18-24 
based kernel to eliminate any doubt about interrupt problems, but this 
didn't help.
On the other hand, with SCore 5.4 and kernel 2.4.19-1SCORE I was able to 
use the ethernet based communication (with or without shmem) without any 
problem - or maybe I did not test enough, but anyway on the same 
time-scale it didn't lock up. Which leads me to believe that somehow the 
new (> 2.4.16) kernels and Myrinet cards do not go well together on our 
computers... Anybody knows how the interrupt rate for Myrinet cards 
compare with the interrupt rate for fast ethernet (3c59x here) for the 
same communication needs ? Any other idea ?

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu бў IWR.Uni-Heidelberg.De

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama бў pccluster.org  Thu Mar  6 09:04:19 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Thu, 06 Mar 2003 09:04:19 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock
In-Reply-To: Your message of "Wed, 05 Mar 2003 18:32:30 JST."
             <Pine.LNX.4.44.0303051815060.32571-100000@kenzo.iwr.uni-heidelberg.de>
Message-ID: <20030306000419.5D83920054@neal.il.is.s.u-tokyo.ac.jp>

In article <Pine.LNX.4.44.0303051815060.32571-100000 бў kenzo.iwr.uni-heidelberg.de> Bogdan Costescu <bogdan.costescu бў iwr.uni-heidelberg.de> wrotes:
> On Wed, 5 Mar 2003, Bogdan Costescu wrote:
> 
> > I'll try next to see if I can get SCore 4.2.1 to work with a newer kernel 
> > (2.4.18-19 or so, maybe some RedHat variant) to see if the problem comes 
> > from the newer kernel or from newer SCore.

I forget FAQ (http://www.pccluster.org/faq/en/faq-tips/faq.html).
(Category: PM Communication Facility. 6)
Please check whether IRQ is duplicated nor not.

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From bogdan.costescu бў iwr.uni-heidelberg.de  Thu Mar  6 18:58:30 2003
From: bogdan.costescu бў iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Thu, 6 Mar 2003 10:58:30 +0100 (CET)
Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock
In-Reply-To: <20030306000419.5D83920054@neal.il.is.s.u-tokyo.ac.jp>
Message-ID: <Pine.LNX.4.44.0303061051180.15567-100000@kenzo.iwr.uni-heidelberg.de>

On Thu, 6 Mar 2003 kameyama бў pccluster.org wrote:

> Please check whether IRQ is duplicated nor not.

I did go through the FAQ... The interrupts are not shared; as I have only 
a few devices in the computer (one IDE disk, one 3c905C, one Myrinet 
card), each has its own interrupt:

           CPU0       CPU1       
  0:    3202112    3042292    IO-APIC-edge  timer
  1:          2          1    IO-APIC-edge  keyboard
  2:          0          0          XT-PIC  cascade
  5:    1264609    1260821   IO-APIC-level  eth0
  8:          0          1    IO-APIC-edge  rtc
 11:          0          0   IO-APIC-level  usb-ohci
 12:     211923     162138   IO-APIC-level  myri
 14:      11142       8059    IO-APIC-edge  ide0

For non-SMP kernels, they are also not shared.

I'll try next to see if I can get GM to work with the same kernel (RH 
2.4.18-24) and if I can get some bad behaviour as well - although given 
different MPI implementation I guess that the communication needs (rate of 
interrupts, PCI bus usage, etc.) will be different.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu бў IWR.Uni-Heidelberg.De

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From bogdan.costescu бў iwr.uni-heidelberg.de  Fri Mar  7 04:58:42 2003
From: bogdan.costescu бў iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Thu, 6 Mar 2003 20:58:42 +0100 (CET)
Subject: [SCore-users-jp] [SCore-users] Kernel oops
Message-ID: <Pine.LNX.4.44.0303062040390.16049-100000@kenzo.iwr.uni-heidelberg.de>

Dear SCore developers,

I've postponed trying to test GM on our nodes as I have observed that 
whenever SCoreD crashes and takes with it one node there is also an Oops 
displayed on the node. This is with the SCore 4.2.1 kernel patch applied 
to RH 2.4.18-24, so it might be some error that I have introduced, but the 
behaviour (SCoreD taking down one node) is the same with SCore 5.4 and 
kernel 2.4.19-1SCORE which I plan to test tomorrow.

So, the (decoded) Oops looks like this:

EIP is at __wake_up [kernel] 0x3c (2.4.18-24SCORE)
eax: c041c998   ebx: c25a4d80     ecx: 00000000       edx: 00000000
esi: 00000001   edi: c041c994     ebp: c25abf1c       esp: c25abf08
ds: 0018   es: 0018   ss: 0018
Process swapper (pid: 0, stackpage=c25ab000)
Stack:  00000282 00000001 c041c96c c041c840 c041c994 00000002 c019ec9a c25a4d80
        00000001 dbdfa015 00000010 c010a6e3 00000010 c041c840 c25abf7c c25abf7c
        c0398000 00000010 c25a4d80 c010a872 00000010 c25abf7c c25a4d80 00000001
Call trace: [<c019ec9a>] myri_pm_intr [kernel] 0x7a (0xc25abf20))
[<c010a63e>] handle_IRQ_event [kernel] 0x5e (oxc25abf34))
[<c010a872>] do_IRQ [kernel] 0xc2 (0c25abf54))
[<c0106e60>] default_idle [kernel] 0x0 (0xc25abf68))
[<c0106e60>] default_idle [kernel] 0x0 (0xc25abf74))
[<c010d098>] call_do_IRQ [kernel] 0x5 (0xc25abf78))
[<c0106e60>] default_idle [kernel] 0x0 (0xc25abf7c))
[<c0106e60>] default_idle [kernel] 0x0 (0xc25abf90))
[<c0106e89>] default_idle [kernel] 0x29 (0xc25abfa4))
[<c0106f02>] cpu_idle [kernel] 0x32 (0xc25abfb0))
[<c011dafb>] call_console_drivers [kernel] 0xeb (0xc25abfd0))
[<c011dca9>] printk [kernel] 0x129 (0xc25abffc))
Code: 8b 02 85 45 f0 74 ed 6a 00 52 e8 75 f0 ff ff 5a 85 c0 59 74
Using defaults from ksymoops -t elf32-i386 -a i386

Trace; c019ec9a <myri_pm_intr+7a/90>
Trace; c010a63e <handle_IRQ_event+5e/90>
Trace; c010a872 <do_IRQ+c2/110>
Trace; c0106e60 <default_idle+0/40>
Trace; c0106e60 <default_idle+0/40>
Trace; c010d098 <call_do_IRQ+5/d>
Trace; c0106e60 <default_idle+0/40>
Trace; c0106e60 <default_idle+0/40>
Trace; c0106e89 <default_idle+29/40>
Trace; c0106f02 <cpu_idle+32/50>
Trace; c011dafb <call_console_drivers+eb/100>
Trace; c011dca9 <printk+129/140>
Code;  00000000 Before first symbol
00000000 <_EIP>:
Code;  00000000 Before first symbol
   0:   8b 02                     mov    (%edx),%eax
Code;  00000002 Before first symbol
   2:   85 45 f0                  test   %eax,0xfffffff0(%ebp)
Code;  00000005 Before first symbol
   5:   74 ed                     je     fffffff4 <_EIP+0xfffffff4> fffffff4 <END_OF_CODE+1f463075/????>
Code;  00000007 Before first symbol
   7:   6a 00                     push   $0x0
Code;  00000009 Before first symbol
   9:   52                        push   %edx
Code;  0000000a Before first symbol
   a:   e8 75 f0 ff ff            call   fffff084 <_EIP+0xfffff084> fffff084 <END_OF_CODE+1f462105/????>
Code;  0000000f Before first symbol
   f:   5a                        pop    %edx
Code;  00000010 Before first symbol
  10:   85 c0                     test   %eax,%eax
Code;  00000012 Before first symbol
  12:   59                        pop    %ecx
Code;  00000013 Before first symbol
  13:   74 00                     je     15 <_EIP+0x15> 00000015 Before first symbol

 <0>Kernel panic: Aiee, killing interrupt handler!


Today I was able to reproduce this Oops several times on different nodes. 
The trace is always the same, except for the line(s) after cpu_idle, which 
can be replaced by:
[<c0105000>] stext [kernel] 0x0 (...))

I looked a bit through the code but I don't really understand Myrinet 
programming too well, so maybe this gives you some idea. Spurious 
interrupts ? Lost interrupts ? I'm still not confortable with the 
interrupt state on my machines as they have Tyan 760MP boards which are 
known for instabilities.
Anyway, as I said, I plan to try tomorrow with SCore 5.4 and kernel 
2.4.19-1SCORE to see if the locks there are also associated with such 
Oopses.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu бў IWR.Uni-Heidelberg.De

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kate бў pfu.fujitsu.com  Sun Mar  9 16:13:10 2003
From: kate бў pfu.fujitsu.com (KATAYAMA Yoshio)
Date: Sun, 09 Mar 2003 16:13:10 +0900
Subject: [SCore-users-jp] SCore 5.4.0 binary RPM install
Message-ID: <200303090713.AA12082@flash.tokyo.pfu.co.jp>

PFU д╬╩╥╗│д╚┐╜д╖д▐д╣бг
дк└д╧├д╦д╩дъд▐д╣бг

SCore 5.4.2 дЄе╨еде╩еъ RPM д╟едеєе╣е╚б╝еыд╖д╞дддыд╬д╟д╣дмбв╖╫╗╗
е█е╣е╚д╬ SCore елб╝е═еыд╬едеєе╣е╚б╝еыд╟еиещб╝д╦д╩дъд▐д╣бг

б╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜
[root бў comp1 RPMS]# rpm -Uvh kernel-2.4.19-1SCORE.i686.rpm
еиещб╝: ░═┬╕└нд╬╖ч╟б:
        kernel-drm = 4.2.0д╧ XFree86-4.2.0-8 д╦╔м═╫д╚д╡дьд╞ддд▐д╣
[root бў comp1 RPMS]# rpm -q --provides -p kernel-2.4.19-1SCORE.i686.rpm
module-info  
kernel = 2.4.19
kernel-drm = 4.1.0
kernel = 2.4.19-1SCORE
[root бў comp1 RPMS]# rpm -q --provides kernel       
module-info  
kernel = 2.4.18
kernel-drm = 4.1.0
kernel-drm = 4.2.0
kernel = 2.4.18-3
[root бў comp1 RPMS]# rpm -q --requires XFree86
Glide3  
XFree86-xfs = 4.2.0
XFree86-libs = 4.2.0
XFree86-base-fonts = 4.2.0-8
/etc/pam.d/system-auth  
kernel-drm = 4.2.0
/bin/ln  
/usr/sbin/chkfontpath  
б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜

╝шдъ┤║дид║бв--nodeps дЄ╔╒д▒д╞едеєе╣е╚б╝еыдЄ┬│╣╘д╖д▐д╖д┐дмбв▓┐дл
╠ф┬ъдмдвдыд╟д╖дчдждлбг

д│дьд╚┤╪╖╕двдыдл╩мдлдъд▐д╗дєдмбвrpmtest дм╚є╛яд╦├┘дпд╩д├д╞ддд▐д╣бг

б╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜
[root бў server sbin]# ./rpmtest bioinfo1 ethernet -reply &
[1] 4520
[root бў server sbin]# time ./rpmtest bioinfo2 ethernet -dest 0 -ping
8       0.0028161

real    4m42.299s
user    0m0.008s
sys     0m0.000s
[root бў server sbin]# 
б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜

(├э) comp1 вк HOST_0бвcomp2 вк HOST_1

Ether д╬е╔ещеде╨дм e100 д╦д╩д├д╞ддд▐д╖д┐д╬д╟бвeepro100 д╦╩╤дид┐
д╚д│дэбв─╠╛яд╬╗■┤╓д╦д╩дъд▐д╖д┐бг

б╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜
[root бў server sbin]# ./rpmtest bioinfo1 ethernet -reply &
[1] 4603
[root бў server sbin]# time ./rpmtest bioinfo2 ethernet -dest 0 -ping
8       7.59142e-05

real    0m8.165s
user    0m0.006s
sys     0m0.006s
[root бў server sbin]# 
б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜

д│дьд╧бв├▒╜уд╦е╔ещеде╨д╬╠ф┬ъд╟д╖дчдждлбгд╜дьд╚дтбвSCore елб╝е═еы
д╬едеєе╣е╚б╝еыд╦╠ф┬ъдмдвд├д┐д╬д╟д╖дчдждлбг

д╩дкбв╖╫╗╗е█е╣е╚д╧ FMV-C600 (Pentium 4 2.4 GHz, i845G, 512 MB) 
дм 4 ┬цд╬╣╜└од╟д╣бг

░╩╛хбвдшдэд╖дпдк┤ъддд╖д▐д╣бг
--
(│Ї)г╨г╞г╒ббг╧г╙г╙г├б╦г╠гщгюгїг°е╖е╣е╞ер╔Ї
╩╥╗│бб┴▒╔╫
Tel 044-520-6617  Fax 044-556-1022


From nrcb бў streamline-computing.com  Sun Mar  9 18:48:28 2003
From: nrcb бў streamline-computing.com (Nick Birkett)
Date: Sun, 9 Mar 2003 09:48:28 +0000
Subject: [SCore-users-jp] [SCore-users] score 5.4 build problems
Message-ID: <200303090948.28249.nrcb@streamline-computing.com>

Dear SCore users,

I have upgraded to RedHat 7.3 (original not updates version), 
and installed Score 5.4.
Kernel 2.4.19-1SCORE is installed as binary and source.

I am getting errors when I try the build using source packages:

score-5.4.0.build.tar.gz
score-5.4.0.score.tar.gz
score-5.4.0.mpi.tar.gz
score-5.4.0.utils.tar.gz

./configure works without error, but make gives:

PWD=/opt/score/score-src/SCore/pm2/arch/composite^M
+ make -w BUILD=/opt/score/score-src/SCore/build host_nickname=i386-redhat7-linux2_4 DIST= all^M
make[4]: Entering directory `/raid0/opt/score5.4.0/score-src/SCore/pm2/arch/composite'^M
cd obj.i386-redhat7-linux2_4;VPATH=.. make all BUILD=/opt/score/score-src/SCore/build host_nickname=i386-redhat7-linux2_4 DIST=^M
make[5]: Entering directory `/raid0/opt/score5.4.0/score-src/SCore/pm2/arch/composite/obj.i386-redhat7-linux2_4'^M
/usr/bin/gcc `if grep Unportable /usr/include/asm/spinlock.h> /dev/null; then echo -I/usr/src/linux-2.4/include; fi`  -O2 `case i386-unknown-linux in sparc-*-*) echo -Dsparc;; i386-*-*) echo -Di386 -m486;; alpha-*-*) echo -Dalpha;; esac` `case i386-unknown-linux in *-*-sunos4*) echo -Dsunos4;; *-*-netbsd*) echo -Dnetbsd;; *-*-linux*) echo -Dlinux;; *-*-osf*) echo -Dosf1_linux -I/usr/local/linux/linux.include;; esac`  -Wall `case i386-unknown-linux in alpha-*-linux*) echo  -pipe -ffixed-8 -mcpu=ev5 -Wa,-mev6 ;;  esac`  -I../../../include  `if grep Unportable /usr/include/asm/spinlock.h>/dev/null; then echo -I/usr/src/linux-2.4/include; fi`  -o pm_composite.o -c ../pm_composite.c^M
In file included from /usr/include/linux/spinlock.h:35,^M
                 from ../../../include/pm_lock.h:79,^M
                 from ../pm_composite.c:79:^M
/usr/include/asm/spinlock.h: In function `read_lock':^M
/usr/include/asm/spinlock.h:168: `LOCK' undeclared (first use in this function)^M
/usr/include/asm/spinlock.h:168: (Each undeclared identifier is reported only once^M
/usr/include/asm/spinlock.h:168: for each function it appears in.)^M
/usr/include/asm/spinlock.h:168: parse error before string constant^M
/usr/include/asm/spinlock.h:168: parse error before `:'^M
/usr/include/asm/spinlock.h: In function `write_lock':^M
/usr/include/asm/spinlock.h:177: `LOCK' undeclared (first use in this function)^M
/usr/include/asm/spinlock.h:177: parse error before string constant^M
/usr/include/asm/spinlock.h:177: parse error before `:'^M
/usr/include/asm/spinlock.h: In function `write_trylock':^M
/usr/include/asm/spinlock.h:186: warning: implicit declaration of function `atomic_sub_and_test'^M
/usr/include/asm/spinlock.h:188: warning: implicit declaration of function `atomic_add'^M
../pm_composite.c: At top level:^M

Any help appreciated.

Cheers,

Nick
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From bogdan.costescu бў iwr.uni-heidelberg.de  Mon Mar 10 04:50:01 2003
From: bogdan.costescu бў iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Sun, 9 Mar 2003 20:50:01 +0100 (CET)
Subject: [SCore-users-jp] Re: [SCore-users] score 5.4 build problems
In-Reply-To: <200303090948.28249.nrcb@streamline-computing.com>
Message-ID: <Pine.LNX.4.44.0303092043440.1589-100000@kenzo.iwr.uni-heidelberg.de>

On Sun, 9 Mar 2003, Nick Birkett wrote:

> I am getting errors when I try the build using source packages:
> 
> score-5.4.0.build.tar.gz
> score-5.4.0.score.tar.gz
> score-5.4.0.mpi.tar.gz
> score-5.4.0.utils.tar.gz

As I wrote in my earlier messages, I did compile all the user-level stuff 
(actually only what I needed, not all packages, but those above were 
included). I did not encounter any such problem... My system is RH 
7.2-based with pretty much all updates installed.

--
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu бў IWR.Uni-Heidelberg.De

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama бў pccluster.org  Mon Mar 10 09:30:19 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Mon, 10 Mar 2003 09:30:19 +0900
Subject: [SCore-users-jp] Re: [SCore-users] score 5.4 build problems
In-Reply-To: Your message of "Sun, 09 Mar 2003 09:48:28 JST."
             <200303090948.28249.nrcb@streamline-computing.com>
Message-ID: <20030310003019.EAADB20054@neal.il.is.s.u-tokyo.ac.jp>

In article <200303090948.28249.nrcb бў streamline-computing.com> Nick Birkett <nrcb бў streamline-computing.com> wrotes:
> c-*-*) echo -Dsparc;; i386-*-*) echo -Di386 -m486;; alpha-*-*) echo -Dalpha;;
>  esac` `case i386-unknown-linux in *-*-sunos4*) echo -Dsunos4;; *-*-netbsd*) 
> echo -Dnetbsd;; *-*-linux*) echo -Dlinux;; *-*-osf*) echo -Dosf1_linux -I/usr
> /local/linux/linux.include;; esac`  -Wall `case i386-unknown-linux in alpha-*
> -linux*) echo  -pipe -ffixed-8 -mcpu=ev5 -Wa,-mev6 ;;  esac`  -I../../../incl
> ude  `if grep Unportable /usr/include/asm/spinlock.h>/dev/null; then echo -I/
> usr/src/linux-2.4/include; fi`  -o pm_composite.o -c ../pm_composite.c^M
> In file included from /usr/include/linux/spinlock.h:35,^M
>                  from ../../../include/pm_lock.h:79,^M
>                  from ../pm_composite.c:79:^M
> /usr/include/asm/spinlock.h: In function `read_lock':^M

SCore source needs kernel header files.
But redhat 7.3 (or later) /usr/include/{asm,linux} dose not support
full kernel header file.
Please install kernel-source rpm on server host.

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From emile.carcamo бў nec.fr  Mon Mar 10 16:00:57 2003
From: emile.carcamo бў nec.fr (Emile CARCAMO)
Date: Mon, 10 Mar 2003 08:00:57 +0100
Subject: [SCore-users-jp] [SCore-users] rcp-all usage with SCore 5.4
Message-ID: <200303100701.h2A70vZG006018@emilepc.ess.nec.fr>

Hello,

	I just noticed the following trouble when using
	rcp-all after installing brand new version 5.4 :

<sparepc.ess.nec.fr>[ecarcamo]<117>rsh-all -g essfrance uname -a 
node01.ess.nec.fr
node02.ess.nec.fr
node01.ess.nec.fr: Linux node01.ess.nec.fr 2.4.19-1SCORE #1 Wed Feb 5 14:10:38 
JST 2003 i686 unknown
node02.ess.nec.fr: Linux node02.ess.nec.fr 2.4.19-1SCORE #1 Wed Feb 5 14:10:38 
JST 2003 i686 unknown
<sparepc.ess.nec.fr>[ecarcamo]<118>
<sparepc.ess.nec.fr>[ecarcamo]<118>rcp-all /etc/printcap essfrance:/tmp 
SCOUT: Spawning done.            
[node01-2]:
if: Expression Syntax.
SCOUT: Session done.
<sparepc.ess.nec.fr>[ecarcamo]<119>which rcp-all 
/opt/score/bin/rcp-all
<sparepc.ess.nec.fr>[ecarcamo]<120>

	This command always worked fine so far... Thanks
	in advance for the help, and best regards.

-- 
Emile_CARCAMO         NEC High Performance            http://www.hpce.nec.com
System Engineer         Computing Europe         mailto:ecarcamo бў hpce.nec.com
(+33)6-8063-7003   GSM
(+33)1-3930-6601   FAX   / Your mouse has moved. Windows NT must be restarted \
(+33)1-3930-6613  PHONE  \ for the change to take effect. Reboot now?  [ OK ] /


_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama бў pccluster.org  Mon Mar 10 16:52:31 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Mon, 10 Mar 2003 16:52:31 +0900
Subject: [SCore-users-jp] Re: [SCore-users] rcp-all usage with SCore 5.4
In-Reply-To: Your message of "Mon, 10 Mar 2003 08:00:57 JST."
             <200303100701.h2A70vZG006018@emilepc.ess.nec.fr>
Message-ID: <20030310075231.DF8B220054@neal.il.is.s.u-tokyo.ac.jp>

In article <200303100701.h2A70vZG006018 бў emilepc.ess.nec.fr> Emile CARCAMO <emile.carcamo бў nec.fr> wrotes:
> 	I just noticed the following trouble when using
> 	rcp-all after installing brand new version 5.4 :
> 
> <sparepc.ess.nec.fr>[ecarcamo]<117>rsh-all -g essfrance uname -a 
> node01.ess.nec.fr
> node02.ess.nec.fr
> node01.ess.nec.fr: Linux node01.ess.nec.fr 2.4.19-1SCORE #1 Wed Feb 5 14:10:3
> 8 
> JST 2003 i686 unknown
> node02.ess.nec.fr: Linux node02.ess.nec.fr 2.4.19-1SCORE #1 Wed Feb 5 14:10:3
> 8 
> JST 2003 i686 unknown
> <sparepc.ess.nec.fr>[ecarcamo]<118>
> <sparepc.ess.nec.fr>[ecarcamo]<118>rcp-all /etc/printcap essfrance:/tmp 
> SCOUT: Spawning done.            
> [node01-2]:
> if: Expression Syntax.
> SCOUT: Session done.

Sorry, rcp-all is not work if compute host's login shell is csh or tcsh.
Please apply this patch to rcp-all.

                       from Kameyama Toyohisa
---------------------------------------cut here---------------------------------
Index: rcp-all.pl
===================================================================
RCS file: /develop/cvsroot/score-src/program/utils/rcp-all/rcp-all.pl,v
retrieving revision 1.10
retrieving revision 1.11
diff -u -r1.10 -r1.11
--- rcp-all.pl	25 Oct 2002 04:44:57 -0000	1.10
+++ rcp-all.pl	10 Mar 2003 07:42:42 -0000	1.11
@@ -82,9 +82,9 @@
     my ($file) = @files[0];
     $file_basename = basename($file);
     $filemode = sprintf "%03o", (stat($file))[2] & 0777;
-    $scout_command = "scout -wait -g $group -re '\"if [ -d $remote_dir ]; then cat > "
-        . "$remote_dir/$file_basename;chmod $filemode $remote_dir/$file_basename;"
-        . "else cat > $remote_dir;chmod $filemode $remote_dir; fi\"'";
+    $scout_command = "scout -wait -g $group -re '\"[ -d $remote_dir ] && (cat > "
+        . "$remote_dir/$file_basename;chmod $filemode $remote_dir/$file_basename);"
+        . "[ -d $remote_dir ] || (cat > $remote_dir;chmod $filemode $remote_dir) \"'";
     open(REMOTE, "|$scout_command") or Error("Cannot exec scout command $!");
     open(LOCAL, $file) or Error("Cannot open $file $!");
     while(sysread(LOCAL, $_, $bufsize)) {
---------------------------------------cut here---------------------------------
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From emile.carcamo бў nec.fr  Mon Mar 10 17:20:53 2003
From: emile.carcamo бў nec.fr (Emile CARCAMO)
Date: Mon, 10 Mar 2003 09:20:53 +0100
Subject: [SCore-users-jp] Re: [SCore-users] rcp-all usage with SCore 5.4
In-Reply-To: Your message of "Mon, 10 Mar 2003 17:19:16 +0900."
             <20030310081916.3001420054@neal.il.is.s.u-tokyo.ac.jp> 
Message-ID: <200303100820.h2A8Krkf002374@emilepc.ess.nec.fr>


> > 	Does this patch apply on following file :
> > 
> > /opt/score5.4.0/bin/bin.i386-redhat7-linux2_4/rcp-all.exe
> 
> This patch was maked for rcp-all source file,
> But you can apply rcp-all.exe, directory.
> When patch asks "File to patch":, you specify rcp-all.exe
> 
> Please Issue following command:
>     % cd /opt/score5.4.0/bin/bin.i386-redhat7-linux2_4/
>     % patch < patch_file (Or privious my mail)
>     File to patch: rcp-all.exe
>                    ~~~~~~~~~~~
> 

Thanks a lot Kameyama-san, problem is fixed now !!
This patch also helps if your login shell is bash,
AFAIK. Best regards,

-- 
Emile_CARCAMO         NEC High Performance            http://www.hpce.nec.com
System Engineer         Computing Europe         mailto:ecarcamo бў hpce.nec.com
(+33)6-8063-7003   GSM
(+33)1-3930-6601   FAX   / Your mouse has moved. Windows NT must be restarted \
(+33)1-3930-6613  PHONE  \ for the change to take effect. Reboot now?  [ OK ] /


_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From uebayasi бў pultek.co.jp  Mon Mar 10 22:10:33 2003
From: uebayasi бў pultek.co.jp (Masao Uebayashi)
Date: Mon, 10 Mar 2003 22:10:33 +0900 (JST)
Subject: [SCore-users-jp] [SCore-users] Question about Modified Ack/Nack
Message-ID: <20030310.221033.60048924.uebayasi@pultek.co.jp>

Hello.

In terms of Modified Ack/Nack, what should be done if the following
situations happen?

For example, a note S sends messages 0, 1, 2, 3 to another note R.

	a) R receives 1, 0, 2, 3.

	b) R receives 1, 3.

Thanks in advance.

Masao
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From nrcb бў streamline-computing.com  Mon Mar 10 22:24:10 2003
From: nrcb бў streamline-computing.com (Nick Birkett)
Date: Mon, 10 Mar 2003 13:24:10 +0000
Subject: [SCore-users-jp] [SCore-users] Resource problem
Message-ID: <200303101324.10326.nrcb@streamline-computing.com>

Hi - I am getting a funny resource problem on a machine that has been running 
Score 5.0.1 for 200 days. 

I can run rpmtest, scstest between 2 nodes.
I can run a parallel job on each node separately,
but when I try to run both hosts together I get a  Resource unavailable
error.

A Myrnet line card was replaced on Friday. Could it be a cable is
in the wrong hole (but surely rpmtest and scstest would not then work )?

I have tried restarting scoreboard and msgbserv and rebooting comp29,30.

Anyone have an idea about this ?

------------------------------------------------------------------------

[nrcb бў saturn mpi]$ cat hosts 
comp29.ex.ac.uk
comp30.ex.ac.uk  

[nrcb бў saturn mpi]$ scout -wait  -F hosts -e scrun -nodes=2 ./jacobi_mpi
SCOUT: Spawning done.          
SCore-D 5.0.1 connected (jid=70).
<0:0> SCORE: 2 nodes (1x2) ready.
  Running with nprocs= 2
  Array size nxg,nyg =  1024 1024
  Iteration count    =  1024
  Running with nprocs= 2
 cpus= 2: Iteration =  10  8.66808374E+12
 cpus= 2: Iteration =  20  8.61407852E+12

WORKS

[nrcb бў saturn mpi]$ scout -wait  -F hosts -e scrun -nodes=4 ./jacobi_mpi
SCOUT: Spawning done.          
FEP:ERROR SCore-D Login failed: Resource unavailable.
SCOUT: Session done.

DOESNT WORK

[nrcb бў saturn mpi]$ cat hosts 
comp30.ex.ac.uk        
comp29.ex.ac.uk


[nrcb бў saturn mpi]$ scout -wait  -F hosts -e scrun -nodes=2 ./jacobi_mpi
SCOUT: Spawning done.          
SCore-D 5.0.1 connected (jid=72).
<0:0> SCORE: 2 nodes (1x2) ready.
  Running with nprocs= 2
  Array size nxg,nyg =  1024 1024
  Iteration count    =  1024
  Running with nprocs= 2
 cpus= 2: Iteration =  10  8.66808374E+12
 cpus= 2: Iteration =  20  8.61407852E+12
 cpus= 2: Iteration =  30  8.57295514E+12

WORKS

[nrcb бў saturn mpi]$ scout -wait  -F hosts -e scrun -nodes=4 ./jacobi_mpi
SCOUT: Spawning done.          
FEP:ERROR SCore-D Login failed: Resource unavailable.
SCOUT: Session done.

DOESNT WORK

The jacob_mpi application is a standard one that works up to 64 processes.

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From jodell бў ad.brown.edu  Tue Mar 11 03:32:43 2003
From: jodell бў ad.brown.edu (James O'Dell)
Date: 10 Mar 2003 13:32:43 -0500
Subject: [SCore-users-jp] [SCore-users] Debugger
Message-ID: <1047321162.1540.4.camel@cr1>

I'm a new user of the SCORE system and have a couple of questions about
the debugger.

1) Is there anyway to tell score to simply dump core, rather than trying
to invoke GDB?

2) When I look at the stack trace in GDB, all I seem to see are
functions related to score. How do I display the stack of user function
invocations?


Thanks,
Jim
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama бў pccluster.org  Tue Mar 11 08:49:12 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Tue, 11 Mar 2003 08:49:12 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Resource problem
In-Reply-To: Your message of "Mon, 10 Mar 2003 13:24:10 JST."
             <200303101324.10326.nrcb@streamline-computing.com>
Message-ID: <20030310234912.44EB320054@neal.il.is.s.u-tokyo.ac.jp>

In article <200303101324.10326.nrcb бў streamline-computing.com> Nick Birkett <nrcb бў streamline-computing.com> wrotes:
> [nrcb бў saturn mpi]$ scout -wait  -F hosts -e scrun -nodes=2 ./jacobi_mpi
> SCOUT: Spawning done.          
> SCore-D 5.0.1 connected (jid=70).
> <0:0> SCORE: 2 nodes (1x2) ready.

I think the program connected SCore-D multi user mode (running with 1x2 host.)
(Because if single user mode, jid is not displayed.)

Plase check your environment variable SCORE_OPTIONS.

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From hori бў swimmy-soft.com  Tue Mar 11 09:45:58 2003
From: hori бў swimmy-soft.com (Atsushi HORI)
Date: Tue, 11 Mar 2003 09:45:58 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Debugger
In-Reply-To: <1047321162.1540.4.camel@cr1>
References: <1047321162.1540.4.camel@cr1>
Message-ID: <3130220758.hori0000@swimmy-soft.com>

Hi.

>I'm a new user of the SCORE system and have a couple of questions about
>the debugger.
>
>1) Is there anyway to tell score to simply dump core, rather than trying
>to invoke GDB?

No. This is because if your program run on 100 processors, do you 
really need 100 core files ?

>2) When I look at the stack trace in GDB, all I seem to see are
>functions related to score. How do I display the stack of user function
>invocations?

There are two possible cases.

1) Your program is not compiled with the debug (-g) option, 
   or executable file is stripped and symbol information is lost.
2) The exception signal is raised in the SCore runtime library itself.

BTW, which parallel library (MPI ?) or parallel language (OpenMP?) 
are you using ?

----
Atsushi HORI
SCore Developer
Swimmy Software, Inc.

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From uebayasi бў pultek.co.jp  Tue Mar 11 13:47:56 2003
From: uebayasi бў pultek.co.jp (Masao Uebayashi)
Date: Tue, 11 Mar 2003 13:47:56 +0900 (JST)
Subject: [SCore-users-jp] [SCore-users] Re: Question about Modified Ack/Nack
In-Reply-To: <20030310.221033.60048924.uebayasi@pultek.co.jp>
References: <20030310.221033.60048924.uebayasi@pultek.co.jp>
Message-ID: <20030311.134756.125114941.uebayasi@pultek.co.jp>

> In terms of Modified Ack/Nack, what should be done if the following
> situations happen?
> 
> For example, a note S sends messages 0, 1, 2, 3 to another note R.
> 
> 	a) R receives 1, 0, 2, 3.
> 
> 	b) R receives 1, 3.

I looked at PM/Myrinet.  I can understand its behavier if Myrinet
preserves packet order, but I'm not sure.

Masao
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From s-sumi бў flab.fujitsu.co.jp  Tue Mar 11 14:11:11 2003
From: s-sumi бў flab.fujitsu.co.jp (Shinji Sumimoto)
Date: Tue, 11 Mar 2003 14:11:11 +0900 (JST)
Subject: [SCore-users-jp] Re: [SCore-users] Question about Modified Ack/Nack
In-Reply-To: <20030310.221033.60048924.uebayasi@pultek.co.jp>
References: <20030310.221033.60048924.uebayasi@pultek.co.jp>
Message-ID: <20030311.141111.35022320.s-sumi@flab.fujitsu.co.jp>

Hi.

From: Masao Uebayashi <uebayasi бў pultek.co.jp>
Subject: [SCore-users] Question about Modified Ack/Nack
Date: Mon, 10 Mar 2003 22:10:33 +0900 (JST)
Message-ID: <20030310.221033.60048924.uebayasi бў pultek.co.jp>

uebayasi> Hello.
uebayasi> 
uebayasi> In terms of Modified Ack/Nack, what should be done if the following
uebayasi> situations happen?
uebayasi> 
uebayasi> For example, a note S sends messages 0, 1, 2, 3 to another note R.
uebayasi> 
uebayasi> 	a) R receives 1, 0, 2, 3.
uebayasi> 
uebayasi> 	b) R receives 1, 3.
uebayasi> 

In both case, the messages expect 0 are discarded. The receiver sends
nack to sender.


a) R receives 1, 0, 2, 3.
	      x  o  x  x  
b) R receives 1, 3.
	      x  x

o: received
x: discarded 

Shinji.

uebayasi> Masao
uebayasi> _______________________________________________
uebayasi> SCore-users mailing list
uebayasi> SCore-users бў pccluster.org
uebayasi> http://www.pccluster.org/mailman/listinfo/score-users
uebayasi> 
uebayasi> 
------
Shinji Sumimoto, Fujitsu Labs
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kate бў pfu.fujitsu.com  Tue Mar 11 16:17:00 2003
From: kate бў pfu.fujitsu.com (KATAYAMA Yoshio)
Date: Tue, 11 Mar 2003 16:17:00 +0900
Subject: [SCore-users-jp] _IceTransSocketUNIXConnect
Message-ID: <200303110717.AA13354@flash.tokyo.pfu.co.jp>

PFU д╬╩╥╗│д╚┐╜д╖д▐д╣бг
дк└д╧├д╦д╩д├д╞дкдъд▐д╣бг

SCore 5.4.0 д╟ demo/bin/mandel дЄ╝┬╣╘д╣дыд╚бв

б╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜
[root бў server tmp]# scrun -monitor /opt/score/demo/bin/mandel
SCore-D 5.4.0 connected.
<0:0> SCORE: 8 nodes (8x1) ready.
_IceTransSocketUNIXConnect: Cannot connect to non-local host bioinfo0.envi.osakafu-u.ac.jp
Warning: Tried to connect to session manager, Could not open network socket
:: -size 320x240 -re 0.000000 -im 0.000000 -radius 2.000000
end: 0 sec  59 msec 305 usec
б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜

д╚дддж╖┘╣Ёесе├е╗б╝е╕дм╜╨д▐д╣бгд│дьд╧▓┐дм╕╢░°д╩д╬д╟д╖дчдждлбг

║г╞№бве╡б╝е╨е█е╣е╚дЄ║╞едеєе╣е╚б╝еыд╖д┐д╬д╟д╣дмбвд╜дь░╩┴░д╧б╩╡н▓▒
дм█г╦цд╟д╣дмб╦д│д╬есе├е╗б╝е╕дм╜╨д╞ддд╩длд├д┐д╚╗╫ддд▐д╣бг

║╤д▀д▐д╗дєдмбвдшдэд╖дпдк┤ъддд╖д▐д╣бг

г╨г╙
Date: Sun, 09 Mar 2003 16:13:10 +0900
From: KATAYAMA Yoshio <kate бў pfu.fujitsu.com>
Subject: [SCore-users-jp] SCore 5.4.0 binary RPM install

д╬╖ядтдшдэд╖дпдк┤ъддд╖д▐д╣бг
--
(│Ї)г╨г╞г╒ббг╧г╙г╙г├б╦г╠гщгюгїг°е╖е╣е╞ер╔Ї
╩╥╗│бб┴▒╔╫
Tel 044-520-6617  Fax 044-556-1022


From nrcb бў streamline-computing.com  Tue Mar 11 16:27:19 2003
From: nrcb бў streamline-computing.com (Nick Birkett)
Date: Tue, 11 Mar 2003 07:27:19 +0000
Subject: [SCore-users-jp] Re: [SCore-users] score 5.4 build problems
In-Reply-To: <20030311004059.21E2020054@neal.il.is.s.u-tokyo.ac.jp>
References: <20030311004059.21E2020054@neal.il.is.s.u-tokyo.ac.jp>
Message-ID: <200303110727.19138.nrcb@streamline-computing.com>

On Tuesday 11 March 2003 12:40 am, you wrote:

> > > But redhat 7.3 (or later) /usr/include/{asm,linux} dose not support
> > > full kernel header file.
> > > Please install kernel-source rpm on server host.

Many thanks. I re-installed Score kernel and have link

lrwxrwxrwx    1 root     root           19 Mar 11 07:17 linux-2.4 -> linux-2.4.19-1SCORE

make menuconfig in  linux-2.4.19-1SCORE , save and exit.

I guess I have too many kernels and links on my system !!

Regards,

Nick
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama бў pccluster.org  Tue Mar 11 18:40:20 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Tue, 11 Mar 2003 18:40:20 +0900
Subject: [SCore-users-jp] SCore 5.4.0 binary RPM install
In-Reply-To: Your message of "Sun, 09 Mar 2003 16:13:10 JST."
             <200303090713.AA12082@flash.tokyo.pfu.co.jp>
Message-ID: <20030311094020.4D78620054@neal.il.is.s.u-tokyo.ac.jp>

╡╡╗│д╟д╣.

In article <200303090713.AA12082 бў flash.tokyo.pfu.co.jp> KATAYAMA Yoshio <kate бў pfu.fujitsu.com> wrotes:
> SCore 5.4.2 дЄе╨еде╩еъ RPM д╟едеєе╣е╚б╝еыд╖д╞дддыд╬д╟д╣дмбв╖╫╗╗
> е█е╣е╚д╬ SCore елб╝е═еыд╬едеєе╣е╚б╝еыд╟еиещб╝д╦д╩дъд▐д╣бг

д╣д▀д▐д╗дє.
spec file д╬╡н╜╥е▀е╣д╟д╣.

> ╝шдъ┤║дид║бв--nodeps дЄ╔╒д▒д╞едеєе╣е╚б╝еыдЄ┬│╣╘д╖д▐д╖д┐дмбв▓┐дл
> ╠ф┬ъдмдвдыд╟д╖дчдждлбг

compute host д╟ X server дЄ╬йд┴╛хд▓д╩дд╕┬дъ╠ф┬ъд╩ддд╚╗╫ддд▐д╣.
╬йд┴╛хд▓ды╛ь╣ч, redhat 7.3 д╬ server д╦двды i830 (2.4.19 ╔╕╜р kernel
д╦д╧двдъд▐д╗дєд╟д╖д┐.) дкдшд╙ (е│еєе╤едеы╗■д╦еиещб╝д╦д╩д├д┐д╬д╟
д╧д║д╖д┐╡н▓▒дм...) дм┬╕║▀д╖д╩ддд╬д╟╠ф┬ъд╦д╩дыдлдтд╖дьд▐д╗дє.
│║┼Ў host д╬ chipset д╧ i845 д╩д╬д╟╠ф┬ъд╦д╩дъд╜джд╩...

> д│дьд╚┤╪╖╕двдыдл╩мдлдъд▐д╗дєдмбвrpmtest дм╚є╛яд╦├┘дпд╩д├д╞ддд▐д╣бг

(├ц╬м)

> Ether д╬е╔ещеде╨дм e100 д╦д╩д├д╞ддд▐д╖д┐д╬д╟бвeepro100 д╦╩╤дид┐
> д╚д│дэбв─╠╛яд╬╗■┤╓д╦д╩дъд▐д╖д┐бг

(├ц╬м)

> д│дьд╧бв├▒╜уд╦е╔ещеде╨д╬╠ф┬ъд╟д╖дчдждлбгд╜дьд╚дтбвSCore елб╝е═еы
> д╬едеєе╣е╚б╝еыд╦╠ф┬ъдмдвд├д┐д╬д╟д╖дчдждлбг

е╔ещеде╨д╬╠ф┬ъд└д╚╗╫ддд▐д╣.

                       from Kameyama Toyohisa


From kameyama бў pccluster.org  Tue Mar 11 19:20:33 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Tue, 11 Mar 2003 19:20:33 +0900
Subject: [SCore-users-jp] _IceTransSocketUNIXConnect
In-Reply-To: Your message of "Tue, 11 Mar 2003 16:17:00 JST."
             <200303110717.AA13354@flash.tokyo.pfu.co.jp>
Message-ID: <20030311102033.8D79720054@neal.il.is.s.u-tokyo.ac.jp>

╡╡╗│д╟д╣.

In article <200303110717.AA13354 бў flash.tokyo.pfu.co.jp> KATAYAMA Yoshio <kate бў pfu.fujitsu.com> wrotes:
> SCore 5.4.0 д╟ demo/bin/mandel дЄ╝┬╣╘д╣дыд╚бв
> 
> б╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜д│д│длдщб╜б╜б╜б╜
> [root бў server tmp]# scrun -monitor /opt/score/demo/bin/mandel
> SCore-D 5.4.0 connected.
> <0:0> SCORE: 8 nodes (8x1) ready.
> _IceTransSocketUNIXConnect: Cannot connect to non-local host bioinfo0.envi.os
> akafu-u.ac.jp
> Warning: Tried to connect to session manager, Could not open network socket
> :: -size 320x240 -re 0.000000 -im 0.000000 -radius 2.000000
> end: 0 sec  59 msec 305 usec
> б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜д│д│д▐д╟б╜б╜б╜б╜
> 
> д╚дддж╖┘╣Ёесе├е╗б╝е╕дм╜╨д▐д╣бгд│дьд╧▓┐дм╕╢░°д╩д╬д╟д╖дчдждлбг
> 
> ║г╞№бве╡б╝е╨е█е╣е╚дЄ║╞едеєе╣е╚б╝еыд╖д┐д╬д╟д╣дмбвд╜дь░╩┴░д╧б╩╡н▓▒
> дм█г╦цд╟д╣дмб╦д│д╬есе├е╗б╝е╕дм╜╨д╞ддд╩длд├д┐д╚╗╫ддд▐д╣бг

┬┐╩м╜╨д╞ддд┐д╚╗╫ддд▐д╣дм...
д│дьд╧, mandel дм ICE (Inter Client Exchenge protocol) д╚ддджд╬дЄ
╗╚═╤д╖дшджд╚д╖д╞╝║╟╘д╖д╞дддыдтд╬д╬дшджд╟д╣.

░ь▒■, xsm дЄ╬йд┴╛хд▓д╞, д╜д╬дтд╚д╟╡п╞░д╣дьд╨
    Warning: Tried to connect to session manager, Could not open network socket
д╧╛├дидыдшджд╟д╣дм...

                       from Kameyama Toyohisa


From kate бў pfu.fujitsu.com  Wed Mar 12 16:34:32 2003
From: kate бў pfu.fujitsu.com (KATAYAMA Yoshio)
Date: Wed, 12 Mar 2003 16:34:32 +0900
Subject: [SCore-users-jp] SCore 5.4.0 binary RPM install
In-Reply-To: Your message of Tue, 11 Mar 2003 18:40:20 +0900.
             <20030311094020.4D78620054@neal.il.is.s.u-tokyo.ac.jp> 
Message-ID: <200303120734.AA14066@flash.tokyo.pfu.co.jp>

PFU д╬╩╥╗│д╟д╣бг

д┤▓є┼·═н╞ёджд┤д╢ддд▐д╣бг

Date: Tue, 11 Mar 2003 18:40:20 +0900
From: kameyama бў pccluster.org

>> SCore 5.4.2 дЄе╨еде╩еъ RPM д╟едеєе╣е╚б╝еыд╖д╞дддыд╬д╟д╣дмбв╖╫╗╗
>> е█е╣е╚д╬ SCore елб╝е═еыд╬едеєе╣е╚б╝еыд╟еиещб╝д╦д╩дъд▐д╣бг

>д╣д▀д▐д╗дє.
>spec file д╬╡н╜╥е▀е╣д╟д╣.

░┬┐┤д╖д▐д╖д┐бг

>> ╝шдъ┤║дид║бв--nodeps дЄ╔╒д▒д╞едеєе╣е╚б╝еыдЄ┬│╣╘д╖д▐д╖д┐дмбв▓┐дл
>> ╠ф┬ъдмдвдыд╟д╖дчдждлбг

>compute host д╟ X server дЄ╬йд┴╛хд▓д╩дд╕┬дъ╠ф┬ъд╩ддд╚╗╫ддд▐д╣.
>╬йд┴╛хд▓ды╛ь╣ч, redhat 7.3 д╬ server д╦двды i830 (2.4.19 ╔╕╜р kernel
>д╦д╧двдъд▐д╗дєд╟д╖д┐.) дкдшд╙ (е│еєе╤едеы╗■д╦еиещб╝д╦д╩д├д┐д╬д╟
>д╧д║д╖д┐╡н▓▒дм...) дм┬╕║▀д╖д╩ддд╬д╟╠ф┬ъд╦д╩дыдлдтд╖дьд▐д╗дє.
>│║┼Ў host д╬ chipset д╧ i845 д╩д╬д╟╠ф┬ъд╦д╩дъд╜джд╩...

┴╟д╬ RedHat 7.3 д╟д╧бвX дмджд▐дп╞░длд╩дд(*)д╬д╟бведеєе╞еыд╬ web 
е╡еде╚длдще└ежеєеэб╝е╔д╖д┐ i830-20030120-i386-linux.tar.gz дЄ╞■
дьд╞ддд▐д╣бг

бЎ tty ▓ш╠╠ вк X вк tty ▓ш╠╠ д▐д╟д╧ OK д╟д╣дмбвд╜д╬╕хбвX д╦╠сдь
бЎ д╩дпд╩дъд▐д╣
бЎ kernel-2.4.18-24.7.x.i686.rpm д╦д╣дыд╚бвд│д╬е╔ещеде╨дмд╩дпд╞
бЎ дт OK д╩дшджд╟д╣дмбв╟░д╬░┘д╦╞■дьд╞ддд▐д╣

д│дьд╧бвSCore елб╝е═еыд╟дт╞▒══д╟д╖д┐д╬д╟бвд│д╬е╔ещеде╨дЄ╞■дьд┐д╚
д│дэбв╞░ддд╞дпдьд╞дддыдшджд╟д╣бг

>> д│дьд╚┤╪╖╕двдыдл╩мдлдъд▐д╗дєдмбвrpmtest дм╚є╛яд╦├┘дпд╩д├д╞ддд▐д╣бг

>> Ether д╬е╔ещеде╨дм e100 д╦д╩д├д╞ддд▐д╖д┐д╬д╟бвeepro100 д╦╩╤дид┐
>> д╚д│дэбв─╠╛яд╬╗■┤╓д╦д╩дъд▐д╖д┐бг

>> д│дьд╧бв├▒╜уд╦е╔ещеде╨д╬╠ф┬ъд╟д╖дчдждлбгд╜дьд╚дтбвSCore елб╝е═еы
>> д╬едеєе╣е╚б╝еыд╦╠ф┬ъдмдвд├д┐д╬д╟д╖дчдждлбг

>е╔ещеде╨д╬╠ф┬ъд└д╚╗╫ддд▐д╣.

═н╞ёджд┤д╢ддд▐д╖д┐бг
--
(│Ї)г╨г╞г╒ббг╧г╙г╙г├б╦г╠гщгюгїг°е╖е╣е╞ер╔Ї
╩╥╗│бб┴▒╔╫
Tel 044-520-6617  Fax 044-556-1022


From kate бў pfu.fujitsu.com  Wed Mar 12 16:34:38 2003
From: kate бў pfu.fujitsu.com (KATAYAMA Yoshio)
Date: Wed, 12 Mar 2003 16:34:38 +0900
Subject: [SCore-users-jp] _IceTransSocketUNIXConnect
In-Reply-To: Your message of Tue, 11 Mar 2003 19:20:33 +0900.
             <20030311102033.8D79720054@neal.il.is.s.u-tokyo.ac.jp> 
Message-ID: <200303120734.AA14071@flash.tokyo.pfu.co.jp>

PFU д╬╩╥╗│д╟д╣бг

д┤▓є┼·═н╞ёджд┤д╢ддд▐д╣бг

Date: Tue, 11 Mar 2003 19:20:33 +0900
From: kameyama бў pccluster.org

>> SCore 5.4.0 д╟ demo/bin/mandel дЄ╝┬╣╘д╣дыд╚бв
бжбжбж
>> д╚дддж╖┘╣Ёесе├е╗б╝е╕дм╜╨д▐д╣бгд│дьд╧▓┐дм╕╢░°д╩д╬д╟д╖дчдждлбг
>> 
>> ║г╞№бве╡б╝е╨е█е╣е╚дЄ║╞едеєе╣е╚б╝еыд╖д┐д╬д╟д╣дмбвд╜дь░╩┴░д╧б╩╡н▓▒
>> дм█г╦цд╟д╣дмб╦д│д╬есе├е╗б╝е╕дм╜╨д╞ддд╩длд├д┐д╚╗╫ддд▐д╣бг

>┬┐╩м╜╨д╞ддд┐д╚╗╫ддд▐д╣дм...
>д│дьд╧, mandel дм ICE (Inter Client Exchenge protocol) д╚ддджд╬дЄ
>╗╚═╤д╖дшджд╚д╖д╞╝║╟╘д╖д╞дддыдтд╬д╬дшджд╟д╣.

░ь╚╠ецб╝е╢д╟╝┬╣╘д╖д┐дщбвд│д╬есе├е╗б╝е╕дм╜╨д╩дпд╩дъд▐д╖д┐бг

║╞едеєе╣е╚б╝еыдм╜кд├д╞д╣д░б╩ецб╝е╢евележеєе╚дЄ║юды┴░б╦д╦╣╘д╩д├д╞бв
╕л┤╖дьд╩ддесе├е╗б╝е╕дм╜╨д╞╛╟д├д╞д╖д▐ддд▐д╖д┐бгд╔дждтбвдк┴√дмд╗д╖
д▐д╖д┐бг
--
(│Ї)г╨г╞г╒ббг╧г╙г╙г├б╦г╠гщгюгїг°е╖е╣е╞ер╔Ї
╩╥╗│бб┴▒╔╫
Tel 044-520-6617  Fax 044-556-1022


From nrcb бў streamline-computing.com  Wed Mar 12 16:19:26 2003
From: nrcb бў streamline-computing.com (Nick Birkett)
Date: Wed, 12 Mar 2003 07:19:26 +0000
Subject: [SCore-users-jp] [SCore-users] Trunked network Score 5.4
Message-ID: <200303120719.26926.nrcb@streamline-computing.com>

Hi we are having a problem with trunked network.
2 onboard GB cards connected to 2 fast ether switches.

Simple tests work, but running application (eg charmm)
causes network to crash.

 Application works fine using 1 network card.

Hardware : SuperMicro 1U servers, dual onboard Intel Gbit
Software: kernel 2.4.19-1SCORE smp, Score 5.4
charmm 28b2.

Will try again using 2 gigabit switches.

configuration files attached

Thanks,

Nick

-------------- next part --------------
е╞ене╣е╚╖┴╝░░╩│░д╬┼║╔╒е╒ебедеыдЄ╩▌┤╔д╖д▐д╖д┐...
е╒ебедеы╠╛: scorehosts.db
╖┐:         text/x-csrc
е╡еде║:     2388 е╨еде╚
└т╠└:       ╠╡д╖
URL:        <http://new1.pccluster.org/pipermail/score-users-jp/attachments/20030312/8084a541/attachment.bin>

-------------- next part --------------
╩╕╗·е│б╝е╔╗╪─ъд╬╠╡дд┼║╔╒╩╕╜ёдЄ╩▌┤╔д╖д▐д╖д┐...
╠╛┴░: pm-gigabit0.conf
URL:  <http://new1.pccluster.org/pipermail/score-users-jp/attachments/20030312/8084a541/attachment.ksh>

-------------- next part --------------
╩╕╗·е│б╝е╔╗╪─ъд╬╠╡дд┼║╔╒╩╕╜ёдЄ╩▌┤╔д╖д▐д╖д┐...
╠╛┴░: pm-gigabit1.conf
URL:  <http://new1.pccluster.org/pipermail/score-users-jp/attachments/20030312/8084a541/attachment-0001.ksh>

From hori бў swimmy-soft.com  Wed Mar 12 17:27:07 2003
From: hori бў swimmy-soft.com (Atsushi HORI)
Date: Wed, 12 Mar 2003 17:27:07 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Trunked network Score 5.4
In-Reply-To: <200303120719.26926.nrcb@streamline-computing.com>
References: <200303120719.26926.nrcb@streamline-computing.com>
Message-ID: <3130334827.hori0002@swimmy-soft.com>

Hi.

>Hi we are having a problem with trunked network.
>2 onboard GB cards connected to 2 fast ether switches.
>
>Simple tests work, but running application (eg charmm)
>causes network to crash.

I always run scstest one night. You had better to do the same thing.

----
Atsushi HORI
SCore Developer
Swimmy Software, Inc.

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama бў pccluster.org  Wed Mar 12 17:40:29 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Wed, 12 Mar 2003 17:40:29 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Trunked network Score 5.4
In-Reply-To: Your message of "Wed, 12 Mar 2003 07:19:26 JST."
             <200303120719.26926.nrcb@streamline-computing.com>
Message-ID: <20030312084029.9287420055@neal.il.is.s.u-tokyo.ac.jp>

In article <200303120719.26926.nrcb бў streamline-computing.com> Nick Birkett <nrcb бў streamline-computing.com> wrotes:
> Simple tests work, but running application (eg charmm)
> causes network to crash.
> 
>  Application works fine using 1 network card.

...

> ethernet	type=ethernet \
> 		-config:file=/opt/score/etc/pm-ethernet.conf
> gigabit0	type=ethernet \
> 		-config:file=/opt/score/etc/pm-gigabit0.conf
> gigabit1	type=ethernet \
> 		-config:file=/opt/score/etc/pm-gigabit1.conf
> gigabitx2       type=ethernet \
>                 -config:file=/opt/score/etc/pm-gigabit1.conf \
>                 -trunk0:file=/opt/score/etc/pm-gigabit0.conf

...

> 
> comp00.streamline	HOST_0 network=gigabitx2,gigabit0,gigabit1,shmem0,shme
> m1 group=_scoreall_,MYRI,ETHER,SHMEM smp=2 MSGBSERV
> comp01.streamline	HOST_1 network=gigabitx2,gigabit0,gigabit1,shmem0,shme
> m1 group=_scoreall_,MYRI,ETHER,SHMEM smp=2 MSGBSERV

Plesse remove gigabit0 and gigabit1 network at least each host line.

    http://www.pccluster.org/score/dist/score/html/en/reference/pm/ether-trunking.html
says:
    In this file, ethernet-0, ethernet-1, ethernet-2 and ethernet-3
    networks should be used for test purpose only, and needless networks
    should be removed after following communication tests are
    finished. Because, these definition causes a trouble in SCore-D
    multiuser environment.

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From bogdan.costescu бў iwr.uni-heidelberg.de  Wed Mar 12 21:03:05 2003
From: bogdan.costescu бў iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Wed, 12 Mar 2003 13:03:05 +0100 (CET)
Subject: [SCore-users-jp] [SCore-users] Back to SCore 4.2.1...
Message-ID: <Pine.LNX.4.44.0303121257230.10954-100000@kenzo.iwr.uni-heidelberg.de>

Dear SCore developers and users,

I've given up on the newest version of SCore and went back to SCore 4.2.1 
on kernel 2.4.16-based. During the last few days, I've got a better 
behaviour from SCore 5.4 - no more node crashes, although I didn't change 
anything - but jobs would still stall or stop with "Deadlock detected." 
message only minutes after start.

But I'm not writting this to discourage other people from installing SCore 
5.4. Instead, I would be very interested to hear from other sites 
especially with similar hardware if SCore 5.4 created such problems. I 
wouldn't be surprised if our hardware (not renowned for stability Tyan 
2460 760MP-based) would actually be part of the problem...

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu бў IWR.Uni-Heidelberg.De

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From nrcb бў streamline-computing.com  Thu Mar 13 08:51:02 2003
From: nrcb бў streamline-computing.com (Nick Birkett)
Date: Wed, 12 Mar 2003 23:51:02 +0000
Subject: [SCore-users-jp] [SCore-users] trunking
Message-ID: <200303122351.02223.nrcb@streamline-computing.com>

Some first results.

Hardware: 1U dual Xeon 2.6GHz Superservers with onboard dual gigabit.
Each network via its own gigabit switch.

Pallas benchmarks: Pingpong looks ok but Sendrecv is not good:

backoff 1024
maxnsend 16

on both networks.

See attached benchmarks.

  
-------------- next part --------------
╩╕╗·е│б╝е╔╗╪─ъд╬╠╡дд┼║╔╒╩╕╜ёдЄ╩▌┤╔д╖д▐д╖д┐...
╠╛┴░: trunking
URL:  <http://new1.pccluster.org/pipermail/score-users-jp/attachments/20030312/a62d0bc8/attachment.ksh>

From bogdan.costescu бў iwr.uni-heidelberg.de  Thu Mar 13 21:05:33 2003
From: bogdan.costescu бў iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Thu, 13 Mar 2003 13:05:33 +0100 (CET)
Subject: [SCore-users-jp] Re: [SCore-users] trunking
In-Reply-To: <200303122351.02223.nrcb@streamline-computing.com>
Message-ID: <Pine.LNX.4.44.0303131301120.25410-100000@kenzo.iwr.uni-heidelberg.de>

On Wed, 12 Mar 2003, Nick Birkett wrote:

> Hardware: 1U dual Xeon 2.6GHz Superservers with onboard dual gigabit.

Are these Intel or Broadcom NICs ? Or something else...

> Each network via its own gigabit switch.

Could you also tell us what switches you use ? This is just to make an 
idea as we are probably interested to set up something very similar in the 
near future, Myrinet is still expensive for small number of nodes and 
Gigabit Ethernet seems to have a pretty small latency with SCore (which is 
more important for us more than the increased bandwidth).

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu бў IWR.Uni-Heidelberg.De

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From s-sumi бў flab.fujitsu.co.jp  Thu Mar 13 21:47:36 2003
From: s-sumi бў flab.fujitsu.co.jp (Shinji Sumimoto)
Date: Thu, 13 Mar 2003 21:47:36 +0900 (JST)
Subject: [SCore-users-jp] Re: [SCore-users] trunking
In-Reply-To: <200303122351.02223.nrcb@streamline-computing.com>
References: <200303122351.02223.nrcb@streamline-computing.com>
Message-ID: <20030313.214736.640901912.s-sumi@flab.fujitsu.co.jp>

Hi.

Sorry for late response.

Are you using a switch that supports JUMBO Frame ?
If so, how about mpi_zerocopy=on option?

Here are results using Intel PRO/1000XTs on Supermicro mother boards.
These results are not so good compared with Myrinet 2000.  

Broadcom 5701 based NIC has also good communication  performace.
See: http://www.pccluster.org/score/dist/score/html/en/overview/pm-perf.html

maxnsend 24
backoff 2000
with mpi_zerocopy=on 

***** Two Intel PRO/1000XTs.
#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# ( #processes = 2 )
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000        25.06        25.06        25.06         0.00
            1         1000        24.97        25.03        25.00         0.08
            2         1000        24.38        24.41        24.40         0.16
            4         1000        25.18        25.20        25.19         0.30
            8         1000        24.57        24.63        24.60         0.62
           16         1000        25.52        25.54        25.53         1.19
           32         1000        24.98        25.01        25.00         2.44
           64         1000        25.09        25.11        25.10         4.86
          128         1000        24.71        24.71        24.71         9.88
          256         1000        29.80        29.84        29.82        16.36
          512         1000        30.46        30.50        30.48        32.02
         1024         1000        46.70        46.78        46.74        41.75
         2048         1000        63.69        73.19        68.44        53.37
         4096         1000       112.09       112.15       112.12        69.66
         8192         1000       191.11       191.21       191.16        81.71
        16384         1000       247.54       247.60       247.57       126.21
        32768         1000       652.67       652.69       652.68        95.76
        65536         1000       956.54       956.55       956.54       130.68
       131072         1000      1559.37      1559.38      1559.38       160.32
       262144          640      2737.14      2737.14      2737.14       182.67
       524288          320      4946.09      4946.14      4946.12       202.18
      1048576          160      9352.07      9352.14      9352.10       213.85
      2097152           80     24004.33     24004.79     24004.56       166.63
      4194304           40     48974.10     48975.73     48974.91       163.35

***** Three Intel PRO/1000XTs.
#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# ( #processes = 2 )
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000        30.05        30.06        30.06         0.00
            1         1000        25.05        25.13        25.09         0.08
            2         1000        21.71        21.77        21.74         0.18
            4         1000        23.80        23.85        23.83         0.32
            8         1000        32.45        32.48        32.46         0.47
           16         1000        28.11        28.13        28.12         1.08
           32         1000        24.24        24.30        24.27         2.51
           64         1000        22.63        22.67        22.65         5.38
          128         1000        25.27        25.28        25.27         9.66
          256         1000        28.47        28.54        28.51        17.11
          512         1000        30.25        30.30        30.28        32.23
         1024         1000        45.22        45.29        45.26        43.12
         2048         1000        62.34        71.10        66.72        54.94
         4096         1000        96.77        96.86        96.81        80.66
         8192         1000       178.64       178.79       178.71        87.39
        16384         1000       564.16       564.17       564.16        55.39
        32768         1000       625.56       625.58       625.57        99.91
        65536         1000       798.51       798.52       798.52       156.54
       131072         1000      1289.24      1289.24      1289.24       193.91
       262144          640      2369.67      2369.68      2369.68       211.00
       524288          320      4019.12      4019.12      4019.12       248.81
      1048576          160      7143.38      7143.41      7143.39       279.98
      2097152           80     14925.95     14926.55     14926.25       267.98
      4194304           40     29971.10     29973.05     29972.07       266.91


PS: I am now re-writing PM/Ethernet to reduce communication cost.

Shinji. 

From: Nick Birkett <nrcb бў streamline-computing.com>
Subject: [SCore-users] trunking
Date: Wed, 12 Mar 2003 23:51:02 +0000
Message-ID: <200303122351.02223.nrcb бў streamline-computing.com>

nrcb> Some first results.
nrcb> 
nrcb> Hardware: 1U dual Xeon 2.6GHz Superservers with onboard dual gigabit.
nrcb> Each network via its own gigabit switch.
nrcb> 
nrcb> Pallas benchmarks: Pingpong looks ok but Sendrecv is not good:
nrcb> 
nrcb> backoff 1024
nrcb> maxnsend 16
nrcb> 
nrcb> on both networks.
nrcb> 
nrcb> See attached benchmarks.
nrcb> 
nrcb>   
------
Shinji Sumimoto, Fujitsu Labs
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From iztok.daneu бў rzs-hm.si  Thu Mar 13 23:09:07 2003
From: iztok.daneu бў rzs-hm.si (Iztok Daneu)
Date: Thu, 13 Mar 2003 14:09:07 +0000 (UTC)
Subject: [SCore-users-jp] [SCore-users] Unusual problem with scrun
Message-ID: <Pine.LNX.4.44.0303131402570.20309-100000@calvus.rzs-hm.si>

Dear Score users,

we have a bit unusual problem:

disk on one of our cluster nodes nodes crashed some time ago. Instead of going
trough additional computational node installation procedure we simply
copied the content of other compute node hd with dd. After some twiddling
with fdisk, fsck and vi the machine boots OK , but when we try to run
application (for example: scrun -scored=tuba0,nodes=28 system hostname)
with scrun on the cluster with repaired node included, we get the error
message:

<13> SCORE-D:ERROR open_ddt_socket(STDIN)=111
<13> SCORE-D:ERROR open_ddt_socket(STDIN)=111

13 is the number of "repaired" computional node ;) and the score version
is 5.2.0.

We know that this is not a score problem but we created the problem ourselves.
Does anyone have a clue what we missed in our procedure?

Thank you very much in advance,
	iztok
-- 
If money can't buy happiness, I guess you'll just have to rent it.


_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From bogdan.costescu бў iwr.uni-heidelberg.de  Thu Mar 13 23:31:38 2003
From: bogdan.costescu бў iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Thu, 13 Mar 2003 15:31:38 +0100 (CET)
Subject: [SCore-users-jp] Re: [SCore-users] Unusual problem with scrun
In-Reply-To: <Pine.LNX.4.44.0303131402570.20309-100000@calvus.rzs-hm.si>
Message-ID: <Pine.LNX.4.44.0303131526390.25410-100000@kenzo.iwr.uni-heidelberg.de>

On Thu, 13 Mar 2003, Iztok Daneu wrote:

> we simply copied the content of other compute node hd with dd.

Then you probably copied also the content of the /scored or /var/scored 
which is used by SCoreD to keep data about jobs. You should probably start 
scored with -reset or -resetall command line arguments to clean up any old 
state in this directory.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu бў IWR.Uni-Heidelberg.De


_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From ce107 бў dam.brown.edu  Fri Mar 14 00:09:04 2003
From: ce107 бў dam.brown.edu (C. Evangelinos)
Date: Thu, 13 Mar 2003 10:09:04 -0500 (EST)
Subject: [SCore-users-jp] [SCore-users] NFS installation
Message-ID: <200303131509.h2DF94210604@fritz.dam.brown.edu>

I'd like to install SCore 5.4 (for development purposes) on a
heterogeneous network of PCs. As these machines are old they do not
really have 1/2GB of free space for /opt/score locally. I was
wondering whether I could install SCore in an NFS partition (and pay
whatever the performance loss this means). If that is doable what
still needs to be installed locally?

Constantinos Evangelinos

Center for Fluid Mechanics
Brown University
and
Ocean Engineering Department
MIT
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From bogdan.costescu бў iwr.uni-heidelberg.de  Fri Mar 14 00:47:44 2003
From: bogdan.costescu бў iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Thu, 13 Mar 2003 16:47:44 +0100 (CET)
Subject: [SCore-users-jp] Re: [SCore-users] NFS installation
In-Reply-To: <200303131509.h2DF94210604@fritz.dam.brown.edu>
Message-ID: <Pine.LNX.4.44.0303131645360.25410-100000@kenzo.iwr.uni-heidelberg.de>

On Thu, 13 Mar 2003, C. Evangelinos wrote:

> I was wondering whether I could install SCore in an NFS partition (and
> pay whatever the performance loss this means).

2-3 years ago I tried to install SCore on diskless clients and the only 
obstacle that I couldn't pass was that /tmp and/or /var/score could not be 
on NFS, but on local disk. I don't know if requirements changed in the 
mean time; it they didn't, I think that you have a pretty big chance of 
getting it to work.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu бў IWR.Uni-Heidelberg.De

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama бў pccluster.org  Fri Mar 14 09:54:34 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Fri, 14 Mar 2003 09:54:34 +0900
Subject: [SCore-users-jp] Re: [SCore-users] NFS installation
In-Reply-To: Your message of "Thu, 13 Mar 2003 16:47:44 JST."
             <Pine.LNX.4.44.0303131645360.25410-100000@kenzo.iwr.uni-heidelberg.de>
Message-ID: <20030314005435.32BCE20054@neal.il.is.s.u-tokyo.ac.jp>

In article <Pine.LNX.4.44.0303131645360.25410-100000 бў kenzo.iwr.uni-heidelberg.de> Bogdan Costescu <bogdan.costescu бў iwr.uni-heidelberg.de> wrotes:
> On Thu, 13 Mar 2003, C. Evangelinos wrote:
> 
> > I was wondering whether I could install SCore in an NFS partition (and
> > pay whatever the performance loss this means).
> 
> 2-3 years ago I tried to install SCore on diskless clients and the only 
> obstacle that I couldn't pass was that /tmp and/or /var/score could not be 
> on NFS, but on local disk.

Probably, this probrem is not change.
You can shared by NFS /opt/score (and SCore source file)
even if you want to run SCore on heterogeneous network.
But you cannot share /var/scored.

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From iztok.daneu бў rzs-hm.si  Fri Mar 14 17:45:29 2003
From: iztok.daneu бў rzs-hm.si (Iztok Daneu)
Date: Fri, 14 Mar 2003 08:45:29 +0000 (UTC)
Subject: [SCore-users-jp] Re: [SCore-users] Unusual problem with scrun
In-Reply-To:  <Pine.LNX.4.44.0303131526390.25410-100000@kenzo.iwr.uni-heidelberg.de>
Message-ID: <Pine.LNX.4.44.0303140838340.12813-100000@calvus.rzs-hm.si>

Hi,

On Thu, 13 Mar 2003, Bogdan Costescu wrote:

> On Thu, 13 Mar 2003, Iztok Daneu wrote:
>
> > we simply copied the content of other compute node hd with dd.
>
> Then you probably copied also the content of the /scored or /var/scored
> which is used by SCoreD to keep data about jobs. You should probably start
> scored with -reset or -resetall command line arguments to clean up any old
> state in this directory.

The machine which was used as source for dd-ing was properly taken from
the cluster (no jobs were running at the time, the machine was added to
scorehosts.defects file ....) so there were no staled jobs files. However
we did try your suggestion but with no avail.

Thank you for the help anyway.

regards,
	iztok
-- 
Your motives for doing whatever good deed you may have in mind will be
misinterpreted by somebody.

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From jodell бў ad.brown.edu  Tue Mar 18 07:22:15 2003
From: jodell бў ad.brown.edu (James O'Dell)
Date: 17 Mar 2003 17:22:15 -0500
Subject: [SCore-users-jp] [SCore-users] Configuring SCore
Message-ID: <1047939735.14922.28.camel@cr1>

We have had our SCore cluster running over fast ethernet for a few weeks
now and we decided to add gigabit ethernet,.

My assumption is that to add a new network i:

1) edit the pm_ehternet file on the nodes to start the gig interface.
2) Add a file pm-gig.conf to the /opt/score/etc directory. This file has
the MAC addresses of the gig cards.
3) Edit the scoredhosts.db file to define gigaethernet,include bu
pm-gig.conf file and define the nodes to have gigabit ethernet.
4) Reboot the server and the compute hosts.

Unfortunately
when I run  

scstest -network gigaethernet

I get the following messages:

gaethernet/ethernet (error=12).
  argv[0] -config
  argv[1] /var/scored/scoreboard/kansas.0000V3000V7t
Unable to open PM gigaethernet/ethernet (error=12).
  argv[0] -config
  argv[1] /var/scored/scoreboard/kansas.0000V3000V7t

Any help would be greatly appreciated.

Also, is there an easier way to do this?

Thanks,
Jim

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From ce107 бў dam.brown.edu  Tue Mar 18 08:27:19 2003
From: ce107 бў dam.brown.edu (C. Evangelinos)
Date: Mon, 17 Mar 2003 18:27:19 -0500 (EST)
Subject: [SCore-users-jp] Re: [SCore-users] NFS installation
In-Reply-To: <20030314005435.32BCE20054@neal.il.is.s.u-tokyo.ac.jp> from "kameyama@pccluster.org" at Mar 14, 2003 09:54:34 AM
Message-ID: <200303172327.h2HNRJH25065@fritz.dam.brown.edu>

Thanks to everybody for the help on the NFS installation. I think I'll
manage and will report back to the list of the results. In the
meantime I'm stuck in installing the SCore kernel on my old
heterogeneous equipment. The first two machines I tried (a PII-333 of
a stepping that is fine for SCore and a Pentium-133) have 64MB of RAM
and as they try to boot with the new kernel they run out of memory and
start killing daemons - the machine never gets much down the bootup
sequence. I'm using the i686 and i586 kernels that come in RPM format
with SCore 5.4. Is >64MB RAM a necessity for SCore?

Constantinos Evangelinos

Center for Fluid Mechanics
Brown University
and
Ocean Engineering Department
MIT
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From jodell бў ad.brown.edu  Tue Mar 18 08:39:16 2003
From: jodell бў ad.brown.edu (James O'Dell)
Date: 17 Mar 2003 18:39:16 -0500
Subject: [SCore-users-jp] [SCore-users] Help with an error message
Message-ID: <1047944356.14938.31.camel@cr1>

Does anyoen know what the following messages mean?
I got them whil running:

scstest -network gigaethernet


bio-11(-1) pmAssociateNodes: Invalid argument(22)
bio-12(-1) pmAssociateNodes: Invalid argument(22)


Thanks,
Jim

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From hori бў swimmy-soft.com  Tue Mar 18 11:30:28 2003
From: hori бў swimmy-soft.com (Atsushi HORI)
Date: Tue, 18 Mar 2003 11:30:28 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Help with an error message
In-Reply-To: <1047944356.14938.31.camel@cr1>
References: <1047944356.14938.31.camel@cr1>
Message-ID: <3130831828.hori0000@swimmy-soft.com>

Hi,

>1) edit the pm_ehternet file on the nodes to start the gig interface.
>2) Add a file pm-gig.conf to the /opt/score/etc directory. This file has
>the MAC addresses of the gig cards.
>3) Edit the scoredhosts.db file to define gigaethernet,include bu
>pm-gig.conf file and define the nodes to have gigabit ethernet.
>4) Reboot the server and the compute hosts.

And you must do the following on all cluster hosts;

5) /etc/rc.d/init.d/pm_ethernet stop
   Edit /etc/rc.d/init.d/pm_ethernet
   /etc/rc.d/init.d/pm_ethernet start

The pm_sthernet script binds PM unit number and Linux ethernet device 
(eth0, eth1, ...).

>Does anyoen know what the following messages mean?
>I got them whil running:
>
>scstest -network gigaethernet
>
>
>bio-11(-1) pmAssociateNodes: Invalid argument(22)
>bio-12(-1) pmAssociateNodes: Invalid argument(22)

Send me the files /opt/score/etc/scorehosts.db and 
/opt/score/etc/pm-gig.conf.

----
Atsushi HORI
Swimmy Software, Inc.

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From jodell бў ad.brown.edu  Wed Mar 19 03:53:25 2003
From: jodell бў ad.brown.edu (James O'Dell)
Date: 18 Mar 2003 13:53:25 -0500
Subject: [SCore-users-jp] [SCore-users] Starting Compute Host Lock services: msgbserv:No hosts
Message-ID: <1048013605.14938.56.camel@cr1>

I get an error message when I try to start my msgbserv as below:

/etc/rc.d/init.d/msgbserv start
Starting Compute Host Lock services: msgbserv:No hosts

Here is the output of my scorehosts and sceptic commands

scorehosts -l -g _scoreall_ 
bio-1.cascv.brown.edu
bio-2.cascv.brown.edu
bio-3.cascv.brown.edu
bio-4.cascv.brown.edu
bio-5.cascv.brown.edu
bio-6.cascv.brown.edu
bio-7.cascv.brown.edu
bio-8.cascv.brown.edu
bio-9.cascv.brown.edu
bio-10.cascv.brown.edu
bio-11.cascv.brown.edu
bio-12.cascv.brown.edu
12 hosts found.


sceptic -v -g _scoreall_
bio-2.cascv.brown.edu: OK
bio-9.cascv.brown.edu: OK
bio-1.cascv.brown.edu: OK
bio-10.cascv.brown.edu: OK
bio-6.cascv.brown.edu: OK
bio-8.cascv.brown.edu: OK
bio-4.cascv.brown.edu: OK
bio-7.cascv.brown.edu: OK
bio-5.cascv.brown.edu: OK
bio-12.cascv.brown.edu: OK
bio-3.cascv.brown.edu: OK
bio-11.cascv.brown.edu: OK
All host responding.

Where is msgbserv getting the idea that there are no hosts?

Thanks,
Jim


_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From jodell бў ad.brown.edu  Wed Mar 19 04:00:09 2003
From: jodell бў ad.brown.edu (James O'Dell)
Date: 18 Mar 2003 14:00:09 -0500
Subject: [SCore-users-jp] Re: [SCore-users] Help with an error message
In-Reply-To: <3130831828.hori0000@swimmy-soft.com>
References: <1047944356.14938.31.camel@cr1>
	 <3130831828.hori0000@swimmy-soft.com>
Message-ID: <1048014009.14922.64.camel@cr1>

Here is my scorehosts.db


/*
 *       SCore 5.0 scorehosts.db
 *		generated by PCCC EIT 5.2
 */

/* PM/Myrinet */
myrinet		type=myrinet \
		-firmware:file=/opt/score/share/lanai/lanai.mcp \
		-config:file=/opt/score/etc/pm-myrinet.conf

/* PM/Myrinet */
myrinet2k	type=myrinet2k \
		-firmware:file=/opt/score/share/lanai/lanaiM2k.mcp \
		-config:file=/opt/score/etc/pm-myrinet.conf

/* PM/Ethernet */
ethernet	type=ethernet \
		-config:file=/opt/score/etc/pm-ethernet.conf
gigaethernet	type=ethernet \
		-config:file=/opt/score/etc/pm-gig.conf
/* PM/Agent */
udp		type=agent -agent=pmaudp \
		-config:file=/opt/score/etc/pm-udp.conf

/* RHiNET */
rhinet		type=rhinet \
		-firmware:file=/opt/score/share/rhinet/phu_top_0207a.hex \
		-config:file=/opt/score/etc/pm-rhinet.conf
##
##
#include "/opt/score//etc/ndconf/0"
#include "/opt/score//etc/ndconf/1"
#include "/opt/score//etc/ndconf/2"
#include "/opt/score//etc/ndconf/3"
#include "/opt/score//etc/ndconf/4"
#include "/opt/score//etc/ndconf/5"
#include "/opt/score//etc/ndconf/6"
#include "/opt/score//etc/ndconf/7"
#include "/opt/score//etc/ndconf/8"
#include "/opt/score//etc/ndconf/9"
#include "/opt/score//etc/ndconf/10"
#include "/opt/score//etc/ndconf/11"
##
#define MSGBSERV	msgbserv=(kansas-fe.cascv.brown.edu:8764)

bio-1.cascv.brown.edu	HOST_0 network=ethernet group=_scoreall_,100Mb
smp=2 MSGBSERV
bio-2.cascv.brown.edu	HOST_1 network=ethernet group=_scoreall_,100Mb
smp=2 MSGBSERV
bio-3.cascv.brown.edu	HOST_2 network=ethernet group=_scoreall_,100Mb
smp=2 MSGBSERV
bio-4.cascv.brown.edu	HOST_3 network=ethernet group=_scoreall_,100Mb
smp=2 MSGBSERV
bio-5.cascv.brown.edu	HOST_4 network=ethernet,gigaethernet
group=_scoreall_,100Mb,gige smp=2 MSGBSERV
bio-6.cascv.brown.edu	HOST_5 network=ethernet,gigaethernet
group=_scoreall_,100Mb,gige smp=2 MSGBSERV
bio-7.cascv.brown.edu	HOST_6 network=ethernet,gigaethernet
group=_scoreall_,100Mb,gige smp=2 MSGBSERV
bio-8.cascv.brown.edu	HOST_7 network=ethernet,gigaethernet
group=_scoreall_,100Mb,gige smp=2 MSGBSERV
bio-9.cascv.brown.edu	HOST_8 network=ethernet,gigaethernet
group=_scoreall_,100Mb,gige smp=2 MSGBSERV
bio-10.cascv.brown.edu	HOST_9 network=ethernet,gigaethernet
group=_scoreall_,100Mb,gige smp=2 MSGBSERV
bio-11.cascv.brown.edu	HOST_10 network=ethernet,gigaethernet
group=_scoreall_,100Mb,gige smp=2 MSGBSERV
bio-12.cascv.brown.edu	HOST_11 network=ethernet,gigaethernet
group=_scoreall_,100Mb,gige smp=2 MSGBSERV


Here is my pm-gig.conf file:
unit 1
maxnsend 8
# Not connected yet
#0 00:30:48:23:70:CF bio-1.cascv.brown.edu
#1 00:30:48:23:70:B1 bio-2.cascv.brown.edu
#2 00:30:48:23:70:D9 bio-3.cascv.brown.edu
#3 00:30:48:23:70:E3 bio-4.cascv.brown.edu
4 00:30:48:23:6E:2B bio-5.cascv.brown.edu
5 00:30:48:23:3F:05 bio-6.cascv.brown.edu
6 00:30:48:23:3E:51 bio-7.cascv.brown.edu
7 00:30:48:23:3E:3D bio-8.cascv.brown.edu
8 00:30:48:23:70:EB bio-9.cascv.brown.edu
9 00:30:48:23:6F:05 bio-10.cascv.brown.edu
10 00:30:48:23:6E:55 bio-11.cascv.brown.edu
11 00:30:48:23:70:E1 bio-12.cascv.brown.edu

I have disabled the first four hosts as we don't have enough room in our
switch for them.

I have also edited the pm_ethernet file to start and stop eth1. When I
run "pm_ethernet stop" and then run "pm_ethernet start" I get the
messages below.

[root бў bio-12 init.d]# ./pm_ethernet stop
Stopping PM/Ethernet: device: eth0
device: eth1

[root бў bio-12 init.d]# ./pm_ethernet start
n Starting PM/Ethernet: 
device: eth0
device: eth1
etherpmctl: ERROR on unit 1: "Link has been severed(67)" Check dmesg
log!!

Many thanks for your help!

Jim

On Mon, 2003-03-17 at 21:30, Atsushi HORI wrote:
> Hi,
> 
> >1) edit the pm_ehternet file on the nodes to start the gig interface.
> >2) Add a file pm-gig.conf to the /opt/score/etc directory. This file has
> >the MAC addresses of the gig cards.
> >3) Edit the scoredhosts.db file to define gigaethernet,include bu
> >pm-gig.conf file and define the nodes to have gigabit ethernet.
> >4) Reboot the server and the compute hosts.
> 
> And you must do the following on all cluster hosts;
> 
> 5) /etc/rc.d/init.d/pm_ethernet stop
>    Edit /etc/rc.d/init.d/pm_ethernet
>    /etc/rc.d/init.d/pm_ethernet start
> 
> The pm_sthernet script binds PM unit number and Linux ethernet device 
> (eth0, eth1, ...).
> 
> >Does anyoen know what the following messages mean?
> >I got them whil running:
> >
> >scstest -network gigaethernet
> >
> >
> >bio-11(-1) pmAssociateNodes: Invalid argument(22)
> >bio-12(-1) pmAssociateNodes: Invalid argument(22)
> 
> Send me the files /opt/score/etc/scorehosts.db and 
> /opt/score/etc/pm-gig.conf.
> 
> ----
> Atsushi HORI
> Swimmy Software, Inc.
> 
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From jodell бў ad.brown.edu  Wed Mar 19 05:29:22 2003
From: jodell бў ad.brown.edu (James O'Dell)
Date: 18 Mar 2003 15:29:22 -0500
Subject: [SCore-users-jp] Re: [SCore-users] Help with an error message
In-Reply-To: <1048014009.14922.64.camel@cr1>
References: <1047944356.14938.31.camel@cr1>
	 <3130831828.hori0000@swimmy-soft.com>  <1048014009.14922.64.camel@cr1>
Message-ID: <1048019362.16469.81.camel@cr1>

I found part of my problem. The "Link has been severd message" came
about because my gig interface was not marked UP by ifconfig. I am not
using the gigbit ethernet for anything esle by SCore so it was not UP.
I modified the pm_ethernet scripts to do a "/sbin/ifconfig eth1 up"
and an "/sbin/ifconfig eth1 down" before and after respectively.

rpmtest indicates that both interfaces are now working.

I cannot test with scstest because I somehow broke my msgbserv.
That problem is in another message.

Jim

On Tue, 2003-03-18 at 14:00, James O'Dell wrote:
> Here is my scorehosts.db
> 
> 
> /*
>  *       SCore 5.0 scorehosts.db
>  *		generated by PCCC EIT 5.2
>  */
> 
> /* PM/Myrinet */
> myrinet		type=myrinet \
> 		-firmware:file=/opt/score/share/lanai/lanai.mcp \
> 		-config:file=/opt/score/etc/pm-myrinet.conf
> 
> /* PM/Myrinet */
> myrinet2k	type=myrinet2k \
> 		-firmware:file=/opt/score/share/lanai/lanaiM2k.mcp \
> 		-config:file=/opt/score/etc/pm-myrinet.conf
> 
> /* PM/Ethernet */
> ethernet	type=ethernet \
> 		-config:file=/opt/score/etc/pm-ethernet.conf
> gigaethernet	type=ethernet \
> 		-config:file=/opt/score/etc/pm-gig.conf
> /* PM/Agent */
> udp		type=agent -agent=pmaudp \
> 		-config:file=/opt/score/etc/pm-udp.conf
> 
> /* RHiNET */
> rhinet		type=rhinet \
> 		-firmware:file=/opt/score/share/rhinet/phu_top_0207a.hex \
> 		-config:file=/opt/score/etc/pm-rhinet.conf
> ##
> ##
> #include "/opt/score//etc/ndconf/0"
> #include "/opt/score//etc/ndconf/1"
> #include "/opt/score//etc/ndconf/2"
> #include "/opt/score//etc/ndconf/3"
> #include "/opt/score//etc/ndconf/4"
> #include "/opt/score//etc/ndconf/5"
> #include "/opt/score//etc/ndconf/6"
> #include "/opt/score//etc/ndconf/7"
> #include "/opt/score//etc/ndconf/8"
> #include "/opt/score//etc/ndconf/9"
> #include "/opt/score//etc/ndconf/10"
> #include "/opt/score//etc/ndconf/11"
> ##
> #define MSGBSERV	msgbserv=(kansas-fe.cascv.brown.edu:8764)
> 
> bio-1.cascv.brown.edu	HOST_0 network=ethernet group=_scoreall_,100Mb
> smp=2 MSGBSERV
> bio-2.cascv.brown.edu	HOST_1 network=ethernet group=_scoreall_,100Mb
> smp=2 MSGBSERV
> bio-3.cascv.brown.edu	HOST_2 network=ethernet group=_scoreall_,100Mb
> smp=2 MSGBSERV
> bio-4.cascv.brown.edu	HOST_3 network=ethernet group=_scoreall_,100Mb
> smp=2 MSGBSERV
> bio-5.cascv.brown.edu	HOST_4 network=ethernet,gigaethernet
> group=_scoreall_,100Mb,gige smp=2 MSGBSERV
> bio-6.cascv.brown.edu	HOST_5 network=ethernet,gigaethernet
> group=_scoreall_,100Mb,gige smp=2 MSGBSERV
> bio-7.cascv.brown.edu	HOST_6 network=ethernet,gigaethernet
> group=_scoreall_,100Mb,gige smp=2 MSGBSERV
> bio-8.cascv.brown.edu	HOST_7 network=ethernet,gigaethernet
> group=_scoreall_,100Mb,gige smp=2 MSGBSERV
> bio-9.cascv.brown.edu	HOST_8 network=ethernet,gigaethernet
> group=_scoreall_,100Mb,gige smp=2 MSGBSERV
> bio-10.cascv.brown.edu	HOST_9 network=ethernet,gigaethernet
> group=_scoreall_,100Mb,gige smp=2 MSGBSERV
> bio-11.cascv.brown.edu	HOST_10 network=ethernet,gigaethernet
> group=_scoreall_,100Mb,gige smp=2 MSGBSERV
> bio-12.cascv.brown.edu	HOST_11 network=ethernet,gigaethernet
> group=_scoreall_,100Mb,gige smp=2 MSGBSERV
> 
> 
> Here is my pm-gig.conf file:
> unit 1
> maxnsend 8
> # Not connected yet
> #0 00:30:48:23:70:CF bio-1.cascv.brown.edu
> #1 00:30:48:23:70:B1 bio-2.cascv.brown.edu
> #2 00:30:48:23:70:D9 bio-3.cascv.brown.edu
> #3 00:30:48:23:70:E3 bio-4.cascv.brown.edu
> 4 00:30:48:23:6E:2B bio-5.cascv.brown.edu
> 5 00:30:48:23:3F:05 bio-6.cascv.brown.edu
> 6 00:30:48:23:3E:51 bio-7.cascv.brown.edu
> 7 00:30:48:23:3E:3D bio-8.cascv.brown.edu
> 8 00:30:48:23:70:EB bio-9.cascv.brown.edu
> 9 00:30:48:23:6F:05 bio-10.cascv.brown.edu
> 10 00:30:48:23:6E:55 bio-11.cascv.brown.edu
> 11 00:30:48:23:70:E1 bio-12.cascv.brown.edu
> 
> I have disabled the first four hosts as we don't have enough room in our
> switch for them.
> 
> I have also edited the pm_ethernet file to start and stop eth1. When I
> run "pm_ethernet stop" and then run "pm_ethernet start" I get the
> messages below.
> 
> [root бў bio-12 init.d]# ./pm_ethernet stop
> Stopping PM/Ethernet: device: eth0
> device: eth1
> 
> [root бў bio-12 init.d]# ./pm_ethernet start
> n Starting PM/Ethernet: 
> device: eth0
> device: eth1
> etherpmctl: ERROR on unit 1: "Link has been severed(67)" Check dmesg
> log!!
> 
> Many thanks for your help!
> 
> Jim
> 
> On Mon, 2003-03-17 at 21:30, Atsushi HORI wrote:
> > Hi,
> > 
> > >1) edit the pm_ehternet file on the nodes to start the gig interface.
> > >2) Add a file pm-gig.conf to the /opt/score/etc directory. This file has
> > >the MAC addresses of the gig cards.
> > >3) Edit the scoredhosts.db file to define gigaethernet,include bu
> > >pm-gig.conf file and define the nodes to have gigabit ethernet.
> > >4) Reboot the server and the compute hosts.
> > 
> > And you must do the following on all cluster hosts;
> > 
> > 5) /etc/rc.d/init.d/pm_ethernet stop
> >    Edit /etc/rc.d/init.d/pm_ethernet
> >    /etc/rc.d/init.d/pm_ethernet start
> > 
> > The pm_sthernet script binds PM unit number and Linux ethernet device 
> > (eth0, eth1, ...).
> > 
> > >Does anyoen know what the following messages mean?
> > >I got them whil running:
> > >
> > >scstest -network gigaethernet
> > >
> > >
> > >bio-11(-1) pmAssociateNodes: Invalid argument(22)
> > >bio-12(-1) pmAssociateNodes: Invalid argument(22)
> > 
> > Send me the files /opt/score/etc/scorehosts.db and 
> > /opt/score/etc/pm-gig.conf.
> > 
> > ----
> > Atsushi HORI
> > Swimmy Software, Inc.
> > 
> _______________________________________________
> SCore-users mailing list
> SCore-users бў pccluster.org
> http://www.pccluster.org/mailman/listinfo/score-users
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From michael бў streamline-computing.com  Wed Mar 19 06:30:59 2003
From: michael бў streamline-computing.com (Michael Rudgyard Streamline)
Date: Tue, 18 Mar 2003 21:30:59 +0000 (GMT)
Subject: [SCore-users-jp] [SCore-users] Benchmarking 256 processor problem
Message-ID: <Pine.LNX.4.21.0303182035170.19595-100000@scgj.streamline>

We have a customer who has an interesting problem with MPI_Bcast,
where SCore seems to hang on large numbers of processors,. and in
particular when there are 2 processes per node running. The code
segment is provided below.

My understanding is that the MPICH (and hence the SCore) implementation of
MPI_Bcast is globally asynchronous, and is built using MPI_Send. It is
therefore possible that (in the example below, and in a worse case) the
256th processor may have yet to receive messages from all other
(255) processors. I suspect that this may be problematic because there a
maximum number of message buffers that may be sent at a given time. I know
this was the case on SGI and Cray systems, and I think this is the case
with MPICH but can't find the corresponding environment variables on the
MPICH web-site.

As far as I am aware, MPI_Send will block if the send cannot be buffered
(so I assume this is the case for MPI_BCast), and given that MPI_BCast is
called in the correct order for each processor (avoiding the well-known
deadlock situations), I can't see why this code should necessarily cause
the code to hang (???) other than there being potentially a lot of
messages floating around... This leads me to believe that it must just be
the number of outstanding messages that is the problem, although in that
case shouldn't the corresponding MPI_BCast block at the senders
side ? Could there be an issue in particular due to messages sent via 
shared memory (ie a performance vs. correctness issue) ?

For info, each send is about a Kilobyte of information.

Note that making the broadcast synchonous, ie. by adding an MPI_Barrier,
we solve the problem. 

The machine is running Score 5.0.1 with MPI 1.2.4 over Myrinet 2000
(M3F-PCI64-B 2MB).

Thanks in advance,

Michael 

----------------

The code ran fine on up to 128 processors when tested on one process per
node.  It also ran fine on 2 processes per node on up to 32 nodes (ie 64
processes).  However when run on 64x2 then the code would "stop" at
differing points, normally within a minute of execution of an hour long
job.  By "stop"  I mean the processes would remain at 100% CPU but no work
was being done, as though a process was waiting for a message.

Reason
------

Our investigations this afternoon has led us to believe that it comes down
to a loop of MPI_Bcasts:
            DO 300 p = 0, noprocs-1
               JSTART = p*JMAX/noprocs+1
               JFINISH = (p+1)*JMAX/noprocs
               npts = IMAX*(JFINISH-JSTART+1)
               CALL MPI_Bcast(U(1,JSTART), npts,
     :                     MPI_DOUBLE_PRECISION, p, MPI_COMM_WORLD,
     :                     error)
  300       CONTINUE
This broadcast simply sends the next processors chunk of the array to all
the other processors.  An AllToAll would be similar, however this was used
to give better control over the number of messages being sent at any time.

However, it appears that this isn't the case.  By adding an MPI_Barrier
call after the MPI_Bcast the problem of the "stopping" wasn't repeated in
our tests.


_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama бў pccluster.org  Wed Mar 19 09:14:35 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Wed, 19 Mar 2003 09:14:35 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Starting Compute Host Lock services: msgbserv:No hosts
In-Reply-To: Your message of "18 Mar 2003 13:53:25 JST."
             <1048013605.14938.56.camel@cr1>
Message-ID: <20030319001435.2629020054@neal.il.is.s.u-tokyo.ac.jp>

In article <1048013605.14938.56.camel бў cr1> "James O'Dell" <jodell бў ad.brown.edu> wrotes:
> I get an error message when I try to start my msgbserv as below:
> 
> /etc/rc.d/init.d/msgbserv start
> Starting Compute Host Lock services: msgbserv:No hosts

In your scorehosts.db says:
    #define MSGBSERV        msgbserv=(kansas-fe.cascv.brown.edu:8764)

But your /var/scored stored file is:
    gaethernet/ethernet (error=12).
      argv[0] -config
      argv[1] /var/scored/scoreboard/kansas.0000V3000V7t
    Unable to open PM gigaethernet/ethernet (error=12).
      argv[0] -config
      argv[1] /var/scored/scoreboard/kansas.0000V3000V7t

What is official hostname of your server?
msgbserv serach management host from scoreboad database by *official hostname*.
You can get official hostname by officialnamecommand:
    % officialname
If your server's hostname is kansas.cascv.brown.edu, please change
msgbserv entry.

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From hori бў swimmy-soft.com  Wed Mar 19 15:40:55 2003
From: hori бў swimmy-soft.com (Atsushi HORI)
Date: Wed, 19 Mar 2003 15:40:55 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Help with an error message
In-Reply-To: <1047944356.14938.31.camel@cr1>
References: <1047944356.14938.31.camel@cr1>
Message-ID: <3130933255.hori0006@swimmy-soft.com>

Hi.

>Does anyoen know what the following messages mean?
>I got them whil running:
>
>scstest -network gigaethernet
>
>
>bio-11(-1) pmAssociateNodes: Invalid argument(22)
>bio-12(-1) pmAssociateNodes: Invalid argument(22)

There are a number of possibilities to cause this message. So, try 
with the PM debug option.

% export PM_DEBUG=1
% scstest -network gigaether

And let me know the output messages.

----
Atsushi HORI
SCore Developer
Swimmy Software, Inc.

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From jodell бў ad.brown.edu  Thu Mar 20 05:15:52 2003
From: jodell бў ad.brown.edu (James O'Dell)
Date: 19 Mar 2003 15:15:52 -0500
Subject: [SCore-users-jp] [Fwd: Re: [SCore-users] Starting Compute Host Lock services: msgbserv:No hosts]
Message-ID: <1048104952.14922.106.camel@cr1>

-----Forwarded Message-----

> From: James O'Dell <jodell бў ad.brown.edu>
> To: kameyama бў pccluster.org
> Subject: Re: [SCore-users] Starting Compute Host Lock services: msgbserv:No hosts
> Date: 19 Mar 2003 12:58:25 -0500
> 
> That was the problem exactly! It makes me wonder how my cluster ever
> worked correctly!
> 
> Thanks for the pointers. My cluster is up and running on gig ethernet.
> 
> Jim
> 
> On Tue, 2003-03-18 at 19:14, kameyama бў pccluster.org wrote:
> > In article <1048013605.14938.56.camel бў cr1> "James O'Dell" <jodell бў ad.brown.edu> wrotes:
> > > I get an error message when I try to start my msgbserv as below:
> > > 
> > > /etc/rc.d/init.d/msgbserv start
> > > Starting Compute Host Lock services: msgbserv:No hosts
> > 
> > In your scorehosts.db says:
> >     #define MSGBSERV        msgbserv=(kansas-fe.cascv.brown.edu:8764)
> > 
> > But your /var/scored stored file is:
> >     gaethernet/ethernet (error=12).
> >       argv[0] -config
> >       argv[1] /var/scored/scoreboard/kansas.0000V3000V7t
> >     Unable to open PM gigaethernet/ethernet (error=12).
> >       argv[0] -config
> >       argv[1] /var/scored/scoreboard/kansas.0000V3000V7t
> > 
> > What is official hostname of your server?
> > msgbserv serach management host from scoreboad database by *official hostname*.
> > You can get official hostname by officialnamecommand:
> >     % officialname
> > If your server's hostname is kansas.cascv.brown.edu, please change
> > msgbserv entry.
> > 
> >                        from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From jodell бў ad.brown.edu  Thu Mar 20 05:15:35 2003
From: jodell бў ad.brown.edu (James O'Dell)
Date: 19 Mar 2003 15:15:35 -0500
Subject: [SCore-users-jp] [Fwd: Re: [SCore-users] Help with an error message]
Message-ID: <1048104935.14922.104.camel@cr1>

-----Forwarded Message-----

> From: James O'Dell <jodell бў ad.brown.edu>
> To: Atsushi Hori <hori бў swimmy-soft.com>
> Subject: Re: [SCore-users] Help with an error message
> Date: 19 Mar 2003 12:56:31 -0500
> 
> Once I got my gig interfaces configured properly, this message went
> away.
> 
> Thanks for all of your help!
> 
> Jim
> 
> On Wed, 2003-03-19 at 01:40, Atsushi HORI wrote:
> > Hi.
> > 
> > >Does anyoen know what the following messages mean?
> > >I got them whil running:
> > >
> > >scstest -network gigaethernet
> > >
> > >
> > >bio-11(-1) pmAssociateNodes: Invalid argument(22)
> > >bio-12(-1) pmAssociateNodes: Invalid argument(22)
> > 
> > There are a number of possibilities to cause this message. So, try 
> > with the PM debug option.
> > 
> > % export PM_DEBUG=1
> > % scstest -network gigaether
> > 
> > And let me know the output messages.
> > 
> > ----
> > Atsushi HORI
> > SCore Developer
> > Swimmy Software, Inc.
> > 
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From jodell бў ad.brown.edu  Thu Mar 20 08:05:21 2003
From: jodell бў ad.brown.edu (James O'Dell)
Date: 19 Mar 2003 18:05:21 -0500
Subject: [SCore-users-jp] [SCore-users] Procedure for adjusting networking parameters
Message-ID: <1048115121.16469.128.camel@cr1>

Does anyone have a "best practices" procedure that they'd like to share
on how they adjust their networking parameters for highest performance?

Thanks,
Jim
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From hori бў swimmy-soft.com  Thu Mar 20 11:06:17 2003
From: hori бў swimmy-soft.com (Atsushi HORI)
Date: Thu, 20 Mar 2003 11:06:17 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Procedure for adjusting networking parameters
In-Reply-To: <1048115121.16469.128.camel@cr1>
References: <1048115121.16469.128.camel@cr1>
Message-ID: <3131003177.hori0001@swimmy-soft.com>

Hi,

>Does anyone have a "best practices" procedure that they'd like to share
>on how they adjust their networking parameters for highest performance?

What is the definition of "performance" ?

Althought many people believe that the communication perfomance can 
be measured with latency and bandwidth, but as far as I know, those 
latency and bandwidth are representing some aspects of communication 
characteristics. Have you ever heard of the LogP model ? This is 
another performance measure of communication, but still LogP 
represents some aspects, not all.

Further, cluster users want to run their jobs as fast as they could. 
So the ultimate goal is to obtain the highest performance of 
applications, not the communication. Somtimes application performace 
depends on the latency and bandwidth, but sometimes not.

----
Atsushi HORI
SCore Developer
Swimmy Software, Inc.

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From jodell бў ad.brown.edu  Fri Mar 21 03:48:30 2003
From: jodell бў ad.brown.edu (James O'Dell)
Date: 20 Mar 2003 13:48:30 -0500
Subject: [SCore-users-jp] Re: [SCore-users] Procedure for adjusting networking parameters
In-Reply-To: <3131003177.hori0001@swimmy-soft.com>
References: <1048115121.16469.128.camel@cr1>
	 <3131003177.hori0001@swimmy-soft.com>
Message-ID: <1048186110.20434.74.camel@cr1>

Good point. I guess what you are really saying is that I should tune my
system against its typical workload.

Jim

On Wed, 2003-03-19 at 21:06, Atsushi HORI wrote:
> Hi,
> 
> >Does anyone have a "best practices" procedure that they'd like to share
> >on how they adjust their networking parameters for highest performance?
> 
> What is the definition of "performance" ?
> 
> Althought many people believe that the communication perfomance can 
> be measured with latency and bandwidth, but as far as I know, those 
> latency and bandwidth are representing some aspects of communication 
> characteristics. Have you ever heard of the LogP model ? This is 
> another performance measure of communication, but still LogP 
> represents some aspects, not all.
> 
> Further, cluster users want to run their jobs as fast as they could. 
> So the ultimate goal is to obtain the highest performance of 
> applications, not the communication. Somtimes application performace 
> depends on the latency and bandwidth, but sometimes not.
> 
> ----
> Atsushi HORI
> SCore Developer
> Swimmy Software, Inc.
> 
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From jodell бў ad.brown.edu  Fri Mar 21 07:54:46 2003
From: jodell бў ad.brown.edu (James O'Dell)
Date: 20 Mar 2003 17:54:46 -0500
Subject: [SCore-users-jp] [SCore-users] More on my segmentation violation problem
Message-ID: <1048200886.20434.97.camel@cr1>

I've recompiled GROMACS to include symbols and have managed to get a
debugger backtrace from the process that is experiencing the
segmentation violation.

#0  0x082052cd in syscall ()
#1  0xbfffe5f8 in ?? ()
#2  0x081e045f in vsyscall (handle=0x834a080, retp=0xbfffe5f8,
args=0xbfffe5ec)
    at ../scwrap.c:711
#3  0x081e0495 in score_syscall (handle=0x834a080, retp=0xbfffe5f8)
    at ../scwrap.c:726
#4  0x081e0acf in __nanosleep (req=0xbfffe61c, rem=0xbfffe61c)
    at ../scwrap.c:1166
#5  0x08203c7a in sleep ()
#6  0x081b020e in score_wait_forever () at ../libsc_util.c:154
#7  0x081b04f2 in sc_inspectme (x_display=0xbffffd56 "dev1:0",
signal=11)
    at ../libscio.c:243
#8  0x081a8be0 in MPID_SCORE_Exception ()
#9  <signal handler called>
#10 angles (nbonds=27548, forceatoms=0x8eeed60, forceparams=0x8eeaa20, 
    x=0x8fc0560, f=0x93e8218, fr=0x8cca460, g=0x8ccaca0, box=0x8ad1c98, 
    lambda=0, dvdlambda=0xbfffec98, md=0x8cb61b8, ngrp=2,
egnb=0x8ad12c0, 
    egcoul=0x8ad12a8, fcd=0x8ad15b0) at ../../include/vec.h:235
#11 0x080898ee in calc_bonds (log=0x8ad1440, cr=0x86af708, mcr=0x0, 
    idef=0x8ad402c, x_s=0x8fc0560, f=0x93e8218, fr=0x8cca460,
g=0x8ccaca0, 
    epot=0x8ad11a0, nrnb=0xbffff1d0, box=0x8ad1c98, lambda=0,
md=0x8cb61b8, 
    ngrp=2, egnb=0x8ad12c0, egcoul=0x8ad12a8, fcd=0x8ad15b0, step=0, 
    bSepDVDL=0) at bondfree.c:109
---Type <return> to continue, or q <return> to quit---
#12 0x0805dd2d in force (fp=0x8ad1440, step=0, fr=0x8cca460,
ir=0x8ad1aa8, 
    idef=0x8ad402c, nsb=0x8ad3008, cr=0x86af708, mcr=0x0,
nrnb=0xbffff1d0, 
    grps=0x8ad1908, md=0x8cb61b8, ngener=2, opts=0x8ad1c28, x=0x8fc0560,
    f=0x93e8218, epot=0x8ad11a0, fcd=0x8ad15b0, bVerbose=0,
box=0x8ad1c98, 
    lambda=0, graph=0x8ccaca0, excl=0x8adf1c4, bNBFonly=0,
lr_vir=0xbffff610, 
    mu_tot=0xbffff1c0, qsum=-6.99999762, bGatherOnly=0) at force.c:960
#13 0x0807eade in do_force (log=0x8ad1440, cr=0x86af708, mcr=0x0, 
    parm=0x8ad1aa8, nsb=0x8ad3008, vir_part=0xbffff640,
pme_vir=0xbffff610, 
    step=0, nrnb=0xbffff1d0, top=0x8ad4028, grps=0x8ad1908, x=0x8fc0560,
    v=0x90454e8, f=0x93e8218, buf=0x9363290, mdatoms=0x8cb61b8, 
    ener=0x8ad11a0, fcd=0x8ad15b0, bVerbose=0, lambda=0,
graph=0x8ccaca0, 
    bNS=1, bNBFonly=0, fr=0x8cca460, mu_tot=0xbffff1c0, bGatherOnly=0)
    at sim_util.c:282
#14 0x0805177e in do_md (log=0x8ad1440, cr=0x86af708, mcr=0x0, nfile=21,
    fnm=0x828bd04, bVerbose=1, bCompact=1, bDummies=0, dummycomm=0x0, 
    stepout=10, parm=0x8ad1aa8, grps=0x8ad1908, top=0x8ad4028,
ener=0x8ad11a0, 
    fcd=0x8ad15b0, x=0x8fc0560, vold=0x94f2128, v=0x90454e8,
vt=0x946d1a0, 
    f=0x93e8218, buf=0x9363290, mdatoms=0x8cb61b8, nsb=0x8ad3008, 
    nrnb=0x8ae0260, graph=0x8ccaca0, edyn=0xbffff7f0, fr=0x8cca460, 
    box_size=0xbffff790, Flags=0) at md.c:508
#15 0x080508b6 in mdrunner (cr=0x86af708, mcr=0x0, nfile=21,
fnm=0x828bd04, 
    bVerbose=1, bCompact=1, nDlb=0, nstepout=10, edyn=0xbffff7f0,
Flags=0)
    at md.c:193

The code at the violation is in this vicinity.
240       a[YY]=y;
241       a[ZZ]=z;
242     }
243
244     static inline void rvec_sub(const rvec a,const rvec b,rvec c)
245     {
246       real x,y,z;
247       
248       x=a[XX]-b[XX];
249       y=a[YY]-b[YY];

I don't believe that this behavior is specific to my hardware or
operating system since I get apprximately the same behavior on an IBM
SP.

The segmentation violation seems to happen very early in the run.
In this case I was running on 12 processors. Also, if I perform exactly
the same calculation several times in a row sometimes it will
segmentation fault and sometimes not. It seems to me that it has all the
classic characteristics of a storage allocation problem in the gromacs
code to me. 

Does anybody have suggestions on how to pursue this further?

Thanks,Jim

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From s-sumi бў bd6.so-net.ne.jp  Fri Mar 21 12:51:09 2003
From: s-sumi бў bd6.so-net.ne.jp (Shinji Sumimoto)
Date: Fri, 21 Mar 2003 12:51:09 +0900 (JST)
Subject: [SCore-users-jp] Re: [SCore-users] Benchmarking 256 processor problem
In-Reply-To: <Pine.LNX.4.21.0303182035170.19595-100000@scgj.streamline>
References: <Pine.LNX.4.21.0303182035170.19595-100000@scgj.streamline>
Message-ID: <20030321.125109.846938101.s-sumi@bd6.so-net.ne.jp>

Hi.

Thank you for the information.

Is the situation occurred with mpi_zerocopy=on option? 

If it is not occurred, there are someting wrong in message transfer on
PM or MPI level implementation.

If same problem is occurred in small size of cluster(ex 16 nodes),
please let us know. We can re-produce the situation and fix it.

PS: New version of MPICH, version 1.2.5, includes new version of
    mpi_bcast, so, your problem may be solved. We will try to port it to
    SCore.

Shinji. 

From: Michael Rudgyard Streamline <michael бў streamline-computing.com>
Subject: [SCore-users] Benchmarking 256 processor problem
Date: Tue, 18 Mar 2003 21:30:59 +0000 (GMT)
Message-ID: <Pine.LNX.4.21.0303182035170.19595-100000 бў scgj.streamline>

michael> 
michael> We have a customer who has an interesting problem with MPI_Bcast,
michael> where SCore seems to hang on large numbers of processors,. and in
michael> particular when there are 2 processes per node running. The code
michael> segment is provided below.
michael> 
michael> My understanding is that the MPICH (and hence the SCore) implementation of
michael> MPI_Bcast is globally asynchronous, and is built using MPI_Send. It is
michael> therefore possible that (in the example below, and in a worse case) the
michael> 256th processor may have yet to receive messages from all other
michael> (255) processors. I suspect that this may be problematic because there a
michael> maximum number of message buffers that may be sent at a given time. I know
michael> this was the case on SGI and Cray systems, and I think this is the case
michael> with MPICH but can't find the corresponding environment variables on the
michael> MPICH web-site.
michael> 
michael> As far as I am aware, MPI_Send will block if the send cannot be buffered
michael> (so I assume this is the case for MPI_BCast), and given that MPI_BCast is
michael> called in the correct order for each processor (avoiding the well-known
michael> deadlock situations), I can't see why this code should necessarily cause
michael> the code to hang (???) other than there being potentially a lot of
michael> messages floating around... This leads me to believe that it must just be
michael> the number of outstanding messages that is the problem, although in that
michael> case shouldn't the corresponding MPI_BCast block at the senders
michael> side ? Could there be an issue in particular due to messages sent via 
michael> shared memory (ie a performance vs. correctness issue) ?
michael> 
michael> For info, each send is about a Kilobyte of information.
michael> 
michael> Note that making the broadcast synchonous, ie. by adding an MPI_Barrier,
michael> we solve the problem. 
michael> 
michael> The machine is running Score 5.0.1 with MPI 1.2.4 over Myrinet 2000
michael> (M3F-PCI64-B 2MB).
michael> 
michael> Thanks in advance,
michael> 
michael> Michael 
michael> 
michael> ----------------
michael> 
michael> The code ran fine on up to 128 processors when tested on one process per
michael> node.  It also ran fine on 2 processes per node on up to 32 nodes (ie 64
michael> processes).  However when run on 64x2 then the code would "stop" at
michael> differing points, normally within a minute of execution of an hour long
michael> job.  By "stop"  I mean the processes would remain at 100% CPU but no work
michael> was being done, as though a process was waiting for a message.
michael> 
michael> Reason
michael> ------
michael> 
michael> Our investigations this afternoon has led us to believe that it comes down
michael> to a loop of MPI_Bcasts:
michael>             DO 300 p = 0, noprocs-1
michael>                JSTART = p*JMAX/noprocs+1
michael>                JFINISH = (p+1)*JMAX/noprocs
michael>                npts = IMAX*(JFINISH-JSTART+1)
michael>                CALL MPI_Bcast(U(1,JSTART), npts,
michael>      :                     MPI_DOUBLE_PRECISION, p, MPI_COMM_WORLD,
michael>      :                     error)
michael>   300       CONTINUE
michael> This broadcast simply sends the next processors chunk of the array to all
michael> the other processors.  An AllToAll would be similar, however this was used
michael> to give better control over the number of messages being sent at any time.
michael> 
michael> However, it appears that this isn't the case.  By adding an MPI_Barrier
michael> call after the MPI_Bcast the problem of the "stopping" wasn't repeated in
michael> our tests.
michael> 
michael> 
michael> _______________________________________________
michael> SCore-users mailing list
michael> SCore-users бў pccluster.org
michael> http://www.pccluster.org/mailman/listinfo/score-users
michael> 
------
Shinji Sumimoto, Fujitsu Labs
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From jducom бў nd.edu  Fri Mar 21 14:09:54 2003
From: jducom бў nd.edu (Jean-Christophe Ducom)
Date: Fri, 21 Mar 2003 00:09:54 -0500
Subject: [SCore-users-jp] [SCore-users] Score with SK-9D21
Message-ID: <3E7A9EA2.7000907@nd.edu>

All,

	I know that it is mentionned in the FAQ but I'd like to double check in 
case it has been fixed with the latest version. Is there any chance that 
Score works with the SysKonnect SK-9D21 card?
Thank you and sorry about this (annoying/redundant) question.

	JC

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From nrcb бў streamline-computing.com  Fri Mar 21 17:15:16 2003
From: nrcb бў streamline-computing.com (Nick Birkett)
Date: Fri, 21 Mar 2003 08:15:16 +0000
Subject: [SCore-users-jp] [SCore-users] pghpf score 5.4
Message-ID: <200303210815.16173.nrcb@streamline-computing.com>

I am having problem getting pghpf to work with score 5.4.0

I have this working for score 5.0.1.

Here is the error:

mpif90 -compiler pghpf  -c -Mmpi -fast -tp p7 -Mlfs  overlap=size:3 -Ktrap=fp  
math.f
/opt/score/mpi/mpich-1.2.4/i386-redhat7-linux2_4_pghpf/bin/mpif90: eval: 
illegal option: -c
eval: usage: eval [arg ...]
make: *** [math.o] Error 2

Has anyone else tried this ?

See attached mpif90 wrapper, the compiler/site and compiler/pghpf
used to compile from source.

Regards,

Nick


-------------- next part --------------
е╞ене╣е╚╖┴╝░░╩│░д╬┼║╔╒е╒ебедеыдЄ╩▌┤╔д╖д▐д╖д┐...
е╒ебедеы╠╛: mpif90
╖┐:         application/x-shellscript
е╡еде║:     11747 е╨еде╚
└т╠└:       ╠╡д╖
URL:        <http://new1.pccluster.org/pipermail/score-users-jp/attachments/20030321/c578f03c/attachment.bin>

-------------- next part --------------
╩╕╗·е│б╝е╔╗╪─ъд╬╠╡дд┼║╔╒╩╕╜ёдЄ╩▌┤╔д╖д▐д╖д┐...
╠╛┴░: site
URL:  <http://new1.pccluster.org/pipermail/score-users-jp/attachments/20030321/c578f03c/attachment.ksh>

-------------- next part --------------
╩╕╗·е│б╝е╔╗╪─ъд╬╠╡дд┼║╔╒╩╕╜ёдЄ╩▌┤╔д╖д▐д╖д┐...
╠╛┴░: pghpf
URL:  <http://new1.pccluster.org/pipermail/score-users-jp/attachments/20030321/c578f03c/attachment-0001.ksh>

From arpiruk бў yahoo.com  Fri Mar 21 22:31:17 2003
From: arpiruk бў yahoo.com (=?iso-2022-jp?b?YXJwaXJ1ayAbJEIhdxsoQiB5YWhvby5jb20=?=)
Date: Fri, 21 Mar 2003 05:31:17 -0800 (PST)
Subject: [SCore-users-jp] [SCore-users] rc.config.scoreboard problem
In-Reply-To: <20030318030001.17598.78856.Mailman@www.pccluster.org>
Message-ID: <20030321133117.26480.qmail@web13907.mail.yahoo.com>


I have some question concerning the installation.

During installation of Score5.2 on Suse 2.4.18 the setup reports 

cp: cannot stat `rc.config.scoreboard': No such file or directory
        Exception in ../SRC/services.c, line 299 concerning file not opened

where should I get this file from and where should I put it? is there a simitlarity to linux  rc.config file?

Sincerely,

Arpiruk Hokpunna

CSE student

TU-Munich

 
---------------------------------
Do you Yahoo!?
Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!
-------------- next part --------------
HTMLд╬┼║╔╒е╒ебедеыдЄ╩▌┤╔д╖д▐д╖д┐...
URL: <http://new1.pccluster.org/pipermail/score-users-jp/attachments/20030321/95d91a5d/attachment.html>

From arpiruk бў yahoo.com  Fri Mar 21 22:30:31 2003
From: arpiruk бў yahoo.com (=?iso-2022-jp?b?YXJwaXJ1ayAbJEIhdxsoQiB5YWhvby5jb20=?=)
Date: Fri, 21 Mar 2003 05:30:31 -0800 (PST)
Subject: [SCore-users-jp] [SCore-users] Re: SCore-users digest, Vol 1 #194 - 4 msgs
In-Reply-To: <20030318030001.17598.78856.Mailman@www.pccluster.org>
Message-ID: <20030321133031.35701.qmail@web13901.mail.yahoo.com>


I have some question concerning the installation.

During installation of Score5.2 on Suse 2.4.18 the setup reports 

cp: cannot stat `rc.config.scoreboard': No such file or directory
        Exception in ../SRC/services.c, line 299 concerning file not opened

where should I get this file from and where should I put it? is there a simitlarity to linux  rc.config file?

Sincerely,

Arpiruk Hokpunna

CSE student

TU-Munich

 
---------------------------------
Do you Yahoo!?
Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!
-------------- next part --------------
HTMLд╬┼║╔╒е╒ебедеыдЄ╩▌┤╔д╖д▐д╖д┐...
URL: <http://new1.pccluster.org/pipermail/score-users-jp/attachments/20030321/11540b0e/attachment.html>

From kameyama бў pccluster.org  Mon Mar 24 15:08:44 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Mon, 24 Mar 2003 15:08:44 +0900
Subject: [SCore-users-jp] Re: [SCore-users] pghpf score 5.4
In-Reply-To: Your message of "Fri, 21 Mar 2003 08:15:16 JST."
             <200303210815.16173.nrcb@streamline-computing.com>
Message-ID: <20030324060844.61FA620054@neal.il.is.s.u-tokyo.ac.jp>

In article <200303210815.16173.nrcb бў streamline-computing.com> Nick Birkett <nrcb бў streamline-computing.com> wrotes:
> Content-Transfer-Encoding: 8bit
> 
> I am having problem getting pghpf to work with score 5.4.0
> 
> I have this working for score 5.0.1.
> 
> Here is the error:
> 
> mpif90 -compiler pghpf  -c -Mmpi -fast -tp p7 -Mlfs  overlap=size:3 -Ktrap=fp

If you want to use  communications libraries with MPICH/SCore,
you need only MPICH/SCore with PGI library, and edit:
    $PGI/linux86/pghpfrc
or
    $HOME/.mypghpfrc

> math.f
> /opt/score/mpi/mpich-1.2.4/i386-redhat7-linux2_4_pghpf/bin/mpif90: eval: 
> illegal option: -c
> eval: usage: eval [arg ...]
> make: *** [math.o] Error 2
> 
> Has anyone else tried this ?

I think Fortran 90 compiler is not found on MPI build time.
Please check mpi build log (/opt/score/score-src/out.*/mpi.build).
I get this message:

checking if /work/kameyama/install/bin/scoref77   works with GETARG and IARGC...
 no
...
configure: warning: Could not find a way to access the command line from Fortran
 77
configure: error: Command line access is required for MPICH
Error configuring the Fortran subsystem!
Turning off Fortran support

You must specify mpicc and mpif77 compiler for pghpf.

Note that you can use compiler alias file on SCore 5.4.
    http://www.pccluster.org/score/dist/score/html/en/man/man5/compiler_alias.html
If you write following a line on /opt/score/etc/compilers/alias:
    pghpf	pgi
And you write  /opt/score/etc/compilers/site:
    mpif90	pghpf=pghpf ...
And you type following command:
    % mpif90 -compiler=pghpf ...
mpif90 convert following:
    % mpif90 -compiler=pgi -compiler-path=pghpf ...

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From nrcb бў streamline-computing.com  Tue Mar 25 00:08:12 2003
From: nrcb бў streamline-computing.com (Nick Birkett)
Date: Mon, 24 Mar 2003 15:08:12 +0000
Subject: [SCore-users-jp] [SCore-users] suspending/unsuspending single user jobs
Message-ID: <200303241508.12977.nrcb@streamline-computing.com>

Hi we would like to be able to suspend single users jobs (ie running under
PBS or Sun Grid Engine) and be able to start another job with jobs suspended.
This is so we can suspend parallel queues at certain times of the week to run
large jobs. 

We have tried sending the SIGTSTP the the scrun.exe process of a job and this
suspends it. However there is a prblem in starting another job with one 
suspended. I have tried restarting the msgbderv but still get error that the
pm device is already opened.

ie I would like to know how to suspend a single user job and close the pm
devices associated with that job so I can run another job using the same 
nodes.

Is there a way to do this ?

Cheers,

Nick


_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From M.Newiger бў deltacomputer.de  Tue Mar 25 02:13:56 2003
From: M.Newiger бў deltacomputer.de (Martin Newiger)
Date: Mon, 24 Mar 2003 18:13:56 +0100
Subject: [SCore-users-jp] [SCore-users] NIS
Message-ID: <c=DE%a=_%p=Delta_GmbH%l=ABAKUS-030324171356Z-1390@abakus.delnet>

When I add users to a SCore-system I want to have them being available
on my nodes as well (that they can login there too). What must I do? The
NIS-Server is the SCore-Master.

Regards 
Martin Newiger
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama бў pccluster.org  Tue Mar 25 09:00:43 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Tue, 25 Mar 2003 09:00:43 +0900
Subject: [SCore-users-jp] Re: [SCore-users] NIS
In-Reply-To: Your message of "Mon, 24 Mar 2003 18:13:56 JST."
             <c=DE%a=_%p=Delta_GmbH%l=ABAKUS-030324171356Z-1390@abakus.delnet>
Message-ID: <20030325000043.7D3D72003B@neal.il.is.s.u-tokyo.ac.jp>

In article <c=DE%a=_%p=Delta_GmbH%l=ABAKUS-030324171356Z-1390 бў abakus.delnet> Martin Newiger <M.Newiger бў deltacomputer.de> wrotes:
> When I add users to a SCore-system I want to have them being available
> on my nodes as well (that they can login there too). What must I do? The
> NIS-Server is the SCore-Master.

1. Please add user on NIS-Server locally.
2. Please issue this commands on NIS-server:
      # cd /var/yp
      # make
   These commands update NIS data.

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From haddock бў webgroup.co.jp  Wed Mar 26 16:54:00 2003
From: haddock бў webgroup.co.jp (=?iso-2022-jp?b?aGFkZG9jayAbJEIhdxsoQiB3ZWJncm91cC5jby5qcA==?=)
Date: Wed, 26 Mar 2003 16:54:00 +0900
Subject: [SCore-users-jp] [SCore-users] Cannot mount ?
Message-ID: <5fcb3441.34415fcb@webgroup.co.jp>

Hi all

  I'm making the pc-cluster with score-5.4 on Redhat7.3.
When I try install client machines , I got some errors like
this

  ---------------
 VFS :Mounted root (ext2 filesystem).
 Using EIT5 feature
 mounting /proc filesystem.... done
 Testing..........
No dhcp_server specified. Used Broadcast
setupNetwork cannot set the gateway address
done
NFS mount 192.168.1.2: /mnt/runtime
Cannot mount 
exiting
See the documentation for this trouble

---------------------------

 Do you have any idea ? 
Thanks for your help


Regards

                                  haddock

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama бў pccluster.org  Wed Mar 26 17:14:22 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Wed, 26 Mar 2003 17:14:22 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Cannot mount ?
In-Reply-To: Your message of "Wed, 26 Mar 2003 16:54:00 JST."
             <5fcb3441.34415fcb@webgroup.co.jp>
Message-ID: <20030326081422.D3B262003B@neal.il.is.s.u-tokyo.ac.jp>

In article <5fcb3441.34415fcb бў webgroup.co.jp> haddock бў webgroup.co.jp wrotes:
> Hi all
> 
>   I'm making the pc-cluster with score-5.4 on Redhat7.3.
> When I try install client machines , I got some errors like
> this
> 
>   ---------------
>  VFS :Mounted root (ext2 filesystem).
>  Using EIT5 feature
>  mounting /proc filesystem.... done
>  Testing..........
> No dhcp_server specified. Used Broadcast
> setupNetwork cannot set the gateway address

Do you set correct gateway address in Network Configuration window?
This gateway address is specified for compute hosts (not server hosts).

> NFS mount 192.168.1.2: /mnt/runtime

your server's IP addrress for compute hosts side is 192.168.1.2.
Is this correct?

If your server host have 2 Network card,
please use eth0 and official hostname to compute host side.

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kraehe бў copyleft.de  Thu Mar 27 03:10:36 2003
From: kraehe бў copyleft.de (Michael Koehne)
Date: Wed, 26 Mar 2003 19:10:36 +0100
Subject: [SCore-users-jp] [SCore-users] Queue hangs for one user
Message-ID: <20030326181035.GA16787@bakunin.copyleft.de>

Moin Guru's,

  we have a 40node/80cpu SCore system at CLAMV (http://www.clamv.iu-bremen.de/)
  that is used by half a dozen people. We had a CPU/FAN problem a few days
  ago, and Ulrich who noticed it did the following :

  - removed cell05 from /var/scored/pbs/server_priv/nodes
  - insert cell05 into /opt/score/etc/scorehosts.defects
  - /etc/rc.d/init.d/pbs_server restart
  - and a shutdown of cell05, that saved the CPU

  We got the FAN yesterday - I installed it and reversed Ulrichs
  changes. I did not restart the pbs server, as there had been
  jobs running, that had not been my jobs.

  Now mhoeft has the problem, that all of his jobs hang in the
  queue. When he came to me, he was also unable to qdel his jobs,
  so i did the `/etc/rc.d/init.d/pbs_server restart`, as there
  had been no other users at that time. Now he is able to submit
  and delete jobs, but his jobs will never run, just blocked and
  waiting in the queue.
  
  I could start job, schroedi can start jobs, but Matthias jobs look like :

  7933.muscle.clu mhoeft   default  flash         --    4  --    -- --  Q   -- 
     cell10+cell10+cell09+cell09+cell08+cell08+cell07+cell07

  now the funny, if i start a job and immediate look at `qstat -rn` the
  job of mhoeft will get an R status for a tick of second and to fall
  back to Q nearly immediate. Time elapsed stays -- ??? any idea ?

Bye Michael
-- 
  mailto:kraehe бў copyleft.de             UNA:+.? 'CED+2+:::Linux:2.4.18'UNZ+1'
  http://www.xml-edifact.org/           CETERUM CENSEO WINDOWS ESSE DELENDAM
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From M.Newiger бў deltacomputer.de  Thu Mar 27 09:26:53 2003
From: M.Newiger бў deltacomputer.de (Martin Newiger)
Date: Thu, 27 Mar 2003 01:26:53 +0100
Subject: [SCore-users-jp] [SCore-users] No Function
Message-ID: <c=DE%a=_%p=Delta_GmbH%l=ABAKUS-030327002653Z-6@abakus.delnet>

Hi,

I want to add new nodes to an existing cluster configuration. If I press
 the load-button it appears to have no function (no window pops up). Is
there any other way to add new compute host to an old configuration?

Kind regards 
M.Newiger
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama бў pccluster.org  Thu Mar 27 09:26:36 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Thu, 27 Mar 2003 09:26:36 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Queue hangs for one user
In-Reply-To: Your message of "Wed, 26 Mar 2003 19:10:36 JST."
             <20030326181035.GA16787@bakunin.copyleft.de>
Message-ID: <20030327002636.489502004E@neal.il.is.s.u-tokyo.ac.jp>

In article <20030326181035.GA16787 бў bakunin.copyleft.de> Michael Koehne <kraehe бў copyleft.de> wrotes:
>   We got the FAN yesterday - I installed it and reversed Ulrichs
>   changes. I did not restart the pbs server, as there had been
>   jobs running, that had not been my jobs.

Please check
    /var/scored/pbs/server_logs/*
and
    /var/scored/pbs/sched_logs/*
This directory contains server and sceduler log.

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama бў pccluster.org  Thu Mar 27 09:34:06 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Thu, 27 Mar 2003 09:34:06 +0900
Subject: [SCore-users-jp] Re: [SCore-users] No Function
In-Reply-To: Your message of "Thu, 27 Mar 2003 01:26:53 JST."
             <c=DE%a=_%p=Delta_GmbH%l=ABAKUS-030327002653Z-6@abakus.delnet>
Message-ID: <20030327003406.1B9DD2004E@neal.il.is.s.u-tokyo.ac.jp>

In article <c=DE%a=_%p=Delta_GmbH%l=ABAKUS-030327002653Z-6 бў abakus.delnet> Martin Newiger <M.Newiger бў deltacomputer.de> wrotes:
> I want to add new nodes to an existing cluster configuration. If I press
>  the load-button it appears to have no function (no window pops up).

The load button dosee not pop up window.
Pleasse continue setup to click next button.

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From aixpresso бў web.de  Thu Mar 27 18:17:48 2003
From: aixpresso бў web.de (Daniel Amkreutz)
Date: Thu, 27 Mar 2003 10:17:48 +0100
Subject: [SCore-users-jp] [SCore-users] Master as Compute Node Problem with PM-Ethernet
Message-ID: <200303270917.h2R9Hk205643@mailgate5.cinetic.de>

Hello.

We use a Cluster of 3Nodes and 1 Master. They're all equal and we would like to configure the master as
a compute node,too.
Here's what we got so far:

-scored & msgb starts and recognizes 4 Nodes
-scout session can be started with 4 Nodes.

But when i try to run a job scored claims about the folowing:

<3> SCore-D:WARNING Unable to open PM gigaethernet/ethernet (error=2).
<3> SCore-D:WARNING   argv[0] -config
<3> SCore-D:WARNING   argv[1] /var/scored/scoreboard/master.0000B2000RgT
<3> SCore-D:ERROR No PM device opened.

In my opinion the <3> is the nodenumber (ok it is the master).

These are the configuration files:

PM ETHERNET:

unit 0
# maxnsend 0 - 32
maxnsend 16
# backoff 1000 - 20000 (usec)
backoff 4800
# checksum (0 if off, 1 is on)
checksum 0
# PE    MAC address             base hostname           # comment
0       00:30:48:27:17:4C       node01.cluster.domain   # ip=192.168.222.1 on eth0
1       00:30:48:27:16:E2       node02.cluster.domain   # ip=192.168.222.2 on eth0
2       00:30:48:27:16:56       node03.cluster.domain   # ip=192.168.222.3 on eth0
3       00:30:48:27:17:32       master.cluster.domain   # ip=192.168.222.254 on eth0


SCOREHOSTS.DB

/*
 *       SCore 5.0 scorehosts.db
 *              generated by PCCC EIT 5.2
 */

/* PM/Myrinet */
myrinet         type=myrinet \
                -firmware:file=/opt/score/share/lanai/lanai.mcp \
                -config:file=/opt/score/etc/pm-myrinet.conf

/* PM/Myrinet */
myrinet2k       type=myrinet2k \
                -firmware:file=/opt/score/share/lanai/lanaiM2k.mcp \
                -config:file=/opt/score/etc/pm-myrinet.conf

/* PM/Ethernet */
ethernet        type=ethernet \
                -config:file=/opt/score/etc/pm-ethernet.conf
gigaethernet    type=ethernet \
                -config:file=/opt/score/etc/pm-ethernet.conf
/* PM/Agent */
udp             type=agent -agent=pmaudp \
                -config:file=/opt/score/etc/pm-udp.conf

/* RHiNET */
rhinet          type=rhinet \
                -firmware:file=/opt/score/share/rhinet/phu_top_0207a.hex \
                -config:file=/opt/score/etc/pm-rhinet.conf
##
/* PM/SHMEM */
shmem0          type=shmem -node=0
shmem1          type=shmem -node=1
##
#include "/opt/score//etc/ndconf/0"
#include "/opt/score//etc/ndconf/1"
#include "/opt/score//etc/ndconf/2"
#include "/opt/score//etc/ndconf/3"
##
#define MSGBSERV        msgbserv=(master.cluster.domain:8764)

node01.cluster.domain   HOST_0 network=gigaethernet,shmem0,shmem1 group=_scoreall_,pcc smp=2 MSGBSERV
node02.cluster.domain   HOST_1 network=gigaethernet,shmem0,shmem1 group=_scoreall_,pcc smp=2 MSGBSERV
node03.cluster.domain   HOST_2 network=gigaethernet,shmem0,shmem1 group=_scoreall_,pcc smp=2 MSGBSERV
master.cluster.domain   HOST_3 network=gigaethernet,shmem0,shmem1 group=_scoreall_,pcc smp=2 MSGBSERV

I've also installed the SCORE Kernel with PM-Ethernet support on the master and activated PM Ethernet on eth0 with /etc/init.d/pm_ethernet start

can anyone tell what's wrong ??

Has anyone else tried to use the master as compute node ?

Thank You

Daniel
______________________________________________________________________________
Keine Lust, immer Ihre Adressdaten in eine E-Mail zu schreiben? Mit der
vCard ist Schluss damit! Infos - http://freemail.web.de/features/?mc=021153

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From hori бў swimmy-soft.com  Thu Mar 27 18:27:35 2003
From: hori бў swimmy-soft.com (Atsushi HORI)
Date: Thu, 27 Mar 2003 18:27:35 +0900
Subject: [SCore-users-jp] Re: [SCore-users] Master as Compute Node Problem with PM-Ethernet
In-Reply-To: <200303270917.h2R9Hk205643@mailgate5.cinetic.de>
References: <200303270917.h2R9Hk205643@mailgate5.cinetic.de>
Message-ID: <3131634455.hori0000@swimmy-soft.com>

Hi,

>I've also installed the SCORE Kernel with PM-Ethernet support on the 
>master and activated PM Ethernet on eth0 with /etc/init.d/pm_ethernet 
>start
>
>can anyone tell what's wrong ??

This sounds like you installed master node manualy. I suspect that 
the /var/scored/ directory does not exist.

You had better using the "bininstall -compute" command (script) that 
will do the most stuff of installation.

----
Atsushi HORI
SCore Developer
Swimmy Software, Inc.

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From aixpresso бў web.de  Thu Mar 27 18:52:53 2003
From: aixpresso бў web.de (Daniel Amkreutz)
Date: Thu, 27 Mar 2003 10:52:53 +0100
Subject: [SCore-users-jp] [SCore-users] Re: Master as Compute Node Problem with PM-Ethernet
Message-ID: <200303270952.h2R9qr229204@mailgate5.cinetic.de>

Hi.

the /var/scored directories are present.


Your tip with the bininstall script was very helpfull.

Thank you !


______________________________________________________________________________
Sie haben mehr zu sagen als in eine SMS passt? Mit WEB.DE FreeMail ist
das jetzt kein Problem mehr! http://freemail.web.de/features/?mc=021182

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From ce107 бў dam.brown.edu  Fri Mar 28 08:36:37 2003
From: ce107 бў dam.brown.edu (C. Evangelinos)
Date: Thu, 27 Mar 2003 18:36:37 -0500 (EST)
Subject: [SCore-users-jp] [SCore-users] SCORE_RSH and use of ssh instead of rsh
Message-ID: <200303272336.h2RNabG15621@fritz.dam.brown.edu>

Thanks to the list's suggestions the NIS setup with only /var/scored
local to each compute node works fine. I'm also exporting via NFS
/opt/score to the rest of the nodes (read-only) so they can have full
functionality for compiling, executing etc. Such a setup meant that I
could not use the bininstall way of doing things and ended up doing
quite a few things on my own.

A few comments (mainly SCore but also Omni-related):
1) removing the rpms leaves init scripts behind in /etc/rc.d as well
as the new devices (the latter is not really a problem)
2) It would be nice to have a script that reproduces the effects of
installing the rpms for setting up device and configuration scripts,
local directories etc. for the case of NFS installations like mine
which do not use EIT or the RPMS for the compute nodes. I may end up
writing one myself anyway as I add nodes.
3) I got SCore to work fine (so far) on a system with a Realtek
ethernet card (8139too driver). It cannot handle interrupt reaping
however - the machine becomes highly unstable after a little while,
the system log fills up with
kernel: eth0: Too much work at interrupt, IntrStatus=0x0001.
messages and the machine requires a reboot. With reaping set to off
everything works fine. Performance between such a box and another one
with an Intel eepro100 driven card is so-and-so: Ping pong latency
(RTT/2) is ~58us, asymptotic ping-pong bandwidth is ~77Mbit/s out of
100 (worse than what LAM gets). BTW I'd be nice if the SCore document
reported RTT/2 instead of RTT numbers as I've seen people
misunderstand MPICH/PM numbers for double their actual value.
4) It would be more graceful to set things up so that if one already
has Java installed, the system doesn't look for things in
/opt/score/java/linux
Setting OMNI_JAVAVM seems to fix things for the Omni compiler but
jumpshot ignores setting JAVA_HOME and JVM as environment variables
before calling it.
5) There should be a way to pass back-end specific compiler
optimization flags to the Omni compiler. 

My main remaining problems are:
a) Integration with SGE - I just got someone to translate the Japanese
instructions but I'd like to know whether the source code that comes
with SCore (contrib) is modified or the Sun one as I want to use SCore
with the latest patched version of SGE out of Sun (and I'd prefer if
possible to avoid having to recompile everything but use as much of
Sun's binary installation as possible). 
b) This is the most important problem and related to the title of my
e-mail: 
For various security reasons I cannot use SCore with rsh (beyond
testing). Even with tcp wrappers enabled to limit access I'd prefer to
use ssh instead. SCORE_RSH seems to work with very few SCore binaries
and most importantly cannot work with scout. Is there a quick fix for
that or is rsh hardcoded in too many places in the source code?
Moreover, given the way connections propagate on an SCore cluster,
would running an ssh agent on the machine where scout is entered
enough to provide for transparent connections or do ssh-agents need to
run everywhere with some mechanism for new shells to get the required
environment variables setup automatically?
c) Moreover, if running as an SGE job, what mechanism would SCore use?
Normal rsh, ssh (supposing it's fixed as a replacement) or SGE's rsh?

Thanks everyone for their help,

Constantinos Evangelinos

Center for Fluid Mechanics
Brown University
and
Ocean Engineering Department
MIT


PS> On another mini-cluster with IBM nodes with NetXtreme BCM5703X
Gigabit Ethernet cards (tg3 driver) I get for netpipe's ping-pong an
RTT/2 latency of 68us and an asymptotic bandwidth that is around
535Mbit/s though sometimes one gets an extra 200Mbit/s for no
reason... Avoid...

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama бў pccluster.org  Fri Mar 28 10:06:39 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Fri, 28 Mar 2003 10:06:39 +0900
Subject: [SCore-users-jp] Re: [SCore-users] SCORE_RSH and use of ssh instead of rsh
In-Reply-To: Your message of "Thu, 27 Mar 2003 18:36:37 JST."
             <200303272336.h2RNabG15621@fritz.dam.brown.edu>
Message-ID: <20030328010639.A2D6F20058@neal.il.is.s.u-tokyo.ac.jp>

In article <200303272336.h2RNabG15621 бў fritz.dam.brown.edu> "C. Evangelinos" <ce107 бў dam.brown.edu> wrotes:
> 2) It would be nice to have a script that reproduces the effects of
> installing the rpms for setting up device and configuration scripts,
> local directories etc. for the case of NFS installations like mine
> which do not use EIT or the RPMS for the compute nodes. I may end up
> writing one myself anyway as I add nodes.

You can use /opt/score/install/setup command to install device, init.d
and local directory:
    # cd .TN/opt/score/install*B
    # ./setup -score_comp
Please see
    /opt/score/doc/html/en/installation/sys-compute-fromsrc.html

> b) This is the most important problem and related to the title of my
> e-mail: 
> For various security reasons I cannot use SCore with rsh (beyond
> testing). Even with tcp wrappers enabled to limit access I'd prefer to
> use ssh instead. SCORE_RSH seems to work with very few SCore binaries
> and most importantly cannot work with scout. Is there a quick fix for
> that or is rsh hardcoded in too many places in the source code?

Probably, SCORE_RSH is work most SCore commands expect scout and PBS.

scout is not worked SCORE_RSH, because scout use rsh to inter-compute hosts.
For example, if you want to scout to 4 hosts (comp0, comp1, comp2, comp3),

scout execute as following:
1. scout execute scremote to comp0
      your_host% rsh comp0 scremote ...
2. scremote on comp0 execute scremote to comp1
      comp0% rsh comp1 scremote ...
3. scremote on comp1 execute scremote to comp2
      comp1% rsh comp2 scremote ...
3. scremote on comp2 execute scremote to comp3
      comp2% rsh comp3 scremote ...
If scout use SCORE_RSH, you cannot use ssh-agent to execute scout.

But if you run scoutd on compute hosts,
scout use scoutd insted of rshd.
So you may stop rshd on compute hosts.

                       from Kameyama Toyohisa
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From rajeev бў pst.fujitsu.com  Fri Mar 28 10:29:56 2003
From: rajeev бў pst.fujitsu.com (Rajeev S)
Date: Fri, 28 Mar 2003 10:29:56 +0900
Subject: [SCore-users-jp] [SCore-users] Include me
Message-ID: <200303280128.KAA05378@tkns.tk.pst.fujitsu.com>

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From yoneya бў nanolc.jst.go.jp  Fri Mar 28 15:29:33 2003
From: yoneya бў nanolc.jst.go.jp (Makoto Yoneya)
Date: Fri, 28 Mar 2003 15:29:33 +0900
Subject: [SCore-users-jp] [SCore-users] How to specify a input data file with scrun?
Message-ID: <BOENJMJEEDNPDCELOJOGIEGECAAA.yoneya@nanolc.jst.go.jp>

Dear SCore users:

I'm new comer to SCore world.
I'd like to run the MD program GROMACS(3.1.4) on a Linux2.4.18/SCore(5.0.0)
system.
I'd tried the following.

scrun -scored=cmp***,nodes=4 scatter -file data.tpr :: mdrun_d -np 4 -deffnm
data < data.tpr

Here, mdrun_d is the program executable, data.tpr is a input data file for
this mdrun_d.
The invocation of the scrun looks successful.
However, this job looks halt just around reading the input data.
What's wrong the usage?
I really need helps!

Yokoyama Nano-structured Liquid Crystal Project.
Makoto Yoneya (Dr.)

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From Yamamoto.Takaya бў wrc.melco.co.jp  Fri Mar 28 15:48:25 2003
From: Yamamoto.Takaya бў wrc.melco.co.jp (Takaya Yamamoto)
Date: Fri, 28 Mar 2003 15:48:25 +0900
Subject: [SCore-users-jp] е╖еєе░еыCPUд╚е╟ехевеыCPU
Message-ID: <5.0.2.5.2.20030328153729.033c49f0@133.141.16.40>

╗░╔й┼┼╡ббб╗│╦▄д╟д╣бг
ддд─дтдк└д╧├д╦д╩д├д╞дкдъд▐д╣бг

SMPепеще╣е┐дЄ║юдыд╚днд╬╝┴╠фд╟д╣бг

║гбв
ббе╡б╝е╨б╝╖є╖╫╗╗е█е╣е╚бзе╖еєе░еыCPU
бб╖╫╗╗е█е╣е╚2┬цбз╢жд╦е╟ехевеыCPU
д╬3PCб╩5CPUб╦д╬╣╜└од╦д╖дшджд╚д╖д╞ддд▐д╣бг

EITд╟едеєе╣е╚б╝еыд╖дшджд╚д╖д╞дддыд╬д╟д╣дмбв
Group Creationд╬д╚днд╦бве╖еєе░еыCPUд╬PCд╚е╟ехевеыCPUд╬PCдЄ
╞▒д╕е░еыб╝е╫д╦║о║▀д╡д╗ды╩¤╦бдмдядлдъд▐д╗дєбг
д╔д╬дшджд╦д╣дьд╨ддддд╟д╖дчдждлбй

дшдэд╖дпдк┤ъддд╖д▐д╣бг

░╩╛х 


From hori бў swimmy-soft.com  Fri Mar 28 16:06:34 2003
From: hori бў swimmy-soft.com (Atsushi HORI)
Date: Fri, 28 Mar 2003 16:06:34 +0900
Subject: [SCore-users-jp] Re: [SCore-users] How to specify a input data file with scrun?
In-Reply-To: <BOENJMJEEDNPDCELOJOGIEGECAAA.yoneya@nanolc.jst.go.jp>
References: <BOENJMJEEDNPDCELOJOGIEGECAAA.yoneya@nanolc.jst.go.jp>
Message-ID: <3131712394.hori0000@swimmy-soft.com>

Hi,

>I'd tried the following.
>
>scrun -scored=cmp***,nodes=4 scatter -file data.tpr :: mdrun_d -np 4 -deffnm
>data < data.tpr
>
>Here, mdrun_d is the program executable, data.tpr is a input data file for
>this mdrun_d.
>The invocation of the scrun looks successful.
>However, this job looks halt just around reading the input data.
>What's wrong the usage?

Here, 

scrun -scored=cmp***,nodes=4 scatter -file data.tpr

creates data.tpr at somewhere in the /var/scored/ directory as a 
temporary file on each compute host. Then,

mdrun_d -np 4 -deffnm data

tries to read the file at the current directory where the scrun 
command is invoked at the server host.

If above my assumption is true, then the job will run if you type as 
follows.

scrun -scored=cmp***,nodes=4 scatter -file /tmp/data.tpr :: mdrun_d 
-np 4 -deffnm
/tmp/data < data.tpr

P.S.
I am not sure but there was a bug in scatter or stdin of scrun in 5.0 
or around. You had better to upgrade your SCore system.

----
Atsushi HORI
SCore Developer
Swimmy Software, Inc.

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From uebayasi бў pultek.co.jp  Fri Mar 28 16:19:18 2003
From: uebayasi бў pultek.co.jp (Masao Uebayashi)
Date: Fri, 28 Mar 2003 16:19:18 +0900 (JST)
Subject: [SCore-users-jp] [SCore-users] How to specify a input data
 file with scrun?
In-Reply-To: <BOENJMJEEDNPDCELOJOGIEGECAAA.yoneya@nanolc.jst.go.jp>
References: <BOENJMJEEDNPDCELOJOGIEGECAAA.yoneya@nanolc.jst.go.jp>
Message-ID: <20030328.161918.125126176.uebayasi@pultek.co.jp>

> scrun -scored=cmp***,nodes=4 scatter -file data.tpr :: mdrun_d -np 4 -deffnm
> data < data.tpr

In this case, '<' is interpreted by the shell before scrun is invoked,
and it's not passed to scrun.  You need to use ':= data.tpr' to
specify an input file.  See the "INPUT/OUTPUT REDIRECTION" section in
the scrun manual page.

Masao
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama бў pccluster.org  Fri Mar 28 16:27:50 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Fri, 28 Mar 2003 16:27:50 +0900
Subject: [SCore-users-jp] е╖еєе░еыCPU д╚е╟ехевеыCPU
In-Reply-To: Your message of "Fri, 28 Mar 2003 15:48:25 JST."
             <5.0.2.5.2.20030328153729.033c49f0@133.141.16.40>
Message-ID: <20030328072750.CA5202005C@neal.il.is.s.u-tokyo.ac.jp>

╡╡╗│д╟д╣.

In article <5.0.2.5.2.20030328153729.033c49f0 бў 133.141.16.40> Takaya Yamamoto <Yamamoto.Takaya бў wrc.melco.co.jp> wrotes:
> ║гбв
> ббе╡б╝е╨б╝╖є╖╫╗╗е█е╣е╚бзе╖еєе░еыCPU
> бб╖╫╗╗е█е╣е╚2┬цбз╢жд╦е╟ехевеыCPU
> д╬3PCб╩5CPUб╦д╬╣╜└од╦д╖дшджд╚д╖д╞ддд▐д╣бг
> 
> EITд╟едеєе╣е╚б╝еыд╖дшджд╚д╖д╞дддыд╬д╟д╣дмбв
> Group Creationд╬д╚днд╦бве╖еєе░еыCPUд╬PCд╚е╟ехевеыCPUд╬PCдЄ
> ╞▒д╕е░еыб╝е╫д╦║о║▀д╡д╗ды╩¤╦бдмдядлдъд▐д╗дєбг
> д╔д╬дшджд╦д╣дьд╨ддддд╟д╖дчдждлбй

(─╛└▄ scorehosts.db дЄ╩╘╜╕д╖д┐д█дждм┴сдддлдт├╬дьд▐д╗дєдм...)
group дЄ 2 д─║ю└од╖д▐д╣.
д▐д║, SMP д└д▒д╬е░еыб╝е╫дЄ║ю└од╖д╞. д│д│д╦д╧ shmem дЄ╞■дьд▐д╣.
╝бд╦┴┤╔Їд╬е█е╣е╚дЄ┤▐др╩╠д╬ group дЄ║ю└од╖д╞, д╜д┴дщд╦д╧ shmem дЄ
╞■дьд╩дддшджд╦д╖д▐д╣.

║╟╜к┼кд╩ scorehosts.db д╧ network д╧ host д┤д╚д╦╗╪─ъд╡дьд▐д╣д╬д╟,
╕х╝╘д╬е░еыб╝е╫дЄ╗╚═╤д╣дьд╨, 5 CPU ╗╚═╤д╣дыд│д╚дмд╟дндыд╚╗╫ддд▐д╣.

                       from Kameyama Toyohisa


From yoneya бў nanolc.jst.go.jp  Fri Mar 28 17:37:07 2003
From: yoneya бў nanolc.jst.go.jp (Makoto Yoneya)
Date: Fri, 28 Mar 2003 17:37:07 +0900
Subject: [SCore-users-jp] RE: [SCore-users] How to specify a input data file with scrun?
In-Reply-To: <3131712394.hori0000@swimmy-soft.com>
Message-ID: <BOENJMJEEDNPDCELOJOGAEGFCAAA.yoneya@nanolc.jst.go.jp>

Thanks Hori-san for comments.

> -----Original Message-----
> From: Atsushi HORI
 
> scrun -scored=cmp***,nodes=4 scatter -file /tmp/data.tpr :: mdrun_d 
> -np 4 -deffnm /tmp/data < data.tpr

It improve the situation!
Now program looks running since not only elapsed time, but CPU time
also increasing (only elapse time in the former time).
However, even I tried very short (only 5 step) run, the job continue to
run over 30 CPU minutes.
Also in this time, there are no screen listing of STDOUT or STDERR
and also any log files on the invoking directory.
Then, some file output problems are now occurring etc.
Still need helps!

Makoto Yoneya
JST/ERATO Yokoyama Nano-LC project
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From phmaeda бў med.nagoya-cu.ac.jp  Fri Mar 28 17:55:59 2003
From: phmaeda бў med.nagoya-cu.ac.jp (=?iso-2022-jp?b?cGhtYWVkYSAbJEIhdxsoQiBtZWQubmFnb3lhLWN1LmFjLmpw?=)
Date: Fri, 28 Mar 2003 17:55:59 +0900
Subject: [SCore-users-jp] EITедеєе╣е╚б╝еые╚еще╓еы
Message-ID: <20030328175559.143f88cf.phmaeda@med.nagoya-cu.ac.jp>

╠╛╕┼▓░╗╘╬й┬ч│╪╔┬▒б╠Ї║▐╔Їд╬┴░┼─ ┼░д╚┐╜д╖д▐д╣бг
е┐еєе╤еп╝┴д╬╩м╗╥╞░╬╧│╪╖╫╗╗дЄ╣╘джд┐дсбвPCепеще╣е┐б╝дЄ┴╚дтджд╚╣═дид╞ддд▐д╣бг

RedHat7.3дЄе╒еыедеєе╣е╚б╝еыд╖д┐е▐е╖еєд╦SCORE5.2дЄEITдЄ╗╚д├д╞едеєе╣е╚б╝еы╗╚═╤д╚д╣дыд╚╝бд╬дшджд╩е└едевеэе░дм╜╨д▐д╣бг
Error Message
No boot configuration files
д│д╬есе├е╗б╝е╕дЄ╠╡╗ыд╖д╞дтедеєе╣е╚б╝еыд╧┐╩д▀д▐д╣дмбвд│д╬есе├е╗б╝е╕д╟╝ид╖д╞дддыboot configuration files д╚д╧▓┐дЄ╗╪д╣д╬д╟д╖дчдждлбг

д▐д┐бвcomphostд╬едеєе╣е╚б╝еыд╦┐╩дрд╚д│дэд╟
Cannot exec daemon/a.out
д╚ддджесе├е╗б╝е╕дм╜╨д▐д╣бг
д│д╬д┐дсдлбвFDDд╟╡п╞░д╖д┐comphostд╟
No dhcp_server specified. Used Broadcast
д╚есе├е╗б╝е╕дм╜╨д╞бв┐Ї▓єTryдЄ╖лдъ╩╓д╖д┐╕хедеєе╣е╚б╝еыдм╝║╟╘д╖д▐д╣бг
╢▓дщдпе╡б╝е╨б╝┬жд╬dhcpе╡б╝е╨б╝дм╡п╞░д╖д╞ддд╩ддд┐дсд╚╗╫ддд▐д╣дмбвд╔д╬е╫еэе░ещердмdhcpе╡б╝е╨б╝д╦┴ъ┼Ўд╣дыд╬д╟д╖дчдждлбвд▐д┐бв╝ъ╞░д╟╡п╞░д╣дыд┐дсд╦д╧д╔джд╣дьд╨дшддд╬д╟д╖дчдждлбг

░╩╛хбвдшдэд╖дпд┤╢╡╝идЄдк┤ъдд├╫д╖д▐д╣бг

╠╛╕┼▓░╗╘╬й┬ч│╪╔┬▒б╠Ї║▐╔Ї
┴░┼─ ┼░
phmaeda бў med.nagoya-cu.ac.jp


From Yamamoto.Takaya бў wrc.melco.co.jp  Fri Mar 28 18:55:07 2003
From: Yamamoto.Takaya бў wrc.melco.co.jp (Takaya Yamamoto)
Date: Fri, 28 Mar 2003 18:55:07 +0900
Subject: [SCore-users-jp] е╖еєе░еы CPUд╚е╟ехевеыCPU
In-Reply-To: <20030328072750.CA5202005C@neal.il.is.s.u-tokyo.ac.jp>
References: <"Your message of Fri, 28 Mar 2003 15:48:25 JST."<5.0.2.5.2.20030328153729.033c49f0@133.141.16.40>
Message-ID: <5.0.2.5.2.20030328185453.0333ed58@133.141.16.40>

╗│╦▄д╟д╣бг
двдъдмд╚джд┤д╢ддд▐д╖д┐бг

At 16:27 03/03/28 +0900, kameyama бў pccluster.org wrote:
>╡╡╗│д╟д╣.
>
>In article <5.0.2.5.2.20030328153729.033c49f0 бў 133.141.16.40> Takaya 
>Yamamoto <Yamamoto.Takaya бў wrc.melco.co.jp> wrotes:
> > ║гбв
> > ббе╡б╝е╨б╝╖є╖╫╗╗е█е╣е╚бзе╖еєе░еыCPU
> > бб╖╫╗╗е█е╣е╚2┬цбз╢жд╦е╟ехевеыCPU
> > д╬3PCб╩5CPUб╦д╬╣╜└од╦д╖дшджд╚д╖д╞ддд▐д╣бг
> >
> > EITд╟едеєе╣е╚б╝еыд╖дшджд╚д╖д╞дддыд╬д╟д╣дмбв
> > Group Creationд╬д╚днд╦бве╖еєе░еыCPUд╬PCд╚е╟ехевеыCPUд╬PCдЄ
> > ╞▒д╕е░еыб╝е╫д╦║о║▀д╡д╗ды╩¤╦бдмдядлдъд▐д╗дєбг
> > д╔д╬дшджд╦д╣дьд╨ддддд╟д╖дчдждлбй
>
>(─╛└▄ scorehosts.db дЄ╩╘╜╕д╖д┐д█дждм┴сдддлдт├╬дьд▐д╗дєдм...)
>group дЄ 2 д─║ю└од╖д▐д╣.
>д▐д║, SMP д└д▒д╬е░еыб╝е╫дЄ║ю└од╖д╞. д│д│д╦д╧ shmem дЄ╞■дьд▐д╣.
>╝бд╦┴┤╔Їд╬е█е╣е╚дЄ┤▐др╩╠д╬ group дЄ║ю└од╖д╞, д╜д┴дщд╦д╧ shmem дЄ
>╞■дьд╩дддшджд╦д╖д▐д╣.
>
>║╟╜к┼кд╩ scorehosts.db д╧ network д╧ host д┤д╚д╦╗╪─ъд╡дьд▐д╣д╬д╟,
>╕х╝╘д╬е░еыб╝е╫дЄ╗╚═╤д╣дьд╨, 5 CPU ╗╚═╤д╣дыд│д╚дмд╟дндыд╚╗╫ддд▐д╣.
>
>                        from Kameyama Toyohisa
>_______________________________________________
>SCore-users-jp mailing list
>SCore-users-jp бў pccluster.org
>http://www.pccluster.org/mailman/listinfo/score-users-jp


From bogdan.costescu бў iwr.uni-heidelberg.de  Fri Mar 28 20:35:37 2003
From: bogdan.costescu бў iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Fri, 28 Mar 2003 12:35:37 +0100 (CET)
Subject: [SCore-users-jp] RE: [SCore-users] How to specify a input data file with scrun?
In-Reply-To: <BOENJMJEEDNPDCELOJOGAEGFCAAA.yoneya@nanolc.jst.go.jp>
Message-ID: <Pine.LNX.4.44.0303281231270.15859-100000@kenzo.iwr.uni-heidelberg.de>

On Fri, 28 Mar 2003, Makoto Yoneya wrote:

> Also in this time, there are no screen listing of STDOUT or STDERR
> and also any log files on the invoking directory.

I'm sorry, but I don't exactly understand the problem.


From bogdan.costescu бў iwr.uni-heidelberg.de  Fri Mar 28 20:49:01 2003
From: bogdan.costescu бў iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Fri, 28 Mar 2003 12:49:01 +0100 (CET)
Subject: [SCore-users-jp] Re: [SCore-users] SCORE_RSH and use of ssh instead of rsh
In-Reply-To: <200303272336.h2RNabG15621@fritz.dam.brown.edu>
Message-ID: <Pine.LNX.4.44.0303281235540.15859-100000@kenzo.iwr.uni-heidelberg.de>

On Thu, 27 Mar 2003, C. Evangelinos wrote:

> 1) removing the rpms leaves init scripts behind in /etc/rc.d as well
> as the new devices (the latter is not really a problem)

Indeed, the devices are not a problem. However, the scripts should be 
removed, but only if you did not modify them (= rpm -V still reports them 
as "original").

> 2) It would be nice to have a script that reproduces the effects of
> installing the rpms for setting up device and configuration scripts,
> local directories etc.

It exists. You probably didn't read the whole docs...

http://www.pccluster.org/score/dist/score/html/en/installation/sys-compute-fromsrc.html

and if you think the 4-5 commands that have to be executed are too much, 
then you can put them in a script.

> for the case of NFS installations like mine which do not use EIT or the
> RPMS for the compute nodes.

My install attempt few weeks ago was not on NFS, but was without the RPMs, 
so it's certainly possible if you follow these indications.

> 3) I got SCore to work fine (so far) on a system with a Realtek
> ethernet card (8139too driver).

I'm surprised that it worked at all !!! The Realtek cards (but not 
including the latest C+ variation driven by 8139cp driver which is a 
different chip) are not useful in anything that requires large amounts of 
communication - they need too much CPU intervention in performing any kind 
of network activity.

> kernel: eth0: Too much work at interrupt, IntrStatus=0x0001.

This is a typical message of system being too slow to process all the 
incoming packets. And the system can be slowed down a lot by even 
processing these packets !

> Performance between such a box and another one with an Intel eepro100
> driven card is so-and-so

You compare two different things: if you have eepro100 cards for the whole 
cluster, use them !

> Is there a quick fix for that or is rsh hardcoded in too many places in
> the source code?

Well, if you can get ssh to act like rsh (which is normally the case), 
then you can just rename ssh to rsh (or make a link or ...) and everything 
should just work.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu бў IWR.Uni-Heidelberg.De

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From bogdan.costescu бў iwr.uni-heidelberg.de  Fri Mar 28 20:55:01 2003
From: bogdan.costescu бў iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Fri, 28 Mar 2003 12:55:01 +0100 (CET)
Subject: [SCore-users-jp] Re: [SCore-users] SCORE_RSH and use of ssh instead of rsh
In-Reply-To: <20030328010639.A2D6F20058@neal.il.is.s.u-tokyo.ac.jp>
Message-ID: <Pine.LNX.4.44.0303281249150.15859-100000@kenzo.iwr.uni-heidelberg.de>

On Fri, 28 Mar 2003 kameyama бў pccluster.org wrote:

> scout is not worked SCORE_RSH, because scout use rsh to inter-compute hosts.
> For example, if you want to scout to 4 hosts (comp0, comp1, comp2, comp3),

OK, but what about setting SSH to use HostbasedAuthentication or
RhostsRSAAuthentication and have ssh_known_hosts which contains keys for
all nodes distributed to all nodes in the cluster; then you can make a ssh 
connection from any node to any node, so the scout scheme should work.
I do have here a non-SCore cluster that is configured like that - rsh is 
no longer installed.

> If scout use SCORE_RSH, you cannot use ssh-agent to execute scout.

The idea above doesn't use ssh-agent. It's just ssh on client side and 
sshd on server side.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu бў IWR.Uni-Heidelberg.De

_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From yoneya бў nanolc.jst.go.jp  Fri Mar 28 23:31:44 2003
From: yoneya бў nanolc.jst.go.jp (=?iso-2022-jp?b?eW9uZXlhIBskQiF3GyhCIG5hbm9sYy5qc3QuZ28uanA=?=)
Date: Fri, 28 Mar 2003 23:31:44 +0900(JST)
Subject: [SCore-users-jp] RE: [SCore-users] How to specify a input data file with scrun?
In-Reply-To: <Pine.LNX.4.44.0303281231270.15859-100000@kenzo.iwr.uni-heidelberg.de>
References: <Pine.LNX.4.44.0303281231270.15859-100000@kenzo.iwr.uni-heidelberg.de>
Message-ID: <20030328233144.3d78.yoneya@nanolc.jst.go.jp>

Dear Dr. Costescu

Thanks for your comments.

> From the docs at www.gromacs.org, I find that mdrun can read the input 
> from a file specified with the "-s" command line option. Why aren't you 
> specifying it like this ? Then you don't need to redirect stdin.

The option I'd tried,
mdrun_d -deffnm data,
works same as,
mdrun_d -s data.tpr, since -deffnm specify the generic name for
I/O files not only *.tpr but the other *.gro files etc.

> If you want to send a file from stdin to the process, then I don't 
> understand exactly why you are copying it first to the nodes.

As far as I knonw, GROMACS does not read the input data from stdin
but just open and read the specified input data file.
As in your comment, only the primary execution node needs to read
the input file (also as far as I know).
However, since I do not know which node will become the primary node,
I tried to copy the input data file to all the nodes in the group.
If there are misunderstanding above, please point out that.
It will be great help to solve my problems.

Thanks again.

Makoto Yoneya
Yokoyama Nano-LC project
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From bogdan.costescu бў iwr.uni-heidelberg.de  Fri Mar 28 23:55:52 2003
From: bogdan.costescu бў iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Fri, 28 Mar 2003 15:55:52 +0100 (CET)
Subject: [SCore-users-jp] RE: [SCore-users] How to specify a input data file with scrun?
In-Reply-To: <20030328233144.3d78.yoneya@nanolc.jst.go.jp>
Message-ID: <Pine.LNX.4.44.0303281533140.17331-100000@kenzo.iwr.uni-heidelberg.de>

On Fri, 28 Mar 2003 yoneya бў nanolc.jst.go.jp wrote:

> since -deffnm specify the generic name for I/O files

OK, as I'm not a GROMACS user, I missed this in the docs. Then the 
suggestion from Atsushi should work. But wouldn't be even simpler to take 
the file(s) from the current directory, i.e. why wouldn't it work like:

scrun [options] mdrun_d -deffnm [options]

started in the directory where data.tpr already exists.

> As far as I knonw, GROMACS does not read the input data from stdin

Sorry, I misinterpreted the line you first posted.

> However, since I do not know which node will become the primary node,
> I tried to copy the input data file to all the nodes in the group.

No, you don't know on which nodes the job will run, but scrun/scatter do.  
So by using "scatter -node 0", stdin will be sent to the first node of the
job, whatever this is.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu бў IWR.Uni-Heidelberg.De


_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From yoneya бў nanolc.jst.go.jp  Sat Mar 29 09:56:19 2003
From: yoneya бў nanolc.jst.go.jp (=?iso-2022-jp?b?eW9uZXlhIBskQiF3GyhCIG5hbm9sYy5qc3QuZ28uanA=?=)
Date: Sat, 29 Mar 2003 09:56:19 +0900(JST)
Subject: [SCore-users-jp] RE: [SCore-users] How to specify a input data file with scrun?
In-Reply-To: <Pine.LNX.4.44.0303281533140.17331-100000@kenzo.iwr.uni-heidelberg.de>
References: <Pine.LNX.4.44.0303281533140.17331-100000@kenzo.iwr.uni-heidelberg.de>
Message-ID: <20030329095619.8c94.yoneya@nanolc.jst.go.jp>

Dear Dr. Costescu :

Thanks again.

>But wouldn't be even simpler to take 
> the file(s) from the current directory, i.e. why wouldn't it work like:
> 
> scrun [options] mdrun_d -deffnm [options]
> 
> started in the directory where data.tpr already exists.

I'd tried the above first, but the job looks halt (without incleasing
CPU time).
It will work in another configuration of SCore system, but not in
the system I use.

> No, you don't know on which nodes the job will run, but scrun/scatter do.  
> So by using "scatter -node 0", stdin will be sent to the first node of the
> job, whatever this is.

This usage of the scatter has not tried yet.
I'll try later.

Thanks again!

Makoto Yoneya
JST/ERATO Yokoyama Nano-LC Project
_______________________________________________
SCore-users mailing list
SCore-users бў pccluster.org
http://www.pccluster.org/mailman/listinfo/score-users


From kameyama бў pccluster.org  Mon Mar 31 11:01:31 2003
From: kameyama бў pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=)
Date: Mon, 31 Mar 2003 11:01:31 +0900
Subject: [SCore-users-jp] EITедеєе╣е╚б╝еые╚еще╓еы
In-Reply-To: Your message of "Fri, 28 Mar 2003 17:55:59 JST."
             <20030328175559.143f88cf.phmaeda@med.nagoya-cu.ac.jp>
Message-ID: <20030331020131.A856420058@neal.il.is.s.u-tokyo.ac.jp>

╡╡╗│д╟д╣.

In article <20030328175559.143f88cf.phmaeda бў med.nagoya-cu.ac.jp> phmaeda бў med.nagoya-cu.ac.jp wrotes:
> RedHat7.3дЄе╒еыедеєе╣е╚б╝еыд╖д┐е▐е╖еєд╦SCORE5.2дЄEITдЄ╗╚д├д╞едеєе╣е╚б╝еы╗╚═╤
> д╚д╣дыд╚╝бд╬дшджд╩е└едевеэе░дм╜╨д▐д╣бг
> Error Message
> No boot configuration files
> д│д╬есе├е╗б╝е╕дЄ╠╡╗ыд╖д╞дтедеєе╣е╚б╝еыд╧┐╩д▀д▐д╣дмбвд│д╬есе├е╗б╝е╕д╟╝ид╖д╞дд
> дыboot configuration files д╚д╧▓┐дЄ╗╪д╣д╬д╟д╖дчдждлбг

     /opt/score/ndboot/images
д╦
    100Mbps_Ethernet.lst
    1Gbps_Ethernet.lst 
д╚дддже╒ебедеыдмдвдыдлд╔дждлд╬е┴езе├епд╟д╥д├длдлд├д╞дддыдшджд╟д╣.
> 
> д▐д┐бвcomphostд╬едеєе╣е╚б╝еыд╦┐╩дрд╚д│дэд╟
> Cannot exec daemon/a.out
> д╚ддджесе├е╗б╝е╕дм╜╨д▐д╣бг

д│дьд╧
    /opt/score/libexec/eitd
дЄ╡п╞░д╖дшджд╚д╖д╞╝║╟╘д╖д╞дддыдшджд╟д╣.

> д│д╬д┐дсдлбвFDDд╟╡п╞░д╖д┐comphostд╟
> No dhcp_server specified. Used Broadcast
> д╚есе├е╗б╝е╕дм╜╨д╞бв┐Ї▓єTryдЄ╖лдъ╩╓д╖д┐╕хедеєе╣е╚б╝еыдм╝║╟╘д╖д▐д╣бг
> ╢▓дщдпе╡б╝е╨б╝┬жд╬dhcpе╡б╝е╨б╝дм╡п╞░д╖д╞ддд╩ддд┐дсд╚╗╫ддд▐д╣дмбвд╔д╬е╫еэе░ещ
> ердмdhcpе╡б╝е╨б╝д╦┴ъ┼Ўд╣дыд╬д╟д╖дчдждлбв

╛х╡нд╬ eitd дм dhcp е╡б╝е╨д╦│║┼Ўд╖д▐д╣.

╛╔╛їдлдщ╣═дид╞ SCore д╬едеєе╣е╚б╝еыдмджд▐дпддд├д╞ддд╩ддд╚
╗╫дядьд▐д╣.

┬┐╩м, /opt дм╞■д├д╞дддые╟егеьепе╚еъдмдвд╒дьд╞дддыд╬д╟д╧д╩дддлд╚...
    # df -h /opt
дЄ╝┬╣╘д╖д╞д▀д╞дпд└д╡дд.

                       from Kameyama Toyohisa