From bogdan.costescu @ iwr.uni-heidelberg.de Wed Mar 5 03:45:35 2003 From: bogdan.costescu @ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Tue, 4 Mar 2003 19:45:35 +0100 (CET) Subject: [SCore-users-jp] [SCore-users] Myrinet deadlock Message-ID: Dear SCore developers, When trying to test SCore 5.4, I get what it looks like a deadlock when using Myrinet, association with Shmem making it appear faster. Setup: The cluster is composed of 8 nodes, each with dual Athlon, 512 MB RAM and older Myrinet (LANai 4) cards. I kept the configuration files from an older (4.2.1) SCore installation which worked flawlessly for more than a year, so I believe that there are no errors in this part. I installed the kernel RPM provided in the distribution, but compiled here all the user-level stuff. The problem: When trying to run a job that uses Myrinet with or without Shmem (-nodes=8x2 or -nodes=8x1) the job locks at random places. When running a job that uses Ethernet (either -nodes=8x1 or -nodes=8x2) the lockup does not occur even if I put more load on the nodes, like starting several jobs at the same time on the same nodes. When the job is in this state, it can sometimes (but not always) be interrupted with Ctrl-C (if it's still connected to the terminal). But sometimes not even pskill is able to get rid of it, the message indicating that the job is killed appears every time pskill is executed, but the job is still there - at some point SCoreD dies and it's restarted by sc_watch. Attaching gdb to the job in this state gives something like: #0 0x082c2702 in shmemReceive () #1 0x082b2f4d in composite_attach_context () #2 0x0829a485 in MPID_SCORE_Recv_Message () #3 0x082999f6 in MPID_SCORE_PIwrecv () #4 0x08299754 in MPID_SCORE_PIbrecv () #5 0x0829e2b1 in MPID_CH_Check_incoming () #6 0x082948d7 in MPID_RecvComplete () #7 0x0828a1ff in PMPI_Waitall () or #0 0x082c6080 in myriReceive () #1 0x0829a485 in MPID_SCORE_Recv_Message () #2 0x082999f6 in MPID_SCORE_PIwrecv () #3 0x08299754 in MPID_SCORE_PIbrecv () #4 0x0829e2b1 in MPID_CH_Check_incoming () #5 0x082948d7 in MPID_RecvComplete () #6 0x0828a1ff in PMPI_Waitall () from which I assume that this is a deadlock. However, the application that produced this (CHARMM) is very stable and worked flawlessly with older versions of SCore, so deadlocks caused by bad programming in the application are to be excluded. The /proc/pm/myrinet/0/info file on all nodes indicates 0; we never had any problems with these cards with older SCore versions. Do you have any idea about what is going on ? -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From s-sumi @ flab.fujitsu.co.jp Wed Mar 5 18:48:14 2003 From: s-sumi @ flab.fujitsu.co.jp (Shinji Sumimoto) Date: Wed, 05 Mar 2003 18:48:14 +0900 (JST) Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock In-Reply-To: References: Message-ID: <20030305.184814.596528202.s-sumi@flab.fujitsu.co.jp> Hi. The default mpich version of mpich is changed from mpich 1.2.0 to mpich 1.2.4. Could you build mpich 1.2.0 from source and test it? If once mpich 1.2.0 is installed, you can choose mpich1.2.0 and mpich1.2.4 by -mpi option. PS: How about mpi_zerocopy=on option? Shinji. From: Bogdan Costescu Subject: [SCore-users] Myrinet deadlock Date: Tue, 4 Mar 2003 19:45:35 +0100 (CET) Message-ID: bogdan.costescu> bogdan.costescu> Dear SCore developers, bogdan.costescu> bogdan.costescu> When trying to test SCore 5.4, I get what it looks like a deadlock when bogdan.costescu> using Myrinet, association with Shmem making it appear faster. bogdan.costescu> bogdan.costescu> Setup: bogdan.costescu> The cluster is composed of 8 nodes, each with dual Athlon, 512 MB RAM and bogdan.costescu> older Myrinet (LANai 4) cards. I kept the configuration files from an bogdan.costescu> older (4.2.1) SCore installation which worked flawlessly for more than a bogdan.costescu> year, so I believe that there are no errors in this part. I installed the bogdan.costescu> kernel RPM provided in the distribution, but compiled here all the bogdan.costescu> user-level stuff. bogdan.costescu> bogdan.costescu> The problem: bogdan.costescu> When trying to run a job that uses Myrinet with or without Shmem bogdan.costescu> (-nodes=8x2 or -nodes=8x1) the job locks at random places. When running a bogdan.costescu> job that uses Ethernet (either -nodes=8x1 or -nodes=8x2) the lockup does bogdan.costescu> not occur even if I put more load on the nodes, like starting several jobs bogdan.costescu> at the same time on the same nodes. bogdan.costescu> When the job is in this state, it can sometimes (but not always) be bogdan.costescu> interrupted with Ctrl-C (if it's still connected to the terminal). But bogdan.costescu> sometimes not even pskill is able to get rid of it, the message indicating bogdan.costescu> that the job is killed appears every time pskill is executed, but the job bogdan.costescu> is still there - at some point SCoreD dies and it's restarted by sc_watch. bogdan.costescu> bogdan.costescu> Attaching gdb to the job in this state gives something like: bogdan.costescu> bogdan.costescu> #0 0x082c2702 in shmemReceive () bogdan.costescu> #1 0x082b2f4d in composite_attach_context () bogdan.costescu> #2 0x0829a485 in MPID_SCORE_Recv_Message () bogdan.costescu> #3 0x082999f6 in MPID_SCORE_PIwrecv () bogdan.costescu> #4 0x08299754 in MPID_SCORE_PIbrecv () bogdan.costescu> #5 0x0829e2b1 in MPID_CH_Check_incoming () bogdan.costescu> #6 0x082948d7 in MPID_RecvComplete () bogdan.costescu> #7 0x0828a1ff in PMPI_Waitall () bogdan.costescu> bogdan.costescu> or bogdan.costescu> bogdan.costescu> #0 0x082c6080 in myriReceive () bogdan.costescu> #1 0x0829a485 in MPID_SCORE_Recv_Message () bogdan.costescu> #2 0x082999f6 in MPID_SCORE_PIwrecv () bogdan.costescu> #3 0x08299754 in MPID_SCORE_PIbrecv () bogdan.costescu> #4 0x0829e2b1 in MPID_CH_Check_incoming () bogdan.costescu> #5 0x082948d7 in MPID_RecvComplete () bogdan.costescu> #6 0x0828a1ff in PMPI_Waitall () bogdan.costescu> bogdan.costescu> from which I assume that this is a deadlock. However, the application that bogdan.costescu> produced this (CHARMM) is very stable and worked flawlessly with older bogdan.costescu> versions of SCore, so deadlocks caused by bad programming in the bogdan.costescu> application are to be excluded. bogdan.costescu> bogdan.costescu> The /proc/pm/myrinet/0/info file on all nodes indicates 0; we never had bogdan.costescu> any problems with these cards with older SCore versions. bogdan.costescu> bogdan.costescu> Do you have any idea about what is going on ? bogdan.costescu> bogdan.costescu> -- bogdan.costescu> Bogdan Costescu bogdan.costescu> bogdan.costescu> IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen bogdan.costescu> Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY bogdan.costescu> Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 bogdan.costescu> E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De bogdan.costescu> bogdan.costescu> bogdan.costescu> _______________________________________________ bogdan.costescu> SCore-users mailing list bogdan.costescu> SCore-users @ pccluster.org bogdan.costescu> http://www.pccluster.org/mailman/listinfo/score-users bogdan.costescu> bogdan.costescu> ------ Shinji Sumimoto, Fujitsu Labs _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From bogdan.costescu @ iwr.uni-heidelberg.de Wed Mar 5 19:30:51 2003 From: bogdan.costescu @ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Wed, 5 Mar 2003 11:30:51 +0100 (CET) Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock In-Reply-To: <20030305.184814.596528202.s-sumi@flab.fujitsu.co.jp> Message-ID: On Wed, 5 Mar 2003, Shinji Sumimoto wrote: > The default mpich version of mpich is changed from mpich 1.2.0 to mpich 1.2.4. Yes, I was aware of this. > Could you build mpich 1.2.0 from source and test it? As I built from source all user-level stuff, I already got mpi-1.2.0. But now I'm wondering how to build the ch_score2 device as this seems not to be built by default and I wanted to test it as well. > If once mpich 1.2.0 is installed, you can choose mpich1.2.0 and mpich1.2.4 by -mpi option. Actually the -mpi option doesn't seem to work, but I now set my path to include first the bin directory of mpi-1.2.0. > PS$B!'(B How about mpi_zerocopy=on option? I tried it and it seemed to lower the chances of locking up, but it still happens. When it does, I get sometimes: SCORE: Deadlock detected <0:0>SCore: *** SIGNAL EXCEPTION eip=0x08299a6b, cr2=0x 0 *** ... With mpich-1.2.0 I get the same lock-ups. Another thing which is worth mentioning is that whenever the jobs are not interruptible and killable with pskill and SCoreD has to restart, it always takes down one of the nodes. It's not the same node (and with older SCore we didn't have such problem), so now because of this and because of independence of MPI library I start to suspect the kernel-side. I'll try next to see if I can get SCore 4.2.1 to work with a newer kernel (2.4.18-19 or so, maybe some RedHat variant) to see if the problem comes from the newer kernel or from newer SCore. Thank you for any suggestion! -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From hori @ swimmy-soft.com Wed Mar 5 19:46:55 2003 From: hori @ swimmy-soft.com (Atsushi HORI) Date: Wed, 5 Mar 2003 19:46:55 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock References: <20030305.184814.596528202.s-sumi@flab.fujitsu.co.jp> Message-ID: <3129738415.hori0008@swimmy-soft.com> Hi. >I'll try next to see if I can get SCore 4.2.1 to work with a newer kernel >(2.4.18-19 or so, maybe some RedHat variant) to see if the problem comes >from the newer kernel or from newer SCore. Another suggestion, no question. Have you recompiled your application program(s) ? ---- Atsushi HORI Swimmy Software, Inc. _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama @ pccluster.org Wed Mar 5 19:55:16 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Wed, 05 Mar 2003 19:55:16 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock In-Reply-To: Your message of "Wed, 05 Mar 2003 11:30:51 JST." Message-ID: <20030305105516.5336F20057@neal.il.is.s.u-tokyo.ac.jp> In article Bogdan Costescu wrotes: > On Wed, 5 Mar 2003, Shinji Sumimoto wrote: > now I'm wondering how to build the ch_score2 device as this seems not to > be built by default and I wanted to test it as well. Note that ch_score2 dose not support from SCore 4.1! ch_score2 use PM internal header file, so if you want to compile ch_score2, you must install SCore source file and set MADE_CHSCORE2 make variable to yes. > > If once mpich 1.2.0 is installed, you can choose mpich1.2.0 and mpich1.2.4 > by -mpi option. > > Actually the -mpi option doesn't seem to work, but I now set my path to > include first the bin directory of mpi-1.2.0. -mpi option is specified on compile time. $ mpicc -mpi mpich-1.2.0 ... Or plese set SCORE_MPI to mpich-1.2.0. > so now because of this and because of independence of MPI > library I start to suspect the kernel-side. Which do you use kernel rpm? There are some SCore 5.4.0 kernel rpm. Probably I think you use kernel-smp-2.4.19-1SCORE.athlon.rpm or kernel-smp-2.4.19-1SCORE.i686.rpm. Please try to change unused kernel. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From bogdan.costescu @ iwr.uni-heidelberg.de Wed Mar 5 19:56:45 2003 From: bogdan.costescu @ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Wed, 5 Mar 2003 11:56:45 +0100 (CET) Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock In-Reply-To: <3129738415.hori0008@swimmy-soft.com> Message-ID: On Wed, 5 Mar 2003, Atsushi HORI wrote: > Have you recompiled your application program(s) ? Yes, of course. The first phrase in the documentation mentions that there is no binary compatibility with older SCore versions. And as I was using previously 4.2.1 I didn't even attempt to run those binaries with the newer SCore. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From bogdan.costescu @ iwr.uni-heidelberg.de Wed Mar 5 20:24:00 2003 From: bogdan.costescu @ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Wed, 5 Mar 2003 12:24:00 +0100 (CET) Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock In-Reply-To: <20030305105516.5336F20057@neal.il.is.s.u-tokyo.ac.jp> Message-ID: On Wed, 5 Mar 2003 kameyama @ pccluster.org wrote: > Note that ch_score2 dose not support from SCore 4.1! I wanted to compile it only for 5.4 to compare stability and speed with ch_score. > ch_score2 use PM internal header file, so if you want to compile > ch_score2, you must install SCore source file and set > MADE_CHSCORE2 make variable to yes. OK, thank you. > -mpi option is specified on compile time. > $ mpicc -mpi mpich-1.2.0 ... Ahh, now I see what I did wrong: I tried with: mpicc -mpi 1.2.0 so it's my fault, sorry for the false alarm... > Probably I think you use kernel-smp-2.4.19-1SCORE.athlon.rpm or > kernel-smp-2.4.19-1SCORE.i686.rpm. I tried already both of these (with mpich-1.2.0 only the i686 one). I also tried with the non-SMP athlon kernel and it also locks up; of course as I can't run -nodes=8x2 it takes a bit longer now to lock, but that was what I experienced with the SMP kernels as well. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From bogdan.costescu @ iwr.uni-heidelberg.de Thu Mar 6 02:32:30 2003 From: bogdan.costescu @ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Wed, 5 Mar 2003 18:32:30 +0100 (CET) Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock In-Reply-To: Message-ID: On Wed, 5 Mar 2003, Bogdan Costescu wrote: > I'll try next to see if I can get SCore 4.2.1 to work with a newer kernel > (2.4.18-19 or so, maybe some RedHat variant) to see if the problem comes > from the newer kernel or from newer SCore. I managed to patch RedHat's 2.4.18-24 with the kernel patch for SCore 4.2.1 and built the SMP athlon kernel (haven't tested yet the SMP i686 kernel but I'll do it this evening). However with this kernel I also experience the lockups. So, still using SCore 4.2.1 I went back to the 2.4.16-based kernel that I used before and I was again able to run without any lockup for more than 1 hour which already means "stable". So, some problem with the kernel... I also tried to boot with "noapic" for both Score 5.4 with kernel 2.4.19-1SCORE and SCore 4.2.1 with my 2.4.18-24 based kernel to eliminate any doubt about interrupt problems, but this didn't help. On the other hand, with SCore 5.4 and kernel 2.4.19-1SCORE I was able to use the ethernet based communication (with or without shmem) without any problem - or maybe I did not test enough, but anyway on the same time-scale it didn't lock up. Which leads me to believe that somehow the new (> 2.4.16) kernels and Myrinet cards do not go well together on our computers... Anybody knows how the interrupt rate for Myrinet cards compare with the interrupt rate for fast ethernet (3c59x here) for the same communication needs ? Any other idea ? -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama @ pccluster.org Thu Mar 6 09:04:19 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Thu, 06 Mar 2003 09:04:19 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock In-Reply-To: Your message of "Wed, 05 Mar 2003 18:32:30 JST." Message-ID: <20030306000419.5D83920054@neal.il.is.s.u-tokyo.ac.jp> In article Bogdan Costescu wrotes: > On Wed, 5 Mar 2003, Bogdan Costescu wrote: > > > I'll try next to see if I can get SCore 4.2.1 to work with a newer kernel > > (2.4.18-19 or so, maybe some RedHat variant) to see if the problem comes > > from the newer kernel or from newer SCore. I forget FAQ (http://www.pccluster.org/faq/en/faq-tips/faq.html). (Category: PM Communication Facility. 6) Please check whether IRQ is duplicated nor not. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From bogdan.costescu @ iwr.uni-heidelberg.de Thu Mar 6 18:58:30 2003 From: bogdan.costescu @ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Thu, 6 Mar 2003 10:58:30 +0100 (CET) Subject: [SCore-users-jp] Re: [SCore-users] Myrinet deadlock In-Reply-To: <20030306000419.5D83920054@neal.il.is.s.u-tokyo.ac.jp> Message-ID: On Thu, 6 Mar 2003 kameyama @ pccluster.org wrote: > Please check whether IRQ is duplicated nor not. I did go through the FAQ... The interrupts are not shared; as I have only a few devices in the computer (one IDE disk, one 3c905C, one Myrinet card), each has its own interrupt: CPU0 CPU1 0: 3202112 3042292 IO-APIC-edge timer 1: 2 1 IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 5: 1264609 1260821 IO-APIC-level eth0 8: 0 1 IO-APIC-edge rtc 11: 0 0 IO-APIC-level usb-ohci 12: 211923 162138 IO-APIC-level myri 14: 11142 8059 IO-APIC-edge ide0 For non-SMP kernels, they are also not shared. I'll try next to see if I can get GM to work with the same kernel (RH 2.4.18-24) and if I can get some bad behaviour as well - although given different MPI implementation I guess that the communication needs (rate of interrupts, PCI bus usage, etc.) will be different. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From bogdan.costescu @ iwr.uni-heidelberg.de Fri Mar 7 04:58:42 2003 From: bogdan.costescu @ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Thu, 6 Mar 2003 20:58:42 +0100 (CET) Subject: [SCore-users-jp] [SCore-users] Kernel oops Message-ID: Dear SCore developers, I've postponed trying to test GM on our nodes as I have observed that whenever SCoreD crashes and takes with it one node there is also an Oops displayed on the node. This is with the SCore 4.2.1 kernel patch applied to RH 2.4.18-24, so it might be some error that I have introduced, but the behaviour (SCoreD taking down one node) is the same with SCore 5.4 and kernel 2.4.19-1SCORE which I plan to test tomorrow. So, the (decoded) Oops looks like this: EIP is at __wake_up [kernel] 0x3c (2.4.18-24SCORE) eax: c041c998 ebx: c25a4d80 ecx: 00000000 edx: 00000000 esi: 00000001 edi: c041c994 ebp: c25abf1c esp: c25abf08 ds: 0018 es: 0018 ss: 0018 Process swapper (pid: 0, stackpage=c25ab000) Stack: 00000282 00000001 c041c96c c041c840 c041c994 00000002 c019ec9a c25a4d80 00000001 dbdfa015 00000010 c010a6e3 00000010 c041c840 c25abf7c c25abf7c c0398000 00000010 c25a4d80 c010a872 00000010 c25abf7c c25a4d80 00000001 Call trace: [] myri_pm_intr [kernel] 0x7a (0xc25abf20)) [] handle_IRQ_event [kernel] 0x5e (oxc25abf34)) [] do_IRQ [kernel] 0xc2 (0c25abf54)) [] default_idle [kernel] 0x0 (0xc25abf68)) [] default_idle [kernel] 0x0 (0xc25abf74)) [] call_do_IRQ [kernel] 0x5 (0xc25abf78)) [] default_idle [kernel] 0x0 (0xc25abf7c)) [] default_idle [kernel] 0x0 (0xc25abf90)) [] default_idle [kernel] 0x29 (0xc25abfa4)) [] cpu_idle [kernel] 0x32 (0xc25abfb0)) [] call_console_drivers [kernel] 0xeb (0xc25abfd0)) [] printk [kernel] 0x129 (0xc25abffc)) Code: 8b 02 85 45 f0 74 ed 6a 00 52 e8 75 f0 ff ff 5a 85 c0 59 74 Using defaults from ksymoops -t elf32-i386 -a i386 Trace; c019ec9a Trace; c010a63e Trace; c010a872 Trace; c0106e60 Trace; c0106e60 Trace; c010d098 Trace; c0106e60 Trace; c0106e60 Trace; c0106e89 Trace; c0106f02 Trace; c011dafb Trace; c011dca9 Code; 00000000 Before first symbol 00000000 <_EIP>: Code; 00000000 Before first symbol 0: 8b 02 mov (%edx),%eax Code; 00000002 Before first symbol 2: 85 45 f0 test %eax,0xfffffff0(%ebp) Code; 00000005 Before first symbol 5: 74 ed je fffffff4 <_EIP+0xfffffff4> fffffff4 Code; 00000007 Before first symbol 7: 6a 00 push $0x0 Code; 00000009 Before first symbol 9: 52 push %edx Code; 0000000a Before first symbol a: e8 75 f0 ff ff call fffff084 <_EIP+0xfffff084> fffff084 Code; 0000000f Before first symbol f: 5a pop %edx Code; 00000010 Before first symbol 10: 85 c0 test %eax,%eax Code; 00000012 Before first symbol 12: 59 pop %ecx Code; 00000013 Before first symbol 13: 74 00 je 15 <_EIP+0x15> 00000015 Before first symbol <0>Kernel panic: Aiee, killing interrupt handler! Today I was able to reproduce this Oops several times on different nodes. The trace is always the same, except for the line(s) after cpu_idle, which can be replaced by: [] stext [kernel] 0x0 (...)) I looked a bit through the code but I don't really understand Myrinet programming too well, so maybe this gives you some idea. Spurious interrupts ? Lost interrupts ? I'm still not confortable with the interrupt state on my machines as they have Tyan 760MP boards which are known for instabilities. Anyway, as I said, I plan to try tomorrow with SCore 5.4 and kernel 2.4.19-1SCORE to see if the locks there are also associated with such Oopses. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kate @ pfu.fujitsu.com Sun Mar 9 16:13:10 2003 From: kate @ pfu.fujitsu.com (KATAYAMA Yoshio) Date: Sun, 09 Mar 2003 16:13:10 +0900 Subject: [SCore-users-jp] SCore 5.4.0 binary RPM install Message-ID: <200303090713.AA12082@flash.tokyo.pfu.co.jp> PFU の片山と申します。 お世話になります。 SCore 5.4.2 をバイナリ RPM でインストールしているのですが、計算 ホストの SCore カーネルのインストールでエラーになります。 ――――ここから――――ここから――――ここから――――ここから―――― [root @ comp1 RPMS]# rpm -Uvh kernel-2.4.19-1SCORE.i686.rpm エラー: 依存性の欠如: kernel-drm = 4.2.0は XFree86-4.2.0-8 に必要とされています [root @ comp1 RPMS]# rpm -q --provides -p kernel-2.4.19-1SCORE.i686.rpm module-info kernel = 2.4.19 kernel-drm = 4.1.0 kernel = 2.4.19-1SCORE [root @ comp1 RPMS]# rpm -q --provides kernel module-info kernel = 2.4.18 kernel-drm = 4.1.0 kernel-drm = 4.2.0 kernel = 2.4.18-3 [root @ comp1 RPMS]# rpm -q --requires XFree86 Glide3 XFree86-xfs = 4.2.0 XFree86-libs = 4.2.0 XFree86-base-fonts = 4.2.0-8 /etc/pam.d/system-auth kernel-drm = 4.2.0 /bin/ln /usr/sbin/chkfontpath ――――ここまで――――ここまで――――ここまで――――ここまで―――― 取り敢えず、--nodeps を付けてインストールを続行しましたが、何か 問題があるでしょうか。 これと関係あるか分かりませんが、rpmtest が非常に遅くなっています。 ――――ここから――――ここから――――ここから――――ここから―――― [root @ server sbin]# ./rpmtest bioinfo1 ethernet -reply & [1] 4520 [root @ server sbin]# time ./rpmtest bioinfo2 ethernet -dest 0 -ping 8 0.0028161 real 4m42.299s user 0m0.008s sys 0m0.000s [root @ server sbin]# ――――ここまで――――ここまで――――ここまで――――ここまで―――― (注) comp1 → HOST_0、comp2 → HOST_1 Ether のドライバが e100 になっていましたので、eepro100 に変えた ところ、通常の時間になりました。 ――――ここから――――ここから――――ここから――――ここから―――― [root @ server sbin]# ./rpmtest bioinfo1 ethernet -reply & [1] 4603 [root @ server sbin]# time ./rpmtest bioinfo2 ethernet -dest 0 -ping 8 7.59142e-05 real 0m8.165s user 0m0.006s sys 0m0.006s [root @ server sbin]# ――――ここまで――――ここまで――――ここまで――――ここまで―――― これは、単純にドライバの問題でしょうか。それとも、SCore カーネル のインストールに問題があったのでしょうか。 なお、計算ホストは FMV-C600 (Pentium 4 2.4 GHz, i845G, 512 MB) が 4 台の構成です。 以上、よろしくお願いします。 -- (株)PFU OSSC)Linuxシステム部 片山 善夫 Tel 044-520-6617 Fax 044-556-1022 From nrcb @ streamline-computing.com Sun Mar 9 18:48:28 2003 From: nrcb @ streamline-computing.com (Nick Birkett) Date: Sun, 9 Mar 2003 09:48:28 +0000 Subject: [SCore-users-jp] [SCore-users] score 5.4 build problems Message-ID: <200303090948.28249.nrcb@streamline-computing.com> Dear SCore users, I have upgraded to RedHat 7.3 (original not updates version), and installed Score 5.4. Kernel 2.4.19-1SCORE is installed as binary and source. I am getting errors when I try the build using source packages: score-5.4.0.build.tar.gz score-5.4.0.score.tar.gz score-5.4.0.mpi.tar.gz score-5.4.0.utils.tar.gz ./configure works without error, but make gives: PWD=/opt/score/score-src/SCore/pm2/arch/composite^M + make -w BUILD=/opt/score/score-src/SCore/build host_nickname=i386-redhat7-linux2_4 DIST= all^M make[4]: Entering directory `/raid0/opt/score5.4.0/score-src/SCore/pm2/arch/composite'^M cd obj.i386-redhat7-linux2_4;VPATH=.. make all BUILD=/opt/score/score-src/SCore/build host_nickname=i386-redhat7-linux2_4 DIST=^M make[5]: Entering directory `/raid0/opt/score5.4.0/score-src/SCore/pm2/arch/composite/obj.i386-redhat7-linux2_4'^M /usr/bin/gcc `if grep Unportable /usr/include/asm/spinlock.h> /dev/null; then echo -I/usr/src/linux-2.4/include; fi` -O2 `case i386-unknown-linux in sparc-*-*) echo -Dsparc;; i386-*-*) echo -Di386 -m486;; alpha-*-*) echo -Dalpha;; esac` `case i386-unknown-linux in *-*-sunos4*) echo -Dsunos4;; *-*-netbsd*) echo -Dnetbsd;; *-*-linux*) echo -Dlinux;; *-*-osf*) echo -Dosf1_linux -I/usr/local/linux/linux.include;; esac` -Wall `case i386-unknown-linux in alpha-*-linux*) echo -pipe -ffixed-8 -mcpu=ev5 -Wa,-mev6 ;; esac` -I../../../include `if grep Unportable /usr/include/asm/spinlock.h>/dev/null; then echo -I/usr/src/linux-2.4/include; fi` -o pm_composite.o -c ../pm_composite.c^M In file included from /usr/include/linux/spinlock.h:35,^M from ../../../include/pm_lock.h:79,^M from ../pm_composite.c:79:^M /usr/include/asm/spinlock.h: In function `read_lock':^M /usr/include/asm/spinlock.h:168: `LOCK' undeclared (first use in this function)^M /usr/include/asm/spinlock.h:168: (Each undeclared identifier is reported only once^M /usr/include/asm/spinlock.h:168: for each function it appears in.)^M /usr/include/asm/spinlock.h:168: parse error before string constant^M /usr/include/asm/spinlock.h:168: parse error before `:'^M /usr/include/asm/spinlock.h: In function `write_lock':^M /usr/include/asm/spinlock.h:177: `LOCK' undeclared (first use in this function)^M /usr/include/asm/spinlock.h:177: parse error before string constant^M /usr/include/asm/spinlock.h:177: parse error before `:'^M /usr/include/asm/spinlock.h: In function `write_trylock':^M /usr/include/asm/spinlock.h:186: warning: implicit declaration of function `atomic_sub_and_test'^M /usr/include/asm/spinlock.h:188: warning: implicit declaration of function `atomic_add'^M ../pm_composite.c: At top level:^M Any help appreciated. Cheers, Nick _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From bogdan.costescu @ iwr.uni-heidelberg.de Mon Mar 10 04:50:01 2003 From: bogdan.costescu @ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Sun, 9 Mar 2003 20:50:01 +0100 (CET) Subject: [SCore-users-jp] Re: [SCore-users] score 5.4 build problems In-Reply-To: <200303090948.28249.nrcb@streamline-computing.com> Message-ID: On Sun, 9 Mar 2003, Nick Birkett wrote: > I am getting errors when I try the build using source packages: > > score-5.4.0.build.tar.gz > score-5.4.0.score.tar.gz > score-5.4.0.mpi.tar.gz > score-5.4.0.utils.tar.gz As I wrote in my earlier messages, I did compile all the user-level stuff (actually only what I needed, not all packages, but those above were included). I did not encounter any such problem... My system is RH 7.2-based with pretty much all updates installed. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama @ pccluster.org Mon Mar 10 09:30:19 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Mon, 10 Mar 2003 09:30:19 +0900 Subject: [SCore-users-jp] Re: [SCore-users] score 5.4 build problems In-Reply-To: Your message of "Sun, 09 Mar 2003 09:48:28 JST." <200303090948.28249.nrcb@streamline-computing.com> Message-ID: <20030310003019.EAADB20054@neal.il.is.s.u-tokyo.ac.jp> In article <200303090948.28249.nrcb @ streamline-computing.com> Nick Birkett wrotes: > c-*-*) echo -Dsparc;; i386-*-*) echo -Di386 -m486;; alpha-*-*) echo -Dalpha;; > esac` `case i386-unknown-linux in *-*-sunos4*) echo -Dsunos4;; *-*-netbsd*) > echo -Dnetbsd;; *-*-linux*) echo -Dlinux;; *-*-osf*) echo -Dosf1_linux -I/usr > /local/linux/linux.include;; esac` -Wall `case i386-unknown-linux in alpha-* > -linux*) echo -pipe -ffixed-8 -mcpu=ev5 -Wa,-mev6 ;; esac` -I../../../incl > ude `if grep Unportable /usr/include/asm/spinlock.h>/dev/null; then echo -I/ > usr/src/linux-2.4/include; fi` -o pm_composite.o -c ../pm_composite.c^M > In file included from /usr/include/linux/spinlock.h:35,^M > from ../../../include/pm_lock.h:79,^M > from ../pm_composite.c:79:^M > /usr/include/asm/spinlock.h: In function `read_lock':^M SCore source needs kernel header files. But redhat 7.3 (or later) /usr/include/{asm,linux} dose not support full kernel header file. Please install kernel-source rpm on server host. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From emile.carcamo @ nec.fr Mon Mar 10 16:00:57 2003 From: emile.carcamo @ nec.fr (Emile CARCAMO) Date: Mon, 10 Mar 2003 08:00:57 +0100 Subject: [SCore-users-jp] [SCore-users] rcp-all usage with SCore 5.4 Message-ID: <200303100701.h2A70vZG006018@emilepc.ess.nec.fr> Hello, I just noticed the following trouble when using rcp-all after installing brand new version 5.4 : [ecarcamo]<117>rsh-all -g essfrance uname -a node01.ess.nec.fr node02.ess.nec.fr node01.ess.nec.fr: Linux node01.ess.nec.fr 2.4.19-1SCORE #1 Wed Feb 5 14:10:38 JST 2003 i686 unknown node02.ess.nec.fr: Linux node02.ess.nec.fr 2.4.19-1SCORE #1 Wed Feb 5 14:10:38 JST 2003 i686 unknown [ecarcamo]<118> [ecarcamo]<118>rcp-all /etc/printcap essfrance:/tmp SCOUT: Spawning done. [node01-2]: if: Expression Syntax. SCOUT: Session done. [ecarcamo]<119>which rcp-all /opt/score/bin/rcp-all [ecarcamo]<120> This command always worked fine so far... Thanks in advance for the help, and best regards. -- Emile_CARCAMO NEC High Performance http://www.hpce.nec.com System Engineer Computing Europe mailto:ecarcamo @ hpce.nec.com (+33)6-8063-7003 GSM (+33)1-3930-6601 FAX / Your mouse has moved. Windows NT must be restarted \ (+33)1-3930-6613 PHONE \ for the change to take effect. Reboot now? [ OK ] / _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama @ pccluster.org Mon Mar 10 16:52:31 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Mon, 10 Mar 2003 16:52:31 +0900 Subject: [SCore-users-jp] Re: [SCore-users] rcp-all usage with SCore 5.4 In-Reply-To: Your message of "Mon, 10 Mar 2003 08:00:57 JST." <200303100701.h2A70vZG006018@emilepc.ess.nec.fr> Message-ID: <20030310075231.DF8B220054@neal.il.is.s.u-tokyo.ac.jp> In article <200303100701.h2A70vZG006018 @ emilepc.ess.nec.fr> Emile CARCAMO wrotes: > I just noticed the following trouble when using > rcp-all after installing brand new version 5.4 : > > [ecarcamo]<117>rsh-all -g essfrance uname -a > node01.ess.nec.fr > node02.ess.nec.fr > node01.ess.nec.fr: Linux node01.ess.nec.fr 2.4.19-1SCORE #1 Wed Feb 5 14:10:3 > 8 > JST 2003 i686 unknown > node02.ess.nec.fr: Linux node02.ess.nec.fr 2.4.19-1SCORE #1 Wed Feb 5 14:10:3 > 8 > JST 2003 i686 unknown > [ecarcamo]<118> > [ecarcamo]<118>rcp-all /etc/printcap essfrance:/tmp > SCOUT: Spawning done. > [node01-2]: > if: Expression Syntax. > SCOUT: Session done. Sorry, rcp-all is not work if compute host's login shell is csh or tcsh. Please apply this patch to rcp-all. from Kameyama Toyohisa ---------------------------------------cut here--------------------------------- Index: rcp-all.pl =================================================================== RCS file: /develop/cvsroot/score-src/program/utils/rcp-all/rcp-all.pl,v retrieving revision 1.10 retrieving revision 1.11 diff -u -r1.10 -r1.11 --- rcp-all.pl 25 Oct 2002 04:44:57 -0000 1.10 +++ rcp-all.pl 10 Mar 2003 07:42:42 -0000 1.11 @@ -82,9 +82,9 @@ my ($file) = @files[0]; $file_basename = basename($file); $filemode = sprintf "%03o", (stat($file))[2] & 0777; - $scout_command = "scout -wait -g $group -re '\"if [ -d $remote_dir ]; then cat > " - . "$remote_dir/$file_basename;chmod $filemode $remote_dir/$file_basename;" - . "else cat > $remote_dir;chmod $filemode $remote_dir; fi\"'"; + $scout_command = "scout -wait -g $group -re '\"[ -d $remote_dir ] && (cat > " + . "$remote_dir/$file_basename;chmod $filemode $remote_dir/$file_basename);" + . "[ -d $remote_dir ] || (cat > $remote_dir;chmod $filemode $remote_dir) \"'"; open(REMOTE, "|$scout_command") or Error("Cannot exec scout command $!"); open(LOCAL, $file) or Error("Cannot open $file $!"); while(sysread(LOCAL, $_, $bufsize)) { ---------------------------------------cut here--------------------------------- _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From emile.carcamo @ nec.fr Mon Mar 10 17:20:53 2003 From: emile.carcamo @ nec.fr (Emile CARCAMO) Date: Mon, 10 Mar 2003 09:20:53 +0100 Subject: [SCore-users-jp] Re: [SCore-users] rcp-all usage with SCore 5.4 In-Reply-To: Your message of "Mon, 10 Mar 2003 17:19:16 +0900." <20030310081916.3001420054@neal.il.is.s.u-tokyo.ac.jp> Message-ID: <200303100820.h2A8Krkf002374@emilepc.ess.nec.fr> > > Does this patch apply on following file : > > > > /opt/score5.4.0/bin/bin.i386-redhat7-linux2_4/rcp-all.exe > > This patch was maked for rcp-all source file, > But you can apply rcp-all.exe, directory. > When patch asks "File to patch":, you specify rcp-all.exe > > Please Issue following command: > % cd /opt/score5.4.0/bin/bin.i386-redhat7-linux2_4/ > % patch < patch_file (Or privious my mail) > File to patch: rcp-all.exe > ~~~~~~~~~~~ > Thanks a lot Kameyama-san, problem is fixed now !! This patch also helps if your login shell is bash, AFAIK. Best regards, -- Emile_CARCAMO NEC High Performance http://www.hpce.nec.com System Engineer Computing Europe mailto:ecarcamo @ hpce.nec.com (+33)6-8063-7003 GSM (+33)1-3930-6601 FAX / Your mouse has moved. Windows NT must be restarted \ (+33)1-3930-6613 PHONE \ for the change to take effect. Reboot now? [ OK ] / _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From uebayasi @ pultek.co.jp Mon Mar 10 22:10:33 2003 From: uebayasi @ pultek.co.jp (Masao Uebayashi) Date: Mon, 10 Mar 2003 22:10:33 +0900 (JST) Subject: [SCore-users-jp] [SCore-users] Question about Modified Ack/Nack Message-ID: <20030310.221033.60048924.uebayasi@pultek.co.jp> Hello. In terms of Modified Ack/Nack, what should be done if the following situations happen? For example, a note S sends messages 0, 1, 2, 3 to another note R. a) R receives 1, 0, 2, 3. b) R receives 1, 3. Thanks in advance. Masao _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From nrcb @ streamline-computing.com Mon Mar 10 22:24:10 2003 From: nrcb @ streamline-computing.com (Nick Birkett) Date: Mon, 10 Mar 2003 13:24:10 +0000 Subject: [SCore-users-jp] [SCore-users] Resource problem Message-ID: <200303101324.10326.nrcb@streamline-computing.com> Hi - I am getting a funny resource problem on a machine that has been running Score 5.0.1 for 200 days. I can run rpmtest, scstest between 2 nodes. I can run a parallel job on each node separately, but when I try to run both hosts together I get a Resource unavailable error. A Myrnet line card was replaced on Friday. Could it be a cable is in the wrong hole (but surely rpmtest and scstest would not then work )? I have tried restarting scoreboard and msgbserv and rebooting comp29,30. Anyone have an idea about this ? ------------------------------------------------------------------------ [nrcb @ saturn mpi]$ cat hosts comp29.ex.ac.uk comp30.ex.ac.uk [nrcb @ saturn mpi]$ scout -wait -F hosts -e scrun -nodes=2 ./jacobi_mpi SCOUT: Spawning done. SCore-D 5.0.1 connected (jid=70). <0:0> SCORE: 2 nodes (1x2) ready. Running with nprocs= 2 Array size nxg,nyg = 1024 1024 Iteration count = 1024 Running with nprocs= 2 cpus= 2: Iteration = 10 8.66808374E+12 cpus= 2: Iteration = 20 8.61407852E+12 WORKS [nrcb @ saturn mpi]$ scout -wait -F hosts -e scrun -nodes=4 ./jacobi_mpi SCOUT: Spawning done. FEP:ERROR SCore-D Login failed: Resource unavailable. SCOUT: Session done. DOESNT WORK [nrcb @ saturn mpi]$ cat hosts comp30.ex.ac.uk comp29.ex.ac.uk [nrcb @ saturn mpi]$ scout -wait -F hosts -e scrun -nodes=2 ./jacobi_mpi SCOUT: Spawning done. SCore-D 5.0.1 connected (jid=72). <0:0> SCORE: 2 nodes (1x2) ready. Running with nprocs= 2 Array size nxg,nyg = 1024 1024 Iteration count = 1024 Running with nprocs= 2 cpus= 2: Iteration = 10 8.66808374E+12 cpus= 2: Iteration = 20 8.61407852E+12 cpus= 2: Iteration = 30 8.57295514E+12 WORKS [nrcb @ saturn mpi]$ scout -wait -F hosts -e scrun -nodes=4 ./jacobi_mpi SCOUT: Spawning done. FEP:ERROR SCore-D Login failed: Resource unavailable. SCOUT: Session done. DOESNT WORK The jacob_mpi application is a standard one that works up to 64 processes. _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From jodell @ ad.brown.edu Tue Mar 11 03:32:43 2003 From: jodell @ ad.brown.edu (James O'Dell) Date: 10 Mar 2003 13:32:43 -0500 Subject: [SCore-users-jp] [SCore-users] Debugger Message-ID: <1047321162.1540.4.camel@cr1> I'm a new user of the SCORE system and have a couple of questions about the debugger. 1) Is there anyway to tell score to simply dump core, rather than trying to invoke GDB? 2) When I look at the stack trace in GDB, all I seem to see are functions related to score. How do I display the stack of user function invocations? Thanks, Jim _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama @ pccluster.org Tue Mar 11 08:49:12 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Tue, 11 Mar 2003 08:49:12 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Resource problem In-Reply-To: Your message of "Mon, 10 Mar 2003 13:24:10 JST." <200303101324.10326.nrcb@streamline-computing.com> Message-ID: <20030310234912.44EB320054@neal.il.is.s.u-tokyo.ac.jp> In article <200303101324.10326.nrcb @ streamline-computing.com> Nick Birkett wrotes: > [nrcb @ saturn mpi]$ scout -wait -F hosts -e scrun -nodes=2 ./jacobi_mpi > SCOUT: Spawning done. > SCore-D 5.0.1 connected (jid=70). > <0:0> SCORE: 2 nodes (1x2) ready. I think the program connected SCore-D multi user mode (running with 1x2 host.) (Because if single user mode, jid is not displayed.) Plase check your environment variable SCORE_OPTIONS. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From hori @ swimmy-soft.com Tue Mar 11 09:45:58 2003 From: hori @ swimmy-soft.com (Atsushi HORI) Date: Tue, 11 Mar 2003 09:45:58 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Debugger In-Reply-To: <1047321162.1540.4.camel@cr1> References: <1047321162.1540.4.camel@cr1> Message-ID: <3130220758.hori0000@swimmy-soft.com> Hi. >I'm a new user of the SCORE system and have a couple of questions about >the debugger. > >1) Is there anyway to tell score to simply dump core, rather than trying >to invoke GDB? No. This is because if your program run on 100 processors, do you really need 100 core files ? >2) When I look at the stack trace in GDB, all I seem to see are >functions related to score. How do I display the stack of user function >invocations? There are two possible cases. 1) Your program is not compiled with the debug (-g) option, or executable file is stripped and symbol information is lost. 2) The exception signal is raised in the SCore runtime library itself. BTW, which parallel library (MPI ?) or parallel language (OpenMP?) are you using ? ---- Atsushi HORI SCore Developer Swimmy Software, Inc. _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From uebayasi @ pultek.co.jp Tue Mar 11 13:47:56 2003 From: uebayasi @ pultek.co.jp (Masao Uebayashi) Date: Tue, 11 Mar 2003 13:47:56 +0900 (JST) Subject: [SCore-users-jp] [SCore-users] Re: Question about Modified Ack/Nack In-Reply-To: <20030310.221033.60048924.uebayasi@pultek.co.jp> References: <20030310.221033.60048924.uebayasi@pultek.co.jp> Message-ID: <20030311.134756.125114941.uebayasi@pultek.co.jp> > In terms of Modified Ack/Nack, what should be done if the following > situations happen? > > For example, a note S sends messages 0, 1, 2, 3 to another note R. > > a) R receives 1, 0, 2, 3. > > b) R receives 1, 3. I looked at PM/Myrinet. I can understand its behavier if Myrinet preserves packet order, but I'm not sure. Masao _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From s-sumi @ flab.fujitsu.co.jp Tue Mar 11 14:11:11 2003 From: s-sumi @ flab.fujitsu.co.jp (Shinji Sumimoto) Date: Tue, 11 Mar 2003 14:11:11 +0900 (JST) Subject: [SCore-users-jp] Re: [SCore-users] Question about Modified Ack/Nack In-Reply-To: <20030310.221033.60048924.uebayasi@pultek.co.jp> References: <20030310.221033.60048924.uebayasi@pultek.co.jp> Message-ID: <20030311.141111.35022320.s-sumi@flab.fujitsu.co.jp> Hi. From: Masao Uebayashi Subject: [SCore-users] Question about Modified Ack/Nack Date: Mon, 10 Mar 2003 22:10:33 +0900 (JST) Message-ID: <20030310.221033.60048924.uebayasi @ pultek.co.jp> uebayasi> Hello. uebayasi> uebayasi> In terms of Modified Ack/Nack, what should be done if the following uebayasi> situations happen? uebayasi> uebayasi> For example, a note S sends messages 0, 1, 2, 3 to another note R. uebayasi> uebayasi> a) R receives 1, 0, 2, 3. uebayasi> uebayasi> b) R receives 1, 3. uebayasi> In both case, the messages expect 0 are discarded. The receiver sends nack to sender. a) R receives 1, 0, 2, 3. x o x x b) R receives 1, 3. x x o: received x: discarded Shinji. uebayasi> Masao uebayasi> _______________________________________________ uebayasi> SCore-users mailing list uebayasi> SCore-users @ pccluster.org uebayasi> http://www.pccluster.org/mailman/listinfo/score-users uebayasi> uebayasi> ------ Shinji Sumimoto, Fujitsu Labs _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kate @ pfu.fujitsu.com Tue Mar 11 16:17:00 2003 From: kate @ pfu.fujitsu.com (KATAYAMA Yoshio) Date: Tue, 11 Mar 2003 16:17:00 +0900 Subject: [SCore-users-jp] _IceTransSocketUNIXConnect Message-ID: <200303110717.AA13354@flash.tokyo.pfu.co.jp> PFU の片山と申します。 お世話になっております。 SCore 5.4.0 で demo/bin/mandel を実行すると、 ――――ここから――――ここから――――ここから――――ここから―――― [root @ server tmp]# scrun -monitor /opt/score/demo/bin/mandel SCore-D 5.4.0 connected. <0:0> SCORE: 8 nodes (8x1) ready. _IceTransSocketUNIXConnect: Cannot connect to non-local host bioinfo0.envi.osakafu-u.ac.jp Warning: Tried to connect to session manager, Could not open network socket :: -size 320x240 -re 0.000000 -im 0.000000 -radius 2.000000 end: 0 sec 59 msec 305 usec ――――ここまで――――ここまで――――ここまで――――ここまで―――― という警告メッセージが出ます。これは何が原因なのでしょうか。 今日、サーバホストを再インストールしたのですが、それ以前は(記憶 が曖昧ですが)このメッセージが出ていなかったと思います。 済みませんが、よろしくお願いします。 PS Date: Sun, 09 Mar 2003 16:13:10 +0900 From: KATAYAMA Yoshio Subject: [SCore-users-jp] SCore 5.4.0 binary RPM install の件もよろしくお願いします。 -- (株)PFU OSSC)Linuxシステム部 片山 善夫 Tel 044-520-6617 Fax 044-556-1022 From nrcb @ streamline-computing.com Tue Mar 11 16:27:19 2003 From: nrcb @ streamline-computing.com (Nick Birkett) Date: Tue, 11 Mar 2003 07:27:19 +0000 Subject: [SCore-users-jp] Re: [SCore-users] score 5.4 build problems In-Reply-To: <20030311004059.21E2020054@neal.il.is.s.u-tokyo.ac.jp> References: <20030311004059.21E2020054@neal.il.is.s.u-tokyo.ac.jp> Message-ID: <200303110727.19138.nrcb@streamline-computing.com> On Tuesday 11 March 2003 12:40 am, you wrote: > > > But redhat 7.3 (or later) /usr/include/{asm,linux} dose not support > > > full kernel header file. > > > Please install kernel-source rpm on server host. Many thanks. I re-installed Score kernel and have link lrwxrwxrwx 1 root root 19 Mar 11 07:17 linux-2.4 -> linux-2.4.19-1SCORE make menuconfig in linux-2.4.19-1SCORE , save and exit. I guess I have too many kernels and links on my system !! Regards, Nick _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama @ pccluster.org Tue Mar 11 18:40:20 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Tue, 11 Mar 2003 18:40:20 +0900 Subject: [SCore-users-jp] SCore 5.4.0 binary RPM install In-Reply-To: Your message of "Sun, 09 Mar 2003 16:13:10 JST." <200303090713.AA12082@flash.tokyo.pfu.co.jp> Message-ID: <20030311094020.4D78620054@neal.il.is.s.u-tokyo.ac.jp> 亀山です. In article <200303090713.AA12082 @ flash.tokyo.pfu.co.jp> KATAYAMA Yoshio wrotes: > SCore 5.4.2 をバイナリ RPM でインストールしているのですが、計算 > ホストの SCore カーネルのインストールでエラーになります。 すみません. spec file の記述ミスです. > 取り敢えず、--nodeps を付けてインストールを続行しましたが、何か > 問題があるでしょうか。 compute host で X server を立ち上げない限り問題ないと思います. 立ち上げる場合, redhat 7.3 の server にある i830 (2.4.19 標準 kernel にはありませんでした.) および (コンパイル時にエラーになったので はずした記憶が...) が存在しないので問題になるかもしれません. 該当 host の chipset は i845 なので問題になりそうな... > これと関係あるか分かりませんが、rpmtest が非常に遅くなっています。 (中略) > Ether のドライバが e100 になっていましたので、eepro100 に変えた > ところ、通常の時間になりました。 (中略) > これは、単純にドライバの問題でしょうか。それとも、SCore カーネル > のインストールに問題があったのでしょうか。 ドライバの問題だと思います. from Kameyama Toyohisa From kameyama @ pccluster.org Tue Mar 11 19:20:33 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Tue, 11 Mar 2003 19:20:33 +0900 Subject: [SCore-users-jp] _IceTransSocketUNIXConnect In-Reply-To: Your message of "Tue, 11 Mar 2003 16:17:00 JST." <200303110717.AA13354@flash.tokyo.pfu.co.jp> Message-ID: <20030311102033.8D79720054@neal.il.is.s.u-tokyo.ac.jp> 亀山です. In article <200303110717.AA13354 @ flash.tokyo.pfu.co.jp> KATAYAMA Yoshio wrotes: > SCore 5.4.0 で demo/bin/mandel を実行すると、 > > ――――ここから――――ここから――――ここから――――ここから―――― > [root @ server tmp]# scrun -monitor /opt/score/demo/bin/mandel > SCore-D 5.4.0 connected. > <0:0> SCORE: 8 nodes (8x1) ready. > _IceTransSocketUNIXConnect: Cannot connect to non-local host bioinfo0.envi.os > akafu-u.ac.jp > Warning: Tried to connect to session manager, Could not open network socket > :: -size 320x240 -re 0.000000 -im 0.000000 -radius 2.000000 > end: 0 sec 59 msec 305 usec > ――――ここまで――――ここまで――――ここまで――――ここまで―――― > > という警告メッセージが出ます。これは何が原因なのでしょうか。 > > 今日、サーバホストを再インストールしたのですが、それ以前は(記憶 > が曖昧ですが)このメッセージが出ていなかったと思います。 多分出ていたと思いますが... これは, mandel が ICE (Inter Client Exchenge protocol) というのを 使用しようとして失敗しているもののようです. 一応, xsm を立ち上げて, そのもとで起動すれば Warning: Tried to connect to session manager, Could not open network socket は消えるようですが... from Kameyama Toyohisa From kate @ pfu.fujitsu.com Wed Mar 12 16:34:32 2003 From: kate @ pfu.fujitsu.com (KATAYAMA Yoshio) Date: Wed, 12 Mar 2003 16:34:32 +0900 Subject: [SCore-users-jp] SCore 5.4.0 binary RPM install In-Reply-To: Your message of Tue, 11 Mar 2003 18:40:20 +0900. <20030311094020.4D78620054@neal.il.is.s.u-tokyo.ac.jp> Message-ID: <200303120734.AA14066@flash.tokyo.pfu.co.jp> PFU の片山です。 ご回答有難うございます。 Date: Tue, 11 Mar 2003 18:40:20 +0900 From: kameyama @ pccluster.org >> SCore 5.4.2 をバイナリ RPM でインストールしているのですが、計算 >> ホストの SCore カーネルのインストールでエラーになります。 >すみません. >spec file の記述ミスです. 安心しました。 >> 取り敢えず、--nodeps を付けてインストールを続行しましたが、何か >> 問題があるでしょうか。 >compute host で X server を立ち上げない限り問題ないと思います. >立ち上げる場合, redhat 7.3 の server にある i830 (2.4.19 標準 kernel >にはありませんでした.) および (コンパイル時にエラーになったので >はずした記憶が...) が存在しないので問題になるかもしれません. >該当 host の chipset は i845 なので問題になりそうな... 素の RedHat 7.3 では、X がうまく動かない(*)ので、インテルの web サイトからダウンロードした i830-20030120-i386-linux.tar.gz を入 れています。 * tty 画面 → X → tty 画面 までは OK ですが、その後、X に戻れ * なくなります * kernel-2.4.18-24.7.x.i686.rpm にすると、このドライバがなくて * も OK なようですが、念の為に入れています これは、SCore カーネルでも同様でしたので、このドライバを入れたと ころ、動いてくれているようです。 >> これと関係あるか分かりませんが、rpmtest が非常に遅くなっています。 >> Ether のドライバが e100 になっていましたので、eepro100 に変えた >> ところ、通常の時間になりました。 >> これは、単純にドライバの問題でしょうか。それとも、SCore カーネル >> のインストールに問題があったのでしょうか。 >ドライバの問題だと思います. 有難うございました。 -- (株)PFU OSSC)Linuxシステム部 片山 善夫 Tel 044-520-6617 Fax 044-556-1022 From kate @ pfu.fujitsu.com Wed Mar 12 16:34:38 2003 From: kate @ pfu.fujitsu.com (KATAYAMA Yoshio) Date: Wed, 12 Mar 2003 16:34:38 +0900 Subject: [SCore-users-jp] _IceTransSocketUNIXConnect In-Reply-To: Your message of Tue, 11 Mar 2003 19:20:33 +0900. <20030311102033.8D79720054@neal.il.is.s.u-tokyo.ac.jp> Message-ID: <200303120734.AA14071@flash.tokyo.pfu.co.jp> PFU の片山です。 ご回答有難うございます。 Date: Tue, 11 Mar 2003 19:20:33 +0900 From: kameyama @ pccluster.org >> SCore 5.4.0 で demo/bin/mandel を実行すると、 ・・・ >> という警告メッセージが出ます。これは何が原因なのでしょうか。 >> >> 今日、サーバホストを再インストールしたのですが、それ以前は(記憶 >> が曖昧ですが)このメッセージが出ていなかったと思います。 >多分出ていたと思いますが... >これは, mandel が ICE (Inter Client Exchenge protocol) というのを >使用しようとして失敗しているもののようです. 一般ユーザで実行したら、このメッセージが出なくなりました。 再インストールが終ってすぐ(ユーザアカウントを作る前)に行なって、 見慣れないメッセージが出て焦ってしまいました。どうも、お騒がせし ました。 -- (株)PFU OSSC)Linuxシステム部 片山 善夫 Tel 044-520-6617 Fax 044-556-1022 From nrcb @ streamline-computing.com Wed Mar 12 16:19:26 2003 From: nrcb @ streamline-computing.com (Nick Birkett) Date: Wed, 12 Mar 2003 07:19:26 +0000 Subject: [SCore-users-jp] [SCore-users] Trunked network Score 5.4 Message-ID: <200303120719.26926.nrcb@streamline-computing.com> Hi we are having a problem with trunked network. 2 onboard GB cards connected to 2 fast ether switches. Simple tests work, but running application (eg charmm) causes network to crash. Application works fine using 1 network card. Hardware : SuperMicro 1U servers, dual onboard Intel Gbit Software: kernel 2.4.19-1SCORE smp, Score 5.4 charmm 28b2. Will try again using 2 gigabit switches. configuration files attached Thanks, Nick -------------- next part -------------- テキスト形式以外の添付ファイルを保管しました... ファイル名: scorehosts.db 型: text/x-csrc サイズ: 2388 バイト 説明: 無し URL: -------------- next part -------------- 文字コード指定の無い添付文書を保管しました... 名前: pm-gigabit0.conf URL: -------------- next part -------------- 文字コード指定の無い添付文書を保管しました... 名前: pm-gigabit1.conf URL: From hori @ swimmy-soft.com Wed Mar 12 17:27:07 2003 From: hori @ swimmy-soft.com (Atsushi HORI) Date: Wed, 12 Mar 2003 17:27:07 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Trunked network Score 5.4 In-Reply-To: <200303120719.26926.nrcb@streamline-computing.com> References: <200303120719.26926.nrcb@streamline-computing.com> Message-ID: <3130334827.hori0002@swimmy-soft.com> Hi. >Hi we are having a problem with trunked network. >2 onboard GB cards connected to 2 fast ether switches. > >Simple tests work, but running application (eg charmm) >causes network to crash. I always run scstest one night. You had better to do the same thing. ---- Atsushi HORI SCore Developer Swimmy Software, Inc. _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama @ pccluster.org Wed Mar 12 17:40:29 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Wed, 12 Mar 2003 17:40:29 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Trunked network Score 5.4 In-Reply-To: Your message of "Wed, 12 Mar 2003 07:19:26 JST." <200303120719.26926.nrcb@streamline-computing.com> Message-ID: <20030312084029.9287420055@neal.il.is.s.u-tokyo.ac.jp> In article <200303120719.26926.nrcb @ streamline-computing.com> Nick Birkett wrotes: > Simple tests work, but running application (eg charmm) > causes network to crash. > > Application works fine using 1 network card. ... > ethernet type=ethernet \ > -config:file=/opt/score/etc/pm-ethernet.conf > gigabit0 type=ethernet \ > -config:file=/opt/score/etc/pm-gigabit0.conf > gigabit1 type=ethernet \ > -config:file=/opt/score/etc/pm-gigabit1.conf > gigabitx2 type=ethernet \ > -config:file=/opt/score/etc/pm-gigabit1.conf \ > -trunk0:file=/opt/score/etc/pm-gigabit0.conf ... > > comp00.streamline HOST_0 network=gigabitx2,gigabit0,gigabit1,shmem0,shme > m1 group=_scoreall_,MYRI,ETHER,SHMEM smp=2 MSGBSERV > comp01.streamline HOST_1 network=gigabitx2,gigabit0,gigabit1,shmem0,shme > m1 group=_scoreall_,MYRI,ETHER,SHMEM smp=2 MSGBSERV Plesse remove gigabit0 and gigabit1 network at least each host line. http://www.pccluster.org/score/dist/score/html/en/reference/pm/ether-trunking.html says: In this file, ethernet-0, ethernet-1, ethernet-2 and ethernet-3 networks should be used for test purpose only, and needless networks should be removed after following communication tests are finished. Because, these definition causes a trouble in SCore-D multiuser environment. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From bogdan.costescu @ iwr.uni-heidelberg.de Wed Mar 12 21:03:05 2003 From: bogdan.costescu @ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Wed, 12 Mar 2003 13:03:05 +0100 (CET) Subject: [SCore-users-jp] [SCore-users] Back to SCore 4.2.1... Message-ID: Dear SCore developers and users, I've given up on the newest version of SCore and went back to SCore 4.2.1 on kernel 2.4.16-based. During the last few days, I've got a better behaviour from SCore 5.4 - no more node crashes, although I didn't change anything - but jobs would still stall or stop with "Deadlock detected." message only minutes after start. But I'm not writting this to discourage other people from installing SCore 5.4. Instead, I would be very interested to hear from other sites especially with similar hardware if SCore 5.4 created such problems. I wouldn't be surprised if our hardware (not renowned for stability Tyan 2460 760MP-based) would actually be part of the problem... -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From nrcb @ streamline-computing.com Thu Mar 13 08:51:02 2003 From: nrcb @ streamline-computing.com (Nick Birkett) Date: Wed, 12 Mar 2003 23:51:02 +0000 Subject: [SCore-users-jp] [SCore-users] trunking Message-ID: <200303122351.02223.nrcb@streamline-computing.com> Some first results. Hardware: 1U dual Xeon 2.6GHz Superservers with onboard dual gigabit. Each network via its own gigabit switch. Pallas benchmarks: Pingpong looks ok but Sendrecv is not good: backoff 1024 maxnsend 16 on both networks. See attached benchmarks. -------------- next part -------------- 文字コード指定の無い添付文書を保管しました... 名前: trunking URL: From bogdan.costescu @ iwr.uni-heidelberg.de Thu Mar 13 21:05:33 2003 From: bogdan.costescu @ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Thu, 13 Mar 2003 13:05:33 +0100 (CET) Subject: [SCore-users-jp] Re: [SCore-users] trunking In-Reply-To: <200303122351.02223.nrcb@streamline-computing.com> Message-ID: On Wed, 12 Mar 2003, Nick Birkett wrote: > Hardware: 1U dual Xeon 2.6GHz Superservers with onboard dual gigabit. Are these Intel or Broadcom NICs ? Or something else... > Each network via its own gigabit switch. Could you also tell us what switches you use ? This is just to make an idea as we are probably interested to set up something very similar in the near future, Myrinet is still expensive for small number of nodes and Gigabit Ethernet seems to have a pretty small latency with SCore (which is more important for us more than the increased bandwidth). -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From s-sumi @ flab.fujitsu.co.jp Thu Mar 13 21:47:36 2003 From: s-sumi @ flab.fujitsu.co.jp (Shinji Sumimoto) Date: Thu, 13 Mar 2003 21:47:36 +0900 (JST) Subject: [SCore-users-jp] Re: [SCore-users] trunking In-Reply-To: <200303122351.02223.nrcb@streamline-computing.com> References: <200303122351.02223.nrcb@streamline-computing.com> Message-ID: <20030313.214736.640901912.s-sumi@flab.fujitsu.co.jp> Hi. Sorry for late response. Are you using a switch that supports JUMBO Frame ? If so, how about mpi_zerocopy=on option? Here are results using Intel PRO/1000XTs on Supermicro mother boards. These results are not so good compared with Myrinet 2000. Broadcom 5701 based NIC has also good communication performace. See: http://www.pccluster.org/score/dist/score/html/en/overview/pm-perf.html maxnsend 24 backoff 2000 with mpi_zerocopy=on ***** Two Intel PRO/1000XTs. #----------------------------------------------------------------------------- # Benchmarking Sendrecv # ( #processes = 2 ) #----------------------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec 0 1000 25.06 25.06 25.06 0.00 1 1000 24.97 25.03 25.00 0.08 2 1000 24.38 24.41 24.40 0.16 4 1000 25.18 25.20 25.19 0.30 8 1000 24.57 24.63 24.60 0.62 16 1000 25.52 25.54 25.53 1.19 32 1000 24.98 25.01 25.00 2.44 64 1000 25.09 25.11 25.10 4.86 128 1000 24.71 24.71 24.71 9.88 256 1000 29.80 29.84 29.82 16.36 512 1000 30.46 30.50 30.48 32.02 1024 1000 46.70 46.78 46.74 41.75 2048 1000 63.69 73.19 68.44 53.37 4096 1000 112.09 112.15 112.12 69.66 8192 1000 191.11 191.21 191.16 81.71 16384 1000 247.54 247.60 247.57 126.21 32768 1000 652.67 652.69 652.68 95.76 65536 1000 956.54 956.55 956.54 130.68 131072 1000 1559.37 1559.38 1559.38 160.32 262144 640 2737.14 2737.14 2737.14 182.67 524288 320 4946.09 4946.14 4946.12 202.18 1048576 160 9352.07 9352.14 9352.10 213.85 2097152 80 24004.33 24004.79 24004.56 166.63 4194304 40 48974.10 48975.73 48974.91 163.35 ***** Three Intel PRO/1000XTs. #----------------------------------------------------------------------------- # Benchmarking Sendrecv # ( #processes = 2 ) #----------------------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] Mbytes/sec 0 1000 30.05 30.06 30.06 0.00 1 1000 25.05 25.13 25.09 0.08 2 1000 21.71 21.77 21.74 0.18 4 1000 23.80 23.85 23.83 0.32 8 1000 32.45 32.48 32.46 0.47 16 1000 28.11 28.13 28.12 1.08 32 1000 24.24 24.30 24.27 2.51 64 1000 22.63 22.67 22.65 5.38 128 1000 25.27 25.28 25.27 9.66 256 1000 28.47 28.54 28.51 17.11 512 1000 30.25 30.30 30.28 32.23 1024 1000 45.22 45.29 45.26 43.12 2048 1000 62.34 71.10 66.72 54.94 4096 1000 96.77 96.86 96.81 80.66 8192 1000 178.64 178.79 178.71 87.39 16384 1000 564.16 564.17 564.16 55.39 32768 1000 625.56 625.58 625.57 99.91 65536 1000 798.51 798.52 798.52 156.54 131072 1000 1289.24 1289.24 1289.24 193.91 262144 640 2369.67 2369.68 2369.68 211.00 524288 320 4019.12 4019.12 4019.12 248.81 1048576 160 7143.38 7143.41 7143.39 279.98 2097152 80 14925.95 14926.55 14926.25 267.98 4194304 40 29971.10 29973.05 29972.07 266.91 PS: I am now re-writing PM/Ethernet to reduce communication cost. Shinji. From: Nick Birkett Subject: [SCore-users] trunking Date: Wed, 12 Mar 2003 23:51:02 +0000 Message-ID: <200303122351.02223.nrcb @ streamline-computing.com> nrcb> Some first results. nrcb> nrcb> Hardware: 1U dual Xeon 2.6GHz Superservers with onboard dual gigabit. nrcb> Each network via its own gigabit switch. nrcb> nrcb> Pallas benchmarks: Pingpong looks ok but Sendrecv is not good: nrcb> nrcb> backoff 1024 nrcb> maxnsend 16 nrcb> nrcb> on both networks. nrcb> nrcb> See attached benchmarks. nrcb> nrcb> ------ Shinji Sumimoto, Fujitsu Labs _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From iztok.daneu @ rzs-hm.si Thu Mar 13 23:09:07 2003 From: iztok.daneu @ rzs-hm.si (Iztok Daneu) Date: Thu, 13 Mar 2003 14:09:07 +0000 (UTC) Subject: [SCore-users-jp] [SCore-users] Unusual problem with scrun Message-ID: Dear Score users, we have a bit unusual problem: disk on one of our cluster nodes nodes crashed some time ago. Instead of going trough additional computational node installation procedure we simply copied the content of other compute node hd with dd. After some twiddling with fdisk, fsck and vi the machine boots OK , but when we try to run application (for example: scrun -scored=tuba0,nodes=28 system hostname) with scrun on the cluster with repaired node included, we get the error message: <13> SCORE-D:ERROR open_ddt_socket(STDIN)=111 <13> SCORE-D:ERROR open_ddt_socket(STDIN)=111 13 is the number of "repaired" computional node ;) and the score version is 5.2.0. We know that this is not a score problem but we created the problem ourselves. Does anyone have a clue what we missed in our procedure? Thank you very much in advance, iztok -- If money can't buy happiness, I guess you'll just have to rent it. _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From bogdan.costescu @ iwr.uni-heidelberg.de Thu Mar 13 23:31:38 2003 From: bogdan.costescu @ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Thu, 13 Mar 2003 15:31:38 +0100 (CET) Subject: [SCore-users-jp] Re: [SCore-users] Unusual problem with scrun In-Reply-To: Message-ID: On Thu, 13 Mar 2003, Iztok Daneu wrote: > we simply copied the content of other compute node hd with dd. Then you probably copied also the content of the /scored or /var/scored which is used by SCoreD to keep data about jobs. You should probably start scored with -reset or -resetall command line arguments to clean up any old state in this directory. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From ce107 @ dam.brown.edu Fri Mar 14 00:09:04 2003 From: ce107 @ dam.brown.edu (C. Evangelinos) Date: Thu, 13 Mar 2003 10:09:04 -0500 (EST) Subject: [SCore-users-jp] [SCore-users] NFS installation Message-ID: <200303131509.h2DF94210604@fritz.dam.brown.edu> I'd like to install SCore 5.4 (for development purposes) on a heterogeneous network of PCs. As these machines are old they do not really have 1/2GB of free space for /opt/score locally. I was wondering whether I could install SCore in an NFS partition (and pay whatever the performance loss this means). If that is doable what still needs to be installed locally? Constantinos Evangelinos Center for Fluid Mechanics Brown University and Ocean Engineering Department MIT _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From bogdan.costescu @ iwr.uni-heidelberg.de Fri Mar 14 00:47:44 2003 From: bogdan.costescu @ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Thu, 13 Mar 2003 16:47:44 +0100 (CET) Subject: [SCore-users-jp] Re: [SCore-users] NFS installation In-Reply-To: <200303131509.h2DF94210604@fritz.dam.brown.edu> Message-ID: On Thu, 13 Mar 2003, C. Evangelinos wrote: > I was wondering whether I could install SCore in an NFS partition (and > pay whatever the performance loss this means). 2-3 years ago I tried to install SCore on diskless clients and the only obstacle that I couldn't pass was that /tmp and/or /var/score could not be on NFS, but on local disk. I don't know if requirements changed in the mean time; it they didn't, I think that you have a pretty big chance of getting it to work. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama @ pccluster.org Fri Mar 14 09:54:34 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Fri, 14 Mar 2003 09:54:34 +0900 Subject: [SCore-users-jp] Re: [SCore-users] NFS installation In-Reply-To: Your message of "Thu, 13 Mar 2003 16:47:44 JST." Message-ID: <20030314005435.32BCE20054@neal.il.is.s.u-tokyo.ac.jp> In article Bogdan Costescu wrotes: > On Thu, 13 Mar 2003, C. Evangelinos wrote: > > > I was wondering whether I could install SCore in an NFS partition (and > > pay whatever the performance loss this means). > > 2-3 years ago I tried to install SCore on diskless clients and the only > obstacle that I couldn't pass was that /tmp and/or /var/score could not be > on NFS, but on local disk. Probably, this probrem is not change. You can shared by NFS /opt/score (and SCore source file) even if you want to run SCore on heterogeneous network. But you cannot share /var/scored. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From iztok.daneu @ rzs-hm.si Fri Mar 14 17:45:29 2003 From: iztok.daneu @ rzs-hm.si (Iztok Daneu) Date: Fri, 14 Mar 2003 08:45:29 +0000 (UTC) Subject: [SCore-users-jp] Re: [SCore-users] Unusual problem with scrun In-Reply-To: Message-ID: Hi, On Thu, 13 Mar 2003, Bogdan Costescu wrote: > On Thu, 13 Mar 2003, Iztok Daneu wrote: > > > we simply copied the content of other compute node hd with dd. > > Then you probably copied also the content of the /scored or /var/scored > which is used by SCoreD to keep data about jobs. You should probably start > scored with -reset or -resetall command line arguments to clean up any old > state in this directory. The machine which was used as source for dd-ing was properly taken from the cluster (no jobs were running at the time, the machine was added to scorehosts.defects file ....) so there were no staled jobs files. However we did try your suggestion but with no avail. Thank you for the help anyway. regards, iztok -- Your motives for doing whatever good deed you may have in mind will be misinterpreted by somebody. _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From jodell @ ad.brown.edu Tue Mar 18 07:22:15 2003 From: jodell @ ad.brown.edu (James O'Dell) Date: 17 Mar 2003 17:22:15 -0500 Subject: [SCore-users-jp] [SCore-users] Configuring SCore Message-ID: <1047939735.14922.28.camel@cr1> We have had our SCore cluster running over fast ethernet for a few weeks now and we decided to add gigabit ethernet,. My assumption is that to add a new network i: 1) edit the pm_ehternet file on the nodes to start the gig interface. 2) Add a file pm-gig.conf to the /opt/score/etc directory. This file has the MAC addresses of the gig cards. 3) Edit the scoredhosts.db file to define gigaethernet,include bu pm-gig.conf file and define the nodes to have gigabit ethernet. 4) Reboot the server and the compute hosts. Unfortunately when I run scstest -network gigaethernet I get the following messages: gaethernet/ethernet (error=12). argv[0] -config argv[1] /var/scored/scoreboard/kansas.0000V3000V7t Unable to open PM gigaethernet/ethernet (error=12). argv[0] -config argv[1] /var/scored/scoreboard/kansas.0000V3000V7t Any help would be greatly appreciated. Also, is there an easier way to do this? Thanks, Jim _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From ce107 @ dam.brown.edu Tue Mar 18 08:27:19 2003 From: ce107 @ dam.brown.edu (C. Evangelinos) Date: Mon, 17 Mar 2003 18:27:19 -0500 (EST) Subject: [SCore-users-jp] Re: [SCore-users] NFS installation In-Reply-To: <20030314005435.32BCE20054@neal.il.is.s.u-tokyo.ac.jp> from "kameyama@pccluster.org" at Mar 14, 2003 09:54:34 AM Message-ID: <200303172327.h2HNRJH25065@fritz.dam.brown.edu> Thanks to everybody for the help on the NFS installation. I think I'll manage and will report back to the list of the results. In the meantime I'm stuck in installing the SCore kernel on my old heterogeneous equipment. The first two machines I tried (a PII-333 of a stepping that is fine for SCore and a Pentium-133) have 64MB of RAM and as they try to boot with the new kernel they run out of memory and start killing daemons - the machine never gets much down the bootup sequence. I'm using the i686 and i586 kernels that come in RPM format with SCore 5.4. Is >64MB RAM a necessity for SCore? Constantinos Evangelinos Center for Fluid Mechanics Brown University and Ocean Engineering Department MIT _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From jodell @ ad.brown.edu Tue Mar 18 08:39:16 2003 From: jodell @ ad.brown.edu (James O'Dell) Date: 17 Mar 2003 18:39:16 -0500 Subject: [SCore-users-jp] [SCore-users] Help with an error message Message-ID: <1047944356.14938.31.camel@cr1> Does anyoen know what the following messages mean? I got them whil running: scstest -network gigaethernet bio-11(-1) pmAssociateNodes: Invalid argument(22) bio-12(-1) pmAssociateNodes: Invalid argument(22) Thanks, Jim _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From hori @ swimmy-soft.com Tue Mar 18 11:30:28 2003 From: hori @ swimmy-soft.com (Atsushi HORI) Date: Tue, 18 Mar 2003 11:30:28 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Help with an error message In-Reply-To: <1047944356.14938.31.camel@cr1> References: <1047944356.14938.31.camel@cr1> Message-ID: <3130831828.hori0000@swimmy-soft.com> Hi, >1) edit the pm_ehternet file on the nodes to start the gig interface. >2) Add a file pm-gig.conf to the /opt/score/etc directory. This file has >the MAC addresses of the gig cards. >3) Edit the scoredhosts.db file to define gigaethernet,include bu >pm-gig.conf file and define the nodes to have gigabit ethernet. >4) Reboot the server and the compute hosts. And you must do the following on all cluster hosts; 5) /etc/rc.d/init.d/pm_ethernet stop Edit /etc/rc.d/init.d/pm_ethernet /etc/rc.d/init.d/pm_ethernet start The pm_sthernet script binds PM unit number and Linux ethernet device (eth0, eth1, ...). >Does anyoen know what the following messages mean? >I got them whil running: > >scstest -network gigaethernet > > >bio-11(-1) pmAssociateNodes: Invalid argument(22) >bio-12(-1) pmAssociateNodes: Invalid argument(22) Send me the files /opt/score/etc/scorehosts.db and /opt/score/etc/pm-gig.conf. ---- Atsushi HORI Swimmy Software, Inc. _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From jodell @ ad.brown.edu Wed Mar 19 03:53:25 2003 From: jodell @ ad.brown.edu (James O'Dell) Date: 18 Mar 2003 13:53:25 -0500 Subject: [SCore-users-jp] [SCore-users] Starting Compute Host Lock services: msgbserv:No hosts Message-ID: <1048013605.14938.56.camel@cr1> I get an error message when I try to start my msgbserv as below: /etc/rc.d/init.d/msgbserv start Starting Compute Host Lock services: msgbserv:No hosts Here is the output of my scorehosts and sceptic commands scorehosts -l -g _scoreall_ bio-1.cascv.brown.edu bio-2.cascv.brown.edu bio-3.cascv.brown.edu bio-4.cascv.brown.edu bio-5.cascv.brown.edu bio-6.cascv.brown.edu bio-7.cascv.brown.edu bio-8.cascv.brown.edu bio-9.cascv.brown.edu bio-10.cascv.brown.edu bio-11.cascv.brown.edu bio-12.cascv.brown.edu 12 hosts found. sceptic -v -g _scoreall_ bio-2.cascv.brown.edu: OK bio-9.cascv.brown.edu: OK bio-1.cascv.brown.edu: OK bio-10.cascv.brown.edu: OK bio-6.cascv.brown.edu: OK bio-8.cascv.brown.edu: OK bio-4.cascv.brown.edu: OK bio-7.cascv.brown.edu: OK bio-5.cascv.brown.edu: OK bio-12.cascv.brown.edu: OK bio-3.cascv.brown.edu: OK bio-11.cascv.brown.edu: OK All host responding. Where is msgbserv getting the idea that there are no hosts? Thanks, Jim _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From jodell @ ad.brown.edu Wed Mar 19 04:00:09 2003 From: jodell @ ad.brown.edu (James O'Dell) Date: 18 Mar 2003 14:00:09 -0500 Subject: [SCore-users-jp] Re: [SCore-users] Help with an error message In-Reply-To: <3130831828.hori0000@swimmy-soft.com> References: <1047944356.14938.31.camel@cr1> <3130831828.hori0000@swimmy-soft.com> Message-ID: <1048014009.14922.64.camel@cr1> Here is my scorehosts.db /* * SCore 5.0 scorehosts.db * generated by PCCC EIT 5.2 */ /* PM/Myrinet */ myrinet type=myrinet \ -firmware:file=/opt/score/share/lanai/lanai.mcp \ -config:file=/opt/score/etc/pm-myrinet.conf /* PM/Myrinet */ myrinet2k type=myrinet2k \ -firmware:file=/opt/score/share/lanai/lanaiM2k.mcp \ -config:file=/opt/score/etc/pm-myrinet.conf /* PM/Ethernet */ ethernet type=ethernet \ -config:file=/opt/score/etc/pm-ethernet.conf gigaethernet type=ethernet \ -config:file=/opt/score/etc/pm-gig.conf /* PM/Agent */ udp type=agent -agent=pmaudp \ -config:file=/opt/score/etc/pm-udp.conf /* RHiNET */ rhinet type=rhinet \ -firmware:file=/opt/score/share/rhinet/phu_top_0207a.hex \ -config:file=/opt/score/etc/pm-rhinet.conf ## ## #include "/opt/score//etc/ndconf/0" #include "/opt/score//etc/ndconf/1" #include "/opt/score//etc/ndconf/2" #include "/opt/score//etc/ndconf/3" #include "/opt/score//etc/ndconf/4" #include "/opt/score//etc/ndconf/5" #include "/opt/score//etc/ndconf/6" #include "/opt/score//etc/ndconf/7" #include "/opt/score//etc/ndconf/8" #include "/opt/score//etc/ndconf/9" #include "/opt/score//etc/ndconf/10" #include "/opt/score//etc/ndconf/11" ## #define MSGBSERV msgbserv=(kansas-fe.cascv.brown.edu:8764) bio-1.cascv.brown.edu HOST_0 network=ethernet group=_scoreall_,100Mb smp=2 MSGBSERV bio-2.cascv.brown.edu HOST_1 network=ethernet group=_scoreall_,100Mb smp=2 MSGBSERV bio-3.cascv.brown.edu HOST_2 network=ethernet group=_scoreall_,100Mb smp=2 MSGBSERV bio-4.cascv.brown.edu HOST_3 network=ethernet group=_scoreall_,100Mb smp=2 MSGBSERV bio-5.cascv.brown.edu HOST_4 network=ethernet,gigaethernet group=_scoreall_,100Mb,gige smp=2 MSGBSERV bio-6.cascv.brown.edu HOST_5 network=ethernet,gigaethernet group=_scoreall_,100Mb,gige smp=2 MSGBSERV bio-7.cascv.brown.edu HOST_6 network=ethernet,gigaethernet group=_scoreall_,100Mb,gige smp=2 MSGBSERV bio-8.cascv.brown.edu HOST_7 network=ethernet,gigaethernet group=_scoreall_,100Mb,gige smp=2 MSGBSERV bio-9.cascv.brown.edu HOST_8 network=ethernet,gigaethernet group=_scoreall_,100Mb,gige smp=2 MSGBSERV bio-10.cascv.brown.edu HOST_9 network=ethernet,gigaethernet group=_scoreall_,100Mb,gige smp=2 MSGBSERV bio-11.cascv.brown.edu HOST_10 network=ethernet,gigaethernet group=_scoreall_,100Mb,gige smp=2 MSGBSERV bio-12.cascv.brown.edu HOST_11 network=ethernet,gigaethernet group=_scoreall_,100Mb,gige smp=2 MSGBSERV Here is my pm-gig.conf file: unit 1 maxnsend 8 # Not connected yet #0 00:30:48:23:70:CF bio-1.cascv.brown.edu #1 00:30:48:23:70:B1 bio-2.cascv.brown.edu #2 00:30:48:23:70:D9 bio-3.cascv.brown.edu #3 00:30:48:23:70:E3 bio-4.cascv.brown.edu 4 00:30:48:23:6E:2B bio-5.cascv.brown.edu 5 00:30:48:23:3F:05 bio-6.cascv.brown.edu 6 00:30:48:23:3E:51 bio-7.cascv.brown.edu 7 00:30:48:23:3E:3D bio-8.cascv.brown.edu 8 00:30:48:23:70:EB bio-9.cascv.brown.edu 9 00:30:48:23:6F:05 bio-10.cascv.brown.edu 10 00:30:48:23:6E:55 bio-11.cascv.brown.edu 11 00:30:48:23:70:E1 bio-12.cascv.brown.edu I have disabled the first four hosts as we don't have enough room in our switch for them. I have also edited the pm_ethernet file to start and stop eth1. When I run "pm_ethernet stop" and then run "pm_ethernet start" I get the messages below. [root @ bio-12 init.d]# ./pm_ethernet stop Stopping PM/Ethernet: device: eth0 device: eth1 [root @ bio-12 init.d]# ./pm_ethernet start n Starting PM/Ethernet: device: eth0 device: eth1 etherpmctl: ERROR on unit 1: "Link has been severed(67)" Check dmesg log!! Many thanks for your help! Jim On Mon, 2003-03-17 at 21:30, Atsushi HORI wrote: > Hi, > > >1) edit the pm_ehternet file on the nodes to start the gig interface. > >2) Add a file pm-gig.conf to the /opt/score/etc directory. This file has > >the MAC addresses of the gig cards. > >3) Edit the scoredhosts.db file to define gigaethernet,include bu > >pm-gig.conf file and define the nodes to have gigabit ethernet. > >4) Reboot the server and the compute hosts. > > And you must do the following on all cluster hosts; > > 5) /etc/rc.d/init.d/pm_ethernet stop > Edit /etc/rc.d/init.d/pm_ethernet > /etc/rc.d/init.d/pm_ethernet start > > The pm_sthernet script binds PM unit number and Linux ethernet device > (eth0, eth1, ...). > > >Does anyoen know what the following messages mean? > >I got them whil running: > > > >scstest -network gigaethernet > > > > > >bio-11(-1) pmAssociateNodes: Invalid argument(22) > >bio-12(-1) pmAssociateNodes: Invalid argument(22) > > Send me the files /opt/score/etc/scorehosts.db and > /opt/score/etc/pm-gig.conf. > > ---- > Atsushi HORI > Swimmy Software, Inc. > _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From jodell @ ad.brown.edu Wed Mar 19 05:29:22 2003 From: jodell @ ad.brown.edu (James O'Dell) Date: 18 Mar 2003 15:29:22 -0500 Subject: [SCore-users-jp] Re: [SCore-users] Help with an error message In-Reply-To: <1048014009.14922.64.camel@cr1> References: <1047944356.14938.31.camel@cr1> <3130831828.hori0000@swimmy-soft.com> <1048014009.14922.64.camel@cr1> Message-ID: <1048019362.16469.81.camel@cr1> I found part of my problem. The "Link has been severd message" came about because my gig interface was not marked UP by ifconfig. I am not using the gigbit ethernet for anything esle by SCore so it was not UP. I modified the pm_ethernet scripts to do a "/sbin/ifconfig eth1 up" and an "/sbin/ifconfig eth1 down" before and after respectively. rpmtest indicates that both interfaces are now working. I cannot test with scstest because I somehow broke my msgbserv. That problem is in another message. Jim On Tue, 2003-03-18 at 14:00, James O'Dell wrote: > Here is my scorehosts.db > > > /* > * SCore 5.0 scorehosts.db > * generated by PCCC EIT 5.2 > */ > > /* PM/Myrinet */ > myrinet type=myrinet \ > -firmware:file=/opt/score/share/lanai/lanai.mcp \ > -config:file=/opt/score/etc/pm-myrinet.conf > > /* PM/Myrinet */ > myrinet2k type=myrinet2k \ > -firmware:file=/opt/score/share/lanai/lanaiM2k.mcp \ > -config:file=/opt/score/etc/pm-myrinet.conf > > /* PM/Ethernet */ > ethernet type=ethernet \ > -config:file=/opt/score/etc/pm-ethernet.conf > gigaethernet type=ethernet \ > -config:file=/opt/score/etc/pm-gig.conf > /* PM/Agent */ > udp type=agent -agent=pmaudp \ > -config:file=/opt/score/etc/pm-udp.conf > > /* RHiNET */ > rhinet type=rhinet \ > -firmware:file=/opt/score/share/rhinet/phu_top_0207a.hex \ > -config:file=/opt/score/etc/pm-rhinet.conf > ## > ## > #include "/opt/score//etc/ndconf/0" > #include "/opt/score//etc/ndconf/1" > #include "/opt/score//etc/ndconf/2" > #include "/opt/score//etc/ndconf/3" > #include "/opt/score//etc/ndconf/4" > #include "/opt/score//etc/ndconf/5" > #include "/opt/score//etc/ndconf/6" > #include "/opt/score//etc/ndconf/7" > #include "/opt/score//etc/ndconf/8" > #include "/opt/score//etc/ndconf/9" > #include "/opt/score//etc/ndconf/10" > #include "/opt/score//etc/ndconf/11" > ## > #define MSGBSERV msgbserv=(kansas-fe.cascv.brown.edu:8764) > > bio-1.cascv.brown.edu HOST_0 network=ethernet group=_scoreall_,100Mb > smp=2 MSGBSERV > bio-2.cascv.brown.edu HOST_1 network=ethernet group=_scoreall_,100Mb > smp=2 MSGBSERV > bio-3.cascv.brown.edu HOST_2 network=ethernet group=_scoreall_,100Mb > smp=2 MSGBSERV > bio-4.cascv.brown.edu HOST_3 network=ethernet group=_scoreall_,100Mb > smp=2 MSGBSERV > bio-5.cascv.brown.edu HOST_4 network=ethernet,gigaethernet > group=_scoreall_,100Mb,gige smp=2 MSGBSERV > bio-6.cascv.brown.edu HOST_5 network=ethernet,gigaethernet > group=_scoreall_,100Mb,gige smp=2 MSGBSERV > bio-7.cascv.brown.edu HOST_6 network=ethernet,gigaethernet > group=_scoreall_,100Mb,gige smp=2 MSGBSERV > bio-8.cascv.brown.edu HOST_7 network=ethernet,gigaethernet > group=_scoreall_,100Mb,gige smp=2 MSGBSERV > bio-9.cascv.brown.edu HOST_8 network=ethernet,gigaethernet > group=_scoreall_,100Mb,gige smp=2 MSGBSERV > bio-10.cascv.brown.edu HOST_9 network=ethernet,gigaethernet > group=_scoreall_,100Mb,gige smp=2 MSGBSERV > bio-11.cascv.brown.edu HOST_10 network=ethernet,gigaethernet > group=_scoreall_,100Mb,gige smp=2 MSGBSERV > bio-12.cascv.brown.edu HOST_11 network=ethernet,gigaethernet > group=_scoreall_,100Mb,gige smp=2 MSGBSERV > > > Here is my pm-gig.conf file: > unit 1 > maxnsend 8 > # Not connected yet > #0 00:30:48:23:70:CF bio-1.cascv.brown.edu > #1 00:30:48:23:70:B1 bio-2.cascv.brown.edu > #2 00:30:48:23:70:D9 bio-3.cascv.brown.edu > #3 00:30:48:23:70:E3 bio-4.cascv.brown.edu > 4 00:30:48:23:6E:2B bio-5.cascv.brown.edu > 5 00:30:48:23:3F:05 bio-6.cascv.brown.edu > 6 00:30:48:23:3E:51 bio-7.cascv.brown.edu > 7 00:30:48:23:3E:3D bio-8.cascv.brown.edu > 8 00:30:48:23:70:EB bio-9.cascv.brown.edu > 9 00:30:48:23:6F:05 bio-10.cascv.brown.edu > 10 00:30:48:23:6E:55 bio-11.cascv.brown.edu > 11 00:30:48:23:70:E1 bio-12.cascv.brown.edu > > I have disabled the first four hosts as we don't have enough room in our > switch for them. > > I have also edited the pm_ethernet file to start and stop eth1. When I > run "pm_ethernet stop" and then run "pm_ethernet start" I get the > messages below. > > [root @ bio-12 init.d]# ./pm_ethernet stop > Stopping PM/Ethernet: device: eth0 > device: eth1 > > [root @ bio-12 init.d]# ./pm_ethernet start > n Starting PM/Ethernet: > device: eth0 > device: eth1 > etherpmctl: ERROR on unit 1: "Link has been severed(67)" Check dmesg > log!! > > Many thanks for your help! > > Jim > > On Mon, 2003-03-17 at 21:30, Atsushi HORI wrote: > > Hi, > > > > >1) edit the pm_ehternet file on the nodes to start the gig interface. > > >2) Add a file pm-gig.conf to the /opt/score/etc directory. This file has > > >the MAC addresses of the gig cards. > > >3) Edit the scoredhosts.db file to define gigaethernet,include bu > > >pm-gig.conf file and define the nodes to have gigabit ethernet. > > >4) Reboot the server and the compute hosts. > > > > And you must do the following on all cluster hosts; > > > > 5) /etc/rc.d/init.d/pm_ethernet stop > > Edit /etc/rc.d/init.d/pm_ethernet > > /etc/rc.d/init.d/pm_ethernet start > > > > The pm_sthernet script binds PM unit number and Linux ethernet device > > (eth0, eth1, ...). > > > > >Does anyoen know what the following messages mean? > > >I got them whil running: > > > > > >scstest -network gigaethernet > > > > > > > > >bio-11(-1) pmAssociateNodes: Invalid argument(22) > > >bio-12(-1) pmAssociateNodes: Invalid argument(22) > > > > Send me the files /opt/score/etc/scorehosts.db and > > /opt/score/etc/pm-gig.conf. > > > > ---- > > Atsushi HORI > > Swimmy Software, Inc. > > > _______________________________________________ > SCore-users mailing list > SCore-users @ pccluster.org > http://www.pccluster.org/mailman/listinfo/score-users _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From michael @ streamline-computing.com Wed Mar 19 06:30:59 2003 From: michael @ streamline-computing.com (Michael Rudgyard Streamline) Date: Tue, 18 Mar 2003 21:30:59 +0000 (GMT) Subject: [SCore-users-jp] [SCore-users] Benchmarking 256 processor problem Message-ID: We have a customer who has an interesting problem with MPI_Bcast, where SCore seems to hang on large numbers of processors,. and in particular when there are 2 processes per node running. The code segment is provided below. My understanding is that the MPICH (and hence the SCore) implementation of MPI_Bcast is globally asynchronous, and is built using MPI_Send. It is therefore possible that (in the example below, and in a worse case) the 256th processor may have yet to receive messages from all other (255) processors. I suspect that this may be problematic because there a maximum number of message buffers that may be sent at a given time. I know this was the case on SGI and Cray systems, and I think this is the case with MPICH but can't find the corresponding environment variables on the MPICH web-site. As far as I am aware, MPI_Send will block if the send cannot be buffered (so I assume this is the case for MPI_BCast), and given that MPI_BCast is called in the correct order for each processor (avoiding the well-known deadlock situations), I can't see why this code should necessarily cause the code to hang (???) other than there being potentially a lot of messages floating around... This leads me to believe that it must just be the number of outstanding messages that is the problem, although in that case shouldn't the corresponding MPI_BCast block at the senders side ? Could there be an issue in particular due to messages sent via shared memory (ie a performance vs. correctness issue) ? For info, each send is about a Kilobyte of information. Note that making the broadcast synchonous, ie. by adding an MPI_Barrier, we solve the problem. The machine is running Score 5.0.1 with MPI 1.2.4 over Myrinet 2000 (M3F-PCI64-B 2MB). Thanks in advance, Michael ---------------- The code ran fine on up to 128 processors when tested on one process per node. It also ran fine on 2 processes per node on up to 32 nodes (ie 64 processes). However when run on 64x2 then the code would "stop" at differing points, normally within a minute of execution of an hour long job. By "stop" I mean the processes would remain at 100% CPU but no work was being done, as though a process was waiting for a message. Reason ------ Our investigations this afternoon has led us to believe that it comes down to a loop of MPI_Bcasts: DO 300 p = 0, noprocs-1 JSTART = p*JMAX/noprocs+1 JFINISH = (p+1)*JMAX/noprocs npts = IMAX*(JFINISH-JSTART+1) CALL MPI_Bcast(U(1,JSTART), npts, : MPI_DOUBLE_PRECISION, p, MPI_COMM_WORLD, : error) 300 CONTINUE This broadcast simply sends the next processors chunk of the array to all the other processors. An AllToAll would be similar, however this was used to give better control over the number of messages being sent at any time. However, it appears that this isn't the case. By adding an MPI_Barrier call after the MPI_Bcast the problem of the "stopping" wasn't repeated in our tests. _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama @ pccluster.org Wed Mar 19 09:14:35 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Wed, 19 Mar 2003 09:14:35 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Starting Compute Host Lock services: msgbserv:No hosts In-Reply-To: Your message of "18 Mar 2003 13:53:25 JST." <1048013605.14938.56.camel@cr1> Message-ID: <20030319001435.2629020054@neal.il.is.s.u-tokyo.ac.jp> In article <1048013605.14938.56.camel @ cr1> "James O'Dell" wrotes: > I get an error message when I try to start my msgbserv as below: > > /etc/rc.d/init.d/msgbserv start > Starting Compute Host Lock services: msgbserv:No hosts In your scorehosts.db says: #define MSGBSERV msgbserv=(kansas-fe.cascv.brown.edu:8764) But your /var/scored stored file is: gaethernet/ethernet (error=12). argv[0] -config argv[1] /var/scored/scoreboard/kansas.0000V3000V7t Unable to open PM gigaethernet/ethernet (error=12). argv[0] -config argv[1] /var/scored/scoreboard/kansas.0000V3000V7t What is official hostname of your server? msgbserv serach management host from scoreboad database by *official hostname*. You can get official hostname by officialnamecommand: % officialname If your server's hostname is kansas.cascv.brown.edu, please change msgbserv entry. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From hori @ swimmy-soft.com Wed Mar 19 15:40:55 2003 From: hori @ swimmy-soft.com (Atsushi HORI) Date: Wed, 19 Mar 2003 15:40:55 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Help with an error message In-Reply-To: <1047944356.14938.31.camel@cr1> References: <1047944356.14938.31.camel@cr1> Message-ID: <3130933255.hori0006@swimmy-soft.com> Hi. >Does anyoen know what the following messages mean? >I got them whil running: > >scstest -network gigaethernet > > >bio-11(-1) pmAssociateNodes: Invalid argument(22) >bio-12(-1) pmAssociateNodes: Invalid argument(22) There are a number of possibilities to cause this message. So, try with the PM debug option. % export PM_DEBUG=1 % scstest -network gigaether And let me know the output messages. ---- Atsushi HORI SCore Developer Swimmy Software, Inc. _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From jodell @ ad.brown.edu Thu Mar 20 05:15:52 2003 From: jodell @ ad.brown.edu (James O'Dell) Date: 19 Mar 2003 15:15:52 -0500 Subject: [SCore-users-jp] [Fwd: Re: [SCore-users] Starting Compute Host Lock services: msgbserv:No hosts] Message-ID: <1048104952.14922.106.camel@cr1> -----Forwarded Message----- > From: James O'Dell > To: kameyama @ pccluster.org > Subject: Re: [SCore-users] Starting Compute Host Lock services: msgbserv:No hosts > Date: 19 Mar 2003 12:58:25 -0500 > > That was the problem exactly! It makes me wonder how my cluster ever > worked correctly! > > Thanks for the pointers. My cluster is up and running on gig ethernet. > > Jim > > On Tue, 2003-03-18 at 19:14, kameyama @ pccluster.org wrote: > > In article <1048013605.14938.56.camel @ cr1> "James O'Dell" wrotes: > > > I get an error message when I try to start my msgbserv as below: > > > > > > /etc/rc.d/init.d/msgbserv start > > > Starting Compute Host Lock services: msgbserv:No hosts > > > > In your scorehosts.db says: > > #define MSGBSERV msgbserv=(kansas-fe.cascv.brown.edu:8764) > > > > But your /var/scored stored file is: > > gaethernet/ethernet (error=12). > > argv[0] -config > > argv[1] /var/scored/scoreboard/kansas.0000V3000V7t > > Unable to open PM gigaethernet/ethernet (error=12). > > argv[0] -config > > argv[1] /var/scored/scoreboard/kansas.0000V3000V7t > > > > What is official hostname of your server? > > msgbserv serach management host from scoreboad database by *official hostname*. > > You can get official hostname by officialnamecommand: > > % officialname > > If your server's hostname is kansas.cascv.brown.edu, please change > > msgbserv entry. > > > > from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From jodell @ ad.brown.edu Thu Mar 20 05:15:35 2003 From: jodell @ ad.brown.edu (James O'Dell) Date: 19 Mar 2003 15:15:35 -0500 Subject: [SCore-users-jp] [Fwd: Re: [SCore-users] Help with an error message] Message-ID: <1048104935.14922.104.camel@cr1> -----Forwarded Message----- > From: James O'Dell > To: Atsushi Hori > Subject: Re: [SCore-users] Help with an error message > Date: 19 Mar 2003 12:56:31 -0500 > > Once I got my gig interfaces configured properly, this message went > away. > > Thanks for all of your help! > > Jim > > On Wed, 2003-03-19 at 01:40, Atsushi HORI wrote: > > Hi. > > > > >Does anyoen know what the following messages mean? > > >I got them whil running: > > > > > >scstest -network gigaethernet > > > > > > > > >bio-11(-1) pmAssociateNodes: Invalid argument(22) > > >bio-12(-1) pmAssociateNodes: Invalid argument(22) > > > > There are a number of possibilities to cause this message. So, try > > with the PM debug option. > > > > % export PM_DEBUG=1 > > % scstest -network gigaether > > > > And let me know the output messages. > > > > ---- > > Atsushi HORI > > SCore Developer > > Swimmy Software, Inc. > > _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From jodell @ ad.brown.edu Thu Mar 20 08:05:21 2003 From: jodell @ ad.brown.edu (James O'Dell) Date: 19 Mar 2003 18:05:21 -0500 Subject: [SCore-users-jp] [SCore-users] Procedure for adjusting networking parameters Message-ID: <1048115121.16469.128.camel@cr1> Does anyone have a "best practices" procedure that they'd like to share on how they adjust their networking parameters for highest performance? Thanks, Jim _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From hori @ swimmy-soft.com Thu Mar 20 11:06:17 2003 From: hori @ swimmy-soft.com (Atsushi HORI) Date: Thu, 20 Mar 2003 11:06:17 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Procedure for adjusting networking parameters In-Reply-To: <1048115121.16469.128.camel@cr1> References: <1048115121.16469.128.camel@cr1> Message-ID: <3131003177.hori0001@swimmy-soft.com> Hi, >Does anyone have a "best practices" procedure that they'd like to share >on how they adjust their networking parameters for highest performance? What is the definition of "performance" ? Althought many people believe that the communication perfomance can be measured with latency and bandwidth, but as far as I know, those latency and bandwidth are representing some aspects of communication characteristics. Have you ever heard of the LogP model ? This is another performance measure of communication, but still LogP represents some aspects, not all. Further, cluster users want to run their jobs as fast as they could. So the ultimate goal is to obtain the highest performance of applications, not the communication. Somtimes application performace depends on the latency and bandwidth, but sometimes not. ---- Atsushi HORI SCore Developer Swimmy Software, Inc. _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From jodell @ ad.brown.edu Fri Mar 21 03:48:30 2003 From: jodell @ ad.brown.edu (James O'Dell) Date: 20 Mar 2003 13:48:30 -0500 Subject: [SCore-users-jp] Re: [SCore-users] Procedure for adjusting networking parameters In-Reply-To: <3131003177.hori0001@swimmy-soft.com> References: <1048115121.16469.128.camel@cr1> <3131003177.hori0001@swimmy-soft.com> Message-ID: <1048186110.20434.74.camel@cr1> Good point. I guess what you are really saying is that I should tune my system against its typical workload. Jim On Wed, 2003-03-19 at 21:06, Atsushi HORI wrote: > Hi, > > >Does anyone have a "best practices" procedure that they'd like to share > >on how they adjust their networking parameters for highest performance? > > What is the definition of "performance" ? > > Althought many people believe that the communication perfomance can > be measured with latency and bandwidth, but as far as I know, those > latency and bandwidth are representing some aspects of communication > characteristics. Have you ever heard of the LogP model ? This is > another performance measure of communication, but still LogP > represents some aspects, not all. > > Further, cluster users want to run their jobs as fast as they could. > So the ultimate goal is to obtain the highest performance of > applications, not the communication. Somtimes application performace > depends on the latency and bandwidth, but sometimes not. > > ---- > Atsushi HORI > SCore Developer > Swimmy Software, Inc. > _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From jodell @ ad.brown.edu Fri Mar 21 07:54:46 2003 From: jodell @ ad.brown.edu (James O'Dell) Date: 20 Mar 2003 17:54:46 -0500 Subject: [SCore-users-jp] [SCore-users] More on my segmentation violation problem Message-ID: <1048200886.20434.97.camel@cr1> I've recompiled GROMACS to include symbols and have managed to get a debugger backtrace from the process that is experiencing the segmentation violation. #0 0x082052cd in syscall () #1 0xbfffe5f8 in ?? () #2 0x081e045f in vsyscall (handle=0x834a080, retp=0xbfffe5f8, args=0xbfffe5ec) at ../scwrap.c:711 #3 0x081e0495 in score_syscall (handle=0x834a080, retp=0xbfffe5f8) at ../scwrap.c:726 #4 0x081e0acf in __nanosleep (req=0xbfffe61c, rem=0xbfffe61c) at ../scwrap.c:1166 #5 0x08203c7a in sleep () #6 0x081b020e in score_wait_forever () at ../libsc_util.c:154 #7 0x081b04f2 in sc_inspectme (x_display=0xbffffd56 "dev1:0", signal=11) at ../libscio.c:243 #8 0x081a8be0 in MPID_SCORE_Exception () #9 #10 angles (nbonds=27548, forceatoms=0x8eeed60, forceparams=0x8eeaa20, x=0x8fc0560, f=0x93e8218, fr=0x8cca460, g=0x8ccaca0, box=0x8ad1c98, lambda=0, dvdlambda=0xbfffec98, md=0x8cb61b8, ngrp=2, egnb=0x8ad12c0, egcoul=0x8ad12a8, fcd=0x8ad15b0) at ../../include/vec.h:235 #11 0x080898ee in calc_bonds (log=0x8ad1440, cr=0x86af708, mcr=0x0, idef=0x8ad402c, x_s=0x8fc0560, f=0x93e8218, fr=0x8cca460, g=0x8ccaca0, epot=0x8ad11a0, nrnb=0xbffff1d0, box=0x8ad1c98, lambda=0, md=0x8cb61b8, ngrp=2, egnb=0x8ad12c0, egcoul=0x8ad12a8, fcd=0x8ad15b0, step=0, bSepDVDL=0) at bondfree.c:109 ---Type to continue, or q to quit--- #12 0x0805dd2d in force (fp=0x8ad1440, step=0, fr=0x8cca460, ir=0x8ad1aa8, idef=0x8ad402c, nsb=0x8ad3008, cr=0x86af708, mcr=0x0, nrnb=0xbffff1d0, grps=0x8ad1908, md=0x8cb61b8, ngener=2, opts=0x8ad1c28, x=0x8fc0560, f=0x93e8218, epot=0x8ad11a0, fcd=0x8ad15b0, bVerbose=0, box=0x8ad1c98, lambda=0, graph=0x8ccaca0, excl=0x8adf1c4, bNBFonly=0, lr_vir=0xbffff610, mu_tot=0xbffff1c0, qsum=-6.99999762, bGatherOnly=0) at force.c:960 #13 0x0807eade in do_force (log=0x8ad1440, cr=0x86af708, mcr=0x0, parm=0x8ad1aa8, nsb=0x8ad3008, vir_part=0xbffff640, pme_vir=0xbffff610, step=0, nrnb=0xbffff1d0, top=0x8ad4028, grps=0x8ad1908, x=0x8fc0560, v=0x90454e8, f=0x93e8218, buf=0x9363290, mdatoms=0x8cb61b8, ener=0x8ad11a0, fcd=0x8ad15b0, bVerbose=0, lambda=0, graph=0x8ccaca0, bNS=1, bNBFonly=0, fr=0x8cca460, mu_tot=0xbffff1c0, bGatherOnly=0) at sim_util.c:282 #14 0x0805177e in do_md (log=0x8ad1440, cr=0x86af708, mcr=0x0, nfile=21, fnm=0x828bd04, bVerbose=1, bCompact=1, bDummies=0, dummycomm=0x0, stepout=10, parm=0x8ad1aa8, grps=0x8ad1908, top=0x8ad4028, ener=0x8ad11a0, fcd=0x8ad15b0, x=0x8fc0560, vold=0x94f2128, v=0x90454e8, vt=0x946d1a0, f=0x93e8218, buf=0x9363290, mdatoms=0x8cb61b8, nsb=0x8ad3008, nrnb=0x8ae0260, graph=0x8ccaca0, edyn=0xbffff7f0, fr=0x8cca460, box_size=0xbffff790, Flags=0) at md.c:508 #15 0x080508b6 in mdrunner (cr=0x86af708, mcr=0x0, nfile=21, fnm=0x828bd04, bVerbose=1, bCompact=1, nDlb=0, nstepout=10, edyn=0xbffff7f0, Flags=0) at md.c:193 The code at the violation is in this vicinity. 240 a[YY]=y; 241 a[ZZ]=z; 242 } 243 244 static inline void rvec_sub(const rvec a,const rvec b,rvec c) 245 { 246 real x,y,z; 247 248 x=a[XX]-b[XX]; 249 y=a[YY]-b[YY]; I don't believe that this behavior is specific to my hardware or operating system since I get apprximately the same behavior on an IBM SP. The segmentation violation seems to happen very early in the run. In this case I was running on 12 processors. Also, if I perform exactly the same calculation several times in a row sometimes it will segmentation fault and sometimes not. It seems to me that it has all the classic characteristics of a storage allocation problem in the gromacs code to me. Does anybody have suggestions on how to pursue this further? Thanks,Jim _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From s-sumi @ bd6.so-net.ne.jp Fri Mar 21 12:51:09 2003 From: s-sumi @ bd6.so-net.ne.jp (Shinji Sumimoto) Date: Fri, 21 Mar 2003 12:51:09 +0900 (JST) Subject: [SCore-users-jp] Re: [SCore-users] Benchmarking 256 processor problem In-Reply-To: References: Message-ID: <20030321.125109.846938101.s-sumi@bd6.so-net.ne.jp> Hi. Thank you for the information. Is the situation occurred with mpi_zerocopy=on option? If it is not occurred, there are someting wrong in message transfer on PM or MPI level implementation. If same problem is occurred in small size of cluster(ex 16 nodes), please let us know. We can re-produce the situation and fix it. PS: New version of MPICH, version 1.2.5, includes new version of mpi_bcast, so, your problem may be solved. We will try to port it to SCore. Shinji. From: Michael Rudgyard Streamline Subject: [SCore-users] Benchmarking 256 processor problem Date: Tue, 18 Mar 2003 21:30:59 +0000 (GMT) Message-ID: michael> michael> We have a customer who has an interesting problem with MPI_Bcast, michael> where SCore seems to hang on large numbers of processors,. and in michael> particular when there are 2 processes per node running. The code michael> segment is provided below. michael> michael> My understanding is that the MPICH (and hence the SCore) implementation of michael> MPI_Bcast is globally asynchronous, and is built using MPI_Send. It is michael> therefore possible that (in the example below, and in a worse case) the michael> 256th processor may have yet to receive messages from all other michael> (255) processors. I suspect that this may be problematic because there a michael> maximum number of message buffers that may be sent at a given time. I know michael> this was the case on SGI and Cray systems, and I think this is the case michael> with MPICH but can't find the corresponding environment variables on the michael> MPICH web-site. michael> michael> As far as I am aware, MPI_Send will block if the send cannot be buffered michael> (so I assume this is the case for MPI_BCast), and given that MPI_BCast is michael> called in the correct order for each processor (avoiding the well-known michael> deadlock situations), I can't see why this code should necessarily cause michael> the code to hang (???) other than there being potentially a lot of michael> messages floating around... This leads me to believe that it must just be michael> the number of outstanding messages that is the problem, although in that michael> case shouldn't the corresponding MPI_BCast block at the senders michael> side ? Could there be an issue in particular due to messages sent via michael> shared memory (ie a performance vs. correctness issue) ? michael> michael> For info, each send is about a Kilobyte of information. michael> michael> Note that making the broadcast synchonous, ie. by adding an MPI_Barrier, michael> we solve the problem. michael> michael> The machine is running Score 5.0.1 with MPI 1.2.4 over Myrinet 2000 michael> (M3F-PCI64-B 2MB). michael> michael> Thanks in advance, michael> michael> Michael michael> michael> ---------------- michael> michael> The code ran fine on up to 128 processors when tested on one process per michael> node. It also ran fine on 2 processes per node on up to 32 nodes (ie 64 michael> processes). However when run on 64x2 then the code would "stop" at michael> differing points, normally within a minute of execution of an hour long michael> job. By "stop" I mean the processes would remain at 100% CPU but no work michael> was being done, as though a process was waiting for a message. michael> michael> Reason michael> ------ michael> michael> Our investigations this afternoon has led us to believe that it comes down michael> to a loop of MPI_Bcasts: michael> DO 300 p = 0, noprocs-1 michael> JSTART = p*JMAX/noprocs+1 michael> JFINISH = (p+1)*JMAX/noprocs michael> npts = IMAX*(JFINISH-JSTART+1) michael> CALL MPI_Bcast(U(1,JSTART), npts, michael> : MPI_DOUBLE_PRECISION, p, MPI_COMM_WORLD, michael> : error) michael> 300 CONTINUE michael> This broadcast simply sends the next processors chunk of the array to all michael> the other processors. An AllToAll would be similar, however this was used michael> to give better control over the number of messages being sent at any time. michael> michael> However, it appears that this isn't the case. By adding an MPI_Barrier michael> call after the MPI_Bcast the problem of the "stopping" wasn't repeated in michael> our tests. michael> michael> michael> _______________________________________________ michael> SCore-users mailing list michael> SCore-users @ pccluster.org michael> http://www.pccluster.org/mailman/listinfo/score-users michael> ------ Shinji Sumimoto, Fujitsu Labs _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From jducom @ nd.edu Fri Mar 21 14:09:54 2003 From: jducom @ nd.edu (Jean-Christophe Ducom) Date: Fri, 21 Mar 2003 00:09:54 -0500 Subject: [SCore-users-jp] [SCore-users] Score with SK-9D21 Message-ID: <3E7A9EA2.7000907@nd.edu> All, I know that it is mentionned in the FAQ but I'd like to double check in case it has been fixed with the latest version. Is there any chance that Score works with the SysKonnect SK-9D21 card? Thank you and sorry about this (annoying/redundant) question. JC _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From nrcb @ streamline-computing.com Fri Mar 21 17:15:16 2003 From: nrcb @ streamline-computing.com (Nick Birkett) Date: Fri, 21 Mar 2003 08:15:16 +0000 Subject: [SCore-users-jp] [SCore-users] pghpf score 5.4 Message-ID: <200303210815.16173.nrcb@streamline-computing.com> I am having problem getting pghpf to work with score 5.4.0 I have this working for score 5.0.1. Here is the error: mpif90 -compiler pghpf -c -Mmpi -fast -tp p7 -Mlfs overlap=size:3 -Ktrap=fp math.f /opt/score/mpi/mpich-1.2.4/i386-redhat7-linux2_4_pghpf/bin/mpif90: eval: illegal option: -c eval: usage: eval [arg ...] make: *** [math.o] Error 2 Has anyone else tried this ? See attached mpif90 wrapper, the compiler/site and compiler/pghpf used to compile from source. Regards, Nick -------------- next part -------------- テキスト形式以外の添付ファイルを保管しました... ファイル名: mpif90 型: application/x-shellscript サイズ: 11747 バイト 説明: 無し URL: -------------- next part -------------- 文字コード指定の無い添付文書を保管しました... 名前: site URL: -------------- next part -------------- 文字コード指定の無い添付文書を保管しました... 名前: pghpf URL: From arpiruk @ yahoo.com Fri Mar 21 22:31:17 2003 From: arpiruk @ yahoo.com (=?iso-2022-jp?b?YXJwaXJ1ayAbJEIhdxsoQiB5YWhvby5jb20=?=) Date: Fri, 21 Mar 2003 05:31:17 -0800 (PST) Subject: [SCore-users-jp] [SCore-users] rc.config.scoreboard problem In-Reply-To: <20030318030001.17598.78856.Mailman@www.pccluster.org> Message-ID: <20030321133117.26480.qmail@web13907.mail.yahoo.com> I have some question concerning the installation. During installation of Score5.2 on Suse 2.4.18 the setup reports cp: cannot stat `rc.config.scoreboard': No such file or directory Exception in ../SRC/services.c, line 299 concerning file not opened where should I get this file from and where should I put it? is there a simitlarity to linux rc.config file? Sincerely, Arpiruk Hokpunna CSE student TU-Munich --------------------------------- Do you Yahoo!? Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! -------------- next part -------------- HTMLの添付ファイルを保管しました... URL: From arpiruk @ yahoo.com Fri Mar 21 22:30:31 2003 From: arpiruk @ yahoo.com (=?iso-2022-jp?b?YXJwaXJ1ayAbJEIhdxsoQiB5YWhvby5jb20=?=) Date: Fri, 21 Mar 2003 05:30:31 -0800 (PST) Subject: [SCore-users-jp] [SCore-users] Re: SCore-users digest, Vol 1 #194 - 4 msgs In-Reply-To: <20030318030001.17598.78856.Mailman@www.pccluster.org> Message-ID: <20030321133031.35701.qmail@web13901.mail.yahoo.com> I have some question concerning the installation. During installation of Score5.2 on Suse 2.4.18 the setup reports cp: cannot stat `rc.config.scoreboard': No such file or directory Exception in ../SRC/services.c, line 299 concerning file not opened where should I get this file from and where should I put it? is there a simitlarity to linux rc.config file? Sincerely, Arpiruk Hokpunna CSE student TU-Munich --------------------------------- Do you Yahoo!? Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! -------------- next part -------------- HTMLの添付ファイルを保管しました... URL: From kameyama @ pccluster.org Mon Mar 24 15:08:44 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Mon, 24 Mar 2003 15:08:44 +0900 Subject: [SCore-users-jp] Re: [SCore-users] pghpf score 5.4 In-Reply-To: Your message of "Fri, 21 Mar 2003 08:15:16 JST." <200303210815.16173.nrcb@streamline-computing.com> Message-ID: <20030324060844.61FA620054@neal.il.is.s.u-tokyo.ac.jp> In article <200303210815.16173.nrcb @ streamline-computing.com> Nick Birkett wrotes: > Content-Transfer-Encoding: 8bit > > I am having problem getting pghpf to work with score 5.4.0 > > I have this working for score 5.0.1. > > Here is the error: > > mpif90 -compiler pghpf -c -Mmpi -fast -tp p7 -Mlfs overlap=size:3 -Ktrap=fp If you want to use communications libraries with MPICH/SCore, you need only MPICH/SCore with PGI library, and edit: $PGI/linux86/pghpfrc or $HOME/.mypghpfrc > math.f > /opt/score/mpi/mpich-1.2.4/i386-redhat7-linux2_4_pghpf/bin/mpif90: eval: > illegal option: -c > eval: usage: eval [arg ...] > make: *** [math.o] Error 2 > > Has anyone else tried this ? I think Fortran 90 compiler is not found on MPI build time. Please check mpi build log (/opt/score/score-src/out.*/mpi.build). I get this message: checking if /work/kameyama/install/bin/scoref77 works with GETARG and IARGC... no ... configure: warning: Could not find a way to access the command line from Fortran 77 configure: error: Command line access is required for MPICH Error configuring the Fortran subsystem! Turning off Fortran support You must specify mpicc and mpif77 compiler for pghpf. Note that you can use compiler alias file on SCore 5.4. http://www.pccluster.org/score/dist/score/html/en/man/man5/compiler_alias.html If you write following a line on /opt/score/etc/compilers/alias: pghpf pgi And you write /opt/score/etc/compilers/site: mpif90 pghpf=pghpf ... And you type following command: % mpif90 -compiler=pghpf ... mpif90 convert following: % mpif90 -compiler=pgi -compiler-path=pghpf ... from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From nrcb @ streamline-computing.com Tue Mar 25 00:08:12 2003 From: nrcb @ streamline-computing.com (Nick Birkett) Date: Mon, 24 Mar 2003 15:08:12 +0000 Subject: [SCore-users-jp] [SCore-users] suspending/unsuspending single user jobs Message-ID: <200303241508.12977.nrcb@streamline-computing.com> Hi we would like to be able to suspend single users jobs (ie running under PBS or Sun Grid Engine) and be able to start another job with jobs suspended. This is so we can suspend parallel queues at certain times of the week to run large jobs. We have tried sending the SIGTSTP the the scrun.exe process of a job and this suspends it. However there is a prblem in starting another job with one suspended. I have tried restarting the msgbderv but still get error that the pm device is already opened. ie I would like to know how to suspend a single user job and close the pm devices associated with that job so I can run another job using the same nodes. Is there a way to do this ? Cheers, Nick _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From M.Newiger @ deltacomputer.de Tue Mar 25 02:13:56 2003 From: M.Newiger @ deltacomputer.de (Martin Newiger) Date: Mon, 24 Mar 2003 18:13:56 +0100 Subject: [SCore-users-jp] [SCore-users] NIS Message-ID: When I add users to a SCore-system I want to have them being available on my nodes as well (that they can login there too). What must I do? The NIS-Server is the SCore-Master. Regards Martin Newiger _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama @ pccluster.org Tue Mar 25 09:00:43 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Tue, 25 Mar 2003 09:00:43 +0900 Subject: [SCore-users-jp] Re: [SCore-users] NIS In-Reply-To: Your message of "Mon, 24 Mar 2003 18:13:56 JST." Message-ID: <20030325000043.7D3D72003B@neal.il.is.s.u-tokyo.ac.jp> In article Martin Newiger wrotes: > When I add users to a SCore-system I want to have them being available > on my nodes as well (that they can login there too). What must I do? The > NIS-Server is the SCore-Master. 1. Please add user on NIS-Server locally. 2. Please issue this commands on NIS-server: # cd /var/yp # make These commands update NIS data. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From haddock @ webgroup.co.jp Wed Mar 26 16:54:00 2003 From: haddock @ webgroup.co.jp (=?iso-2022-jp?b?aGFkZG9jayAbJEIhdxsoQiB3ZWJncm91cC5jby5qcA==?=) Date: Wed, 26 Mar 2003 16:54:00 +0900 Subject: [SCore-users-jp] [SCore-users] Cannot mount ? Message-ID: <5fcb3441.34415fcb@webgroup.co.jp> Hi all I'm making the pc-cluster with score-5.4 on Redhat7.3. When I try install client machines , I got some errors like this --------------- VFS :Mounted root (ext2 filesystem). Using EIT5 feature mounting /proc filesystem.... done Testing.......... No dhcp_server specified. Used Broadcast setupNetwork cannot set the gateway address done NFS mount 192.168.1.2: /mnt/runtime Cannot mount exiting See the documentation for this trouble --------------------------- Do you have any idea ? Thanks for your help Regards haddock _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama @ pccluster.org Wed Mar 26 17:14:22 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Wed, 26 Mar 2003 17:14:22 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Cannot mount ? In-Reply-To: Your message of "Wed, 26 Mar 2003 16:54:00 JST." <5fcb3441.34415fcb@webgroup.co.jp> Message-ID: <20030326081422.D3B262003B@neal.il.is.s.u-tokyo.ac.jp> In article <5fcb3441.34415fcb @ webgroup.co.jp> haddock @ webgroup.co.jp wrotes: > Hi all > > I'm making the pc-cluster with score-5.4 on Redhat7.3. > When I try install client machines , I got some errors like > this > > --------------- > VFS :Mounted root (ext2 filesystem). > Using EIT5 feature > mounting /proc filesystem.... done > Testing.......... > No dhcp_server specified. Used Broadcast > setupNetwork cannot set the gateway address Do you set correct gateway address in Network Configuration window? This gateway address is specified for compute hosts (not server hosts). > NFS mount 192.168.1.2: /mnt/runtime your server's IP addrress for compute hosts side is 192.168.1.2. Is this correct? If your server host have 2 Network card, please use eth0 and official hostname to compute host side. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kraehe @ copyleft.de Thu Mar 27 03:10:36 2003 From: kraehe @ copyleft.de (Michael Koehne) Date: Wed, 26 Mar 2003 19:10:36 +0100 Subject: [SCore-users-jp] [SCore-users] Queue hangs for one user Message-ID: <20030326181035.GA16787@bakunin.copyleft.de> Moin Guru's, we have a 40node/80cpu SCore system at CLAMV (http://www.clamv.iu-bremen.de/) that is used by half a dozen people. We had a CPU/FAN problem a few days ago, and Ulrich who noticed it did the following : - removed cell05 from /var/scored/pbs/server_priv/nodes - insert cell05 into /opt/score/etc/scorehosts.defects - /etc/rc.d/init.d/pbs_server restart - and a shutdown of cell05, that saved the CPU We got the FAN yesterday - I installed it and reversed Ulrichs changes. I did not restart the pbs server, as there had been jobs running, that had not been my jobs. Now mhoeft has the problem, that all of his jobs hang in the queue. When he came to me, he was also unable to qdel his jobs, so i did the `/etc/rc.d/init.d/pbs_server restart`, as there had been no other users at that time. Now he is able to submit and delete jobs, but his jobs will never run, just blocked and waiting in the queue. I could start job, schroedi can start jobs, but Matthias jobs look like : 7933.muscle.clu mhoeft default flash -- 4 -- -- -- Q -- cell10+cell10+cell09+cell09+cell08+cell08+cell07+cell07 now the funny, if i start a job and immediate look at `qstat -rn` the job of mhoeft will get an R status for a tick of second and to fall back to Q nearly immediate. Time elapsed stays -- ??? any idea ? Bye Michael -- mailto:kraehe @ copyleft.de UNA:+.? 'CED+2+:::Linux:2.4.18'UNZ+1' http://www.xml-edifact.org/ CETERUM CENSEO WINDOWS ESSE DELENDAM _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From M.Newiger @ deltacomputer.de Thu Mar 27 09:26:53 2003 From: M.Newiger @ deltacomputer.de (Martin Newiger) Date: Thu, 27 Mar 2003 01:26:53 +0100 Subject: [SCore-users-jp] [SCore-users] No Function Message-ID: Hi, I want to add new nodes to an existing cluster configuration. If I press the load-button it appears to have no function (no window pops up). Is there any other way to add new compute host to an old configuration? Kind regards M.Newiger _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama @ pccluster.org Thu Mar 27 09:26:36 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Thu, 27 Mar 2003 09:26:36 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Queue hangs for one user In-Reply-To: Your message of "Wed, 26 Mar 2003 19:10:36 JST." <20030326181035.GA16787@bakunin.copyleft.de> Message-ID: <20030327002636.489502004E@neal.il.is.s.u-tokyo.ac.jp> In article <20030326181035.GA16787 @ bakunin.copyleft.de> Michael Koehne wrotes: > We got the FAN yesterday - I installed it and reversed Ulrichs > changes. I did not restart the pbs server, as there had been > jobs running, that had not been my jobs. Please check /var/scored/pbs/server_logs/* and /var/scored/pbs/sched_logs/* This directory contains server and sceduler log. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama @ pccluster.org Thu Mar 27 09:34:06 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Thu, 27 Mar 2003 09:34:06 +0900 Subject: [SCore-users-jp] Re: [SCore-users] No Function In-Reply-To: Your message of "Thu, 27 Mar 2003 01:26:53 JST." Message-ID: <20030327003406.1B9DD2004E@neal.il.is.s.u-tokyo.ac.jp> In article Martin Newiger wrotes: > I want to add new nodes to an existing cluster configuration. If I press > the load-button it appears to have no function (no window pops up). The load button dosee not pop up window. Pleasse continue setup to click next button. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From aixpresso @ web.de Thu Mar 27 18:17:48 2003 From: aixpresso @ web.de (Daniel Amkreutz) Date: Thu, 27 Mar 2003 10:17:48 +0100 Subject: [SCore-users-jp] [SCore-users] Master as Compute Node Problem with PM-Ethernet Message-ID: <200303270917.h2R9Hk205643@mailgate5.cinetic.de> Hello. We use a Cluster of 3Nodes and 1 Master. They're all equal and we would like to configure the master as a compute node,too. Here's what we got so far: -scored & msgb starts and recognizes 4 Nodes -scout session can be started with 4 Nodes. But when i try to run a job scored claims about the folowing: <3> SCore-D:WARNING Unable to open PM gigaethernet/ethernet (error=2). <3> SCore-D:WARNING argv[0] -config <3> SCore-D:WARNING argv[1] /var/scored/scoreboard/master.0000B2000RgT <3> SCore-D:ERROR No PM device opened. In my opinion the <3> is the nodenumber (ok it is the master). These are the configuration files: PM ETHERNET: unit 0 # maxnsend 0 - 32 maxnsend 16 # backoff 1000 - 20000 (usec) backoff 4800 # checksum (0 if off, 1 is on) checksum 0 # PE MAC address base hostname # comment 0 00:30:48:27:17:4C node01.cluster.domain # ip=192.168.222.1 on eth0 1 00:30:48:27:16:E2 node02.cluster.domain # ip=192.168.222.2 on eth0 2 00:30:48:27:16:56 node03.cluster.domain # ip=192.168.222.3 on eth0 3 00:30:48:27:17:32 master.cluster.domain # ip=192.168.222.254 on eth0 SCOREHOSTS.DB /* * SCore 5.0 scorehosts.db * generated by PCCC EIT 5.2 */ /* PM/Myrinet */ myrinet type=myrinet \ -firmware:file=/opt/score/share/lanai/lanai.mcp \ -config:file=/opt/score/etc/pm-myrinet.conf /* PM/Myrinet */ myrinet2k type=myrinet2k \ -firmware:file=/opt/score/share/lanai/lanaiM2k.mcp \ -config:file=/opt/score/etc/pm-myrinet.conf /* PM/Ethernet */ ethernet type=ethernet \ -config:file=/opt/score/etc/pm-ethernet.conf gigaethernet type=ethernet \ -config:file=/opt/score/etc/pm-ethernet.conf /* PM/Agent */ udp type=agent -agent=pmaudp \ -config:file=/opt/score/etc/pm-udp.conf /* RHiNET */ rhinet type=rhinet \ -firmware:file=/opt/score/share/rhinet/phu_top_0207a.hex \ -config:file=/opt/score/etc/pm-rhinet.conf ## /* PM/SHMEM */ shmem0 type=shmem -node=0 shmem1 type=shmem -node=1 ## #include "/opt/score//etc/ndconf/0" #include "/opt/score//etc/ndconf/1" #include "/opt/score//etc/ndconf/2" #include "/opt/score//etc/ndconf/3" ## #define MSGBSERV msgbserv=(master.cluster.domain:8764) node01.cluster.domain HOST_0 network=gigaethernet,shmem0,shmem1 group=_scoreall_,pcc smp=2 MSGBSERV node02.cluster.domain HOST_1 network=gigaethernet,shmem0,shmem1 group=_scoreall_,pcc smp=2 MSGBSERV node03.cluster.domain HOST_2 network=gigaethernet,shmem0,shmem1 group=_scoreall_,pcc smp=2 MSGBSERV master.cluster.domain HOST_3 network=gigaethernet,shmem0,shmem1 group=_scoreall_,pcc smp=2 MSGBSERV I've also installed the SCORE Kernel with PM-Ethernet support on the master and activated PM Ethernet on eth0 with /etc/init.d/pm_ethernet start can anyone tell what's wrong ?? Has anyone else tried to use the master as compute node ? Thank You Daniel ______________________________________________________________________________ Keine Lust, immer Ihre Adressdaten in eine E-Mail zu schreiben? Mit der vCard ist Schluss damit! Infos - http://freemail.web.de/features/?mc=021153 _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From hori @ swimmy-soft.com Thu Mar 27 18:27:35 2003 From: hori @ swimmy-soft.com (Atsushi HORI) Date: Thu, 27 Mar 2003 18:27:35 +0900 Subject: [SCore-users-jp] Re: [SCore-users] Master as Compute Node Problem with PM-Ethernet In-Reply-To: <200303270917.h2R9Hk205643@mailgate5.cinetic.de> References: <200303270917.h2R9Hk205643@mailgate5.cinetic.de> Message-ID: <3131634455.hori0000@swimmy-soft.com> Hi, >I've also installed the SCORE Kernel with PM-Ethernet support on the >master and activated PM Ethernet on eth0 with /etc/init.d/pm_ethernet >start > >can anyone tell what's wrong ?? This sounds like you installed master node manualy. I suspect that the /var/scored/ directory does not exist. You had better using the "bininstall -compute" command (script) that will do the most stuff of installation. ---- Atsushi HORI SCore Developer Swimmy Software, Inc. _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From aixpresso @ web.de Thu Mar 27 18:52:53 2003 From: aixpresso @ web.de (Daniel Amkreutz) Date: Thu, 27 Mar 2003 10:52:53 +0100 Subject: [SCore-users-jp] [SCore-users] Re: Master as Compute Node Problem with PM-Ethernet Message-ID: <200303270952.h2R9qr229204@mailgate5.cinetic.de> Hi. the /var/scored directories are present. Your tip with the bininstall script was very helpfull. Thank you ! ______________________________________________________________________________ Sie haben mehr zu sagen als in eine SMS passt? Mit WEB.DE FreeMail ist das jetzt kein Problem mehr! http://freemail.web.de/features/?mc=021182 _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From ce107 @ dam.brown.edu Fri Mar 28 08:36:37 2003 From: ce107 @ dam.brown.edu (C. Evangelinos) Date: Thu, 27 Mar 2003 18:36:37 -0500 (EST) Subject: [SCore-users-jp] [SCore-users] SCORE_RSH and use of ssh instead of rsh Message-ID: <200303272336.h2RNabG15621@fritz.dam.brown.edu> Thanks to the list's suggestions the NIS setup with only /var/scored local to each compute node works fine. I'm also exporting via NFS /opt/score to the rest of the nodes (read-only) so they can have full functionality for compiling, executing etc. Such a setup meant that I could not use the bininstall way of doing things and ended up doing quite a few things on my own. A few comments (mainly SCore but also Omni-related): 1) removing the rpms leaves init scripts behind in /etc/rc.d as well as the new devices (the latter is not really a problem) 2) It would be nice to have a script that reproduces the effects of installing the rpms for setting up device and configuration scripts, local directories etc. for the case of NFS installations like mine which do not use EIT or the RPMS for the compute nodes. I may end up writing one myself anyway as I add nodes. 3) I got SCore to work fine (so far) on a system with a Realtek ethernet card (8139too driver). It cannot handle interrupt reaping however - the machine becomes highly unstable after a little while, the system log fills up with kernel: eth0: Too much work at interrupt, IntrStatus=0x0001. messages and the machine requires a reboot. With reaping set to off everything works fine. Performance between such a box and another one with an Intel eepro100 driven card is so-and-so: Ping pong latency (RTT/2) is ~58us, asymptotic ping-pong bandwidth is ~77Mbit/s out of 100 (worse than what LAM gets). BTW I'd be nice if the SCore document reported RTT/2 instead of RTT numbers as I've seen people misunderstand MPICH/PM numbers for double their actual value. 4) It would be more graceful to set things up so that if one already has Java installed, the system doesn't look for things in /opt/score/java/linux Setting OMNI_JAVAVM seems to fix things for the Omni compiler but jumpshot ignores setting JAVA_HOME and JVM as environment variables before calling it. 5) There should be a way to pass back-end specific compiler optimization flags to the Omni compiler. My main remaining problems are: a) Integration with SGE - I just got someone to translate the Japanese instructions but I'd like to know whether the source code that comes with SCore (contrib) is modified or the Sun one as I want to use SCore with the latest patched version of SGE out of Sun (and I'd prefer if possible to avoid having to recompile everything but use as much of Sun's binary installation as possible). b) This is the most important problem and related to the title of my e-mail: For various security reasons I cannot use SCore with rsh (beyond testing). Even with tcp wrappers enabled to limit access I'd prefer to use ssh instead. SCORE_RSH seems to work with very few SCore binaries and most importantly cannot work with scout. Is there a quick fix for that or is rsh hardcoded in too many places in the source code? Moreover, given the way connections propagate on an SCore cluster, would running an ssh agent on the machine where scout is entered enough to provide for transparent connections or do ssh-agents need to run everywhere with some mechanism for new shells to get the required environment variables setup automatically? c) Moreover, if running as an SGE job, what mechanism would SCore use? Normal rsh, ssh (supposing it's fixed as a replacement) or SGE's rsh? Thanks everyone for their help, Constantinos Evangelinos Center for Fluid Mechanics Brown University and Ocean Engineering Department MIT PS> On another mini-cluster with IBM nodes with NetXtreme BCM5703X Gigabit Ethernet cards (tg3 driver) I get for netpipe's ping-pong an RTT/2 latency of 68us and an asymptotic bandwidth that is around 535Mbit/s though sometimes one gets an extra 200Mbit/s for no reason... Avoid... _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama @ pccluster.org Fri Mar 28 10:06:39 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Fri, 28 Mar 2003 10:06:39 +0900 Subject: [SCore-users-jp] Re: [SCore-users] SCORE_RSH and use of ssh instead of rsh In-Reply-To: Your message of "Thu, 27 Mar 2003 18:36:37 JST." <200303272336.h2RNabG15621@fritz.dam.brown.edu> Message-ID: <20030328010639.A2D6F20058@neal.il.is.s.u-tokyo.ac.jp> In article <200303272336.h2RNabG15621 @ fritz.dam.brown.edu> "C. Evangelinos" wrotes: > 2) It would be nice to have a script that reproduces the effects of > installing the rpms for setting up device and configuration scripts, > local directories etc. for the case of NFS installations like mine > which do not use EIT or the RPMS for the compute nodes. I may end up > writing one myself anyway as I add nodes. You can use /opt/score/install/setup command to install device, init.d and local directory: # cd .TN/opt/score/install*B # ./setup -score_comp Please see /opt/score/doc/html/en/installation/sys-compute-fromsrc.html > b) This is the most important problem and related to the title of my > e-mail: > For various security reasons I cannot use SCore with rsh (beyond > testing). Even with tcp wrappers enabled to limit access I'd prefer to > use ssh instead. SCORE_RSH seems to work with very few SCore binaries > and most importantly cannot work with scout. Is there a quick fix for > that or is rsh hardcoded in too many places in the source code? Probably, SCORE_RSH is work most SCore commands expect scout and PBS. scout is not worked SCORE_RSH, because scout use rsh to inter-compute hosts. For example, if you want to scout to 4 hosts (comp0, comp1, comp2, comp3), scout execute as following: 1. scout execute scremote to comp0 your_host% rsh comp0 scremote ... 2. scremote on comp0 execute scremote to comp1 comp0% rsh comp1 scremote ... 3. scremote on comp1 execute scremote to comp2 comp1% rsh comp2 scremote ... 3. scremote on comp2 execute scremote to comp3 comp2% rsh comp3 scremote ... If scout use SCORE_RSH, you cannot use ssh-agent to execute scout. But if you run scoutd on compute hosts, scout use scoutd insted of rshd. So you may stop rshd on compute hosts. from Kameyama Toyohisa _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From rajeev @ pst.fujitsu.com Fri Mar 28 10:29:56 2003 From: rajeev @ pst.fujitsu.com (Rajeev S) Date: Fri, 28 Mar 2003 10:29:56 +0900 Subject: [SCore-users-jp] [SCore-users] Include me Message-ID: <200303280128.KAA05378@tkns.tk.pst.fujitsu.com> _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From yoneya @ nanolc.jst.go.jp Fri Mar 28 15:29:33 2003 From: yoneya @ nanolc.jst.go.jp (Makoto Yoneya) Date: Fri, 28 Mar 2003 15:29:33 +0900 Subject: [SCore-users-jp] [SCore-users] How to specify a input data file with scrun? Message-ID: Dear SCore users: I'm new comer to SCore world. I'd like to run the MD program GROMACS(3.1.4) on a Linux2.4.18/SCore(5.0.0) system. I'd tried the following. scrun -scored=cmp***,nodes=4 scatter -file data.tpr :: mdrun_d -np 4 -deffnm data < data.tpr Here, mdrun_d is the program executable, data.tpr is a input data file for this mdrun_d. The invocation of the scrun looks successful. However, this job looks halt just around reading the input data. What's wrong the usage? I really need helps! Yokoyama Nano-structured Liquid Crystal Project. Makoto Yoneya (Dr.) _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From Yamamoto.Takaya @ wrc.melco.co.jp Fri Mar 28 15:48:25 2003 From: Yamamoto.Takaya @ wrc.melco.co.jp (Takaya Yamamoto) Date: Fri, 28 Mar 2003 15:48:25 +0900 Subject: [SCore-users-jp] シングルCPUとデュアルCPU Message-ID: <5.0.2.5.2.20030328153729.033c49f0@133.141.16.40> 三菱電機 山本です。 いつもお世話になっております。 SMPクラスタを作るときの質問です。 今、  サーバー兼計算ホスト:シングルCPU  計算ホスト2台:共にデュアルCPU の3PC(5CPU)の構成にしようとしています。 EITでインストールしようとしているのですが、 Group Creationのときに、シングルCPUのPCとデュアルCPUのPCを 同じグループに混在させる方法がわかりません。 どのようにすればいいでしょうか? よろしくお願いします。 以上 From hori @ swimmy-soft.com Fri Mar 28 16:06:34 2003 From: hori @ swimmy-soft.com (Atsushi HORI) Date: Fri, 28 Mar 2003 16:06:34 +0900 Subject: [SCore-users-jp] Re: [SCore-users] How to specify a input data file with scrun? In-Reply-To: References: Message-ID: <3131712394.hori0000@swimmy-soft.com> Hi, >I'd tried the following. > >scrun -scored=cmp***,nodes=4 scatter -file data.tpr :: mdrun_d -np 4 -deffnm >data < data.tpr > >Here, mdrun_d is the program executable, data.tpr is a input data file for >this mdrun_d. >The invocation of the scrun looks successful. >However, this job looks halt just around reading the input data. >What's wrong the usage? Here, scrun -scored=cmp***,nodes=4 scatter -file data.tpr creates data.tpr at somewhere in the /var/scored/ directory as a temporary file on each compute host. Then, mdrun_d -np 4 -deffnm data tries to read the file at the current directory where the scrun command is invoked at the server host. If above my assumption is true, then the job will run if you type as follows. scrun -scored=cmp***,nodes=4 scatter -file /tmp/data.tpr :: mdrun_d -np 4 -deffnm /tmp/data < data.tpr P.S. I am not sure but there was a bug in scatter or stdin of scrun in 5.0 or around. You had better to upgrade your SCore system. ---- Atsushi HORI SCore Developer Swimmy Software, Inc. _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From uebayasi @ pultek.co.jp Fri Mar 28 16:19:18 2003 From: uebayasi @ pultek.co.jp (Masao Uebayashi) Date: Fri, 28 Mar 2003 16:19:18 +0900 (JST) Subject: [SCore-users-jp] [SCore-users] How to specify a input data file with scrun? In-Reply-To: References: Message-ID: <20030328.161918.125126176.uebayasi@pultek.co.jp> > scrun -scored=cmp***,nodes=4 scatter -file data.tpr :: mdrun_d -np 4 -deffnm > data < data.tpr In this case, '<' is interpreted by the shell before scrun is invoked, and it's not passed to scrun. You need to use ':= data.tpr' to specify an input file. See the "INPUT/OUTPUT REDIRECTION" section in the scrun manual page. Masao _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama @ pccluster.org Fri Mar 28 16:27:50 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Fri, 28 Mar 2003 16:27:50 +0900 Subject: [SCore-users-jp] シングルCPU とデュアルCPU In-Reply-To: Your message of "Fri, 28 Mar 2003 15:48:25 JST." <5.0.2.5.2.20030328153729.033c49f0@133.141.16.40> Message-ID: <20030328072750.CA5202005C@neal.il.is.s.u-tokyo.ac.jp> 亀山です. In article <5.0.2.5.2.20030328153729.033c49f0 @ 133.141.16.40> Takaya Yamamoto wrotes: > 今、 >  サーバー兼計算ホスト:シングルCPU >  計算ホスト2台:共にデュアルCPU > の3PC(5CPU)の構成にしようとしています。 > > EITでインストールしようとしているのですが、 > Group Creationのときに、シングルCPUのPCとデュアルCPUのPCを > 同じグループに混在させる方法がわかりません。 > どのようにすればいいでしょうか? (直接 scorehosts.db を編集したほうが早いかも知れませんが...) group を 2 つ作成します. まず, SMP だけのグループを作成して. ここには shmem を入れます. 次に全部のホストを含む別の group を作成して, そちらには shmem を 入れないようにします. 最終的な scorehosts.db は network は host ごとに指定されますので, 後者のグループを使用すれば, 5 CPU 使用することができると思います. from Kameyama Toyohisa From yoneya @ nanolc.jst.go.jp Fri Mar 28 17:37:07 2003 From: yoneya @ nanolc.jst.go.jp (Makoto Yoneya) Date: Fri, 28 Mar 2003 17:37:07 +0900 Subject: [SCore-users-jp] RE: [SCore-users] How to specify a input data file with scrun? In-Reply-To: <3131712394.hori0000@swimmy-soft.com> Message-ID: Thanks Hori-san for comments. > -----Original Message----- > From: Atsushi HORI > scrun -scored=cmp***,nodes=4 scatter -file /tmp/data.tpr :: mdrun_d > -np 4 -deffnm /tmp/data < data.tpr It improve the situation! Now program looks running since not only elapsed time, but CPU time also increasing (only elapse time in the former time). However, even I tried very short (only 5 step) run, the job continue to run over 30 CPU minutes. Also in this time, there are no screen listing of STDOUT or STDERR and also any log files on the invoking directory. Then, some file output problems are now occurring etc. Still need helps! Makoto Yoneya JST/ERATO Yokoyama Nano-LC project _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From phmaeda @ med.nagoya-cu.ac.jp Fri Mar 28 17:55:59 2003 From: phmaeda @ med.nagoya-cu.ac.jp (=?iso-2022-jp?b?cGhtYWVkYSAbJEIhdxsoQiBtZWQubmFnb3lhLWN1LmFjLmpw?=) Date: Fri, 28 Mar 2003 17:55:59 +0900 Subject: [SCore-users-jp] EITインストールトラブル Message-ID: <20030328175559.143f88cf.phmaeda@med.nagoya-cu.ac.jp> 名古屋市立大学病院薬剤部の前田 徹と申します。 タンパク質の分子動力学計算を行うため、PCクラスターを組もうと考えています。 RedHat7.3をフルインストールしたマシンにSCORE5.2をEITを使ってインストール使用とすると次のようなダイアログが出ます。 Error Message No boot configuration files このメッセージを無視してもインストールは進みますが、このメッセージで示しているboot configuration files とは何を指すのでしょうか。 また、comphostのインストールに進むところで Cannot exec daemon/a.out というメッセージが出ます。 このためか、FDDで起動したcomphostで No dhcp_server specified. Used Broadcast とメッセージが出て、数回Tryを繰り返した後インストールが失敗します。 恐らくサーバー側のdhcpサーバーが起動していないためと思いますが、どのプログラムがdhcpサーバーに相当するのでしょうか、また、手動で起動するためにはどうすればよいのでしょうか。 以上、よろしくご教示をお願い致します。 名古屋市立大学病院薬剤部 前田 徹 phmaeda @ med.nagoya-cu.ac.jp From Yamamoto.Takaya @ wrc.melco.co.jp Fri Mar 28 18:55:07 2003 From: Yamamoto.Takaya @ wrc.melco.co.jp (Takaya Yamamoto) Date: Fri, 28 Mar 2003 18:55:07 +0900 Subject: [SCore-users-jp] シングル CPUとデュアルCPU In-Reply-To: <20030328072750.CA5202005C@neal.il.is.s.u-tokyo.ac.jp> References: <"Your message of Fri, 28 Mar 2003 15:48:25 JST."<5.0.2.5.2.20030328153729.033c49f0@133.141.16.40> Message-ID: <5.0.2.5.2.20030328185453.0333ed58@133.141.16.40> 山本です。 ありがとうございました。 At 16:27 03/03/28 +0900, kameyama @ pccluster.org wrote: >亀山です. > >In article <5.0.2.5.2.20030328153729.033c49f0 @ 133.141.16.40> Takaya >Yamamoto wrotes: > > 今、 > >  サーバー兼計算ホスト:シングルCPU > >  計算ホスト2台:共にデュアルCPU > > の3PC(5CPU)の構成にしようとしています。 > > > > EITでインストールしようとしているのですが、 > > Group Creationのときに、シングルCPUのPCとデュアルCPUのPCを > > 同じグループに混在させる方法がわかりません。 > > どのようにすればいいでしょうか? > >(直接 scorehosts.db を編集したほうが早いかも知れませんが...) >group を 2 つ作成します. >まず, SMP だけのグループを作成して. ここには shmem を入れます. >次に全部のホストを含む別の group を作成して, そちらには shmem を >入れないようにします. > >最終的な scorehosts.db は network は host ごとに指定されますので, >後者のグループを使用すれば, 5 CPU 使用することができると思います. > > from Kameyama Toyohisa >_______________________________________________ >SCore-users-jp mailing list >SCore-users-jp @ pccluster.org >http://www.pccluster.org/mailman/listinfo/score-users-jp From bogdan.costescu @ iwr.uni-heidelberg.de Fri Mar 28 20:35:37 2003 From: bogdan.costescu @ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Fri, 28 Mar 2003 12:35:37 +0100 (CET) Subject: [SCore-users-jp] RE: [SCore-users] How to specify a input data file with scrun? In-Reply-To: Message-ID: On Fri, 28 Mar 2003, Makoto Yoneya wrote: > Also in this time, there are no screen listing of STDOUT or STDERR > and also any log files on the invoking directory. I'm sorry, but I don't exactly understand the problem. From bogdan.costescu @ iwr.uni-heidelberg.de Fri Mar 28 20:49:01 2003 From: bogdan.costescu @ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Fri, 28 Mar 2003 12:49:01 +0100 (CET) Subject: [SCore-users-jp] Re: [SCore-users] SCORE_RSH and use of ssh instead of rsh In-Reply-To: <200303272336.h2RNabG15621@fritz.dam.brown.edu> Message-ID: On Thu, 27 Mar 2003, C. Evangelinos wrote: > 1) removing the rpms leaves init scripts behind in /etc/rc.d as well > as the new devices (the latter is not really a problem) Indeed, the devices are not a problem. However, the scripts should be removed, but only if you did not modify them (= rpm -V still reports them as "original"). > 2) It would be nice to have a script that reproduces the effects of > installing the rpms for setting up device and configuration scripts, > local directories etc. It exists. You probably didn't read the whole docs... http://www.pccluster.org/score/dist/score/html/en/installation/sys-compute-fromsrc.html and if you think the 4-5 commands that have to be executed are too much, then you can put them in a script. > for the case of NFS installations like mine which do not use EIT or the > RPMS for the compute nodes. My install attempt few weeks ago was not on NFS, but was without the RPMs, so it's certainly possible if you follow these indications. > 3) I got SCore to work fine (so far) on a system with a Realtek > ethernet card (8139too driver). I'm surprised that it worked at all !!! The Realtek cards (but not including the latest C+ variation driven by 8139cp driver which is a different chip) are not useful in anything that requires large amounts of communication - they need too much CPU intervention in performing any kind of network activity. > kernel: eth0: Too much work at interrupt, IntrStatus=0x0001. This is a typical message of system being too slow to process all the incoming packets. And the system can be slowed down a lot by even processing these packets ! > Performance between such a box and another one with an Intel eepro100 > driven card is so-and-so You compare two different things: if you have eepro100 cards for the whole cluster, use them ! > Is there a quick fix for that or is rsh hardcoded in too many places in > the source code? Well, if you can get ssh to act like rsh (which is normally the case), then you can just rename ssh to rsh (or make a link or ...) and everything should just work. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From bogdan.costescu @ iwr.uni-heidelberg.de Fri Mar 28 20:55:01 2003 From: bogdan.costescu @ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Fri, 28 Mar 2003 12:55:01 +0100 (CET) Subject: [SCore-users-jp] Re: [SCore-users] SCORE_RSH and use of ssh instead of rsh In-Reply-To: <20030328010639.A2D6F20058@neal.il.is.s.u-tokyo.ac.jp> Message-ID: On Fri, 28 Mar 2003 kameyama @ pccluster.org wrote: > scout is not worked SCORE_RSH, because scout use rsh to inter-compute hosts. > For example, if you want to scout to 4 hosts (comp0, comp1, comp2, comp3), OK, but what about setting SSH to use HostbasedAuthentication or RhostsRSAAuthentication and have ssh_known_hosts which contains keys for all nodes distributed to all nodes in the cluster; then you can make a ssh connection from any node to any node, so the scout scheme should work. I do have here a non-SCore cluster that is configured like that - rsh is no longer installed. > If scout use SCORE_RSH, you cannot use ssh-agent to execute scout. The idea above doesn't use ssh-agent. It's just ssh on client side and sshd on server side. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From yoneya @ nanolc.jst.go.jp Fri Mar 28 23:31:44 2003 From: yoneya @ nanolc.jst.go.jp (=?iso-2022-jp?b?eW9uZXlhIBskQiF3GyhCIG5hbm9sYy5qc3QuZ28uanA=?=) Date: Fri, 28 Mar 2003 23:31:44 +0900(JST) Subject: [SCore-users-jp] RE: [SCore-users] How to specify a input data file with scrun? In-Reply-To: References: Message-ID: <20030328233144.3d78.yoneya@nanolc.jst.go.jp> Dear Dr. Costescu Thanks for your comments. > From the docs at www.gromacs.org, I find that mdrun can read the input > from a file specified with the "-s" command line option. Why aren't you > specifying it like this ? Then you don't need to redirect stdin. The option I'd tried, mdrun_d -deffnm data, works same as, mdrun_d -s data.tpr, since -deffnm specify the generic name for I/O files not only *.tpr but the other *.gro files etc. > If you want to send a file from stdin to the process, then I don't > understand exactly why you are copying it first to the nodes. As far as I knonw, GROMACS does not read the input data from stdin but just open and read the specified input data file. As in your comment, only the primary execution node needs to read the input file (also as far as I know). However, since I do not know which node will become the primary node, I tried to copy the input data file to all the nodes in the group. If there are misunderstanding above, please point out that. It will be great help to solve my problems. Thanks again. Makoto Yoneya Yokoyama Nano-LC project _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From bogdan.costescu @ iwr.uni-heidelberg.de Fri Mar 28 23:55:52 2003 From: bogdan.costescu @ iwr.uni-heidelberg.de (Bogdan Costescu) Date: Fri, 28 Mar 2003 15:55:52 +0100 (CET) Subject: [SCore-users-jp] RE: [SCore-users] How to specify a input data file with scrun? In-Reply-To: <20030328233144.3d78.yoneya@nanolc.jst.go.jp> Message-ID: On Fri, 28 Mar 2003 yoneya @ nanolc.jst.go.jp wrote: > since -deffnm specify the generic name for I/O files OK, as I'm not a GROMACS user, I missed this in the docs. Then the suggestion from Atsushi should work. But wouldn't be even simpler to take the file(s) from the current directory, i.e. why wouldn't it work like: scrun [options] mdrun_d -deffnm [options] started in the directory where data.tpr already exists. > As far as I knonw, GROMACS does not read the input data from stdin Sorry, I misinterpreted the line you first posted. > However, since I do not know which node will become the primary node, > I tried to copy the input data file to all the nodes in the group. No, you don't know on which nodes the job will run, but scrun/scatter do. So by using "scatter -node 0", stdin will be sent to the first node of the job, whatever this is. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu @ IWR.Uni-Heidelberg.De _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From yoneya @ nanolc.jst.go.jp Sat Mar 29 09:56:19 2003 From: yoneya @ nanolc.jst.go.jp (=?iso-2022-jp?b?eW9uZXlhIBskQiF3GyhCIG5hbm9sYy5qc3QuZ28uanA=?=) Date: Sat, 29 Mar 2003 09:56:19 +0900(JST) Subject: [SCore-users-jp] RE: [SCore-users] How to specify a input data file with scrun? In-Reply-To: References: Message-ID: <20030329095619.8c94.yoneya@nanolc.jst.go.jp> Dear Dr. Costescu : Thanks again. >But wouldn't be even simpler to take > the file(s) from the current directory, i.e. why wouldn't it work like: > > scrun [options] mdrun_d -deffnm [options] > > started in the directory where data.tpr already exists. I'd tried the above first, but the job looks halt (without incleasing CPU time). It will work in another configuration of SCore system, but not in the system I use. > No, you don't know on which nodes the job will run, but scrun/scatter do. > So by using "scatter -node 0", stdin will be sent to the first node of the > job, whatever this is. This usage of the scatter has not tried yet. I'll try later. Thanks again! Makoto Yoneya JST/ERATO Yokoyama Nano-LC Project _______________________________________________ SCore-users mailing list SCore-users @ pccluster.org http://www.pccluster.org/mailman/listinfo/score-users From kameyama @ pccluster.org Mon Mar 31 11:01:31 2003 From: kameyama @ pccluster.org (=?iso-2022-jp?b?a2FtZXlhbWEgGyRCIXcbKEIgcGNjbHVzdGVyLm9yZw==?=) Date: Mon, 31 Mar 2003 11:01:31 +0900 Subject: [SCore-users-jp] EITインストールトラブル In-Reply-To: Your message of "Fri, 28 Mar 2003 17:55:59 JST." <20030328175559.143f88cf.phmaeda@med.nagoya-cu.ac.jp> Message-ID: <20030331020131.A856420058@neal.il.is.s.u-tokyo.ac.jp> 亀山です. In article <20030328175559.143f88cf.phmaeda @ med.nagoya-cu.ac.jp> phmaeda @ med.nagoya-cu.ac.jp wrotes: > RedHat7.3をフルインストールしたマシンにSCORE5.2をEITを使ってインストール使用 > とすると次のようなダイアログが出ます。 > Error Message > No boot configuration files > このメッセージを無視してもインストールは進みますが、このメッセージで示してい > るboot configuration files とは何を指すのでしょうか。 /opt/score/ndboot/images に 100Mbps_Ethernet.lst 1Gbps_Ethernet.lst というファイルがあるかどうかのチェックでひっかかっているようです. > > また、comphostのインストールに進むところで > Cannot exec daemon/a.out > というメッセージが出ます。 これは /opt/score/libexec/eitd を起動しようとして失敗しているようです. > このためか、FDDで起動したcomphostで > No dhcp_server specified. Used Broadcast > とメッセージが出て、数回Tryを繰り返した後インストールが失敗します。 > 恐らくサーバー側のdhcpサーバーが起動していないためと思いますが、どのプログラ > ムがdhcpサーバーに相当するのでしょうか、 上記の eitd が dhcp サーバに該当します. 症状から考えて SCore のインストールがうまくいっていないと 思われます. 多分, /opt が入っているディレクトリがあふれているのではないかと... # df -h /opt を実行してみてください. from Kameyama Toyohisa