SCore FAQ


This FAQ is written based on discussions on Mailing List.

Categories

  1. Hostname (NIS, DNS, /etc/hosts etc) Configuration
  2. Installation of SCore
  3. PM Communication Facility
  4. Compiling from Sources of SCore
  5. SCore-D
  6. OMNI/OpenMP Compiler
  7. MPICH
  8. PBS
  9. Performance
  10. Miscellaneous


Category: Hostname (NIS, DNS, /etc/hosts etc) Configuration


Questions
Answers
1.
The content of the result of "ypcat hosts" and the /etc/hosts file is different. Please execute the following to reflect the content of /etc/hosts in NIS.
      # cd /var/yp
      # make
      
Note:
It is not a bug that two or more same lines appear by ypcat.
      % ypcat hosts | sort -u
      
There is no problem if the output of the above-mentioned command is corresponding to the content of the /etc/hosts file.
2.
We can't execute a system test "sceptic -v -g pcc". Also, fails to execute "telnet" from a client node to server node.
If you are using NIS at a server node, please check whether "localhost" is defined by a server name ("127.0.0.1 server.domain server"). If so, remove this line and add a new correct line. Finally, please execute "#cd /var/yp;make" in order to update NIS database.
3.
"msgbserv" doesn't start up.
Please confirm the setting of NIS, DNS, "/etc/hosts".
4.
"scout" doesn't start up.
If you are using SCore version before "5.0.0", please set FQDN and a simple host name in "/etc/hosts". This problem is solved in SCore5.0.1.
5.
When executing "scrun", an error message "No self host error" appears only on one host.
Please confirm the setting of "/etc/hosts" of the host in which an error message appears.



Category: Installation of SCore


Questions
Answers
1.
My machine does not have FDD.
Can SCore be installed in the machine without FDD?
How can I boot CD with CD of SCore?
There is a way to write score client boot image to CD-R in order to install SCore.
The boot disk image of SCore clients is in /opt/score/ndboot/images/.

So, to create bootable CD-ROM.
      # mkisofs -b /opt/score/ndboot/images/100Mbps_Ethernet.img -c boot.catalog-o /tmp/score-boot.iso -J -r -T
      
and cdrecord.
or using xcdroast.
2.
Can EIT set up Compute Hosts using eth2? Currently, EIT is only working with eth0.
If you want to use eth1 or eth2, you install SCore using eth0 and then change from eth0 to eth2.
3.
EIT doesn't start up, because it failed to set "parameter (domainName)".
Please confirm the setting of NIS. And please execute "% ypmatch your_servers_hostname | awk '{print $2}'" and confirm whether the return value is FQDN host name.
4.
While installing SCore using EIT, I had an error message "Cannot resolve the host clusterus1.dciem.dnd.ca IP address", and couldn't continue the installation.

Please confirm the setting of NIS, DNS, "/etc/hosts".
5.
While installing SCore using EIT, I had an error message "Cannot resolve the server's hostname from IP address", and couldn't continue the installation.
Please confirm the setting of NIS, DNS, "/etc/hosts".
6.
An error message "grab failed: another application has grab" appears when generating a boot disk.
Another program uses the floppy disk driver, please check whether the floppy disk drive is used by other program or not.
7.
We use Easy Installation Tool(EIT) to install, Before creating boot floppy, it appears error dialog box saying that it can not find /opt/score/setup//RedHat/instimage/compconf/.conf directory.
What problem it mean?
Please check disk space of /opt partition on Server Host as follows:
      % df /opt
      
/opt/score needs about 500 MB.
8.
When installing computation nodes in SCore5.0.0, starting up of anaconda fails.
This problem is solved in SCore5.0.1.
9.
We can't install SCore using EIT with e1000.
A boot floppy doesn't include e1000 driver before SCore 5.2.0.
If you want to install SCore 5.2.0 or later, please select 1Gbps_Ethernet on Select Boot Network Device window.
10.
When installing SCore by EIT, the following messages were received and a file transfer will stop.

The file
mnt/source2/RedHat/RPMS/ cannot be opened.
This is due to a missing file, a bad package, or bad media.
Press to try again.
Two causes are thought.

(1) Can not read from CD-ROM.

  Please try to copy the CD-ROM image onto the disk, make mount point there, and execute it.

(2) NFS error.

  Please avoid in this case as follows.
    1) Change to shell screen. (pushing "Alt + F2")
    2) Change directory.
      # cd /mnt/cdrom/RedHat/RPMS/
      # ls -l "Package name"
    3) Then, the OK button of the error message dialog is clicked on Server Host.
    4) When the message like "ls: Package Name: State NFS file handle" appears, it tries several times until getting the normal result of ls command.
11.
We can't boot PC, because a root file system is too large. Also, we can't set a partition like "/boot" using EIT.
Please set the size of a root partition less than 8GB.
EIT ON SCore 5.2.0 or later can setting /boot partition.
12.
When using EIT, does environment variables need to be set up? There is no necessary because of being set when login to Server Host.
13.
I want to get more information in "By binary rpm files" of "SCore Cluster System Software Installation Guide". As follows.

3. Compute Host Settings
  - SCore Linux Kernel Installation
    The following of the kernel image name are correct in /etc/lilo.conf file.
      *-2.4.18-3SCOREsmp
      *-2.4.18-3SCORE
  - SCore System Installation
    ./bininstall command must execute twice or more, all files might not be copied.
4. Server Host Settings
  - Sample scorehosts.db file is set as follows.
     doc/html/installation/ -> doc/html/en/installation/
  - /var/log/msgbserv.out file does not exist after msgbserv started.
  - The setting of the PM-II device is executed as follows.
    # /opt/score/deploy/mkpmethernetconf -speed 100 pm-udp.conf -> /opt/score/deploy/mkpmethernetconf -speed 100 -g pcc

+ others
  - To use Server Host as Compute Host, it sets it as follows.
    ./bininstall -compute command is executed in Server Host.
    Work same as the setting of Compute Host is executed.



Category: PM Communication Facility


Questions
Answers
1.
What kind of Gigabit Ethernet does SCore support ?
PM/Ethernet doesn't depend on an Ethernet NIC and a switch. But its performance depends on them. Please refer to a recommended H/W list.
2.
Is Network Trunking possible at Gigabit Ethernet? If you use Network Trunking on Gigabit Ethernet, you should use Ethernet Switches and NICs which support JUMBO FRAME on 66MHz 64bit PCI in order to achieve high bandwidth because of slackness of PCI DMA bandwidth. On PCI-X or on multiple PCI buses, the performance may be increased.

Please refer to "PM Communication Performance" of "SCore Cluster System Software Overview".

Note:
We have tested SysKonnect 9843 NICs, 3Com 996B-T and Broadcom 5701 NICs using Network Trunking with JUMBO FRAMEs. We have also tested Intel PRO100T, PRO1000XT but not tested with JUMBO FRAMEs.
3.
How to write a configuration file, when connecting two PCs directly without Myrinet switch.
Please set "pm-myrinet.conf" as follows:
0  node0.pccluster.org

1  node1.pccluster.org
4.
When executing "etherpmctl", an error message "resource busy" appears.

5.
PM/Ethernet communication test such as rpmtest(scstest, rcstest etc) failed.
PM/Ethernet communication is failed by following reasons:
  1. MAC addresses in pm-ethernet.conf is wrong: Check MAC addresses in pm-ethernet.conf.
  2. Ethernet switch does not pass Ethernet frames of PM/Ethernet(X.25): Check Ethernet switch configuration.
  3. Device driver or hardware of network interface card is not stable: Check whether the network interface card already runs with PM/Ethernet stable.
6.
When executing PM test, some commands like "scstest" don't work well.
Please check whether IRQ is duplicated nor not. If you are using an automatic setting for IRQ, then please set IRQ manually using BIOS.
7.
If we try to execute "mandel" with Ethernet and SMP, the program crashes.
It's a bug of PM/Ethernet of SCore5.0.0. This bug is fixed in SCore5.0.1.
8.
If we try to execute a SCore program with "SK-9D21", the program crashes.
It's a problem caused by this type of NIC." SysKonnect SK-984x" and "Intel pro1000/T" realize a high bandwidth and low latency.
9.
Running over Myrinet 2000, but are now getting some errors with a code that was working ok (DLPOLY chemistry code):

SCore-D 4.0 connected.
<3> ULT:SYSCALLPANIC(../recv.c:85) PM Error (pmReceive) (32:Broken pipe)
<5> SCore-D:WARNING Some job(s) will not stop (4 more retry)
<5> SCore-D:WARNING Force to stop JOB 1
...
<5> SCore-D:WARNING Failed to stop job(s).
<5> SCore-D:WARNING Force to kill JOB 1
This error means the Myrinet NIC has reset by timeout on packet receiving. If the error is not occurred again, you do not have to care about the error. If the error is occurred again, the error may come from hardware problems.
10.
The performance of Ethernet Trunking is not good. It's the same level as that of one NIC.
Please confirm "scorehosts.db".



Category: Compiling from Sources of SCore


Questions
Answers
1.
Can compilers other than GNU be used for MPICH of SCore? Yes, it is possible.
After editing site file, only the source of mpi is extracted, and does the following operations.
     # cd /opt/score/score-src/runtime/mpi
     # smake
     # smake install
     
2.
When executing "make" to compile SCore source codes, we have an error message.
It's a bug of SCore5.0.0. This bug is fixed in SCore5.0.1.
3.
We can't compile mpi using PGI compiler.
Please check the path of pgf90 compiler (/opt/pgi/linux86/bin/pgf90).



Category: SCore-D


Questions
Answers
1.
When the sample program of MPICH was compiled with mpicc, and executed, the following error messages were received.

<8> ULT: Exception Signal (11)

then the system appears to "hang".
Please do the following command in a scout environment.
      % scout ls -l /opt/score/deploy/bin.i386-redhat7-linux2_4/scored*
      
If the entire binary look the same, then it is OK, but not, you have to copy the SCore-D binary so that you have all the same binary files.
2.
An environment variable "DISPLAY" isn't set automatically, like described in "howtouse/xwindow.html"
It isn't set automatically. The document was old and not correct.
3.
An scrun program outputs following warnings on SMP cluster system.

$ scrun ./a.out
<0> SCore-D:WARNING Number of 'smp' (2) is reset to one since there is no SHMEM device.
<1> SCore-D:WARNING Number of 'smp' (2) is reset to one since there is no SHMEM device.
SCore-D 5.0.1 connected.
...

This warning says that the nodes have two CPUs but there is no entry on the scorehosts.db file. To avoid these warnings, check follwings in /opt/score/etc/scorehosts.db file
  1. Are there shmem0, shmem1 device entries?
    shmem0 type=shmem -node=0
    shmem1 type=shmem -node=1
  2. An each node network entry has the shmem0,shmem1 entries?
    Ex.
    node0 network=ethernet,shmem0,shmem1 group=pcc
When scorehosts.db file has modified, run following command.
# /etc/rc.d/init.d/scoreboard reload
4.
The SCore demo applications which use X-window is failed to execute with following errors:

% scrun -nodes=2 /opt/score/demo/bin/pmandel
Could not open display
Failed to connect to comp1.pcc.org:0 from comp1.pcc.org
Failed to connect to comp1.pcc.org:0 from comp2.pcc.org
One or more processes could not connect to the display.
Exiting
%

The DISPLAY environment variable is not set, or displaying permission from other hosts is not allowed. Set DISPLAY variable and permission free using following commands.
% export DISPLAY=server.pcc.org:0.0
% xhost +
5.
I used mpirun command and ran one application (with an '&'i.e. in the background).
I was not able to start a new job because it gave me an error message saying "SCOUT busy".
Please use SCore-D Multi-User Environment.
You must run scored with root.
Then you must execute mpirun with -score scored= option without scout environment.

Please refer to "Getting Started" of "How to Use SCore Cluster System Software" and "Executing SCore-D for the Multi-User Environment" of "SCore Cluster System Software Reference Guide".
6.
How does SCore assign jobs to each CPU ?
SCore doesn't assign jobs to CPU but to host. Therefore, you can't specify CPU to execute a job as you like.
7.
How can I execute it with SCore though there is a program which needs the standard input? SCore support standard input on SCore 5.2.0 or later. SCore 5.0.1 or before, SCore does not support standard input directly.
Do the following:
      % scrun scatter -node 0 == ./a.out
  
8.
When the Spare Hosts function uses, does the active program keep moving even if one Compute Host stops due to the breakdown? It is possible only by multi user mode of SCore-D.
Restart is done from the stage where checkpoint was gathered by restarting scored.

Please refer to "Automatic Operation and High Availability of SCore-D" or "SCore Cluster System Software Reference Guide" for details.



Category: OMNI/OpenMP Compiler


Questions
Answers
1.
When compiling NPB on OpenMP, we have an error.
Please check "make.def", and confirm whether there is CLINKFLAGS -lm. If you'll execute this program in SCASH environment, please add CFLAGS and CLINKFLAGS in -omniconfig=scash.
2.
We can't execute LU of NPB on OpenMP.

Please set environment variables OMNI_SCASH_ARGS_SIZE and OMNI_SCASH_ARGS_SIZE, according to "/opt/omni/doc/omni-scash-status.html".



Category: MPICH


Questions
Answers
1.
I want to do some performance analysis of MPI based program in SCore. You can use profiling library in MPE.
In MPICH/SCore, you can use upshot and Jumpshot 3 log viewer.
For example, you want to use Jumpshot 3:
1. Compile and link mpi program with -mpilog option.
      % mpicc -mpilog foo.c -o foo
      
2. Set PE_LOG_FORMAT environment variable to SLOG
      % setenv PE_LOG_FORMAT SLOHG
      
3. Execute the program.
      % scrun ./foo
      
This program is created "program_name.slog".

4. viewing log file by logviewer.
      % logviewer foo.log
      
To use Jumpshot3, please see:
/opt/score/doc/mpi/jumpshot/index.html
For more detail for MPE profile library, please see also "MPE user guide".



Category: PBS


Questions
Answers
1.
We can't use "-l" option to sc_qsub.
You can't use the option in SCore5.0.0.
You can use the option in SCore 5.2.0 or later.
2.
Can I execute resources_max.walltime with pbs ?
You can execute it, but the response is very slow.



Category: Performance


Questions
Answers
1.
Scstest or rcstest fails on PM/Ethernet pm-ethernet.conf. Or use timeout option such as:
      % scstest -network ethernet -timeout 10
      % rcstest node00 ethernet -v -timeout 10
       

2.
Time until the processing beginning takes gradually in execution of the easy self-made program used in MPI. Please check the following:

1. Whether does IRQ of ether overlap or not?
  - The overlap of IRQ can be judged by executing the following commands on Compute Host.
      % cat /proc/interrupts
      
2. Does switching hub operate normally?
Please do the power supply of the switch in off/on at once.
Moreover, try to connect to other port because might be break specific port.

Do the tuning of the following parameters in the pm-ethernet.conf file when it has no problem for the above-mentioned.
  maxnsend
  backoff
3.
MPICH/SCore on PM/Myrinet achieves less performance than MPICH/GM on Myrinet2000.

MPICH/GM uses Zero-copy communication at default, MPICH/SCore does not use zero-copy communication at default. Please try to use mpi_zerocopy=on at scrun option, such as :
      % scrun-nodes=4x1,mpi_zerocopy=on a.out
      

4.
MPICH/SCore on PM/Ethernet achieves less performance than MPICH/p4(LAM) on Ethernet. The default parameters defined in /opt/score/etc/pm-ethernet.conf are not optimized. Please optimize the parameters using maxnsend and backoff in pm-ethernet.conf . Or use mpi_eager option, such as:
          scrun -nodes=4x1,mpi_eager=1000000 a.out
	  



Category: Miscellaneous


Questions
Answers
1.
Does SCore workings depend on CPU architecture ?
If CPU architecture is x86 or alpha, and you set SCore environment correctly, then SCore works well. For different type of processor of x86, EIT recognize that all hosts have the same processor and the same performance, and register them in "/opt/score/etc/scorehosts.db".
2.
Is it migrated to PowerPC? No, SCore is not migrated to PowerPC.
3.
How can I execute a mpi program not for SCore on SCore cluster ?
What you have to do is only to install another mpi.
4.
Does the program for commerce of MPI operate by SCore? This is rather common and there is of course a work around.
You may install in addition to SCore also a normal non optimized MPICH (using TCP over Ethernet or tcp over Myrinet) and run your application.
5.
Can the Compute Hosts be dual bootable? If your cluster is installed RedHat 7.2, you may install by binary rpm or by source without separate partition.
(On Compute Host, SCore requires 50 MB on /opt and 1GB on /var/scored.)

If your cluster is installed the other distribution, you may install RedHat 7.2 on a separate partition, and please install SCore by binary rpm or by source.

Please look at "SCore Cluster System Software Installation Guide" for "By binary rpm files" and "By source" installation.

In the any method, you must build kernel by source, and install the kernel to be dual bootable.
This is depended by boot loader.
6.
Does SCore5.0.1 work on RedHat7.3 ?
it does.
But if you wan to recompile SCore itself, please use SCrore 5.2.
7.
When Spare Hosts are defined in the scorehosts.db file, is this group name what may be set up by the same group name as other Compute Hosts? Please do not put a Spare Hosts in the same group as Compute Hosts.
However, please match settings (network,msgbserv) other than group.
8.
Linux kernel hung when rpmtest is executed on Myrinet.
Check IRQ dispatching using as follows:

      % cat /proc/interrupts
      

If the IRQ number of Myrinet is same as the other devices, change Myrinet IRQ number by changing BIOS setting or changing PCI slots of Myrinet





PCCC logo PC Cluster Consotium

CREDIT
This document is a part of the SCore cluster system software developed at PC Cluster Consortium, Japan. Copyright (C) 2003 PC Cluster Consortium.