MPICH-SCore version 2.0 (ch_score2): Running an MPI application

The binary generated by an MPI compiler provided by MPICH-SCore runs on the SCore-D operating system only. Thus you should prepare for one of the Single user environment or the Multiple user environment of SCore-D operating system.

Running in the Single user environment

First you need to run the scout shell program. scout provides a remote shell environment. Then, you may run your MPI application using mpirun on the shell. For example, when you run the application using four nodes:

$ setenv SCBDSERV server.pccluster.org $ msgb -group pcc& $ scout -g pcc [comp0-3]: SCOUT(3.1.0): Ready. $ . . . . . . $ mpirun -np 4 ./mpi_program args ... . . . . . . $ exit SCOUT: session done $

MPICH-SCore supports clusters consists of SMP nodes. Runtime code of the MPICH-SCore spawns multiple MPI processes on a SMP node.

You may specify the number of MPI processes on a SMP node using -np option of mpirun. For example, when you need to run eight MPI processes on four dual processor machines, specifies the options as follows:

$ mpirun -np 4x2 ./my_mpi_program args...

When you do not specify the number of MPI processes on a SMP node, MPICH-SCore spawns MPI processes for each processor of the cluster. Thus the "-np 4" is equal to "-np 2x2" on a cluster consists of dual processor nodes.

You may use scrun instead of mpirun. For example, when you run the MPI application using 16 processors on eight dual processor nodes:

$ scrun -nodes=8x2 ./my_mpi_program args...

When you do not specify -nodes option for scrun, MPICH-SCore spawns MPI processes for each processor of all nodes reserved by scout.

Running the sample application in a single user environment

Here we show an example to run the sample application alltoall, that measures performance of MPI_Alltoall function. This application requires two command line arguments. The first is message length for the all-to-all communication. The second is number of iteration. The result consists of three fields, number of MPI processes, message length and elapsed time for one all-to-all communication. Unit of the time is micro seconds:

$ setenv SCBDSERV server.pccluster.org $ msgb -group pcc& $ scout -g pcc [comp0-3]: SCOUT(3.1.0): Ready. $ . . . . . . $ scrun -nodes=4x2 alltoall 3000 10000 SCORE: Connected (jid=1) <0:0> SCORE: 8 nodes (4x2) ready. 8 3000 1052.230600 $ . . . . . . $ exit SCOUT: session done $

Running in the Multiple user environment

You may run the application just issuing mpirun when using the multiple user environment. Remember you should specify the hostname of SCore-D server as follows:

$ mpirun -np 4x2 -score scored=comp3.pccluster.org ./mpi_program args...

You are able to specify the hostname using environment variable SCORE_OPTIONS instead of the mpirun option:

$ export SCORE_OPTIONS=scored=comp3.pccluster.org $ mpirun -np 4x2 ./mpi_program args...

You may also use scrun instead of mpirun:

$ export SCORE_OPTIONS=scored=comp3.pccluster.org $ scrun -nodes=4x2 ./mpi_program args...

The way to specify the number of MPI processes in an SMP node is same to that for the single user environment. See the previous section.

Improving the performance

Sliding border lines to change the protocols

MPICH-SCore version 1.0 (ch_score) transfers MPI messages using three protocols described below. The runtime code choose the one of them by the message size sending. You may change the border lines for choosing the protocol. The change makes possible to improve the performance of some applications.

Short protocol
transfers a MPI header and message body within a single packet of PM, the low level message passing system of SCore. The total size of the MPI header and the MPI message is limited to MTU of PM. The sender does not wait ready of the receiver. Thus a message unexpected by the receiver stops at a temporary buffer of the receiver.
Eager protocol
transfers a MPI header and message body using multiple packets of PM. The sender does not wait ready of the receiver as the short protocol. Thus a message unexpected by the receiver stops at a temporary buffer of the receiver.
Get protocol (Zero-copy/One-copy transfer)
transfers a MPI message synchronously. The sender waits ready of the receiver. Then the message body is carried. Because the remote memory access (RMA) facility of PM is used to transfer the body, this protocol cannot be used on the device that does not support RMA.

The threshold to change the eager protocol from the short protocol is fixed when building MPICH-SCore library. The default is 1 kbytes. To change this value, define the macro MPID_PKT_MAX_DATA_SIZE having a new value at the top of the header file /opt/score/src/mpi/mpich-1.2.0/src/mpid/ch_score2/mpid.h. Then re-build and install MPICH-SCore.

You can change the threshold between the eager protocol and the get protocol. Since MPICH-SCore version 2.0 has the threshold for each PM devices, you can set a new value for each devices. Use the mpi_max_eager_myrinet option to change it for Myrinet. Use the mpi_max_eager_shmem option for Shmem, inter process communication within a box. The default for Myrinet is 300 kbytes and the default for Shmem is 1.2 kbytes. Followings are an example changing the threshold for Myrinet to 64 kbytes:

$ mpirun -np 4x2 -score mpi_max_eager_myrinet=64000 ./mpi_program args...

Alternatively, you can use scrun:

$ scrun -nodes=4x2,mpi_eager_myrinet=64000 ./mpi_program args...

Some RMA implementations, such as PM/Myrinet transmit data using DMA only. We call the message transfer using such a RMA the Zero-copy transfer since no memory copy by CPU is required when transmitting. MPICH-SCore realizes Zero-copy transfer when using RMA of PM/Myrinet. Zero-copy transfer improves maximum bandwidth of point-to-point message transfer because it reduces congestion of memory access. Zero-copy transfer is effective for some application. However it is not so effective for others since it involves overhead to synchronize the sender and the receiver.

The message transfer using PM/Shmem RMA is the One-copy transfer. Since PM/Shmem realizes copy between virtual memory spaces using the PM/Shmem device driver, the RMA is implemented as one copy. One-copy transfer also improves performance of some application because it reduces congestion of memory access.

Suppress PM Remote Memory Access facility (Zero-copy/One-copy transfer)

MPICH-SCore provides an option mpi_rma that suppress the get protocol since (1) some application achieves better performance when not using RMA, and (2) PM RMA facility does not support checkpointing. The following example suppress the get protocol:

$ mpirun -np 4x2 -score mpi_rma=off mpi_program args...

Alternatively,

$ scrun -nodes=4x2,mpi_rma=off ./mpi_program args...

Using another implementation for all-to-all communications

MPICH-SCore version 2.0 includes another implementation of MPI_Alltoall function and MPI_Alltoallv function. These implementations synchronize all MPI processes for each step of the communication. Try to use -score mpi_synccoll if your application includes invocation of these functions. This option is effective to some applications, however, not to other applications. It depends on the configuration of network and message length of the all-to-all communication.

Getting more power...

Contact to score-info@pccluster.org. We hope our advice helps you to get more performance.

Restrictions

Message transfer using remote memory access facility is supported only on the communication over Myrinet and shared memory between multiple MPI processes in an SMP node.
Currently the user cannot use RMA (get protocol) when checkpointing.