LAM / MPI Parallel Computing
MPI Tutorial: Getting started with LAM/MPI at Dalhousie

Introduction | Preliminary setup | Compiling MPI programs |Booting LAM/MPI
| Running MPI programs | Shutting down LAM/MPI | Cleaning up after yourself | An example ]

1. INTRODUCTION

    1.1 About this document

    This document is a customized version of the "Getting started with LAM/MPI" tutorial for using LAM/MPI at Dalhousie University Faculty of Computer Science. This document describes the steps to preparing an MPI session using LAM/MPI 6.5.6. The document is organized into eight sections:

    This document will not cover in details the ethical and proper computing etiquettes relating to the use of public clusters for parallel computing. As a result, users are requested to consider the affects of the improper use of LAM.

    1.2 About MPI

    MPI is suitable for parallel machines such as the IBM SP, SGI Origin, etc., but it also works well in clusters of workstations. Taking advantage of the availability of the clusters of workstations at Dalhousie, we are interested in using MPI as a single parallel virtual machine with multiple nodes.

    The MPI-1 Standard supports the portability and platform independent computing. As a result, users will undoubtedly enjoy the cross-platform development capability as well as heterogenous communication. For example, MPI codes which have been written on the RS-6000 architecture running AIX can be ported to a Sparc arhitecture running Solaris with almost no modification necessary.

    1.3 About LAM

    LAM is a daemon based implementation of MPI. Initially, the program lamboot spawns LAM daemons based on the list of host machines provided by the user. These daemons remains idle on the remote machines until they receive a message to load the MPI binary to begin execution. Bottom line:

2. PRELIMINARY SETUP

3. COMPILATION UNDER LAM

4. BOOTING LAM

5. RUNNING LAM

    5.1 mpirun

    SPMD

    To run your MPI binaries, use the command mpirun. For example, to run the sample program presented above (assuming that the binary is called ``hello''),

    The N means "run it on all the machines in your hostfile."

    mpirun has several options that can be supplied:

    Be Conservative!

    1. After using lamboot to start your cluster up always execute: tping -c3 N. to check it is working
    2. Always first test you cluster with known working code. Try a simple C program and its Makefile.
    3. When testing your program for the first time use a one node cluster!
    4. Only after completing these first three steps are you ready to try your code on a multi-node cluster!

    MPMD

    Although it is common to write SPMD code, LAM can also handle the MPMD style of executing programs as well (i.e., execute different binaries on each rank).

    Instead of giving mpirun the name of a single binary, you give mpirun the name of an application schema file. The application schema (or "appschema") simply lists the nodes that you want to use, and the name of the binary to execute on each (along with any relevant command line options that your binary may require).

    For example, the following appschema starts master on n0, and starts slave on all the other nodes (n1-7, in this case). Note that we're passing some flags to the slave program, too:

    To run this appschema, you still use mpirun, but no longer need to specify nodes or an application name -- you simply specify the appschema file name (let's say that the above example's file name is esha-homework):

    This will start the respective binaries on their respective nodes.

    5.2 mpitask

    Anologous to the sequential UNIX ps command is mpitask which displays the current status of the MPI program(s) being executed. The -h command line option provides brief synopsis for this command.

    5.3 mpimsg

    Similar to the mpitask command, the mpimsg command gives information about running MPI programs. mpimsg shows all pending messages in the current MPI environment. With mpimsg , you can see messages that are "left over" (i.e. messages that are never received) even after your MPI program has completed.

    This command is not very useful if you are running in the "client-to-client" mode in LAM/MPI (which is the default). You must specifically say -lamd on your mpirun command line for this command to work as expected.

    REMEMBER: Correct MPI programs do not leave messages lying around; all messages should be received during the run of your program.

    5.4 lamclean

    To kill the running MPI program and erase all pending messages, use lamclean:

    NOTE: lamclean should only have to be used for debugging -- i.e. programs that hang, messages that are left around, etc. Correct MPI programs should terminate properly and clean up all their messages.

6. SHUTTING DOWN LAM

7. CLEANING UP AFTER YOURSELF

The Dalhousie Unix machines that you are using are a shared resource.  Unless everyone behaves, the machines become unstable .  Please clean up after yourself!  Make sure you leave these machines as you found them!

  1. Be sure to halt and wipe your cluster before you logoff. Check that the lamd process is really dead on all machines you have been using! If some of the lamd processes are still around try:

    caper$ user=`whoami`; for host in `grep -v '^#' floor3.bhost`; do echo $host; ssh $host "ps -fu $user; skill lamd; sleep 2; ps -fu $user"; done
     

  2. Be sure to cleanup any files that you have left in the /tmp directory.
  3. Be sure that your program closes all files correctly otherwise the machines will run out of file handles and crash.

8. AN EXAMPLE SESSION

caper$ wget http://www.cs.dal.ca/~arc/resources/MPI/sampleCode//MPI_C_SAMPLE.c
caper$  wget http://www.cs.dal.ca/~arc/resources/MPI/sampleCode//makefile
caper$ gmake
caper$ cp /opt/MPI/lam/boot/floor3.bhost .
  [Edit floor3.bhost to make sure present machine is first ]
caper$ eval `ssh-agent`
caper$ ssh-add
   [Enter passphrase for testuser@borg: foobar]

caper$ recon -v floor3.bhost
caper$ lamboot -v floor3.bhost
caper$ tping -c3 N
caper$ mpirun N hello
Enter the number of times around the ring: 2
Process 0 sending 2 to 1
Process 0 received 2
Process 0 decremented num
[....]
caper$ lamhalt
caper$ ssh-agent -k