Parallel Computing
MPI Tutorial: Getting started with OpenMP and Open MPI at ACEnet

[Introduction | Preliminary setup at Ace-net | Compilers |Multithread with OpenMP

|MPI library | Running jobs | Debugging with Totalview | FAQs ]

1. Introduction

1.1 About this document

This document is a customized version of the "User Guide" tutorial for using OpenMP and Open MPI at ACEnet network. The document is organized into following sections:

á Introduction

á Preliminary setup at ACEnet (only needs to be executed once)

á Compilers

á Multithread with OpenMP

á MPI library

á Running jobs

á Debugging with Totalview

á FAQs

This document will not cover in details the ethical and proper computing etiquettes relating to the use of public clusters for parallel computing. As a result, users are requested to consider the affects of the improper use of OpenMP or Open MPI.

1.2 About ACEnet

The Atlantic Computational Excellence Network (ACEnet) is a pan-Atlantic High Performance Computing (HPC) consortium providing distributed HPC resources, visualization and collaboration tools to participating research institutions. Currently, The ACEnet hardware resources are located at several universities and include the following clusters:

á Brasdor (brasdor.ace-net.ca) at StFX

á Fundy (fundy.ace-net.ca) at UNB

á Mahone (mahone.ace-net.ca) at Saint Mary's

á Placentia (placentia2.ace-net.ca) at MUN

á Glooscap (glooscap.ace-net.ca) at Dal

á Courtenay (courtenay.ace-net.ca) at UNBSJ

á More information is available here.

2. Preliminary setup at ACEnet

2.1 Logging in

First, you need to go to http://www.mun.ca/acenet/applications/ to apply for an ACEnet user account using the Project Account Number provided by Prof. Rau-Chaplin. Your account grants you access to all of the ACEnet clusters with the same username and password. When you log in to a particular cluster, you log in to the head node of this cluster, where you can edit, compile and test your code.

All the communication must be performed over the SSH network protocol using an SSH client. If you are using a Unix-like machine, you can ÒsshÓ from the command prompt. On Windows systems, we suggest that you download the freely available client PuTTY. For example, if you want to access the Brasdor cluster using the command line from a Unix-like system, you would type

 ssh -X username@brasdor.ace-net.ca

or alternately

 ssh -X brasdor.ace-net.ca -l username

where the optional -X flag enables X11 connection forwarding.

 user@mahone: ~ $ ssh user@fundy.ace-net.ca

 Password:

 Last login: Tue Dec 11 13:35:25 2007 from 140.184.24.8

 user@fundy: ~ $

The first time you connect to an ACEnet machine via SSH, you will see a message like the following:

 The authenticity of host 'fundy.ace-net.ca (131.202.246.6)' can't be established.

 RSA key fingerprint is ee:28:46:48:78:68:e3:28:ad:45:28:fe:c2:14:0c:d8.

 Are you sure you want to continue connecting (yes/no)?

This is expected and you are safe to answer yes. You will then see a message

 Warning: Permanently added 'fundy.ace-net.ca' (RSA) to the list of known hosts.

After connecting to the machine, you will be prompted for your credentials. Once you have logged in you should change your initial password. The command to do this is simply

 passwd

You will be prompted for your current password and your new password (need advice on choosing a good password? Click here). Within minutes, your password change will be replicated across ACEnet.

2.2 File Transfer

The best way to transfer files to and from the cluster is to use a program that supports SFTP (SSH File Transfer Protocol). SFTP is similar to regular FTP, however instead of sending your data in a readable plain-text format, SFTP encrypts the traffic. The commands for SFTP are the same as FTP. It is available from the command line on most Unix-like systems. Mac OS X users: for a graphical SFTP client, check out Cyberduck. Windows users can also use a program similar to PuTTY called PSFTP, or get WinSCP for a graphical interface similar to Windows Explorer.

Command-line SFTP programs and PSFTP are similar to connecting via SSH. You can initiate a file transfer session with the following syntax

 sftp user@fundy.ace-net.ca

You will be prompted for your password and, upon successful authentication, will see an interactive SFTP prompt.

 user@mahone: ~ $ sftp user@fundy.ace-net.ca

 Connecting to fundy.ace-net.ca...

 Password:

 sftp>

Type help at this prompt to see a list of available commands.

2.3 UNIX Shell

The recommended and default login and Grid Engine shell is bash. If you want to change your shell from tcsh to bash or vice versa then please contact ACEnet support.

2.4 Bourne shells: bash or sh

The commands /bin/bash and /bin/sh reference the same executable, which behaves a bit differently depending on the name it's invoked with, in order to mimic the behaviour of historical versions of sh. Bash is a UNIX shell written for the GNU Project.

There are two files for bash or sh that you should have in your home directory: .bashrc and .profile.

The content of the default user .bashrc file

 # Load default ACEnet cluster profile

 if [ -f /usr/local/lib/bashrc ]; then

   . /usr/local/lib/bashrc

fi

 # Add your settings below

The content of the default user .profile file

 # Do not delete or change this file

 [[ -f ~/.bashrc ]] && . ~/.bashrc

For C shells (csh or tcsh), more information is available here.

2.5 Passwordless SSH access

Note: Paswordless SSH access within the cluster is already configured in your account at all sites except Placentia. Grid Engine relies on SSH to start job processes.

If you want to configure passwordless SSH access yourself then you have to generate an SSH key with the following set of commands:

 $ ssh-keygen -t rsa

   (hit enter three times or answer 'y')

 $ cd ~/.ssh

 $ cp id_rsa.pub authorized_keys

 $ chmod 600 authorized_keys

If you want to set passwordless SSH between different sites then you need to copy three files id_rsa, id_rsa.pub and authorized_keys to other clusters to the~/.ssh directory. For example, to copy these files to the Fundy cluster, type the following:

 $ cd ~/.ssh

 $ scp id_rsa id_rsa.pub authorized_keys fundy.ace-net.ca:.ssh/

3. Compilers

Several compiler suites are currently available:

á Portland Group compilers (preferred compilers)

á Sun Studio 12 compilers

á GNU compilers (gcc 3 and gcc 4)

3.1 PGI compilers

Description

Portland Group compilers 8.0.1 for FORTRAN, C, C++ and High-Performance Fortran

Resources

Vendor website

Documentation, Tips & Techniques, etc.

PGI Support

Commands

 pgcc

 pgCC

 pgf77

 pgf90

 pgf95

 pghpf

Help for any command

 man pgf90

 pgf90 -help

 pgf90 -flags

3.2 Sun Studio compilers

Description

Sun Studio 12 compiler for Fortran, C, C++

Resources

Sun Studio 12 Collection

Commands

cc

CC

f77

f90

f95

Help for any command

 man f90

 f90 -flags

Notes

Other features include: Garbage Collector, IDE, Performance Analyzer, X-Designer

3.3 GNU compilers

Description

The GNU Compiler Collection is a set of programming language compilers produce by the GNU Project and distributed by the Free Software Foundation. Languages include C (gcc), C++ (g++), Fortran (g77), Ada (gnat), and Java (gcj).

Version

á 3.4.6

á 4.1.2 with OpenMP support

Resources

GCC online documentation

Commands for gcc3

gcc

g++

g77

 gnat

gcj

Commands for gcc4

 gcc4

 g++4

 gfortran

Help for any command

 man gcc

 man gfortran

 gcc --help

Notes

Please note that Red Hat Linux has not been updated yet at some sites where you can still find gcc<4.1 with no support for OpenMP. We anticipate that the upgrade will happen very soon.

4. Multithread with OpenMP

4.1 About OpenMP

The OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and FORTRAN on all architectures, including UNIX platforms and Windows NT platforms. Jointly defined by a group of major computer hardware and software vendors, OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer.

4.2 How to Compile and run your OpenMP program

OpenMP (Open Multi-Processing) is a standard for building parallel applications on shared-memory computers (multiprocessors). It consists primarily of a set of compiler directives, with some library routines besides. OpenMP is supported by PGI, Sun Studio and gcc4 compilers.

The maximum number of threads available for OpenMP jobs at ACEnet is either 4 or 16 depending on the cluster. Notice that the requested number of threads is communicated to the program at execution time with the environment variable OMP_NUM_THREADS. For example, if you want your program running on four cores use the following in bash

 export OMP_NUM_THREADS=4

or in csh:

 setenv OMP_NUM_THREADS 4

It may be a good idea to include one of these declarations into your shell profile to have the variable set as soon as you log in.

In order to force the compiler to interpret the OpenMP directives in the source code, you need to specify appropriate flags during the compilation, otherwise a serial code will be generated.

PGI compilers	`-mp`
Sun Studio 12	`-openmp` along with the third optimization level `-O3`
GNU 4	`-fopenmp`

The good introduction to OpenMP can be found here.

Notes

Please note that Red Hat Linux has not been updated yet at some sites where you can still find gcc<4.1 with no support for OpenMP. We anticipate that the upgrade will happen very soon.

4.3 How to Monitor your OpenMP program

You can monitor the parallel execution of your programs using

á mpstat

4.4 Example Sessions

We present here a simple hello world C program using PGI compilers

$ pgcc -mp -o hello omp_hello.c

$ export OMP_NUM_THREADS=4

$ ./hello

Hello parallel world!

Number of threads is 4

Hello world from thread 3

Hello world from thread 0

Hello world from thread 1

Hello world from thread 2

Back to the sequential world.

5. MPI libraries

5.1 About Open MPI

MPI is suitable for parallel machines such as the IBM SP, SGI Origin, etc., but it also works well in clusters of workstations. Taking advantage of the availability of the clusters of workstations at Dalhousie, we are interested in using MPI as a single parallel virtual machine with multiple nodes.

The default (and preferred) MPI implementation at ACEnet is Open MPI. It's free, open source, production-quality MPI-2 implementation. In some rare cases you may still need MPICH; however please note that support for this library will be soon discontinued.

Note: Do not confuse the Open MPI library with OpenMP.

Resources

á Open MPI web site

á Frequently Asked Questions

á Instructional videos and presentations

Current version installed

Open MPI v1.2.7 is configured with PGI (64-bit).

5.2 Compiling MPI programs

The Open MPI team strongly recommends that you simply use Open MPI's "wrapper" compilers to compile your MPI applications. That is, instead of using (for example) gcc to compile your program, use mpicc. Open MPI provides a wrapper compiler for four languages:

Language	Wrapper compiler name
C	mpicc
C++	mpiCC, mpicxx, or mpic++ (note that mpiCC will not exist on case-insensitive filesystems)
Fortran 77	mpif77
Fortran 90	mpif90

Hence, if you expect to compile your program as:

shell$ gcc my_mpi_application.c -o my_mpi_application

Simply use the following instead:

shell$ mpicc my_mpi_application.c -o my_mpi_application

Note that Open MPI's wrapper compilers do not do any actual compiling or linking; all they do is manipulate the command line and add in all the relevant compiler / linker flags and then invoke the underlying compiler / linker (hence, the name "wrapper" compiler). More specifically, if you run into a compiler or linker error, check your source code and/or back-end compiler -- it is usually not the fault of the Open MPI wrapper compiler.

5.3 Sample MPI program in C

We present here a simple C program that passes a message around a ring of processors.

5.4 Makefile

The most simple and straight forward way to compile MPI programs is to modify an existing Makefile. We suggest that you modify this Makefile to your liking and expand on it as you become more comfortable with Open MPI.

`5.5` `An example session`

andang@fundy.ace-net.ca's password:

Last login: Mon Jun 29 13:53:31 2009 from pcox-imac08.cs.dal.ca

andang@fundy: ~ $ make

mpicc MPI_C_SAMPLE.o -o MPI_C_SAMPLE -L./libs

andang@fundy: ~ $ ./MPI_C_SAMPLE

Enter the number of times around the ring: 2

Process 0 sending 2 to 0

Process 0 received 2

Process 0 decremented num

Process 0 sending 1 to 0

Process 0 received 1

Process 0 decremented num

Process 0 sending 0 to 0

Process 0 exiting

6. Running jobs

6.1 OpenMP jobs

To submit the code hello to the scheduler, which will allocate free computing resources to your job and run it on one of the computing nodes, you need to create a small submission script. With this script you instruct the scheduler where to execute the code, where to write the output, and with how many threads you want your code to be run. Here is an example of such a script called submit_hello.sh.

#$ -S /bin/bash

#$ -cwd

#$ -j y

#$ -o hello.out

#$ -pe openmp 4

#$ -l h_rt=01:00:00

export OMP_NUM_THREADS=$NSLOTS

./hello

Finally, to submit the job, type in the command line

qsub submit_hello.sh

Your code will be submitted and eventually run with 4 threads. To check the status of your code, type

qstat

If the status is qw then the job is waiting in the queue, if it's then the job is running, if there is nothing then the job has finished. Now you can check the results. The output should be in the file hello.out which we specified in the job submission script.

6.2 Open MPI jobs

You should use the ompi* parallel environment for Open MPI jobs.

There is no need to specify the list of hosts and the number of processes for the mpirun command because Open MPI will obtain this information directly from Sun Grid Engine.

#$ -S /bin/bash

#$ -cwd

#$ -N test_parallel

#$ -j y

#$ -o test_parallel.log

#$ -l h_vmem=1G

#$ -l h_rt=01:00:00

#$ -pe ompi* 4

mpirun MPI_C_SAMPLE

Save the script to Ò<job_script.sh>Ó and run the job with the following command:

 qsub <job_script.sh>

along with necessary options. The submission script is a handy and flexible tool for setting these options and passing them to the scheduler along with the job name, though it's not strictly required. The typical job submission scripts suited for different types of jobs can be found on the Job control page.

The login node or "head node" on each cluster is intended for managing jobs and files, not for significant computing. As a guideline, any process run on the head node should not consume more than 15 minutes of CPU time. Note that this is not the same as 15 minutes of elapsed time: Login sessions, for example, may last arbitrarily long, but consume little CPU.

All longer jobs must be submitted to the compute hosts via the scheduler, which manages the available resources and assigns them to the waiting jobs. The scheduler used on all ACEnet clusters is Sun Grid Engine (SGE), which is also known as the N1 Grid Engine (N1GE).

Interactive testing on the head node

 $ mpirun -np 4 my_parallel_application

Interactive session through Sun Grid Engine

 $ qrsh -cwd -V -l h_rt=00:10:00,test=true -pe ompi\* 4 my_parallel_application

Please refer to the Job control wiki page for detailed information on how you can manage your jobs. Also, check out the commands qsum and showq.

7. Debugging with TotalView

The TotalView debugger can be used for debugging both serial and parallel (MPI, OpenMP) applications. However, parallel program users will find TotalView extremely useful due to its focus on multi-processor programs debugging. It contains both a graphical and a command line interface; and it includes several features for MPI and OpenMP debugging.

7.1 Latest Version

8.6.2-2

7.2 Location

Glooscap, Placentia, Mahone, Fundy, Courtenay

7.3 Main source of information

TotalView Support, Documentation, Video Tutorials, Tips & Tricks

7.4 Recommended resources

TotalView Overview and Demo

Printable PDF Documentation

7.5 Additional resources

TotalView Tutorial from Lawrence Livermore National Laboratory

Open MPI FAQ: How do I run with the TotalView parallel debugger?

TotalView Release Notes

7.6 Compiling the code

In order to provide necessary symbolic debug information for a debugger, you need to recompile your code. Usually, this requires the -g flag to your compiler.

 mpif90 -g -o test test.f90

7.7 Basic

á Graphical Interface: totalview

Command Line Interface: totalviewcli

If you want to use the GUI-based TotalView parallel debugger then you need to make sure that you are connecting to the head node of the cluster with the X11 forwarding enabled in your SSH client. That will allow you to get windows of a remotely started application shown on your own desktop. Unix users need to run the X11 server on their desktops (if you are running any window manager then you already have the X11 server installed) and connect to the head node with the -Xoption for the SSH client (ssh -X servername.ace-net.ca). Windows users need to install XMing and connect with the PuTTY program with X11 forwarding enabled.

7.8 Debugging Open MPI programs

Before you start debugging with the TotalView parallel debugger you will need to create a file in your home directory named $HOME/.tvdrc with the following content:

 source /usr/local/openmpi/etc/openmpi-totalview.tcl

This will configure TotalView to skip mpirun and jump right into your MPI application; otherwise it will stop deep in the machine code of mpirun itself, which is not what most users want.

You can use the Totalview debugger either on the head node, or through the grid engine interactive queues (Placentia and Courtenay do not support debugging through the queues yet). To debug a job you just need to include --debug in the command line. Open MPI will automatically invoke TotalView to run your MPI process.

Debugging on the head node

If your application is not computationally intensive, does not use a lot of memory, and you are running debugging sessions for short periods of time with a small number of processes (no more than 4), then you can debug your program on the head node.

 mpirun --debug -np 4 my_parallel_application

Debugging on the compute nodes

If your debugging sessions do not qualify to run on the head node, then you need to use dedicated test.q resources, which allow to run a job for less than 1 hour. This option is available at the following sites: Mahone, Fundy, Glooscap. Depending on the cluster, you can request up to 8 slots/processes from Grid Engine.

 qrsh -V -cwd -pe ompi 4 -l h_rt=00:30:00,test=true mpirun --debug myapplication

If you are debugging large jobs, and require more than 4-8 processes for your job, then you can request free slots for an interactive job in the production short.qqueue. If free resources are available, they will be granted to you.

 qrsh -V -cwd -pe ompi 20 -l h_rt=00:30:00 mpirun --debug myapplication

8. FAQs

Why can't I log in?

First, check WAVELETS and the front wiki page to ensure that the machine is not in a scheduled maintenance outage. Sometimes during such an outage the machine may present a login prompt but refuse to recognize your credentials.

If that's not the problem, email support at ace-net.ca

Why doesn't my job start right away?

This could be for a variety of reasons. When you submit a job to the N1 Grid Engine you are making a request for resources. There may be times when the cluster is busy and you will be required to wait for resources. If you use the qstat command, you may see qw next to your job. This indicates that it is in the queue and waiting to be scheduled. If you see an r next to your job then your job is running.

That said, it is often not clear what resources are missing that are preventing your job from being scheduled. Most often it is memory that is in short supply,h_vmem. You may be able to increase your job's likelihood of being scheduled if it requires only few resources by reducing the job's memory requirements. For example:

qalter -l h_vmem=500M,h_rt=hh:mm:ss job_id

will reduce the virtual memory reserved for the job to 500 megabytes. (You must re-supply the h_rt and any other arguments to -l when you use qalter.) The default values are listed on the Job control page. Note that for parallel jobs, this h_vmem request is per process. The scheduler will only start your job if it can find a host (or hosts) with enough memory unassigned to other jobs. You can determine the vmem available on various hosts with

qhost -F h_vmem

or you can see how many hosts have at least, say, 8 gigabytes free with

qhost -l h_vmem=8G

You can also try defining a short time limit for the job:

qalter -l h_rt=0:1:0,other args job_id

imposes a hard run-time limit of 0 hours, 1 minute, 0 seconds (0:1:0). In certain circumstances the scheduler will be able to schedule a job that it knows will finish quickly, where it cannot schedule a longer job.

My job is running well but I noticed an error message

 [cl005:00XXX] ras:gridengine: JOB_ID: YYYYY

This is not an error but a diagnostic message generated to the output file for every Open MPI job, and it contains some useful information:

á the name of the shepherd host - cl005

á the process ID (PID) of the mpirun command on the shepherd host - 00XXX

á the grid engine job ID - JOB_ID: YYYYY

I get the error message

 Open RTE was unable to open the hostfile:

   /tmp/XXXXX.1.short.q/machines

 Check to make sure the path and filename are correct.

You should not be using the option -machinefile in the mpirun command in your submission script. Open MPI will obtain all necessary information directly from Sun Grid Engine.

Check out the typical submission script here.

My job was running fine but then it got terminated with the message

[cl0XX:YYYYY] ERROR: A daemon on node cl0ZZ failed to start as expected.

[cl0XX:YYYYY] ERROR: There may be more information available from

[cl0XX:YYYYY] ERROR: the 'qstat -t' command on the Grid Engine tasks.

[cl0XX:YYYYY] ERROR: If the problem persists, please restart the

[cl0XX:YYYYY] ERROR: Grid Engine PE job

or for Myrinet, with the message

MX:cl0XX:Remote endpoint is closed, peer=00:60:dd:xx:yy:zz (cl0XX:0)

or for Ethernet, with the messages

mca_btl_tcp_frag_recv: readv failed with errno=104

mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=111

It likely means that the job was killed because of the run-time limit (h_rt). Check the run-time of your job (start_time and end_time) with the following command: qacct -j <job_id>, and compare it to the h_rt parameter in your submission script.

You can also get these error messages when your job fails, and some of the processes die or segfault, and others lose communication because of that and have to be killed.

How do I run X Windows (X11) on Microsoft Windows?

We recommend XMing which is very straightforward to install and easy to get working. Check out the guide here.

Permission denied error messages

If you see one of the following messages

 Permission denied (publickey,password,keyboard-interactive).

 Permission denied, please try again.

 (gnome-ssh-askpass:2810): Gtk-WARNING **: cannot open display:

 Permission denied, please try again.

then please check that the passwordless SSH access is configured properly.

1. Introduction

1.1 About this document

1.2 About ACEnet

2. Preliminary setup at ACEnet

2.1 Logging in

2.2 File Transfer

2.3 UNIX Shell

2.4 Bourne shells: bash or sh

2.5 Passwordless SSH access

3. Compilers

3.1 PGI compilers

3.2 Sun Studio compilers

3.3 GNU compilers

4. Multithread with OpenMP

4.1 About OpenMP

4.2 How to Compile and run your OpenMP program

4.3 How to Monitor your OpenMP program

4.4 Example Sessions

5. MPI libraries

5.1 About Open MPI

5.2 Compiling MPI programs

5.3 Sample MPI program in C

5.4 Makefile

5.5 An example session

6. Running jobs

6.1 OpenMP jobs

6.2 Open MPI jobs

7. Debugging with TotalView

7.1 Latest Version

7.2 Location

7.3 Main source of information

7.4 Recommended resources

7.5 Additional resources

7.6 Compiling the code

7.7 Basic

7.8 Debugging Open MPI programs

Debugging on the head node

Debugging on the compute nodes

8. FAQs

`5.5` `An example session`