OpenMP Parallel Computing
Using the SUN's Forte Developer Compilers and Tools
at Dalhousie University

[ Introduction | Preliminary Setup | Compiling OpenMP programs |
| Monitoring OpenMP programs | Profiling OpenMP programs | Example Sessions ]

 

1 . Introduction

    1.1 About this document

    This document is a customized version of the Forte Developer 7: OpenMP API User's Guide for using OpenMP at Dalhousie University, Faculty of Computer Science. The document is organized into six sections:

    1. Introduction
    2. Preliminary Setup
    3. How to Compile OpenMP programs
    4. How to Monitor OpenMP programs
    5. How to Profile OpenMP programs
    6. Example Sessions.
       

1.2 About the Forte OpenMP compiler

The Forte developer OpenMP compiler is available on Locutus or Borg. The Forte OpenMP C compiler implements the OpenMP interface for explicit parallelization including a set of source code directives, run-time library routines, and environment variables with the new option -xopenmp. It translates C/C++ and Fortran programs with OpenMP directives into code suitable for compiling with a native compiler linked with the Sun OpenMP runtime library.

OpenMP is designed for Symmetric Multi-Processors (SMP) like Locutus and Borg. For example, Locutus is a Sun Enterprise 4500 with 3 GB RAM and 8 processors running Solaris 7. An ideal machine to try out OpenMP programs with! NOTE: You must use ssh to log into Locutus as this machine does not respond to Telnet requests!

 

2. Preliminary Setup

3. How to Compile and run your OpenMP program

3.1 Set Environment Variables

To run a parallelized program in a multithreaded environment, you must set the OMP_NUM_THREADS or PARALLEL environment variable prior to program execution. This tells the runtime system the maximum number of threads the program can create. The default is 1. In general, set OMP_NUM_THREADS or PARALLEL to the available number of processors on the target platform. On Locutus, it will be set to 8.

For example, if you want to use 4 threads,

        setenv OMP_NUM_THREADS 4     Or      setenv PARALLEL 4    
or if you are using the bash shell
        export OMP_NUM_THREADS=4     Or      export PARALLEL=4

Note: OMP_NUM_THREADS and PARALLEL should be set to the same value.

OMP_DYNAMIC variable is used to enable or disable dynamic adjustment of the number of threads available for execution of parallel regions. If not set, a default value of TRUE is used. value is either TRUE or FALSE. Example:

        setenv OMP_DYNAMIC FALSE 

3.2 Compiling for OpenMP

To enable explicit parallelization with OpenMP directives, compile your program with the option flag -xopenmp in C/C++ (, and with the option flag -openmp in Fortran 95). For example:

        % cc -xopenmp example.c

Then run your OpenMP program as usual. For example:

        % a.out 

The optimization level under flag -xopenmp is -xO3. The compiler issues a warning if the optimization level of your program is changed from a lower level to -xO3.

 

4. How to Moniter OpenMP program

You can monitor the parallel execution of your programs using

% mpstat

mpstat reports per-processor statistics in tabular form. See man mpstat for details. For a more visual representation try

        % jmpstat 

5. How to Profile your OpenMP program

Sometimes you will have a problem that your OpenMP program does not speed up as you expect. In such case, profiling the program may help you to analyze and solve the problem.

Forte Program Performance Analysis Tools consist of Collector and Performance Analyzer, a pair of tools that you use to collect and analyze performance data for your application. Both tools can be used from the command line or from a graphical user interface. For example:

        locutus% collect -o omptest.8.er omptest  //collecting data for omptest using 8 threads 
        locutus% analyzer omptest.8.er &          //analyze performance 

More information can be found at here.

prof and gprof are traditional profiling tools for generating a statistical profile of the CPU time used by a program and counts the number of times each function in a program is entered. tcov is a code coverage tool that reports the number of times each function is called and each source line is executed.

        locutus% cc -xopenmp -p -o hello.exe omp_hello.c
locutus% prof hello.exe
For more information about prof, gprof, and tcov, see here.

6. Example Sessions

locutus% export OMP_NUM_THREADS=4
locutus% export PARALLEL=4
locutus% wget www.cs.dal.ca/~arc/resources/OpenMP/example2/omp_hello.c
locutus% cc -xopenmp -o hello.exe omp_hello.c 
locutus% hello.exe
   Hello World from thread = 3
   Hello World from thread = 2
   Hello World from thread = 1
   Hello World from thread = 0
   Number of threads = 4


locutus% export OMP_NUM_THREADS=8
locutus% export PARALLEL=8
locutus% hello.exe
   Hello World from thread = 5
   Hello World from thread = 6
   Hello World from thread = 3
   Hello World from thread = 4
   Hello World from thread = 2
   Hello World from thread = 1
   Hello World from thread = 7
   Hello World from thread = 0

   Number of threads = 8

Compile your program with the -p compiler option to generate a profile file called mon.out, then run prof to generate a profile report.

 locutus% cc -xopenmp -p -o hello.exe omp_hello.c 
 locutus% prof hello.exe
%Time Seconds Cumsecs #Calls msec/call Name 85.7 0.06 0.06 __mt_WaitForWork_ 14.3 0.01 0.07 10 1.0 _mprotect 0.0 0.00 0.07 1 0. main 0.0 0.00 0.07 4 0. atexit 0.0 0.00 0.07 1 0. _exithandle 0.0 0.00 0.07 1 0. _fpsetsticky 0.0 0.00 0.07 9 0. _xregs_clrptr 0.0 0.00 0.07 4 0. printf 0.0 0.00 0.07 78 0.0 __lwp_mutex_lock 0.0 0.00 0.07 9 0. __lwp_sema_init 0.0 0.00 0.07 1 0. processor_info 0.0 0.00 0.07 80 0.0 __lwp_mutex_unlock 0.0 0.00 0.07 2 0. __door_bind 0.0 0.00 0.07 3 0. __door_return 0.0 0.00 0.07 2 0. _gettimeofday 0.0 0.00 0.07 77 0.0 _lock_try 0.0 0.00 0.07 1 0. _profil 0.0 0.00 0.07 1 0. __signotifywait 0.0 0.00 0.07 9 0. __lwp_create 0.0 0.00 0.07 9 0. __lwp_continue 0.0 0.00 0.07 6 0. __lwp_self 0.0 0.00 0.07 22 0.0 ___lwp_mutex_lock 0.0 0.00 0.07 2 0. ___lwp_cond_wait 0.0 0.00 0.07 10 0.0 __lwp_sema_wait 0.0 0.00 0.07 9 0. __lwp_sema_post 0.0 0.00 0.07 10 0.0 __lwp_schedctl 0.0 0.00 0.07 1 0. __lwp_sigredirect 0.0 0.00 0.07 8 0. _doprnt 0.0 0.00 0.07 9 0. memchr 0.0 0.00 0.07 18 0.0 _ferror_unlocked 0.0 0.00 0.07 2 0. atoi 0.0 0.00 0.07 10 0.0 getenv 0.0 0.00 0.07 2 0. _mmap 0.0 0.00 0.07 2 0. _time 0.0 0.00 0.07 5 0. _findbuf 0.0 0.00 0.07 1 0. _isatty 0.0 0.00 0.07 1 0. _setbufend 0.0 0.00 0.07 17 0.0 _realbufend 0.0 0.00 0.07 9 0. _xflsbuf 0.0 0.00 0.07 8 0. _getorientation 0.0 0.00 0.07 8 0. _set_orientation_byte 0.0 0.00 0.07 9 0. calloc 0.0 0.00 0.07 10 0.0 .umul 0.0 0.00 0.07 11 0.0 _memset 0.0 0.00 0.07 2 0. _sysconf 0.0 0.00 0.07 1 0. _sysconfig 0.0 0.00 0.07 4 0. _flockget 0.0 0.00 0.07 9 0. _flockrel 0.0 0.00 0.07 10 0.0 _fileno_unlocked 0.0 0.00 0.07 59 0.0 .urem 0.0 0.00 0.07 52 0.0 .rem 0.0 0.00 0.07 14 0.0 _sigdelset 0.0 0.00 0.07 9 0. malloc 0.0 0.00 0.07 9 0. _memcpy 0.0 0.00 0.07 3 0. _libc_open 0.0 0.00 0.07 9 0. .udiv 0.0 0.00 0.07 5 0. _close 0.0 0.00 0.07 3 0. __open 0.0 0.00 0.07 1 0. ___errno 0.0 0.00 0.07 2 0. __sigaction 0.0 0.00 0.07 9 0. _write 0.0 0.00 0.07 1 0. exit

More example sessions to test the efficiency of parallelization strategies:

locutus% cp -r /users/grad/hongyu/openMPtutorial/omptest .
locutus% cd omptest
locutus% make
locutus% export OMP_NUM_THREADS=4
locutus% export PARALLEL=4
locutus% collect -o omptest.4.er omptest  //collecting data for omptest using 4 threads 
locutus% export OMP_NUM_THREADS=8    
locutus% export PARALLEL=8
locutus% collect -o omptest.8.er omptest   //collecting data for omptest using 8 threads
locutus% analyzer omptest.4.er &      
locutus% analyzer omptest.8.er &    //To start the Performance Analyzer for both case

More details please see Demo here.