OpenMP Parallel Computing
Using the SUN's Forte Developer Compilers and Tools
at Dalhousie University

1 . Introduction

1.1 About this document This document is a customized version of the Forte Developer 7: OpenMP API User's Guide for using OpenMP at Dalhousie University, Faculty of Computer Science. The document is organized into six sections:

Introduction
Preliminary Setup
How to Compile OpenMP programs
How to Monitor OpenMP programs
How to Profile OpenMP programs
Example Sessions.

1.2 About the Forte OpenMP compiler The Forte developer OpenMP compiler is available on Locutus or Borg. The Forte OpenMP C compiler implements the OpenMP interface for explicit parallelization including a set of source code directives, run-time library routines, and environment variables with the new option -xopenmp. It translates C/C++ and Fortran programs with OpenMP directives into code suitable for compiling with a native compiler linked with the Sun OpenMP runtime library. OpenMP is designed for Symmetric Multi-Processors (SMP) like Locutus and Borg. For example, Locutus is a Sun Enterprise 4500 with 3 GB RAM and 8 processors running Solaris 7. An ideal machine to try out OpenMP programs with! NOTE: You must use ssh to log into Locutus as this machine does not respond to Telnet requests!

2. Preliminary SetupThe Forte Developer product components and man pages are not installed into the standard /usr/bin/ and /usr/share/man directories. To access the Forte Developer compilers and tools, you must have the Forte Developer component directory in your PATH environment variable. To access the Forte Developer man pages, you must have the Forte Developer man page directory in your MANPATH environment variable. The FCS installation of Forte is in /opt/, so you should add these lines to you .bashrc. PATH=$PATH:/opt/SUNWspro/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/SUNWspro/lib MANPATH=$MANPATH:/opt/SUNWspro/man export PATH LD_LIBRARY_PATH MANPATH 3. How to Compile and run your OpenMP program

3.1 Set Environment Variables

To run a parallelized program in a multithreaded environment, you must set the OMP_NUM_THREADS or PARALLEL environment variable prior to program execution. This tells the runtime system the maximum number of threads the program can create. The default is 1. In general, set OMP_NUM_THREADS or PARALLEL to the available number of processors on the target platform. On Locutus, it will be set to 8. For example, if you want to use 4 threads, setenv OMP_NUM_THREADS 4 Or setenv PARALLEL 4 or if you are using the bash shell export OMP_NUM_THREADS=4 Or export PARALLEL=4 Note: OMP_NUM_THREADS and PARALLEL should be set to the same value. OMP_DYNAMIC variable is used to enable or disable dynamic adjustment of the number of threads available for execution of parallel regions. If not set, a default value of TRUE is used. value is either TRUE or FALSE. Example: setenv OMP_DYNAMIC FALSE 3.2 Compiling for OpenMP To enable explicit parallelization with OpenMP directives, compile your program with the option flag -xopenmp in C/C++ (, and with the option flag -openmp in Fortran 95). For example: % cc -xopenmp example.c Then run your OpenMP program as usual. For example: % a.out The optimization level under flag -xopenmp is -xO3. The compiler issues a warning if the optimization level of your program is changed from a lower level to -xO3. 4. How to Moniter OpenMP programYou can monitor the parallel execution of your programs using % mpstat

mpstat reports per-processor statistics in tabular form. See man mpstat for details. For a more visual representation try

% jmpstat 5. How to Profile your OpenMP program Sometimes you will have a problem that your OpenMP program does not speed up as you expect. In such case, profiling the program may help you to analyze and solve the problem. Forte Program Performance Analysis Tools consist of Collector and Performance Analyzer, a pair of tools that you use to collect and analyze performance data for your application. Both tools can be used from the command line or from a graphical user interface. For example: locutus% collect -o omptest.8.er omptest //collecting data for omptest using 8 threads locutus% analyzer omptest.8.er & //analyze performance More information can be found at here. prof and gprof are traditional profiling tools for generating a statistical profile of the CPU time used by a program and counts the number of times each function in a program is entered. tcov is a code coverage tool that reports the number of times each function is called and each source line is executed. locutus% cc -xopenmp -p -o hello.exe omp_hello.c locutus% prof hello.exe For more information about prof, gprof, and tcov, see here. 6. Example Sessions

locutus% export OMP_NUM_THREADS=4 locutus% export PARALLEL=4 locutus% wget www.cs.dal.ca/~arc/resources/OpenMP/example2/omp_hello.c locutus% cc -xopenmp -o hello.exe omp_hello.c locutus% hello.exe Hello World from thread = 3 Hello World from thread = 2 Hello World from thread = 1 Hello World from thread = 0 Number of threads = 4 locutus% export OMP_NUM_THREADS=8 locutus% export PARALLEL=8 locutus% hello.exe Hello World from thread = 5 Hello World from thread = 6 Hello World from thread = 3 Hello World from thread = 4 Hello World from thread = 2 Hello World from thread = 1 Hello World from thread = 7 Hello World from thread = 0 Number of threads = 8

Compile your program with the -p compiler option to generate a profile file called mon.out, then run prof to generate a profile report.

 locutus% cc -xopenmp -p -o hello.exe omp_hello.c 
 locutus% prof hello.exe
 %Time Seconds Cumsecs  #Calls   msec/call  Name
  85.7    0.06    0.06                      __mt_WaitForWork_
  14.3    0.01    0.07      10      1.0     _mprotect
   0.0    0.00    0.07       1      0.      main
   0.0    0.00    0.07       4      0.      atexit
   0.0    0.00    0.07       1      0.      _exithandle
   0.0    0.00    0.07       1      0.      _fpsetsticky
   0.0    0.00    0.07       9      0.      _xregs_clrptr
   0.0    0.00    0.07       4      0.      printf
   0.0    0.00    0.07      78      0.0     __lwp_mutex_lock
   0.0    0.00    0.07       9      0.      __lwp_sema_init
   0.0    0.00    0.07       1      0.      processor_info
   0.0    0.00    0.07      80      0.0     __lwp_mutex_unlock
   0.0    0.00    0.07       2      0.      __door_bind
   0.0    0.00    0.07       3      0.      __door_return
   0.0    0.00    0.07       2      0.      _gettimeofday
   0.0    0.00    0.07      77      0.0     _lock_try
   0.0    0.00    0.07       1      0.      _profil
   0.0    0.00    0.07       1      0.      __signotifywait
   0.0    0.00    0.07       9      0.      __lwp_create
   0.0    0.00    0.07       9      0.      __lwp_continue
   0.0    0.00    0.07       6      0.      __lwp_self
   0.0    0.00    0.07      22      0.0     ___lwp_mutex_lock
   0.0    0.00    0.07       2      0.      ___lwp_cond_wait
   0.0    0.00    0.07      10      0.0     __lwp_sema_wait
   0.0    0.00    0.07       9      0.      __lwp_sema_post
   0.0    0.00    0.07      10      0.0     __lwp_schedctl
   0.0    0.00    0.07       1      0.      __lwp_sigredirect
   0.0    0.00    0.07       8      0.      _doprnt
   0.0    0.00    0.07       9      0.      memchr
   0.0    0.00    0.07      18      0.0     _ferror_unlocked
   0.0    0.00    0.07       2      0.      atoi
   0.0    0.00    0.07      10      0.0     getenv
   0.0    0.00    0.07       2      0.      _mmap
   0.0    0.00    0.07       2      0.      _time
   0.0    0.00    0.07       5      0.      _findbuf
   0.0    0.00    0.07       1      0.      _isatty
   0.0    0.00    0.07       1      0.      _setbufend
   0.0    0.00    0.07      17      0.0     _realbufend
   0.0    0.00    0.07       9      0.      _xflsbuf
   0.0    0.00    0.07       8      0.      _getorientation
   0.0    0.00    0.07       8      0.      _set_orientation_byte
   0.0    0.00    0.07       9      0.      calloc
   0.0    0.00    0.07      10      0.0     .umul
   0.0    0.00    0.07      11      0.0     _memset
   0.0    0.00    0.07       2      0.      _sysconf
   0.0    0.00    0.07       1      0.      _sysconfig
   0.0    0.00    0.07       4      0.      _flockget
   0.0    0.00    0.07       9      0.      _flockrel
   0.0    0.00    0.07      10      0.0     _fileno_unlocked
   0.0    0.00    0.07      59      0.0     .urem
   0.0    0.00    0.07      52      0.0     .rem
   0.0    0.00    0.07      14      0.0     _sigdelset
   0.0    0.00    0.07       9      0.      malloc
   0.0    0.00    0.07       9      0.      _memcpy
   0.0    0.00    0.07       3      0.      _libc_open
   0.0    0.00    0.07       9      0.      .udiv
   0.0    0.00    0.07       5      0.      _close
   0.0    0.00    0.07       3      0.      __open
   0.0    0.00    0.07       1      0.      ___errno
   0.0    0.00    0.07       2      0.      __sigaction
   0.0    0.00    0.07       9      0.      _write
   0.0    0.00    0.07       1      0.      exit

More example sessions to test the efficiency of parallelization strategies:

locutus% cp -r /users/grad/hongyu/openMPtutorial/omptest .
locutus% cd omptest
locutus% make
locutus% export OMP_NUM_THREADS=4
locutus% export PARALLEL=4
locutus% collect -o omptest.4.er omptest  //collecting data for omptest using 4 threads 
locutus% export OMP_NUM_THREADS=8    
locutus% export PARALLEL=8
locutus% collect -o omptest.8.er omptest   //collecting data for omptest using 8 threads
locutus% analyzer omptest.4.er &      
locutus% analyzer omptest.8.er &    //To start the Performance Analyzer for both case

More details please see Demo here.

OpenMP Parallel Computing Using the SUN's Forte Developer Compilers and Tools at Dalhousie University

1 . Introduction

1.1 About this document

1.2 About the Forte OpenMP compiler

2. Preliminary Setup

3. How to Compile and run your OpenMP program

3.1 Set Environment Variables

3.2 Compiling for OpenMP

4. How to Moniter OpenMP program

5. How to Profile your OpenMP program

6. Example Sessions

OpenMP Parallel Computing
Using the SUN's Forte Developer Compilers and Tools
at Dalhousie University