[ Introduction
| Preliminary
Setup | Compiling
OpenMP programs |
| Monitoring
OpenMP programs | Profiling
OpenMP programs | Example
Sessions ]
This document is a customized version of the Forte Developer 7: OpenMP API User's Guide for using OpenMP at Dalhousie University, Faculty of Computer Science. The document is organized into six sections:
1.2 About the Forte OpenMP compiler
The Forte developer OpenMP compiler is available on Locutus or Borg. The Forte OpenMP C compiler implements the OpenMP interface for explicit parallelization including a set of source code directives, run-time library routines, and environment variables with the new option -xopenmp. It translates C/C++ and Fortran programs with OpenMP directives into code suitable for compiling with a native compiler linked with the Sun OpenMP runtime library.
OpenMP is designed for Symmetric Multi-Processors (SMP) like Locutus and Borg. For example, Locutus is a Sun Enterprise 4500 with 3 GB RAM and 8 processors running Solaris 7. An ideal machine to try out OpenMP programs with! NOTE: You must use ssh to log into Locutus as this machine does not respond to Telnet requests!
The Forte Developer product components and man pages are not installed into the standard /usr/bin/ and /usr/share/man directories. To access the Forte Developer compilers and tools, you must have the Forte Developer component directory in your PATH environment variable. To access the Forte Developer man pages, you must have the Forte Developer man page directory in your MANPATH environment variable.
The FCS installation of Forte is in /opt/, so you should add these lines to you .bashrc.
PATH=$PATH:/opt/SUNWspro/bin LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/SUNWspro/lib MANPATH=$MANPATH:/opt/SUNWspro/man export PATH LD_LIBRARY_PATH MANPATH
3.1 Set Environment Variables
To run a parallelized program in a multithreaded environment, you must set the OMP_NUM_THREADS or PARALLEL environment variable prior to program execution. This tells the runtime system the maximum number of threads the program can create. The default is 1. In general, set OMP_NUM_THREADS or PARALLEL to the available number of processors on the target platform. On Locutus, it will be set to 8.
For example, if you want to use 4 threads,
setenv OMP_NUM_THREADS 4 Or setenv PARALLEL 4or if you are using the bash shellexport OMP_NUM_THREADS=4 Or export PARALLEL=4Note: OMP_NUM_THREADS and PARALLEL should be set to the same value.
OMP_DYNAMIC variable is used to enable or disable dynamic adjustment of the number of threads available for execution of parallel regions. If not set, a default value of TRUE is used. value is either TRUE or FALSE. Example:
setenv OMP_DYNAMIC FALSE3.2 Compiling for OpenMP
To enable explicit parallelization with OpenMP directives, compile your program with the option flag -xopenmp in C/C++ (, and with the option flag -openmp in Fortran 95). For example:
% cc -xopenmp example.cThen run your OpenMP program as usual. For example:
% a.outThe optimization level under flag -xopenmp is -xO3. The compiler issues a warning if the optimization level of your program is changed from a lower level to -xO3.
You can monitor the parallel execution of your programs using
% mpstat
mpstat reports per-processor statistics in tabular form. See man mpstat for details. For a more visual representation try
% jmpstat
Sometimes you will have a problem that your OpenMP program does not speed up as you expect. In such case, profiling the program may help you to analyze and solve the problem.
Forte Program Performance Analysis Tools consist of Collector and Performance Analyzer, a pair of tools that you use to collect and analyze performance data for your application. Both tools can be used from the command line or from a graphical user interface. For example:
locutus% collect -o omptest.8.er omptest //collecting data for omptest using 8 threads locutus% analyzer omptest.8.er & //analyze performanceMore information can be found at here.
prof and gprof are traditional profiling tools for generating a statistical profile of the CPU time used by a program and counts the number of times each function in a program is entered. tcov is a code coverage tool that reports the number of times each function is called and each source line is executed.
locutus% cc -xopenmp -p -o hello.exe omp_hello.cFor more information about prof, gprof, and tcov, see here.
locutus% prof hello.exe
locutus% export OMP_NUM_THREADS=4 locutus% export PARALLEL=4 locutus% wget www.cs.dal.ca/~arc/resources/OpenMP/example2/omp_hello.c locutus% cc -xopenmp -o hello.exe omp_hello.c locutus% hello.exe Hello World from thread = 3 Hello World from thread = 2 Hello World from thread = 1 Hello World from thread = 0 Number of threads = 4 locutus% export OMP_NUM_THREADS=8 locutus% export PARALLEL=8 locutus% hello.exe Hello World from thread = 5 Hello World from thread = 6 Hello World from thread = 3 Hello World from thread = 4 Hello World from thread = 2 Hello World from thread = 1 Hello World from thread = 7 Hello World from thread = 0 Number of threads = 8Compile your program with the -p compiler option to generate a profile file called mon.out, then run prof to generate a profile report.
locutus% cc -xopenmp -p -o hello.exe omp_hello.c locutus% prof hello.exe
%Time Seconds Cumsecs #Calls msec/call Name 85.7 0.06 0.06 __mt_WaitForWork_ 14.3 0.01 0.07 10 1.0 _mprotect 0.0 0.00 0.07 1 0. main 0.0 0.00 0.07 4 0. atexit 0.0 0.00 0.07 1 0. _exithandle 0.0 0.00 0.07 1 0. _fpsetsticky 0.0 0.00 0.07 9 0. _xregs_clrptr 0.0 0.00 0.07 4 0. printf 0.0 0.00 0.07 78 0.0 __lwp_mutex_lock 0.0 0.00 0.07 9 0. __lwp_sema_init 0.0 0.00 0.07 1 0. processor_info 0.0 0.00 0.07 80 0.0 __lwp_mutex_unlock 0.0 0.00 0.07 2 0. __door_bind 0.0 0.00 0.07 3 0. __door_return 0.0 0.00 0.07 2 0. _gettimeofday 0.0 0.00 0.07 77 0.0 _lock_try 0.0 0.00 0.07 1 0. _profil 0.0 0.00 0.07 1 0. __signotifywait 0.0 0.00 0.07 9 0. __lwp_create 0.0 0.00 0.07 9 0. __lwp_continue 0.0 0.00 0.07 6 0. __lwp_self 0.0 0.00 0.07 22 0.0 ___lwp_mutex_lock 0.0 0.00 0.07 2 0. ___lwp_cond_wait 0.0 0.00 0.07 10 0.0 __lwp_sema_wait 0.0 0.00 0.07 9 0. __lwp_sema_post 0.0 0.00 0.07 10 0.0 __lwp_schedctl 0.0 0.00 0.07 1 0. __lwp_sigredirect 0.0 0.00 0.07 8 0. _doprnt 0.0 0.00 0.07 9 0. memchr 0.0 0.00 0.07 18 0.0 _ferror_unlocked 0.0 0.00 0.07 2 0. atoi 0.0 0.00 0.07 10 0.0 getenv 0.0 0.00 0.07 2 0. _mmap 0.0 0.00 0.07 2 0. _time 0.0 0.00 0.07 5 0. _findbuf 0.0 0.00 0.07 1 0. _isatty 0.0 0.00 0.07 1 0. _setbufend 0.0 0.00 0.07 17 0.0 _realbufend 0.0 0.00 0.07 9 0. _xflsbuf 0.0 0.00 0.07 8 0. _getorientation 0.0 0.00 0.07 8 0. _set_orientation_byte 0.0 0.00 0.07 9 0. calloc 0.0 0.00 0.07 10 0.0 .umul 0.0 0.00 0.07 11 0.0 _memset 0.0 0.00 0.07 2 0. _sysconf 0.0 0.00 0.07 1 0. _sysconfig 0.0 0.00 0.07 4 0. _flockget 0.0 0.00 0.07 9 0. _flockrel 0.0 0.00 0.07 10 0.0 _fileno_unlocked 0.0 0.00 0.07 59 0.0 .urem 0.0 0.00 0.07 52 0.0 .rem 0.0 0.00 0.07 14 0.0 _sigdelset 0.0 0.00 0.07 9 0. malloc 0.0 0.00 0.07 9 0. _memcpy 0.0 0.00 0.07 3 0. _libc_open 0.0 0.00 0.07 9 0. .udiv 0.0 0.00 0.07 5 0. _close 0.0 0.00 0.07 3 0. __open 0.0 0.00 0.07 1 0. ___errno 0.0 0.00 0.07 2 0. __sigaction 0.0 0.00 0.07 9 0. _write 0.0 0.00 0.07 1 0. exitMore example sessions to test the efficiency of parallelization strategies:
locutus% cp -r /users/grad/hongyu/openMPtutorial/omptest . locutus% cd omptest locutus% make locutus% export OMP_NUM_THREADS=4 locutus% export PARALLEL=4 locutus% collect -o omptest.4.er omptest //collecting data for omptest using 4 threads locutus% export OMP_NUM_THREADS=8 locutus% export PARALLEL=8 locutus% collect -o omptest.8.er omptest //collecting data for omptest using 8 threads locutus% analyzer omptest.4.er & locutus% analyzer omptest.8.er & //To start the Performance Analyzer for both caseMore details please see Demo here.