[ Introduction | Compiling OpenMP programs
|Test ProgramsI
| Monitoring OpenMP programs | Profiling
OpenMP programs |Example Sessions]
This document is a customized version of the Omni documentation for using OpenMP at Dalhousie University, Faculty of Computer Science. The document is organized into six sections:
NOTE: SUN's Forte tools for compiling and profiling/tuning OpenMP code are now also available.
1.2 About the Omni OpenMP compiler
The Omni OpenMP compiler is available on Locutus. The Omni OpenMP C compiler is a software suite that translates C and Fortran programs with OpenMP directives into code suitable for compiling with a native compiler linked with the Omni OpenMP runtime library.
OpenMP is designed for Symmetric Multi-Processors (SMP) like Locutus and Borg. For example, Locutus is a Sun Enterprise 4500 with 3 GB RAM and 8 processors running Solaris 7. An ideal machine to try out OpenMP programs with! NOTE: You must use ssh to log into Locutus as this machine does not respond to Telnet requests!
The command to compile OpenMP C programs is omcc. The command to compile OpenMP Fortran77 programs is omf77.
To run the OpenMP program, just execute the compiled executable on your SMP platform (Locutus). The number of processor (or operating system threads) used is controlled as follows:
- The environment variable OMPC_NUM_PROCS specifies the number of threads to execute the program. For example, you want to use 4 threads,
setenv OMPC_NUM_PROCS 4or if you are using the bash shellexport OMPC_NUM_PROCS=4- If the environment variable OMPC_NUM_PROCS is not set, check the number of physical processors in your platform. The same number of threads are created as the number of physical processors. So, for example, on Locutus it will be set to 8.
- The environment variable OMPC_BIND_PROCS determines if threads are bound or unbound. Unbound Threads contend within the same processor for time, that is, they get processor time within the scheduler slice for the main process. Bound threads add another scheduler slot to the process queue and runs the thread on this. This makes it a Light Weight Process as opposed to a simple thread of execution. Setting this variable will bind your threads to the number of processors (in the machine, or as set above.) It is highly recommended that you set this to "true" on Locutus to get reasonable performance. For example,
setenv OMPC_BIND_PROCS trueor if you are using the bash shell
export OMPC_BIND_PROCS=true
Several OpenMP test programs are included in the "tests" directory. This directory can be found at
/data/courses/csci6702/Omni-1.4a/testsA set of example programs including a simple "Helloworld" program got as follows:locutus% wget www.cs.dal.ca/~arc/resources/OpenMP/example2/example2.tar.gz
locutus% gunzip example2.tar.gz
locutus% tar xvf example2.tar
You can monitor the parallel execution of your programs using
mpstat
mpstat reports per-processor statistics in tabular form. See man mpstat for details. For a more visual representation try
- jmpstat
Sometimes you will have a problem that your OpenMP program does not speed up as you expect. In such case, profiling the program may help you to analyze and solve the problem.
To enable profiling of the execution, turn on profiling by setting OMPC_LOG:
setenv OMPC_LOGor if you are using the bash shellexport OMPC_LOG=1And run your OpenMP program as usual. For example:
a.outWhen run your program, the log file is created with the name of your program and ".log" extension. For example, if the name of the program is "a.out", the log file "a.out.log" is created. In the log file, several timings of parallel region, barrier and scheduling events are included. The kinds of the events in the log file are defined in "lib/libtlog/tlog.h".
To see the log file, use "tlogview" command:
tlogview a.out.logtlogview is a profile visualization tool, which allow you to check several states and events of OpenMP during the execution in the window. By dragging the region in the window by mouse, you can zoom up the selected region to see the status more preciously. For more information and usage of tlogview, check tlogview.
To disable profiling, unset OMPC_LOG environment variable:
unsetenv OMPC_LOGor if you are using the bash shellunset OMPC_LOGNote that when profiling is on, the execution is getting slow down due to the overhead for getting the timing for each event. As default, the timer routine uses the "gettimeofday" system call which give wall clock time. The resolution of timing depends on the timer routine. You may replace the timer routine with platform-specific routine in "lib/libtlog/tlog-time.c".
locutus% wget www.cs.dal.ca/~arc/resources/OpenMP/example2/omp_hello.c locutus% omcc -o hello.exe omp_hello.c Hello World from thread = 1 Hello World from thread = 6 Hello World from thread = 5 Hello World from thread = 4 Hello World from thread = 7 Hello World from thread = 2 Hello World from thread = 0 Number of threads = 8 Hello World from thread = 3
More example sessions:
locutus:/tmp/Omni-Test$ cp -r /var/http/htdocs/tech_support/Resources/Omni/cg . locutus:/tmp/Omni-Test$cd cg locutus:/tmp/Omni-Test/cg$ gmake cc -c -o second.o second.c cc -o cg cg.c second.o -lm ${OMNI_HOME:=/usr/local}/bin/omcc -o cg-omp cg.c second.o -lm Compiling 'cg.c'... ${OMNI_HOME:=/usr/local}/bin/omcc -o cg-orphan cg-orphan.c second.o -lm Compiling 'cg-orphan.c'... cc -o cg-makedata cg-makedata.c -lm locutus:/tmp/Omni-Test/cg$ export OMPC_BIND_PROCS=true locutus:/tmp/Omni-Test/cg$ OMPC_NUM_PROCS=1 ./cg-omp omp_num_thread=1 omp_max_thread=1 0 1.3764e-13 9.99864415791401 1 2.1067e-15 8.57332792032217 2 2.0809e-15 8.59545103740579 3 2.0978e-15 8.59699723407375 4 1.9100e-15 8.59715491517665 5 2.0295e-15 8.59717443116078 6 1.8605e-15 8.59717707049128 7 1.9794e-15 8.59717744406296 8 1.8638e-15 8.59717749839416 9 1.8070e-15 8.59717750644093 10 1.9231e-15 8.59717750764864 11 1.9795e-15 8.59717750783180 12 1.8284e-15 8.59717750785982 13 1.7639e-15 8.59717750786414 14 1.8498e-15 8.59717750786481 time = 5.178689, 0.345246 (0.000000e+00 - 5.178689e+00)/15, NITCG=25 locutus:/tmp/Omni-Test/cg$ OMPC_NUM_PROCS=2 ./cg-omp omp_num_thread=1 omp_max_thread=2 0 1.3879e-13 9.99864415791401 1 2.1946e-15 8.57332792032217 2 2.1119e-15 8.59545103740579 3 2.0654e-15 8.59699723407375 4 1.9867e-15 8.59715491517665 5 2.1146e-15 8.59717443116078 6 1.8785e-15 8.59717707049128 7 1.8864e-15 8.59717744406296 8 1.9467e-15 8.59717749839416 9 1.8364e-15 8.59717750644093 10 1.8862e-15 8.59717750764864 11 1.7944e-15 8.59717750783180 12 1.8616e-15 8.59717750785982 13 1.8112e-15 8.59717750786414 14 1.8938e-15 8.59717750786481 time = 2.622956, 0.174864 (0.000000e+00 - 2.622956e+00)/15, NITCG=25 locutus:/tmp/Omni-Test/cg$ OMPC_NUM_PROCS=4 ./cg-omp omp_num_thread=1 omp_max_thread=4 0 1.4000e-13 9.99864415791401 1 2.3034e-15 8.57332792032217 2 2.0425e-15 8.59545103740579 3 1.9940e-15 8.59699723407375 4 1.8712e-15 8.59715491517666 5 2.0707e-15 8.59717443116078 6 1.8496e-15 8.59717707049128 7 1.9984e-15 8.59717744406296 8 1.9273e-15 8.59717749839416 9 1.7626e-15 8.59717750644093 10 1.9777e-15 8.59717750764864 11 1.8091e-15 8.59717750783180 12 1.8619e-15 8.59717750785982 13 1.7448e-15 8.59717750786414 14 1.7789e-15 8.59717750786481 time = 1.347051, 0.089803 (0.000000e+00 - 1.347051e+00)/15, NITCG=25 locutus:/tmp/Omni-Test/cg$ OMPC_NUM_PROCS=8 ./cg-omp omp_num_thread=1 omp_max_thread=8 0 1.3958e-13 9.99864415791401 1 2.2396e-15 8.57332792032217 2 2.0384e-15 8.59545103740579 3 1.9475e-15 8.59699723407375 4 1.9677e-15 8.59715491517666 5 2.1662e-15 8.59717443116078 6 1.9009e-15 8.59717707049128 7 1.8623e-15 8.59717744406296 8 1.9301e-15 8.59717749839416 9 1.9383e-15 8.59717750644093 10 1.8409e-15 8.59717750764864 11 1.8889e-15 8.59717750783180 12 1.7497e-15 8.59717750785982 13 1.8287e-15 8.59717750786414 14 1.7180e-15 8.59717750786481 time = 1.841307, 0.122754 (0.000000e+00 - 1.841307e+00)/15, NITCG=25 locutus:/tmp/Omni-Test/cg$ OMPC_BIND_PROCS=false OMPC_NUM_PROCS=8 ./cg-omp omp_num_thread=1 omp_max_thread=8 0 1.3958e-13 9.99864415791401 1 2.2396e-15 8.57332792032217 2 2.0384e-15 8.59545103740579 3 1.9475e-15 8.59699723407375 4 1.9677e-15 8.59715491517666 5 2.1662e-15 8.59717443116078 6 1.9009e-15 8.59717707049128 7 1.8623e-15 8.59717744406296 8 1.9301e-15 8.59717749839416 9 1.9383e-15 8.59717750644093 10 1.8409e-15 8.59717750764864 11 1.8889e-15 8.59717750783180 12 1.7497e-15 8.59717750785982 13 1.8287e-15 8.59717750786414 14 1.7180e-15 8.59717750786481 time = 21.719778, 1.447985 (0.000000e+00 - 2.171978e+01)/15, NITCG=25 locutus:/tmp/Omni-Test/cg$ OMPC_LOG= OMPC_NUM_PROCS=8 ./cg-omp log on ... omp_num_thread=1 omp_max_thread=8 0 1.3958e-13 9.99864415791401 [...] time = 5.941640, 0.396109 (0.000000e+00 - 5.941640e+00)/15, NITCG=25 finalize log ... locutus:/tmp/Omni-Test/cg$ tlogview cg-omg.logNotice the possibility of dimishing returns going from 4-way to 8-way. Also notice the effect of not binding to the processors on a machine where people are using the machine for compute-heavy operations in the background. It is highly recommended that you do set the OMPC_BIND_PROCS=true environment variable.