OpenMP笔记

Basis

Components of OpenMP

OpenMP Is Not:

Meant for distributed memory parallel systems (by itself)
Necessarily implemented identically by all vendors
Guaranteed to make the most efficient use of shared memory
Required to check for data dependencies, data conflicts, race conditions, deadlocks, or code sequences that cause a program to be classified as non-conforming
Designed to handle parallel I/O. The programmer is responsible for synchronizing input and output.

Directive format

#pragma omp directive [clause [clause] ...]

例子:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


void mxv_row(int m,int n,double *a,double *b,double *c)
{
  int i, j;
  double sum;
#pragma omp parallel for default(none)          \
  private(i,j,sum) shared(m,n,a,b,c)
  for (i=0; i<m; i++)
    {
      sum = 0.0;
      for (j=0; j<n; j++)
        sum += b[i*n+j]*c[j];
      a[i] = sum;
    } /*-- End of parallel for --*/
}

Some directives

Private variables are undefined on entry and exit of the parallel region
The value of the original variable (before the parallel region) is undefined after the parallel region !
A private variable within a parallel region has no storage association with the same variable outside of the region
Use the first/last private clause to override this behaviour

Example private variables variables declared inside parallel region are private.

firstprivate is used to:

initialize a variable from the serial part of the code
private clause doesn't initialize the variable

lastprivate

thread that executes the ending loop index copies its value to the master (serial) thread
this gives the same result as serial execution

ordered

used when part of the loop must execute in serial order
ordered clause plus an ordered directive

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(){

    int n;
    omp_set_num_threads(4);
    #pragma omp parallel
    {
        #pragma omp for ordered
        for (n=0;n<10;n++)
            #pragma omp ordered
            printf("n = %d\n",n);
    }

1
2
3
4
5
6
7
8


#pragma omp for ordered schedule(dynamic)
 for(int n=0; n<100; ++n)
 {
   files[n].compress();

   #pragma omp ordered
   send(files[n]);
 }

The collapse clause

When you have nested loops, you can use the collapse clause to apply the threading to multiple nested iterations.

1
2
3
4
5
6


#pragma omp parallel for collapse(2)
 for(int y=0; y<25; ++y)
   for(int x=0; x<80; ++x)
   {
     tick(x,y);
   }

Sections

Sometimes it is handy to indicate that "this and this can run in parallel". The sections setting is just for that.

1
2
3
4
5
6
7
8
9


 #pragma omp sections
 {
   { Work1(); }
   #pragma omp section
   { Work2();
     Work3(); }
   #pragma omp section
   { Work4(); }
 }

This code indicates that any of the tasks Work1, Work2 + Work3 and Work4 may run in parallel, but that Work2 and Work3 must be run in sequence.

1
2
3
4
5
6
7
8
9


#pragma omp sections
{
#pragma omp section
{ foo(); }
#pragma omp section
{ bar(); }
#pragma omp section
{ beer(); }
} // end of sections

If “too many” sections, some threads execute more than one section (round-robin)
If “too few” sections, some threads are idle
We don’t know in advance which thread will execute which section

The default clause

The reduction clause

1
2
3


 int sum=0;
 #pragma omp parallel for reduction(+:sum)
 for(int n=0; n<1000; ++n) sum += table[n];

The nowait clause

The parallel region A parallel region supports the following clauses:

if (scalar expression)
private (list)
shared (list)
default (none|shared)
reduction (operator: list)
copyin (list)
firstprivate (list)
num_threads (scalar_int_expr)

The schedule clause for Load Balance

static: each thread is assigned a fixed-size chunk (default)
dynamic: work is assigned as a thread request it
guided: big chunks first and smaller and smaller chunks later
runtime: use environment variable to control scheduling

Synchronization Controls

MASTER Directive

#pragma omp master newline The MASTER directive specifies a region that is to be executed only by the master thread of the team

#pragma omp barrier

Each thread waits until all others have reached this point

Critical region All threads execute the code, but only one at a time:

1
2


#pragma omp critical [(name)]
{<code-block>}

#pragma omp atomic <statement>

This is a lightweight, special form of a critical section

SINGLE and MASTER construct

1
2
3
4


#pragma omp single [clause[[,] clause] ...]
{
<code-block>
}

Only one thread in the team executes the code enclosed

1
2


#pragma omp master
{<code-block>}

Only the master thread executes the code block

1
2


#pragma omp ordered
{<code-block>}

The enclosed block of code is executed in the order in which iterations would be executed sequentially

1

#pragma omp flush [(list)]

Ensure that all threads in a team have a consistent view of certain objects in memory (In the absence of a list, all visible variables are flushed)

the parallel construct implies a barrier in the end of the parallel region

loop construct
single construct
parallel construct

OpenMP Environment Variables

OpenMP environment variable	Default for Sun OpenMP
OMP_NUM_THREADS n	1
OMP_SCHEDULE “schedule,[chunk]”	static, “N/P” (1)
OMP_DYNAMIC { TRUE\|FALSE }	TRUE (2)
OMP_NESTED { TRUE \| FALSE }	FALSE (3)
OMP_STACKSIZE

OpenMP Runtime Functions

Name	Functionality
omp_set_num_threads	Set number of threads
omp_get_num_threads	Return number of threads in team
omp_get_max_threads	Return maximum number of threads
omp_get_thread_num	Get thread ID
omp_get_num_procs	Return maximum number of processors
omp_in_parallel	Check whether in parallel region
omp_set_dynamic	Activate dynamic thread adjustment (but implementation is free to ignore this)
omp_get_dynamic	Check for dynamic thread adjustment
omp_set_nested	Activate nested parallelism(but implementation is free ignore this)
omp_get_nested	Check for nested parallelism
omp_get_wtime	Returns wall clock time
omp_get_wtick	Number of seconds between clock ticks

Summary

The following OpenMP directives do not accept clauses:

MASTER
CRITICAL
BARRIER
ATOMIC
FLUSH
ORDERED
THREADPRIVATE

example

find the max number in an array and get the index

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


#include<iostream>
#include <vector>
#include"omp.h"
using namespace std;
int main()
{
  int A[18] = {1, 2, 1, 6, 8, 2, 1, 2, 11, 6, 8, 2, 1, 2, 1, 6, 8, 2};
  int max_value = 0, max_row = 0;
  int local_max_value = 0, local_max_row = 0;

#pragma omp parallel firstprivate(local_max_row, local_max_value) shared(max_row, max_value)
  {
#pragma omp for
    for (int j = 0; j < 18; j++) {
      if (A[j] > local_max_value) {
        local_max_value = A[j];
        local_max_row = j;
      }
    }
#pragma omp critical
    if (local_max_value > max_value) {
      max_row = local_max_row;
      max_value = local_max_value;
    }
  }
  cout << "max index " << max_row << " values= " << max_value << endl;
  return 0;
}

Using collapse

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


#include<iostream>
#include <vector>
#include"omp.h"
using namespace std;
int main()
{
  int j, k;
#pragma omp parallel
  {
#pragma omp for collapse(2)
    for (k=1; k<=2; k++)
      for (j=1; j<=3; j++)
        {
          #pragma omp critical
          {
            int id = omp_get_thread_num();
            cout << "id " << id << " k = " << k << " j = " << j << endl;
          }
        }
  }
  return 0;
}

Output:

id 3 k = 2 j = 1
id 0 k = 1 j = 1
id 2 k = 1 j = 3
id 4 k = 2 j = 2
id 1 k = 1 j = 2
id 5 k = 2 j = 3

OpenMP Notes about Acceleration Performance

Eigen Block operation VS OpenMP for operation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68


#include <iostream>
#include <vector>
#include <Library/eigen/Eigen/Dense>
#include <Library/eigen/unsupported/Eigen/MatrixFunctions>
#include <Include/Utils/Logger.h>
#include "mpi.h"
#include "omp.h"

using Eigen::RowMajor;
typedef Eigen::Matrix<double, Dynamic, Dynamic> MatrixD;
using namespace Eigen;
using namespace std;
int main(int argc, char* argv[]){
  // MPI_Init(&argc, &argv);
  int provided, my_id;
  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
  MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
  int N;
  if (my_id == 0) {
    cin >> N;
  }
  MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD);
  MatrixD m = MatrixD::Random(2, N);
  double start, finish;
  start = MPI_Wtime();
  m.row(0) += m.row(1);
  finish = MPI_Wtime();
  if (my_id == 0) {
    cout << "block operation cost time: " << finish - start << " seconds" << endl;
  }
  start = MPI_Wtime();
#pragma omp parallel for
  for (int i = 0; i < N; i++) {
    m(0, i) += m(1, i);
  }
  finish = MPI_Wtime();
  if (my_id == 0) {
    cout << "OpenMP for cost time: " << finish - start << " seconds" << endl;
  }
  start = MPI_Wtime();
  #pragma omp parallel
  {
    int thread_id = omp_get_thread_num();
    int num_threads = omp_get_num_threads();
    int block_size = m.cols() / num_threads;
    int remain_size = m.cols() % num_threads;
    m.block(0, thread_id * block_size, 1, block_size) += m.block(1, thread_id * block_size, 1, block_size);
    #pragma omp master
    m.block(0, num_threads * block_size, 1, remain_size) += m.block(1, num_threads * block_size, 1, remain_size);
  }
  finish = MPI_Wtime();
  if (my_id == 0) {
    cout << "OpenMP block operation cost time: " << finish - start << " seconds" << endl;
  }


  start = MPI_Wtime();
  for (int i = 0; i < N; i++) {
    m(0, i) += m(1, i);
  }
  finish = MPI_Wtime();
  if (my_id == 0) {
    cout << "Normal for cost time: " << finish - start << " seconds" << endl;
  }

  MPI_Finalize();
  return 0;
}

Result

The following results getting from PC make mpitest file=test.cpp n=2 1.

1000000
block operation cost time: 0.0114908 seconds
OpenMP for cost time: 0.0650411 seconds
OpenMP block operation cost time: 0.0261152 seconds
Normal for cost time: 0.134705 seconds

block operation cost time: 0.114225 seconds
OpenMP for cost time: 0.461387 seconds
OpenMP block operation cost time: 0.165431 seconds
Normal for cost time: 1.16467 seconds

We can see that the Eigen block operation for add is much faster that OpenMP implementation.

Compared new operation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48


#include <iostream>
#include <vector>
#include <Library/eigen/Eigen/Dense>
#include "mpi.h"
#include "omp.h"

typedef Eigen::Matrix<double, Dynamic, Dynamic> MatrixD;
using namespace Eigen;
using namespace std;

int main(int argc, char* argv[]){
  // MPI_Init(&argc, &argv);
  int provided, my_id;
  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
  MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
  int N, M;
  if (my_id == 0) {
    cin >> N >> M;
    vector <MatrixD*> V;
    double start, finish;
    start = MPI_Wtime();
    for (int i = 0; i < N; i++) {
      MatrixD *tmp = new MatrixD(M, M);
      V.push_back(tmp);
    }
    finish = MPI_Wtime();
    if (my_id == 0) {
      cout << "Single thread cost time: " << finish - start << " seconds" << endl;
    }

    for (int i = 0; i < N; i++) {
      delete V[i];
    }
    V.clear();
    start = MPI_Wtime();
#pragma omp parallel for
    for (int i = 0; i < N; i++) {
      MatrixD *tmp = new MatrixD(M, M);
      V.push_back(tmp);
    }
    finish = MPI_Wtime();
    if (my_id == 0) {
      cout << "OpenMP cost time: " << finish - start << " seconds" << endl;
    }
  }
  MPI_Finalize();
  return 0;
}

Result

The following result measured based on PC mpirun -n 2 1.

10000 1000
Single thread cost time: 0.0245256 seconds
OpenMP cost time: 0.0420988 seconds

100000 1000
Single thread cost time: 0.237885 seconds
OpenMP cost time: 0.270465 seconds

1000000 100
Single thread cost time: 1.67924 seconds
OpenMP cost time: 14.3786 seconds

From the result, we can see that while using new operation, single thread is better than multi-threads.

OpenMP笔记

Basis

Components of OpenMP

Directive format

Some directives

Synchronization Controls

OpenMP Environment Variables

OpenMP Runtime Functions

Summary

example

OpenMP Notes about Acceleration Performance

Eigen Block operation VS OpenMP for operation

Result

Compared new operation

Result

Reference

相关文章：