OpenMP笔记

本文记录OpenMP相关的笔记

Basis

Components of OpenMP

OpenMP Is Not:

  • Meant for distributed memory parallel systems (by itself)
  • Necessarily implemented identically by all vendors
  • Guaranteed to make the most efficient use of shared memory
  • Required to check for data dependencies, data conflicts, race conditions, deadlocks, or code sequences that cause a program to be classified as non-conforming
  • Designed to handle parallel I/O. The programmer is responsible for synchronizing input and output.

Directive format

#pragma omp directive [clause [clause] ...]

例子:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
void mxv_row(int m,int n,double *a,double *b,double *c)
{
  int i, j;
  double sum;
#pragma omp parallel for default(none)          \
  private(i,j,sum) shared(m,n,a,b,c)
  for (i=0; i<m; i++)
    {
      sum = 0.0;
      for (j=0; j<n; j++)
        sum += b[i*n+j]*c[j];
      a[i] = sum;
    } /*-- End of parallel for --*/
}

Some directives

  • Private variables are undefined on entry and exit of the parallel region
  • The value of the original variable (before the parallel region) is undefined after the parallel region !
  • A private variable within a parallel region has no storage association with the same variable outside of the region
  • Use the first/last private clause to override this behaviour

Example private variables variables declared inside parallel region are private.

firstprivate is used to:

  • initialize a variable from the serial part of the code
  • private clause doesn't initialize the variable

lastprivate

  • thread that executes the ending loop index copies its value to the master (serial) thread
  • this gives the same result as serial execution

ordered

  • used when part of the loop must execute in serial order
  • ordered clause plus an ordered directive
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(){

    int n;
    omp_set_num_threads(4);
    #pragma omp parallel
    {
        #pragma omp for ordered
        for (n=0;n<10;n++)
            #pragma omp ordered
            printf("n = %d\n",n);
    }

1
2
3
4
5
6
7
8
#pragma omp for ordered schedule(dynamic)
 for(int n=0; n<100; ++n)
 {
   files[n].compress();

   #pragma omp ordered
   send(files[n]);
 }

The collapse clause

When you have nested loops, you can use the collapse clause to apply the threading to multiple nested iterations.

1
2
3
4
5
6
#pragma omp parallel for collapse(2)
 for(int y=0; y<25; ++y)
   for(int x=0; x<80; ++x)
   {
     tick(x,y);
   }

Sections

Sometimes it is handy to indicate that "this and this can run in parallel". The sections setting is just for that.

1
2
3
4
5
6
7
8
9
 #pragma omp sections
 {
   { Work1(); }
   #pragma omp section
   { Work2();
     Work3(); }
   #pragma omp section
   { Work4(); }
 }

This code indicates that any of the tasks Work1, Work2 + Work3 and Work4 may run in parallel, but that Work2 and Work3 must be run in sequence.

1
2
3
4
5
6
7
8
9
#pragma omp sections
{
#pragma omp section
{ foo(); }
#pragma omp section
{ bar(); }
#pragma omp section
{ beer(); }
} // end of sections
  • If “too many” sections, some threads execute more than one section (round-robin)
  • If “too few” sections, some threads are idle
  • We don’t know in advance which thread will execute which section

The default clause

The reduction clause

1
2
3
 int sum=0;
 #pragma omp parallel for reduction(+:sum)
 for(int n=0; n<1000; ++n) sum += table[n];

The nowait clause

The parallel region A parallel region supports the following clauses:

  • if (scalar expression)
  • private (list)
  • shared (list)
  • default (none|shared)
  • reduction (operator: list)
  • copyin (list)
  • firstprivate (list)
  • num_threads (scalar_int_expr)

The schedule clause for Load Balance

  • static: each thread is assigned a fixed-size chunk (default)
  • dynamic: work is assigned as a thread request it
  • guided: big chunks first and smaller and smaller chunks later
  • runtime: use environment variable to control scheduling

Synchronization Controls

MASTER Directive

#pragma omp master newline The MASTER directive specifies a region that is to be executed only by the master thread of the team

#pragma omp barrier

Each thread waits until all others have reached this point

Critical region All threads execute the code, but only one at a time:

1
2
#pragma omp critical [(name)]
{<code-block>}

#pragma omp atomic <statement>

This is a lightweight, special form of a critical section

SINGLE and MASTER construct

1
2
3
4
#pragma omp single [clause[[,] clause] ...]
{
<code-block>
}

Only one thread in the team executes the code enclosed

1
2
#pragma omp master
{<code-block>}

Only the master thread executes the code block

1
2
#pragma omp ordered
{<code-block>}

The enclosed block of code is executed in the order in which iterations would be executed sequentially

1
#pragma omp flush [(list)]

Ensure that all threads in a team have a consistent view of certain objects in memory (In the absence of a list, all visible variables are flushed)

the parallel construct implies a barrier in the end of the parallel region

  • loop construct
  • single construct
  • parallel construct

OpenMP Environment Variables

OpenMP environment variable Default for Sun OpenMP
OMP_NUM_THREADS n 1
OMP_SCHEDULE “schedule,[chunk]” static, “N/P” (1)
OMP_DYNAMIC { TRUE|FALSE } TRUE (2)
OMP_NESTED { TRUE | FALSE } FALSE (3)
OMP_STACKSIZE

OpenMP Runtime Functions

Name Functionality
omp_set_num_threads Set number of threads
omp_get_num_threads Return number of threads in team
omp_get_max_threads Return maximum number of threads
omp_get_thread_num Get thread ID
omp_get_num_procs Return maximum number of processors
omp_in_parallel Check whether in parallel region
omp_set_dynamic Activate dynamic thread adjustment (but implementation is free to ignore this)
omp_get_dynamic Check for dynamic thread adjustment
omp_set_nested Activate nested parallelism(but implementation is free ignore this)
omp_get_nested Check for nested parallelism
omp_get_wtime Returns wall clock time
omp_get_wtick Number of seconds between clock ticks

Summary

The following OpenMP directives do not accept clauses:

  • MASTER
  • CRITICAL
  • BARRIER
  • ATOMIC
  • FLUSH
  • ORDERED
  • THREADPRIVATE

example

  1. find the max number in an array and get the index
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include<iostream>
#include <vector>
#include"omp.h"
using namespace std;
int main()
{
  int A[18] = {1, 2, 1, 6, 8, 2, 1, 2, 11, 6, 8, 2, 1, 2, 1, 6, 8, 2};
  int max_value = 0, max_row = 0;
  int local_max_value = 0, local_max_row = 0;

#pragma omp parallel firstprivate(local_max_row, local_max_value) shared(max_row, max_value)
  {
#pragma omp for
    for (int j = 0; j < 18; j++) {
      if (A[j] > local_max_value) {
        local_max_value = A[j];
        local_max_row = j;
      }
    }
#pragma omp critical
    if (local_max_value > max_value) {
      max_row = local_max_row;
      max_value = local_max_value;
    }
  }
  cout << "max index " << max_row << " values= " << max_value << endl;
  return 0;
}
  1. Using collapse
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include<iostream>
#include <vector>
#include"omp.h"
using namespace std;
int main()
{
  int j, k;
#pragma omp parallel
  {
#pragma omp for collapse(2)
    for (k=1; k<=2; k++)
      for (j=1; j<=3; j++)
        {
          #pragma omp critical
          {
            int id = omp_get_thread_num();
            cout << "id " << id << " k = " << k << " j = " << j << endl;
          }
        }
  }
  return 0;
}

Output:

id 3 k = 2 j = 1
id 0 k = 1 j = 1
id 2 k = 1 j = 3
id 4 k = 2 j = 2
id 1 k = 1 j = 2
id 5 k = 2 j = 3

OpenMP Notes about Acceleration Performance

Eigen Block operation VS OpenMP for operation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#include <iostream>
#include <vector>
#include <Library/eigen/Eigen/Dense>
#include <Library/eigen/unsupported/Eigen/MatrixFunctions>
#include <Include/Utils/Logger.h>
#include "mpi.h"
#include "omp.h"

using Eigen::RowMajor;
typedef Eigen::Matrix<double, Dynamic, Dynamic> MatrixD;
using namespace Eigen;
using namespace std;
int main(int argc, char* argv[]){
  // MPI_Init(&argc, &argv);
  int provided, my_id;
  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
  MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
  int N;
  if (my_id == 0) {
    cin >> N;
  }
  MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD);
  MatrixD m = MatrixD::Random(2, N);
  double start, finish;
  start = MPI_Wtime();
  m.row(0) += m.row(1);
  finish = MPI_Wtime();
  if (my_id == 0) {
    cout << "block operation cost time: " << finish - start << " seconds" << endl;
  }
  start = MPI_Wtime();
#pragma omp parallel for
  for (int i = 0; i < N; i++) {
    m(0, i) += m(1, i);
  }
  finish = MPI_Wtime();
  if (my_id == 0) {
    cout << "OpenMP for cost time: " << finish - start << " seconds" << endl;
  }
  start = MPI_Wtime();
  #pragma omp parallel
  {
    int thread_id = omp_get_thread_num();
    int num_threads = omp_get_num_threads();
    int block_size = m.cols() / num_threads;
    int remain_size = m.cols() % num_threads;
    m.block(0, thread_id * block_size, 1, block_size) += m.block(1, thread_id * block_size, 1, block_size);
    #pragma omp master
    m.block(0, num_threads * block_size, 1, remain_size) += m.block(1, num_threads * block_size, 1, remain_size);
  }
  finish = MPI_Wtime();
  if (my_id == 0) {
    cout << "OpenMP block operation cost time: " << finish - start << " seconds" << endl;
  }


  start = MPI_Wtime();
  for (int i = 0; i < N; i++) {
    m(0, i) += m(1, i);
  }
  finish = MPI_Wtime();
  if (my_id == 0) {
    cout << "Normal for cost time: " << finish - start << " seconds" << endl;
  }

  MPI_Finalize();
  return 0;
}

Result

The following results getting from PC make mpitest file=test.cpp n=2 1.

1000000
block operation cost time: 0.0114908 seconds
OpenMP for cost time: 0.0650411 seconds
OpenMP block operation cost time: 0.0261152 seconds
Normal for cost time: 0.134705 seconds

block operation cost time: 0.114225 seconds
OpenMP for cost time: 0.461387 seconds
OpenMP block operation cost time: 0.165431 seconds
Normal for cost time: 1.16467 seconds

We can see that the Eigen block operation for add is much faster that OpenMP implementation.

Compared new operation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include <iostream>
#include <vector>
#include <Library/eigen/Eigen/Dense>
#include "mpi.h"
#include "omp.h"

typedef Eigen::Matrix<double, Dynamic, Dynamic> MatrixD;
using namespace Eigen;
using namespace std;

int main(int argc, char* argv[]){
  // MPI_Init(&argc, &argv);
  int provided, my_id;
  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
  MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
  int N, M;
  if (my_id == 0) {
    cin >> N >> M;
    vector <MatrixD*> V;
    double start, finish;
    start = MPI_Wtime();
    for (int i = 0; i < N; i++) {
      MatrixD *tmp = new MatrixD(M, M);
      V.push_back(tmp);
    }
    finish = MPI_Wtime();
    if (my_id == 0) {
      cout << "Single thread cost time: " << finish - start << " seconds" << endl;
    }

    for (int i = 0; i < N; i++) {
      delete V[i];
    }
    V.clear();
    start = MPI_Wtime();
#pragma omp parallel for
    for (int i = 0; i < N; i++) {
      MatrixD *tmp = new MatrixD(M, M);
      V.push_back(tmp);
    }
    finish = MPI_Wtime();
    if (my_id == 0) {
      cout << "OpenMP cost time: " << finish - start << " seconds" << endl;
    }
  }
  MPI_Finalize();
  return 0;
}

Result

The following result measured based on PC mpirun -n 2 1.

10000 1000
Single thread cost time: 0.0245256 seconds
OpenMP cost time: 0.0420988 seconds
100000 1000
Single thread cost time: 0.237885 seconds
OpenMP cost time: 0.270465 seconds
1000000 100
Single thread cost time: 1.67924 seconds
OpenMP cost time: 14.3786 seconds

From the result, we can see that while using new operation, single thread is better than multi-threads.

Reference

  1. An Introduction Into OpenMP
  2. Parallel Computing and OpenMP Tutorial
updatedupdated2021-11-062021-11-06