CUDA Note

Basic

一个warp包含32个threads。 warp是调度和执行的基本单位，half-warp是存储器操作的基本单位

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <cstdio>

__global__ void printHelloGPU()
{
    printf("Hello World from the GPUn");
}

int main()
{
    printHelloGPU<<<1, 1>>>();
    cudaDeviceSynchronize();

    getchar();
    return 0;
}

The cudaDeviceSyncronize function determines that all the processing on the GPU must be done before continuing.

The __global__ keyword indicates that the following function will run on the GPU.
The code executed on the CPU is referred to as host code, and code executed on the GPU is referred to as device code.
It is required that functions defined with the __global__ keyword return type void.
When calling a function to run on the GPU, we call this function a kernel (In the example, printHelloGPU is the kernel).
When launching a kernel, we must provide an execution configuration, which is done by using the <<< ... >>> syntax.

1
2
3


int myVar = 0;
if( threadIdx.x % 2 )
    myVar = threadIdx.x;

When some of the threads need to execute an instruction while others don’t, this situation is known as thread divergence. Under normal circumstances, divergent branches simply result in some threads remaining idle, while the other threads actually execute the instructions in the branch.

The Execution Configuration

the execution configuration allows programmers to specify the thread hierarchy for a kernel launch, which defines the number of thread blocks, as well as how many threads to execute in each block.

1

<<< NUMBER_OF_BLOCKS, NUMBER_OF_THREADS_PER_BLOCK>>>

A kernel is executed once for every thread in every thread block configured when the kernel is launched.

A group of threads is called a CUDA block. CUDA blocks are grouped into a grid. A kernel is executed as a grid of blocks of threads.

Each CUDA block is executed by one streaming multiprocessor (SM) and cannot be migrated to other SMs in GPU (except during preemption, debugging, or CUDA dynamic parallelism). One SM can run several concurrent CUDA blocks depending on the resources needed by CUDA blocks. Each kernel is executed on one device and CUDA supports running multiple kernels on a device at one time.The above picture shows the kernel execution and mapping on hardware resources available in GPU.

Thread Hierarchy Variables

gridDim.x is the number of the blocks in the grids.

blockIdx.x is the index of the current block within the grid.

blockDim.x is the number of threads in the block. All blocks in a grid contain the same number of threads.

threadIdx.x is index of the thread within a block (starting at 0).

The CUDA threads hierarchy can be 3-dimensional.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <cstdio>


__global__ void printHelloGPU()
{
    printf("Hello x: #%d  y: #%dn", threadIdx.x, threadIdx.y);
}

int main()
{
    dim3 threads(3, 3);
    printHelloGPU<<<1, threads>>>();
    cudaDeviceSynchronize();

    getchar();
    return 0;
}

We can use the dim3 structure to specify dimensions for blocks and threads. Example:

printHelloGPU<<<1, 25>>>() is configured to run in a single thread block which has 25 threads and will, therefore, run 25 times.
printHelloGPU<<<1, dim3(5, 5)>>>() is configured to run in a single thread block which has 25 threads and will, therefore, run 25 times.
printHelloGPU<<<5, 5>>>() is configured to run in 5 thread blocks which each has 5 threads and will therefore run 25 times.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
                       float C[N][N])
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}
int main()
{
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock(N, N);
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if (i < N && j < N)
        C[i][j] = A[i][j] + B[i][j];
}
int main()
{
    // Kernel invocation
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
}

Memory Hierarchy

Registers—These are private to each thread, which means that registers assigned to a thread are not visible to other threads. The compiler makes decisions about register utilization.
L1/Shared memory (SMEM)—Every SM has a fast, on-chip scratchpad memory that can be used as L1 cache and shared memory. All threads in a CUDA block can share shared memory, and all CUDA blocks running on a given SM can share the physical memory resource provided by the SM..
Read-only memory—Each SM has an instruction cache, constant memory, texture memory and RO cache, which is read-only to kernel code.
L2 cache—The L2 cache is shared across all SMs, so every thread in every CUDA block can access this memory. The NVIDIA A100 GPU has increased the L2 cache size to 40 MB as compared to 6 MB in V100 GPUs.
Global memory—This is the framebuffer size of the GPU and DRAM sitting in the GPU. Each thread has private local memory. Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block. All threads have access to the same global memory.

There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces.

Constant Memory

Declare and copies to constant memory

1
2


__constant__ Sphere s[SPHERES];
cudaMemcpyToSymbol( s, temp_s, sizeof(Sphere) * SPHERES) );

There are two reasons why reading from the 64KB of constant memory can save bandwidth over standard reads of global memory:

Asinglereadfromconstantmemorycanbebroadcasttoother“nearby” threads, effectively saving up to 15 reads.
Constantmemoryiscached,soconsecutivereadsofthesameaddresswillnot incur any additional memory traffic.

When it comes to handling constant memory, NVIDIA hardware can broadcast a single memory read to each half-warp. A half-warp—not nearly as creatively named as a warp—is a group of 16 threads: half of a 32-thread warp. If every thread in a half-warp requests data from the same address in constant memory, your GPU will generate only a single read request and subsequently broadcast the data to every thread. If you are reading a lot of data from constant memory, you will generate only 1/16 (roughly 6 percent) of the memory traffic as you would when using global memory.

But the savings don’t stop at a 94 percent reduction in bandwidth when reading constant memory! Because we have committed to leaving the memory unchanged, the hardware can aggressively cache the constant data on the GPU. So after the first read from an address in constant memory, other half-warps requesting the same address, and therefore hitting the constant cache, will generate no additional memory traffic.

The half-warp broadcast feature is in actuality a double-edged sword. Although it can dramatically accelerate performance when all 16 threads are reading the same address, it actually slows performance to a crawl when all 16 threads read different addresses. The trade-off to allowing the broadcast of a single read to 16 threads is that the 16 threads are allowed to place only a single read request at a time. For example, if all 16 threads in a half-warp need different data from constant memory, the 16 different reads get serialized, effectively taking 16 times the amount of time to place the request. If they were reading from conventional global memory, the request could be issued at the same time. In this case, reading from constant memory would probably be slower than using global memory.

Measuring Performance with Events

An event in CUDA is essentially a GPU time stamp that is recorded at a user- specified point in time.

1
2
3


cudaEvent_t start;
cudaEventCreate(&start);
cudaEventRecord( start, 0 );

1
2
3
4
5
6
7


cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord( start, 0 );
// do some work on the GPU
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


int main( void ) {
// capture the start time
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate( &start ) );
HANDLE_ERROR( cudaEventCreate( &stop ) );
HANDLE_ERROR( cudaEventRecord( start, 0 ) );

kernel<<<grids,threads>>>( dev_bitmap );

// get stop time, and display the timing results
HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
HANDLE_ERROR( cudaEventSynchronize( stop ) );
float   elapsedTime;
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
                                    start, stop ) );
printf( "Time to generate:  %3.1f ms\n", elapsedTime );
HANDLE_ERROR( cudaEventDestroy( start ) );
HANDLE_ERROR( cudaEventDestroy( stop ) );

Texture Memory

Like constant memory, texture memory is cached on chip, so in some situations it will provide higher effective bandwidth by reducing memory requests to off-chip DRAM. Specifically, texture caches are designed for graphics applications where memory access patterns exhibit a great deal of spatial locality. In a computing application, this roughly implies that a thread is likely to read from an address “near” the address that nearby threads read, as show in be below picture.

// these exist on the GPU side
texture<float>  texConstSrc;
texture<float>  texIn;
texture<float>  texOut;
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_inSrc, imageSize ) );
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_outSrc, imageSize ) );
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_constSrc, imageSize ) );
HANDLE_ERROR( cudaBindTexture( NULL, texConstSrc,
                               data.dev_constSrc,
                               imageSize ) );
HANDLE_ERROR( cudaBindTexture( NULL, texIn,
                               data.dev_inSrc,
                               imageSize ) );
HANDLE_ERROR( cudaBindTexture( NULL, texOut,
                               data.dev_outSrc,
                               imageSize ) );

Page-Locked Host Memory

There is a significant difference between the memory that malloc() will allocate and the memory that cudaHostAlloc() allocates. The C library function malloc() allocates standard, pageable host memory, while cudaHostAlloc() allocates a buffer of page-locked host memory. Sometimes called pinned memory, page-locked buffers have an important property: The operating system guarantees us that it will never page this memory out to disk, which ensures its residency in physical memory.

CUDA Streams

CUDA streams can play an important role in accelerating your applications. A cudA stream represents a queue of GPU operations that get executed in a specific order. We can add operations such as kernel launches, memory copies, and event starts and stops into a stream. The order in which operations are added to the stream specifies the order in which they will be executed. You can think of each stream as a task on the GPU, and there are opportunities for these tasks to execute in parallel.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


cudaStream_t    stream;
HANDLE_ERROR( cudaStreamCreate( &stream ) );
// now loop over full data, in bite-sized chunks
for (int i=0; i<FULL_DATA_SIZE; i+= N) {
    // copy the locked memory to the device, async
    HANDLE_ERROR( cudaMemcpyAsync( dev_a, host_a+i, N * sizeof(int),
        cudaMemcpyHostToDevice, stream ) );
    HANDLE_ERROR( cudaMemcpyAsync( dev_b, host_b+i, N * sizeof(int),
        cudaMemcpyHostToDevice, stream ) );
    kernel<<<N/256,256,0,stream>>>( dev_a, dev_b, dev_c );
    // copy the data from device to locked memory
    HANDLE_ERROR( cudaMemcpyAsync( host_c+i, dev_c, N * sizeof(int),
}
HANDLE_ERROR( cudaStreamSynchronize( stream ) );

cudaMemcpyAsync() this function executes synchronously, meaning that when the function returns, the copy has completed, and the output buffer now contains the contents that were supposed to be copied into it.

cudaMemcpyAsync() The call to cudaMemcpyAsync() simply places a request to perform a memory copy into the stream specified by the argument stream. When the call returns, there is no guarantee that the copy has even started yet, much less that it has finished. The guarantee that we have is that the copy will definitely be performed before the next opera- tion placed into the same stream. It is required that any host memory pointers passed to cudaMemcpyAsync() have been allocated by cudaHostAlloc().

Functions

__syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


// Device code
__global__ void VecAdd(float* A, float* B, float* C, int N)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N)
        C[i] = A[i] + B[i];
}
// Host code
int main()
{
int N = ...;
    size_t size = N * sizeof(float);
    // Allocate input vectors h_A and h_B in host memory
    float* h_A = (float*)malloc(size);
    float* h_B = (float*)malloc(size);
    // Initialize input vectors
...
    // Allocate vectors in device memory
    float* d_A;
    cudaMalloc(&d_A, size);
    float* d_B;
    cudaMalloc(&d_B, size);
    float* d_C;
    cudaMalloc(&d_C, size);
    // Copy vectors from host memory to device memory
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
    // Invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =
            (N + threadsPerBlock - 1) / threadsPerBlock;
    VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
    // Copy result from device memory to host memory
    // h_C contains the result in host memory
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    // Free device memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    // Free host memory
}

Optimization

Coalesced Access to Global Memory(合并内存访问)

假定GPU memory总线的宽度为256位，因为基于字节地址的总线要访问memory,必须和纵总线的宽度对齐，所以必须按32字节来访问memory。如果我们要访问地址0x00001232中的数据，GPU会同时得到0x00001220到0x0000123F的所有数据，如此其它28个字节的数据都浪费了，白白消耗了带宽。

为了利用总线带宽，GPU通常会把多个线程的访存合并到较少的内存请求中去，以此提高带宽的利用率。

1
2
3
4
5
6


// Assume that the number of threads is 1,024 in each code
float Pixel[1024*1024];
__global__ void coal_float_kernel( float *Pixel)
{
Pixel[blockIdx.x*1024+ threadIdx.x] ++;
}

C++ Language Extensions

Variable Memory Space Specifiers

Shared memory

1
2
3
4
5
6


extern __shared__ float array[];
__device__ void func() {
    short* array0 = (short*)array;
    float* array1 = (float*)&array0[128];
    int* array2 = (int*)&array1[64];
}

`restrict`

Avoid aliasing issue.

1
2
3
4
5
6
7


void foo(const float* a, const float* b, float* c) {
    c[0] = a[0] * b[0];
    c[1] = a[0] * b[0];
    c[2] = a[0] * b[0] * a[1];
    c[3] = a[0] * a[1]
    ...
}

Since the pointers a, b, and c may be aliasd, any write through c could modify elemets of a or b. The compiler cannot take advantage of the common sub-expression or reorder optimizations. By making a, b and c restricted pointers, the programmer assets to the compiler that the pointers are in fact not aliased. The compiler can now reorder and do common sub-expression elimination at will.

1
2


void foo(const float* __restrict__ a, const float* __restrict__ b,
         float* __restrict__ c);

Built-in Vector Types

Basic types

These are vector types derived from the basic integer and floating-point types. They are structures and the 1st, 2nd, 3rd, and 4th components are accessible through the fields x, y, z, and w, respectively. They all come with a constructor function of the form make_<type name>; for example,

1

int2 make_int2(int x, int y);

which creates a vector of type int2 with value(x, y). This format support types char, short, int, long, longlong, float, double

Built-in Variables

Built-in variables specify the grid and block dimensions and the block and thread indices.

gridDim: This variables is of type dim3 and contains the dimensions of the grid.
blockldx: The variables if of type uint3 and contains the block index within the grid.
blockDim: The variable is of type dim3 and contains the dimensions of the block.
threadldx: This variables is of type uint3 and contains the thread index within the block.
warpSize: This variable is of type int and contains the warp size in threads.

Memory Fence Functions

Memory fence funcitons can be used to enforce some ordering on mmeory accesses.

1

void __threadfence_block();

All writes to all memory made by the calling thread before the call to __threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call to __threadfence_block();
All reads from all memory made by the calling thread before the call to __threadfence_block() are ordered before all reads from all memory made by the calling thread after the call to __threadfence_block(). This function is used if threads belong to the same block.

1

void __threadfence();

acts as __threadfence_block() for all threads. This function is used if they are CUDA threads form the same device.

1

void __threadfence_system();

acts as __threadfence_block() for all threads in the device, host threads, and all thread in peer devices. This function is used if they are CUDA threads from two different devices. (This function is only supported by devices of compute capability 2.x and higher).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


__device__ int X = 1, Y = 2;
__device__ void writeXY() {
    X = 10;
    __threadfence();
    Y = 20;
}
__device__ void readXY() {
    int B = Y;
    __threadfence();
    int A = X;
}

For this code, the following outcomes can be observed:

A = 1, B = 2
A = 10, B = 2
A = 10, B = 20

Synchronization Functions

1

void __syncthreads();

waits all threads in the thread block.

1

int __syncthreads_count(int predicate);

is identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to non-zero.

1

int __syncthreads_and(int predicate);

is identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for all of them.

1

int __syncthreads_or(int predicate);

1

void __syncwarp(unsigned mask = 0xffffffff);

will cause the executing thread to wait until all warp lanes named in mask have executed a __syncwarp() (with the same mask) before resuming execution. All non-exited threads named in mask must execute a corresponding __syncwarp() with the same mask, or the result is undefined.

Store Functions Using Cache Hints

1
2
3
4


void __stwb(T* address, T value);
void __stcg(T* address, T value);
void __stcs(T* address, T value);
void __stwt(T* address, T value);

内存

全局内存

CPU 主机端可以通过三种方式对GPU上的内存进行访问：

显式地阻塞传输
显式的非阻塞传输
隐式地使用零内存复制

GPU遵循一种惰性计算模型，只有在真正需要变量内容进行计算时才会到内存中取值。

Examples

vector dot product

最后用一个 sm 核进行求和

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47


#include <iostream>
#include <stdio.h>
#include <unistd.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
using namespace std;
#define N 100000

__global__ void dot(int* a, int* b, int* c) {
    int tid = threadIdx.x;
    int tmp = 0;
    while (tid < N ) {
      tmp += a[tid] * b[tid];
      tid += blockDim.x;
    }
    extern __shared__ int shared_memory[];
    tid = threadIdx.x;
    shared_memory[tid] = tmp;
    __syncthreads();
    if (tid == 0) {
      *c = 0;
      for (int i = 0; i < blockDim.x; i++) {
        *c += shared_memory[i];
      }
    }
}
int main() {
    int a[N], b[N], c[N];
    int *dev_a, *dev_b, *dev_c;
    cudaMalloc((void**)&dev_a, N * sizeof(int));
    cudaMalloc((void**)&dev_b, N * sizeof(int));
    cudaMalloc((void**)&dev_c,  sizeof(int));
    for (int i = 0; i < N; i++) {
        a[i] = 1;
        b[i] = 2;
    }
    cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
    dot<<<1, 1000, 1000 * sizeof(int)>>>(dev_a, dev_b, dev_c);
    cudaMemcpy(c, dev_c, sizeof(int), cudaMemcpyDeviceToHost);
    cout << "result = " << c[0] << endl;
    cudaFree(dev_a);
    cudaFree(dev_b);
    cudaFree(dev_c);

    return 0;
}

shared memory

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50


#include <stdio.h>

__global__ void staticReverse(int *d, int n)
{
  __shared__ int s[64];
  int t = threadIdx.x;
  int tr = n-t-1;
  s[t] = d[t];
  __syncthreads();
  d[t] = s[tr];
}

__global__ void dynamicReverse(int *d, int n)
{
  extern __shared__ int s[];
  int t = threadIdx.x;
  int tr = n-t-1;
  s[t] = d[t];
  __syncthreads();
  d[t] = s[tr];
}

int main(void)
{
  const int n = 64;
  int a[n], r[n], d[n];

  for (int i = 0; i < n; i++) {
    a[i] = i;
    r[i] = n-i-1;
    d[i] = 0;
  }

  int *d_d;
  cudaMalloc(&d_d, n * sizeof(int));

  // run version with static shared memory
  cudaMemcpy(d_d, a, n*sizeof(int), cudaMemcpyHostToDevice);
  staticReverse<<<1,n>>>(d_d, n);
  cudaMemcpy(d, d_d, n*sizeof(int), cudaMemcpyDeviceToHost);
  for (int i = 0; i < n; i++)
    if (d[i] != r[i]) printf("Error: d[%d]!=r[%d] (%d, %d)\n", i, i, d[i], r[i]);

  // run dynamic shared memory version
  cudaMemcpy(d_d, a, n*sizeof(int), cudaMemcpyHostToDevice);
  dynamicReverse<<<1,n,n*sizeof(int)>>>(d_d, n);
  cudaMemcpy(d, d_d, n * sizeof(int), cudaMemcpyDeviceToHost);
  for (int i = 0; i < n; i++)
    if (d[i] != r[i]) printf("Error: d[%d]!=r[%d] (%d, %d)\n", i, i, d[i], r[i]);
}

该例子中是利用 shared memory 进行数组的反转。

其中采用了两种 shared memory 分配的方式，

static shared memory

1
2
3
4
5
6
7
8
9


__global__ void staticReverse(int *d, int n)
{
  __shared__ int s[64];
  int t = threadIdx.x;
  int tr = n-t-1;
  s[t] = d[t];
  __syncthreads();
  d[t] = s[tr];
}

用于在编译时就已经知道需要分配的内存大小

dynamic shared memory

1

dynamicReverse<<<1,n,n*sizeof(int)>>>(d_d, n);

动态的内存分配，需要在启动 kernel 函数时在配置 <<>>> 内进行设置。且声明需要是如下方式：

1

extern __shared__ int s[];

如果想通过该方式声明多个 shared memory ，则是在配置时设置一个连续的大内存空间，然后在分别使用

1
2
3
4
5
6


myKernel<<<gridSize, blockSize, nI*sizeof(int)+nF*sizeof(float)+nC*sizeof(char)>>>(...);

extern __shared__ int s[];
int *integerData = s;                        // nI ints
float *floatData = (float*)&integerData[nI]; // nF floats
char *charData = (char*)&floatData[nF];      // nC chars