如何在不使用原子的情况下同步 CUDA 中的线程

How to Synchronize threads in CUDA without using atomic

本文关键字：CUDA 同步线程情况下原子的更新时间：2023-10-16

我正在从在线UDACITY课程中学习CUDA编程。第二课给出了一个示例代码，它基本上有两个内核，第一个内核__global__ void increment_naive(int *g)简单地将 1 添加到驻留在全局内存中的数组*g元素。

根据 UDACITY 的整个代码如下：

#include <stdio.h>
#include "gputimer.h"
#define NUM_THREADS 1000000
#define ARRAY_SIZE  100
#define BLOCK_WIDTH 1000
void print_array(int *array, int size)
{
    printf("{ ");
    for (int i = 0; i < size; i++)  { printf("%d ", array[i]); }
    printf("}n");
}
__global__ void increment_naive(int *g)
{
    // which thread is this?
    int i = blockIdx.x * blockDim.x + threadIdx.x; 
    // each thread to increment consecutive elements, wrapping at ARRAY_SIZE
    i = i % ARRAY_SIZE;  
    g[i] = g[i] + 1;
}
__global__ void increment_atomic(int *g)
{
    // which thread is this?
    int i = blockIdx.x * blockDim.x + threadIdx.x; 
    // each thread to increment consecutive elements, wrapping at ARRAY_SIZE
    i = i % ARRAY_SIZE;  
    atomicAdd(& g[i], 1);
}
int main(int argc,char **argv)
{   
    GpuTimer timer;
    printf("%d total threads in %d blocks writing into %d array elementsn",
           NUM_THREADS, NUM_THREADS / BLOCK_WIDTH, ARRAY_SIZE);
    // declare and allocate host memory
    int h_array[ARRAY_SIZE];
    const int ARRAY_BYTES = ARRAY_SIZE * sizeof(int);
    // declare, allocate, and zero out GPU memory
    int * d_array;
    cudaMalloc((void **) &d_array, ARRAY_BYTES);
    cudaMemset((void *) d_array, 0, ARRAY_BYTES); 
    // launch the kernel - comment out one of these
    timer.Start();

    increment_naive<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_array);
    //increment_atomic<<<NUM_THREADS/BLOCK_WIDTH, BLOCK_WIDTH>>>(d_array);
    timer.Stop();
    // copy back the array of sums from GPU and print
    cudaMemcpy(h_array, d_array, ARRAY_BYTES, cudaMemcpyDeviceToHost);
    print_array(h_array, ARRAY_SIZE);
    printf("Time elapsed = %g msn", timer.Elapsed());
    // free GPU memory allocation and exit
    cudaFree(d_array);
    return 0;
}

根据该程序，具有 1000 个块的一百万个线程正在写入 10 个数组元素。因此，每个数组元素的结果为 100000。

第一个内核无法生成所需的输出，因为线程不同步访问，从而产生不良结果。这可以使用障碍（如__syncthreads）或使用原子操作来解决。

第二个内核工作正常并产生正确的输出，如下所示：

1000000 total threads in 1000 blocks writing into 100 array elements
{ 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 }
Time elapsed = 0.367648 ms

如前所述，第一个内核每次都会产生错误的输出。

1000000 total threads in 1000 blocks writing into 100 array elements
{ 75 75 75 75 78 78 78 78 73 73 73 73 82 82 82 82 85 85 85 85 92 92 92 92 104 104 104 104 107 107 107 107 89 89 89 89 88 88 88 88 95 95 95 95 103 103 103 103 106 106 106 106 107 107 107 107 105 105 105 105 113 113 113 113 96 96 96 96 95 95 95 95 95 95 95 95 100 100 100 100 98 98 98 98 104 104 104 104 110 110 110 110 126 126 126 126 90 90 90 90 }
Time elapsed = 0.23392 ms

我试图通过在计算的不同阶段放置障碍来修复第一个内核，但未能获得必要的输出。我修复第一个内核的尝试如下：

    __global__ void increment_naive(int *g)
{
    // which thread is this?
    int i = blockIdx.x * blockDim.x + threadIdx.x; 
    __syncthreads();
    // each thread to increment consecutive elements, wrapping at ARRAY_SIZE
    //i = i % ARRAY_SIZE;
    int temp = i%ARRAY_SIZE;
    __syncthreads();
    i = temp;
    __syncthreads();
    //g[i] = g[i] + 1;
    int temp1 = g[i]+1;
    __syncthreads();
    g[i] = temp1;
     __syncthreads();
}

我希望有人指导我解决这个问题，因为这个问题困扰着我，阻碍了我进一步发展的信心。

__syncthreads() 函数确保块中的所有线程都位于代码中的同一位置。使用这些不会实现您想要的。更糟糕的是 - 假设 CUDA 是一台完美的并行机器，所有线程同步工作。你永远不需要任何__syncthreads.不过，你会有不同的结果。考虑以下伪代码和对正在发生的事情的解释：

__perfect_parallel_machine__ void increment_naive(int *g)
{
    int idx = thisThreadIdx % ARRAY_SIZE;
    int local = g[idx];
                               //*all* threads load the initial value of g[idx]
                               //each thread holds a separate copy of 'local' variable
                               //local=0 in each thread
    local = local + 1;
                               //each thread increment its own private copy of 'local'
                               //local=1 for all threads
    g[idx] = local;
                               //each thread stores the same value (1) into global array
                               //g = {1, 1, 1, 1, 1, ...., 1}
}

由于 CUDA 不是一个完美的并行机器，事情会发生顺序错误，你最终会在数组中获得更高的值。设置更多的同步障碍将使您更接近理想的{1, 1, ... , 1}结果。

还有其他屏障功能，例如 __threadfence() .这会停止当前线程（仅当前线程！），直到保证存储到全局数组对其他线程可见。这与 L1/L2 缓存有关，与线程同步无关。例如，通常将__threadfence与原子结合使用来标记您已完成某些数据的填写。

我想你和你的导师之间一定有什么误会。我建议和他谈谈以澄清一下...