Cuda __syncthreads()在我的代码中不起作用

cuda __syncthreads() not working in my code

本文关键字：代码不起作用我的 syncthreads Cuda 更新时间：2023-10-16

情况是这样的

我有一个运行while循环的线程块，当且仅当这些线程中的任何一个满足某些条件时，我需要循环继续。为此，我使用一个共享变量作为继续标志，该标志在每次迭代开始时由线程#0清除，后面是一个__syncthreads()，如果满足继续条件，则可以在迭代期间由任何线程设置。然后在下一次迭代的检查点之前再次调用__syncthreads()以确保线程同步。内核基本上是这样的:

__global__ void foo(void* data) {
    __shared__ int blockContinueFlag;
    do {
        if (threadIdx.x || threadIdx.y || threadIdx.z) {
            blockContinueFlag = 0;
        }
        __syncthreads(); //synch1
        //some data manipulations...
        if(some predicate) {
            blockContinueFlag = true;
        }
        //some data manipulations...
        __syncthreads(); //synch2
    } while (blockContinueFlag);
}

问题是屏障synch2似乎在我的代码中不起作用，有时内核甚至在某些线程满足继续条件时终止(我通过检查主机端返回的数据知道这一点)。为了进一步检查这一点，我在do-while循环之后设置了一个断点，就像下面的代码一样，有时blockContinueFlag被称为true(我只能假设块在一些线程可以设置blockContinueFlag之前退出了循环)。

__global__ void foo(void* data) {
    __shared__ int blockContinueFlag;
    do {
        if (threadIdx.x || threadIdx.y || threadIdx.z) {
            blockContinueFlag = 0;
        }
        __syncthreads(); //synch1
        //some data manipulations...
        if(some predicate) {
            blockContinueFlag = true;
        }
        //some data manipulations...
        __syncthreads(); //synch2
    } while (blockContinueFlag);
    //a break point is set here
}

我记得从cuda手册中读到，如果谓词对所有线程的评估相同，则允许在条件子句中使用__syncthreads()，这应该是在这种情况下。

我有另一个简化版本的代码，只是为了说明这一点。

__global__ void foo(int* data, int kernelSize, int threshold) {
    __shared__ int blockContinueFlag;
    do {
        if (threadIdx.x == 0) {
            blockContinueFlag = 0;
        }
        __syncthreads();
        if (threadIdx.x < kernelSize)  {
            data[threadIdx.x]--;
            for (int i = 0; i < threadIdx.x; i++);
            if (data[threadIdx.x] > threshold)
                blockContinueFlag = true;
        }
        __syncthreads();
    } while (blockContinueFlag);
}
int main()
{
    int hostData[1024], *deviceData;
    for (int i = 0; i < 1024; i++)
        hostData[i] = i;
    cudaMalloc(&deviceData, 1024 * sizeof(int));
    cudaMemcpy(deviceData, hostData, 1024 * sizeof(int), cudaMemcpyHostToDevice);
    foo << <1, 1024 >> >(deviceData, 512, 0);
    cudaDeviceSynchronize();
    cudaMemcpy(hostData, deviceData, 1024 * sizeof(int), cudaMemcpyDeviceToHost);
    fprintf(stderr, cudaGetErrorString(cudaGetLastError()));
    return 0;
}

hostData[]的期望值是main()末尾的{-511, -510, -509, ..., 0, 512, 513, 514,..., 1023}，这有时是实际情况。但在某些情况下，它会在VS 2013调试模式下产生以下值

[0]: -95
[1]: -94
...
[29]: -66
[30]: -65
[31]: -64
[32]: 31
[33]: 32
[34]: 33
...
[61]: 60
[62]: 61
[63]: 62
[64]: -31
[65]: -30
[66]: -29
...
[92]: -3
[93]: -2
[94]: -1
[95]: 0
[96]: 95
[97]: 96
[98]: 97
...

，这表明翘曲实际上并没有同步。

所以有人知道这个原因和/或是否有一种方法可以让线程屏障正确工作吗?

任何帮助都会很感激。

所以这是我的解决方案，一个__syncthreads_or()，而不是三个__syncthreads()的要求。

__global__ void foo(void* data) {
    int blockContinueFlag;
    do {
        blockContinueFlag = 0;
        //some data manipulations...
        if(some predicate) {
            blockContinueFlag = true;
        }
        //some data manipulations...
    } while (__syncthreads_or(blockContinueFlag));
}

在实践中，这比三个同步线程略快。

再次感谢你的帖子

在第一个例子中，您在同步线程之间的相同代码片段中检查条件并清除标志。这是读后写的危险。为了更好地说明你的问题，让我这样重写你的例子:

__global__ void foo(void* data) {
  __shared__ int blockContinueFlag;
  blockContinueFlag = true;
  while (true) {
    if (!blockContinueFlag)
        break;
    if (threadIdx.x || threadIdx.y || threadIdx.z) {
        blockContinueFlag = 0;
    }
    __syncthreads(); //synch1
    //some data manipulations...
    if(some predicate) {
      blockContinueFlag = true;
    }
    //some data manipulations...
    __syncthreads(); //synch2
  };

在本例中，检查标志和循环中断更详细，但本质上是相同的代码(加上在最开始的冗余检查)。

在这个例子中，以及在你的代码中，线程0可以检查循环条件和清除标志，在线程33(另一个warp)执行检查之前。这导致了分歧，所有的邪恶都被释放了。

要修复-您需要在清除标志之前添加另一个__syncthreads()。