启动大内核时出现未知错误

Getting unknown error when launching large kernel sizes

本文关键字：未知错误大内内核启动更新时间：2023-10-16

当我的数组大小大于591 × 591时，我在启动一个简单的内核时遇到了一个问题。当数组大小为591x591时，没有任何错误返回，但是当我启动内核时，网格尺寸为38x38块，每个线程为16x16，内核启动失败并返回一个"未知错误"。

下面的代码是我正在调用的内核，以及我代码中对内核的调用:

#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_device_runtime_api.h>
using namespace std;
#define BLOCKSIZE 16
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__,__LINE__);}
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
    fprintf(stderr, "GPUassert: %s %s %dn", cudaGetErrorString(code), file, line);
    if(abort) exit(code);
}
}

__global__ void IdentityMatrixKernel(float* identity, int size)
{
int index_x = blockIdx.x * blockDim.x + threadIdx.x;
int index_y = blockIdx.y * blockDim.y + threadIdx.y;
// map the two 2D indices to a single linear, 1D index
int grid_width = gridDim.x * blockDim.x;
int index = index_y * grid_width + index_x;
// map the two 2D block indices to a single linear, 1D block index
//int result = blockIdx.y * gridDim.x + blockIdx.x;

if (index % (size+1))
{
    identity[index] = 0;
}
else
{
    identity[index] = 1;
}

void foo(float *aArray, int size)
{
float* d_I;
int size2 = size*size*sizeof(float);
gpuErrchk(cudaMalloc(&d_I,size2));
dim3 block_size;
block_size.x = BLOCKSIZE;
block_size.y = BLOCKSIZE;
dim3 grid_size;
grid_size.x = size1/ block_size.x + 1;
grid_size.y = size1/ block_size.y + 1;
IdentityMatrixKernel<<<grid_size,block_size>>>(d_I,size);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaMemcpy(aArray,d_I,size2,cudaMemcpyDeviceToHost));
cudaFree(d_I);
}
int main()
{
int size = 591;
float *aArray = (float*)malloc(size*size*sizeof(float));
foo(aArray,size);

return 0;
}

对于size = 591，没有错误显示，输出大小为591x591的单位矩阵，但对于任何更大的大小，它向控制台吐出"未知错误"

一个问题似乎是你正在启动一个线程网格，它比你的实际矩阵大:

grid_size.x = size1/ block_size.x + 1;
grid_size.y = size1/ block_size.y + 1;

但是您没有检查内核中的任何越界访问。您需要添加一个线程检查，如:

if ((index_x >= size)||(index_y >= size)) return;

靠近内核的开头。但这还不够。另一个问题是你的index计算不正确:

int index = index_y * grid_width + index_x;

从表面上看，它似乎是正确的，但是由于您的线程数组(可能)比您的数据数组大，这可能会给出不正确的索引。既然您要将size传递给内核，那么将其更改为如下内容:

int index = index_y * size + index_x;

并且您应该能够消除越界访问

我扩展了Robert Crovella的回答。

如果你定义了block_size。{x, y}有一个大的数字(在你的例子中是16)，那么你将无法处理较小大小的数组，例如4x4。你可以定义一个小块大小:

/* create thread blocks */
dim3 block_size;
block_size.x = 4;
block_size.y = 4;
/* create n x n block grids */
dim3 grid_size;
grid_size.x = size1/block_size.x;
grid_size.y = size1/block_size.y;
/* in case of partial sizes make grid_size 1 x 1 */
if (size1 % block_size.x)
    grid_size.x = 1, grid_size.y = 1;