带有大结构变量的 CUDA 内核函数给出了错误的结果

a cuda kernel function with a big struct variable gives wrong result

本文关键字：结果错误函数 CUDA 结构变量内核更新时间：2023-10-16

我有一个 cuda 内核函数来交换结构数组中的元素，但是当结构元素足够大时，例如占用超过 120 个字节，交换是错误的。下面是一个简单的示例：

#include <iostream>
#include <cuda_runtime.h> 
using namespace std;
const int f_num = 30;
const int d_num = 15;
struct S
{
int constID;
float f[f_num];
//double d[d_num];
};
__global__ void
cudaSwap(S *s, int n)
{
int tid = threadIdx.x;
S temp;
if(tid < n)
{
temp = s[tid];
s[tid] = s[tid + n];
s[tid +n] = temp;
}
}
int main()
{
cout << "sizeof float is " << sizeof(float) << endl;
cout << "sizeof double is " << sizeof(double) << endl;
S *h_s = new S[20];
for(int i = 0; i < 20; ++i)
{
h_s[i].constID = i;
for(int j = 0; j < f_num; ++j)
{
h_s[i].f[j] = (float)i + (float)j/100;
//h_s[i].d[j] = (double)i + (double)j/100;
}
}
cout << "original h_s:" << endl;
for(int i = 0; i < 20; ++i)
{
cout << h_s[i].constID << endl;
}
cout << endl;
S *d_s;
cudaMalloc((void**)&d_s, sizeof(S) * 20);
cudaMemset(d_s, 0, sizeof(S) * 20);
cudaMemcpy(d_s, h_s, sizeof(S) * 20, cudaMemcpyHostToDevice);
cudaSwap<<<1,20>>>(d_s, 5);
cudaMemcpy(h_s, d_s, sizeof(S) * 20, cudaMemcpyDeviceToHost);
cout << "swaped h_s:" << endl;
for(int i = 0; i < 20; ++i)
{
cout << h_s[i].constID << endl;
}
cout << endl;
delete [] h_s;
cudaFree(d_s);
return 0;
}

当结构体由少于 30 个浮点元素或 15 个双精度元素组成时，结果为 5 6 7 8 9 0 1 2 3 4 10 11 ...，但是当结构元素较大时，结果为 5 6 7 8 9 5 6 7 8 9 10 11 ...，这意味着s[tid +n] = temp;不起作用。我是 cuda 的新手，谁能告诉我问题的原因以及如何解决问题？也许这与注册有关？我不确定。。。多谢！

这似乎是 CUDA 7.5 和 CUDA 8 中的一个编译器错误(根据我的测试，它不受 PTX 优化级别的影响，所以我相信该错误进入从 CUDA 源生成 PTX，而不是在 PTX 编译到 SASS 中(。

使用调试开关(-G(进行编译似乎会使问题消失，但这对性能有负面影响。

根据我的测试，它似乎在 CUDA 9 EA 中修复，因此我希望它在公开可用时在 CUDA 9 中修复。

一种可能的简单解决方法是修改内核代码，如下所示：

__global__ void
cudaSwap(S *s, int n)
{
int tid = threadIdx.x;
S temp;
if(tid < n)
{
temp = s[tid];
s[tid] = s[tid + n];
memcpy(s+tid+n, &temp, sizeof(S));  // this line is changed
}
}

根据我的测试，这似乎解决了 CUDA 8 中此处提出的案例的问题。