当数组> 1769472时,CUDA 不返回值

CUDA not returning values when Array > 1769472

本文关键字:CUDA 返回值 1769472时 数组 gt      更新时间:2023-10-16

am试图计算平均256组8192字节长的数据。我有一个内核,它可以处理216个数据集,但可以处理更多的数据集,并且内核对每个平均值返回0。我使用一个非常基本的归约系统来计算平均值。

显卡:GTX 780 Ti

这是我的代码

__global__ void Average(double *Input, int Length, int Sets, double *Average, int N) {
    unsigned int Pos = (blockDim.x * blockIdx.x) + threadIdx.x;
    unsigned int Offset;
    int i = Length / N;
    if (Pos < i * Sets) {
        Offset = ((Pos / i) * Length) + (Pos % i); 
        Input[Offset] += Input[Offset + i];
    }
    __syncthreads();
    if (N == Length) {
        Average[Pos] = Input[Pos*Length] / Length;
    }
}
using namespace std;
int main()
{
    const int Length = 8192;
    const int Sets =256;
    const int Width = Length*Sets;
    double *GPU_Average, *GPU_Data;
    cudaMalloc((void**)&GPU_Average, CameraWidth*sizeof(double)*Sets);
    cudaMalloc((void**)&GPU_Data, CameraWidth*sizeof(double)*Width); 
    double CPU_Data[Width];
    double CPU_Average[Sets];
    for (int i = 0; i < Width; i++) {
        CPU_Data[i] = i;
    }
    cudaMemcpy(GPU_Data, CPU_Data, sizeof(double)*Width, cudaMemcpyHostToDevice);
    int N = 2;
    int Total, Blocks, Threads;
    while (N < Length+1) {
        Total = (Sets*Length) / N;
        if (Total > 1024) {
            Threads = 1024;
            Blocks = Total / Threads;
        }
        else {
            Threads = Total;
            Blocks = 1;
        }
        Average << < Blocks, Threads>> >(GPU_Data, Length, Sets, GPU_Average, N);
        N *= 2;
    }
    cudaMemcpy(CPU_Average, (GPU_Average), sizeof(double)*Sets, cudaMemcpyDeviceToHost);
    return 0;
}

感谢在这件事上的任何帮助。

我没有意识到在我的实际代码中(不是上面的代码)我写了

 cudaMalloc((void**)&GPU_Data, Width*sizeof(double)*Width); 

而不是

 cudaMalloc((void**)&GPU_Data, sizeof(double)*Width); 

这分配了太多内存并导致错误。