CUDA 内核导致"display driver not responding"，增加了 4 行

CUDA kernel causing causing "display driver not responding" with the addition of 4 lines

本文关键字：增加 responding driver 内核 display CUDA not 更新时间：2023-10-16

基本问题如下：

当我用N个线程运行下面的内核时，不包括4行来实例化和填充ScaledLLA变量工作良好。

当我用N个线程运行下面的内核时，do包括4用于实例化和填充GPU锁定的ScaledLLA变量的行打开，Windows抛出"显示驱动程序未响应"错误。

如果我通过减小网格大小来减少运行的线程数一切都很顺利。

我是CUDA的新手，一直在逐步构建一些GIS功能。

我的主机代码在内核调用时看起来是这样的。

MapperKernel << <g_CUDAControl->aGetGridSize(), g_CUDAControl->aGetBlockSize() >> >(g_Deltas.lat, g_Deltas.lon, 32.2,
        g_DataReader->aGetMapper().aGetRPCBoundingBox()[0], g_DataReader->aGetMapper().aGetRPCBoundingBox()[1],
        g_CUDAControl->aGetBlockSize().x,
        g_CUDAControl->aGetThreadPitch(),
        LLA_Offset,
        LLA_ScaleFactor,
        RPC_XN,RPC_XD,RPC_YN,RPC_YD,
        Pixel_Offset, Pixel_ScaleFactor,
        device_array);
    cudaDeviceSynchronize(); //code crashes here
    host_array = (point3D*)malloc(num_bytes);
    cudaMemcpy(host_array, device_array, num_bytes, cudaMemcpyDeviceToHost);

被调用的内核看起来像这样：

__global__ void MapperKernel(double deltaLat, double deltaLon, double passedAlt,
    double minLat, double minLon,
    int threadsperblock,
    int threadPitch,
    point3D LLA_Offset,
    point3D LLA_ScaleFactor,
    double * RPC_XN, double * RPC_XD, double * RPC_YN, double * RPC_YD,
    point2D pixelOffset, point2D pixelScaleFactor,
    point3D * rValue)
{
    //calculate thread's LLA
    int latindex = threadIdx.x + blockIdx.x*threadsperblock;
    int lonindex = threadIdx.y + blockIdx.y*threadsperblock;
    point3D LLA;
    LLA.lat = ((double)(latindex))*deltaLat + minLat;
    LLA.lon = ((double)(lonindex))*deltaLon + minLon;
    LLA.alt = passedAlt;
    //scale threads LLA - adding these four lines is what causes the problem
    point3D ScaledLLA;
    ScaledLLA.lat = (LLA.lat - LLA_Offset.lat) * LLA_ScaleFactor.lat;
    ScaledLLA.lon = (LLA.lon - LLA_Offset.lon) * LLA_ScaleFactor.lon;
    ScaledLLA.alt = (LLA.alt - LLA_Offset.alt) * LLA_ScaleFactor.alt;
    rValue[lonindex*threadPitch + latindex] = ScaledLLA; //if I assign LLA without calculating ScaledLLA everything works fine
}

如果我将LLA分配给rValue，那么一切都会快速执行，并且我得到了预期的行为；然而，当我为ScaledLLA添加这四行并试图将其分配给rValue时，CUDA在cudaDeviceSynchronize((调用中花费了太长时间，而我得到了"显示驱动程序未响应"错误，然后继续重置GPU。环顾四周，错误似乎是windows认为GPU没有响应时发生的事情。我确信内核正在运行并执行正确的计算，因为我已经使用NSIGHT调试器进行了调试。

有人能很好地解释为什么在内核中添加这三行会导致执行时间激增吗？

我正在运行Win7 VS 2013，并安装了nsight 4.5。

对于那些稍后通过搜索引擎到达这里的人。原来问题出在卡的内存不足。

这可能是最需要考虑的事情之一，因为问题是在添加实例化之后才出现的。

卡只有这么多内存(~2GB(，我的右值缓冲区占用了大部分内存(~1.5GB(。每个线程都试图实例化自己的point3D变量，卡的内存就用完了。

对于那些感兴趣的人，NSight的探查器说这是一个cudaUknownError。

修复方法是减少运行内核的线程数量