CUDA，可以使用共享内存可以提高我的性能

CUDA, could using shared memory improve my performance?

本文关键字：高我的我的性能可以使共享内存 CUDA 更新时间：2023-10-16

我正在实现一种算法，以使用CUDA将图像转换为灰度。我现在已经有效了，但是我正在寻找提高性能的方法。目前，整个颜色图像被转移到设备内存，然后每个线程通过查找相应的三个（r，g，b）颜色值来计算灰色像素值。

我已经确保访问全球记忆的访问是合并的，尽管这并没有真正提高我的性能（在内存访问合并后，36 MB的图像少了0.003 s ...）。现在，我想知道使用共享内存是否可以改善我的性能。这是我现在拥有的：

我的cuda内核：

__global__ void darkenImage(const unsigned char * inputImage,
    unsigned char * outputImage, const int width, const int height, int iteration){
  int x = ((blockIdx.x * blockDim.x) + (threadIdx.x + (iteration * MAX_BLOCKS * nrThreads))) * 3;
  if(x+2 < (3 * width*height)){
    float grayPix = 0.0f;
    float r = static_cast< float >(inputImage[x]);
    float g = static_cast< float >(inputImage[x+1]);
    float b = static_cast< float >(inputImage[x+2]);
    grayPix = __fadd_rn(__fadd_rn(__fmul_rn(0.3f, r),__fmul_rn(0.59f, g)), __fmul_rn(0.11f, b));
    grayPix = fma(grayPix,0.6f,0.5f);

    outputImage[(x/3)] = static_cast< unsigned char >(grayPix);
  }
}

我的问题确实是，因为两个线程之间没有内存，使用共享内存不应在这里真正有所帮助吗？还是我误会了？

问：

linus

如果您不使用相同的值多一次，则使用共享内存（缓存）不会改善性能。但是，您可以尝试删除iteration参数并使用每个块处理更多数据。尝试在内核中进行单个内核启动和一个循环，以便每个线程可以计算一个以上的输出数据。

否您是正确的，共享内存无济于事，因为您没有一次访问数据。