无法理解 CUDA 内核启动的行为

Can't understand the behaviour of CUDA kernel launch

本文关键字：启动内核 CUDA 更新时间：2023-10-16

#include "utils.h"
__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
                       unsigned char* const greyImage,
                       int numRows, int numCols)
{
  for (size_t r = 0; r < numRows; ++r) {
    for (size_t c = 0; c < numCols; ++c) {
      uchar4 rgba = rgbaImage[r * numCols + c];
      float channelSum = 0.299f * rgba.x + 0.587f * rgba.y + 0.114f * rgba.z;
      greyImage[r * numCols + c] = channelSum;
    }
  }
}
void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage,
                            unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
  const dim3 blockSize(1, 1, 1);  //TODO
  const dim3 gridSize( 1, 1, 1);  //TODO
  rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
  cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}

这是用于将颜色图像转换为灰度的代码。我正在为一门课程进行这项任务，并在completing it之后得到了这些结果。

A.
blockSize = (1, 1, 1)
gridSize = (1, 1, 1)
Your code ran in: 34.772705 msecs.
B.
blockSize = (numCols, 1, 1)
gridSize = (numRows, 1, 1)
Your code ran in: 1821.326416 msecs.
C.
blockSize = (numRows, 1, 1)
gridSize = (numCols, 1, 1)
Your code ran in: 1695.917480 msecs.
D.
blockSize = (1024, 1, 1)
gridSize = (170, 1, 1) [the image size is : r=313, c=557, blockSize*gridSize ~= r*c]
Your code ran in: 1709.109863 msecs.

我尝试了更多的组合，但是没有一个比A更好的表现。我在增加块和网格时只有几个差异，而差异很小。例如：

blockSize = (10, 1, 1)
gridSize = (10, 1, 1)
Your code ran in: 34.835167 msecs.

我不明白为什么更高的数字不会获得更好的性能，而是导致性能较差。而且，似乎增加的块质量比网格大小好。

您计算启动的每个线程中的所有像素，即内核是完全串行的。使用更多块或更大的块只是重复计算。在后一种情况下，为什么不将for循环从内核中移出，并且每个线程都计算一个像素？