CUDA: 2D数组索引产生意想不到的结果

CUDA: 2D array indexing giving unexpected results

本文关键字：意想不到结果索引 2D 数组 CUDA 更新时间：2023-10-16

我开始学习CUDA，我想写一个简单的程序，将一些数据复制到GPU，修改它，然后再传输回来。我已经在谷歌上搜索过了，试图找出我的错误。我很确定问题在我的内核中，但我不完全确定是什么错了。

内核:

__global__ void doStuff(float* data, float* result)
{
    if (threadIdx.x < 9) // take the first 9 threads
    {
        int index = threadIdx.x;
        result[index] = (float) index;
    }
}

以下是我的main的相关部分:

#include <stdlib.h>
#include <stdio.h>
int main(void)
{
    /*
        Setup
    */
    float simple[] = {-1.0, -2.0, -3.0, -4.0, -5.0, -6.0, -7.0, -8.0, -9.0};
    float* data_array;
    float* result_array;
    size_t data_array_pitch, result_array_pitch;
    int width_in_bytes = 3 * sizeof(float);
    int height = 3;
    /*
        Initialize GPU arrays
    */
    cudaMallocPitch(&data_array, &data_array_pitch, width_in_bytes, height);
    cudaMallocPitch(&result_array, &result_array_pitch, width_in_bytes, height);
    /*
        Copy data to GPU
    */
    cudaMemcpy2D(data_array, data_array_pitch, simple, width_in_bytes, width_in_bytes, height, cudaMemcpyHostToDevice);
    dim3 threads_per_block(16, 16);
    dim3 num_blocks(1,1);
    /*
        Do stuff
    */
    doStuff<<<num_blocks, threads_per_blocks>>>(data_array, result_array);
    /*
        Get the results
    */
    cudaMemcpy2D(simple, width_in_bytes, result_array, result_array_pitch, width_in_bytes, height, cudaMemcpyDeviceToHost);
    for (int i = 1; i <= 9; ++i)
    {
        printf("%f ", simple[i-1]);
        if(!(i%3))
            printf("n");
    }
    return 0;
}

当我运行这个，我得到0.000000 1.000000 2.00000的第一行和垃圾的其他两个。

如果你刚刚开始学习cuda，我不确定我会专注于2D数组。

同样好奇的是，如果您手动输入您的代码到问题中，因为您定义了一个threads_per_block变量，但随后您在内核调用中使用threads_per_blocks。

无论如何，你的代码有几个问题:

当使用2D数组时，几乎总是需要传递pitch参数(以某种方式)传递给内核。cudaMallocPitch在每行末尾分配带有额外填充的数组，以便下一行从对齐良好的边界开始。这通常会导致分配粒度为128或256字节。所以你的第一个行有3个有效的数据实体，后面有足够的空白空间来填充比如256字节(等于你的音调变量)。所以我们必须改变内核调用和内核本身来解释这个。
你的内核本质上是一个1D内核(它不理解或使用threadIdx.y，例如)。因此，没有必要启动2D网格。虽然在这种情况下它没有伤害任何东西，但它会产生冗余，这在其他代码中可能会令人困惑和麻烦。

这是一个更新的代码，显示了一些变化，将给你预期的结果，基于上面的注释:

#include <stdio.h>

__global__ void doStuff(float* data, float* result, size_t dpitch, size_t rpitch, int width)
{
    if (threadIdx.x < 9) // take the first 9 threads
    {
        int index = threadIdx.x;
        result[((index/width)*(rpitch/sizeof(float)))+ (index%width)] = (float) index;
    }
}
int main(void)
{
    /*
        Setup
    */
    float simple[] = {-1.0, -2.0, -3.0, -4.0, -5.0, -6.0, -7.0, -8.0, -9.0};
    float* data_array;
    float* result_array;
    size_t data_array_pitch, result_array_pitch;
    int height = 3;
    int width = 3;
    int width_in_bytes = width * sizeof(float);
    /*
        Initialize GPU arrays
    */
    cudaMallocPitch(&data_array, &data_array_pitch, width_in_bytes, height);
    cudaMallocPitch(&result_array, &result_array_pitch, width_in_bytes, height);
    /*
        Copy data to GPU
    */
    cudaMemcpy2D(data_array, data_array_pitch, simple, width_in_bytes, width_in_bytes, height, cudaMemcpyHostToDevice);
    dim3 threads_per_block(16);
    dim3 num_blocks(1,1);
    /*
        Do stuff
    */
    doStuff<<<num_blocks, threads_per_block>>>(data_array, result_array, data_array_pitch, result_array_pitch, width);
    /*
        Get the results
    */
    cudaMemcpy2D(simple, width_in_bytes, result_array, result_array_pitch, width_in_bytes, height, cudaMemcpyDeviceToHost);
    for (int i = 1; i <= 9; ++i)
    {
        printf("%f ", simple[i-1]);
        if(!(i%3))
            printf("n");
    }
    return 0;
}

你可能也会觉得这个问题读起来很有趣。

编辑:回复评论中的问题:

result[((index/width)*(rpitch/sizeof(float)))+ (index%width)] = (float) index;
              1               2                      3

要计算正确的数组元素索引，必须:

从线程索引中计算(虚拟)行索引。我们通过将线程索引除以每行(非定向)的宽度(以元素为单位，而不是字节为单位)的整数除法来实现这一点。
将行索引乘以每个倾斜的行的宽度。每个倾斜行的宽度由pitch参数给出，其单位为字节。要将这个倾斜的字节参数转换为倾斜的元素参数，我们除以每个元素的大小。然后通过将数量乘以步骤1中计算的行索引，我们现在已经索引到正确的行。
通过取线程索引的余数(模除)除以宽度(以元素为单位)，从线程索引中计算(虚拟)列索引。一旦我们有了列索引(在元素中)，我们将其添加到步骤2中计算的正确行起始索引中，以确定该线程将负责的元素。

对于一个相对简单的操作来说，上面是相当多的努力，这就是为什么我建议首先关注基本cuda概念而不是倾斜数组的一个例子。例如，在处理倾斜数组之前，我会考虑如何处理1和2D线程块，以及1和2D网格。在某些情况下，倾斜数组是访问2D数组(或3D数组)的有用性能增强器，但它们绝不是处理CUDA中的多维数组所必需的。

实际上也可以通过替换

int width_in_bytes = 3 * sizeof(float);

由:

int width_in_bytes = sizeof(float)*9;

因为这是告诉cudaMemcpy2D从src到dst复制多少字节的参数，在第一个代码中你要求复制3个浮点数，但你想复制的数组长度为9，所以你需要的宽度是9个浮点数的大小。

虽然这个解决方案是有效的，但在你的代码中仍然有一些效率低下的地方;例如，如果你真的想让代码块的前9个线程做点什么，你应该在'if'中添加一个and(&&)

threadIdx.y==0