为什么调用 CUDA 内核函数时这个类成员变量没有改变

Why does this class member variable not change when calling a CUDA kernel function?

本文关键字：成员变量改变 CUDA 调用内核函数为什么更新时间：2023-10-16

在一个简单的测试 CUDA 应用程序中，我有一个指向类实例列表的指针，并将该数据复制到 GPU。然后我多次运行内核函数。然后，内核函数为每个类实例调用一个 __device__ 成员函数，该函数递增一个变量 profitLoss 。

出于某种原因，profitLoss没有递增。这是我的代码：

#include <stdio.h>
#include <stdlib.h>
#define N 200000
class Strategy {
    private:
        double profitLoss;
    public:
        __device__ __host__ Strategy() {
            this->profitLoss = 0;
        }
        __device__ __host__ void backtest() {
            this->profitLoss++;
        }
        __device__ __host__ double getProfitLoss() {
            return this->profitLoss;
        }
};
__global__ void backtestStrategies(Strategy *strategies) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        strategies[i].backtest();
    }
}
int main() {
    int threadsPerBlock = 1024;
    int blockCount = 32;
    Strategy *devStrategies;
    Strategy *strategies = (Strategy*)malloc(N * sizeof(Strategy));
    int i = 0;
    // Allocate memory for strategies on the GPU.
    cudaMalloc((void**)&devStrategies, N * sizeof(Strategy));
    // Initialize strategies on host.
    for (i=0; i<N; i++) {
        strategies[i] = Strategy();
    }
    // Copy strategies from host to GPU.
    cudaMemcpy(devStrategies, strategies, N * sizeof(Strategy), cudaMemcpyHostToDevice);
    for (i=0; i<363598; i++) {
        backtestStrategies<<<blockCount, threadsPerBlock>>>(devStrategies);
    }
    // Copy strategies from the GPU.
    cudaMemcpy(strategies, devStrategies, N * sizeof(Strategy), cudaMemcpyDeviceToHost);
    // Display results.
    for (i=0; i<N; i++) {
        printf("%fn", strategies[i].getProfitLoss());
    }
    // Free memory for the strategies on the GPU.
    cudaFree(devStrategies);
    return 0;
}

输出如下：

我希望它是：

363598.000000
363598.000000
363598.000000
363598.000000
363598.000000
363598.000000
363598.000000
363598.000000
...

我相信profitLoss由于我初始化对象的方式（自动存储持续时间）而没有增加，并且我不确定是否有更好的方法来实例化这些对象并将它们cudaMemcpy到 GPU：

strategies[i] = Strategy();

任何人都可以就如何解决此问题或可能是什么原因提供任何建议吗？提前谢谢你！

更新似乎对于前 32768 条输出行，有数据，然后，每行都是零。所以我可能达到了某种极限。

根据网格暗blockCount和块暗threadsPerBlock设置，您只启动 32x1024 个线程，每个线程仅更新一个实例。这就是为什么在向量的头部只有 32768 个非零结果的原因。

为了获得预期的结果，您可以通过将网格变暗blockCount增加到足以覆盖所有N元素来增加 GPU 线程的数量，或者

您可以在内核函数中使用 for 循环，让每个 GPU 线程更新多个元素，直到所有元素都更新。

第二种方式是首选，因为它的块启动开销要少得多。但是您可能仍然需要一个大于 32 的网格暗度才能充分利用您的 GPU。您可以在此处找到更多详细信息。

https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/