Mathematica/CUDA减少执行时间

Mathematica/CUDA reduce execution time

本文关键字：执行时间 CUDA Mathematica 更新时间：2023-10-16

我正在编写一个简单的粒子输运蒙特卡罗模拟程序。我的方法是为CUDA编写一个内核，并将其作为Mathematica函数执行。

内核:

#include "curand_kernel.h"
#include "math.h"
extern "C" __global__ void monteCarlo(Real_t *transmission, mint seed, mint pathN) {
curandState rngState;
int index = threadIdx.x + blockIdx.x*blockDim.x;
curand_init(seed, index, 0, &rngState);
if (index < pathN) {
    //-------------start one packet run----------------------
    float packetWeight = 1.0;
    int m = 0;
    while(packetWeight > 0.0){
        //MONTE CARLO CODE
        // Test: still in the sample?
            if(z_coordinate > sampleThickness){
                packetWeight = 0;
                z_coordinate = sampleThickness;
                transmission[index]=1;
            }
        }
    }
    //-------------end one packet run------------------------
}
}

Mathematica代码:

Needs["CUDALink`"];
cudaBM = CUDAFunctionLoad[code, 
"monteCarlo", {{_Real, "Output"}, _Integer, _Integer}, 256, 
"UnmangleCode" -> False];

pathN = 100000;
result = 0;  (*count for transmitted particles*)
For[j = 0, j < 10, j++,
   buffer = CUDAMemoryAllocate["Float", 100000];
   cudaBM[buffer, 1490, pathN];
   resultOneRun = Total[CUDAMemoryGet[buffer]];
   result = result + resultOneRun;
];

到目前为止，一切似乎都工作，但与没有CUDA的纯C代码相比，速度的提高是微不足道的。我有两个问题

curand_init()函数在每个模拟步骤开始时由所有线程执行->我可以为所有线程调用此函数一次吗?
内核返回给Mathematica一个非常大的实数数组(100,000)。我知道，CUDA的瓶颈是GPU和CPU之间的通道带宽。我只需要列表中所有元素的总和，所以在GPU中计算列表元素的总和并只向CPU发送一个实数会更有效。

1)如果你需要为所有线程执行curand_init()一次，你可以在CPU中这样做并将其作为参数传递给CUDA吗?

2)如何"设备浮点sumTotal"函数求和并返回您的值?您是否复制了尽可能多的传输数据到共享内存缓冲区?

根据CURAND文档，调用curand_init()比调用curand()或curand_uniform()要慢。对curand_init()的大偏移量比小偏移量花费更多时间。保存和恢复随机发电机状态比反复重新计算起动状态要快得多。"

http://docs.nvidia.com/cuda/curand/index.html topic_1_3_4

也请看看这个线程的更多细节CUDA程序导致nvidia驱动崩溃