正在CUDA上生成决策树

Generating decision tree on CUDA

本文关键字：决策树 CUDA 正在更新时间：2023-10-16

我想在CUDA上生成一些决策树，下面我们有伪代码（代码非常原始，只是为了理解我写的内容）：

class Node
{
public :
    Node* father;
    Node** sons;
    int countSons;
    __device__ __host__ Node(Node* father)
    {
        this->father = father;
        sons = NULL;
    }
};
__global__ void GenerateSons(Node** fathers, int countFathers*, Node** sons, int* countSons)
{
    int Thread_Index = (blockDim.x * blockIdx.x) + threadIdx.x;
    if(Thread_Index < *(countFathers))
    {
        Node* Thread_Father = fathers[Thread_Index];
        Node** Thread_Sons;
        int Thread_countSons;
        //Now we are creating new sons for our Thread_Father
        /*
        * Generating Thread_Sons for Thread_Father;
        */
        Thread_Father->sons = Thread_Sons; 
        Thread_Father->countSons = Thread_countSons;
        //Wait for others
            /*I added here __syncthreads because I want to count all generated sons
             by threads
            */
            *(countSons) += Thread_countSons;
        __syncthreads();
        //Get all generated sons from whole Block and copy to sons
        if(threadIdx.x == 0)
        {
            sons = new Node*[*(countSons)];
        }
        /*I added here __syncthreads because I want to allocated array for sons
            */
        __syncthreads();
        int Thread_Offset;
        /*
        * Get correct offset for actual thread
        */
        for(int i = 0; i < Thread_countSons; i++)
            sons[Thread_Offset + i] = Thread_Sons[i];
    }
}
void main ()
{
    Node* root = new Node();
    //transfer root to kernel by cudaMalloc and cudaMemcpy
    Node* root_d = root->transfer();
    Node** fathers_d;
    /*
    * preapre array with father root and copy him to kernel
    */
    int* countFathers, countSons;
    /*
    * preapre pointer of int for kernel and for countFathers set value 1
    */
    for(int i = 0; i < LevelTree; i++)
    {
        Node** sons = NULL;
        int threadsPerBlock = 256; 
        int blocksPerGrid = (*(countFathers)/*get count of fathers*/  + threadsPerBlock - 1) / threadsPerBlock;
        GenerateSons<<<blocksPerGrid , threadsPerBlock >>>(fathers_d, countFathers, sons, countSons);
        //Wait for end of kernel call
        cudaDeviceSynchronize();
        //replace
        fathers_d = sons;
        countFathers = countSons;
    }
}

因此，它适用于5级（为检查器生成决策树），但在6级上我有错误。在内核代码的某个地方，malloc返回NULL，对我来说，这是一个信息，即blockThreads中的一些线程无法分配更多的内存。我确信我正在清理调用内核的每一端上所有不需要的对象。我想，我不能理解CUDA中使用记忆的一些事实。如果我在线程的本地内存中创建对象，内核结束了他的活动，那么在内核的第二次启动时，我可以看到内核第一次调用的节点是。所以我的问题是，来自内核第一次调用的对象Node存储在哪里？它们是否存储在块中线程的本地内存中？如果这是真的，那么在每次调用内核函数时，我都会减少这个线程的本地内存空间？

我使用的是具有计算功能2.1的GT 555m、CUDA SDK 5.0、具有NSight 3.0 的Visual Studio 2010 Premium

Okey，

我发现，内核中的new和malloc调用被分配在设备上的全局内存中。我还发现了这个

默认情况下，CUDA会创建一个8MB的堆。

CUDA应用程序设计与开发，第128页

因此，我使用这种方法cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);将设备上的堆内存增加到128Mb，并且程序正确地生成了6级树（22110个子），但实际上我遇到了一些内存泄漏。。。我需要找到。