在 CUDA 中实现等效的可变长度(局部）数组

Achieving the equivalent of a variable-length (local) array in CUDA

本文关键字：局部数组 CUDA 实现更新时间：2023-10-16

我有一些使用本地内存的代码（我可能使用了寄存器，但我需要动态寻址）。由于我使用的内存量取决于输入和块中的线程数（这也取决于运行时的输入，尽管在启动之前） - 它不能是固定大小的数组。另一方面，我不会写

__global__ foo(short x)
{
    int my_local_mem_array[x];
}

（这是有效但有问题的 C99，但即使在主机端也无效C++。

如何达到相同的效果？到目前为止，我的想法是在内存大小上模板化内核，然后只在内核上使用最大可能的适当 L1 内存调用它，只使用我需要的内存。但这有点丑陋，因为这意味着我必须将实例化的数量乘以不同的可能的最大内存大小。呸。

我认为模板元编程可能是做你想要的事情的唯一现实方法（你为什么真正想要这样做的理由不是很明显，但这是另一个问题）。我知道没有任何其他方法可以声明"可变"长度的本地内存数组，因为本地内存需要静态编译作为每个线程堆栈帧的一部分。

当然，实例化和选择同一模板函数的许多不同版本并不有趣，但您可以使用 boost 预处理器之类的东西来自动化所有乏味。

例如，考虑以下简单内核，它看起来很像您在问题中描述的模型：

#include <boost/preprocessor/arithmetic/inc.hpp>
#include <boost/preprocessor/comparison/not_equal.hpp>
#include <boost/preprocessor/repetition/for.hpp>
#include <boost/preprocessor/tuple/elem.hpp>
template<int N>
__global__ void kernel(int *out, int Nout)
{
    int scratch[N];
    for(int i=0; i<N; i++)
        scratch[i] = i - Nout;
    if (Nout > 1) {
       out[threadIdx.x] = scratch[Nout];
    }
}
#define PRED(r, state) 
   BOOST_PP_NOT_EQUAL( 
      BOOST_PP_TUPLE_ELEM(2, 0, state), 
      BOOST_PP_INC(BOOST_PP_TUPLE_ELEM(2, 1, state)) 
   ) 
   /**/
#define OP(r, state) 
   ( 
      BOOST_PP_INC(BOOST_PP_TUPLE_ELEM(2, 0, state)), 
      BOOST_PP_TUPLE_ELEM(2, 1, state) 
   ) 
   /**/
#define STUB(n) template __global__ void kernel<n>(int *, int);
#define MACRO(r, state) STUB(BOOST_PP_TUPLE_ELEM(2, 0, state));
BOOST_PP_FOR((10, 20), PRED, OP, MACRO) // generate kernel<10> ... kernel<20>

在这里，我使用 BOOST_PP_FOR 自动生成了 10 个不同的基本内核实例：

>nvcc -arch=sm_21 -cubin -Xptxas="-v" -I ..boost_1_60_0 template.cu
template.cu
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelILi13EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi13EEvPii
    56 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi17EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi17EEvPii
    72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi15EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi15EEvPii
    64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi19EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi19EEvPii
    80 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi11EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi11EEvPii
    48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi16EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi16EEvPii
    64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi20EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi20EEvPii
    80 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi12EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi12EEvPii
    48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi14EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi14EEvPii
    56 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi18EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi18EEvPii
    72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]
ptxas info    : Compiling entry function '_Z6kernelILi10EEvPii' for 'sm_21'
ptxas info    : Function properties for _Z6kernelILi10EEvPii
    40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 9 registers, 44 bytes cmem[0]

您还可以使用相同的自动化来生成主机包装器函数，该函数在运行时选择正确的实例。虽然它并不理想，但它是便携式的、快速的，并且可以与 CUDA 工具链很好地配合使用。

只需使用

cudaMalloc 使用主机分配的全局内存。在编译时未完全定义的任何数组访问都将导致 CUDA 使用"本地内存"，尽管名称如此，但这只是全局内存。或者，您可以使用 new 或 __device__ malloc 。