在编写类似的CUDA内核时，如何避免在没有宏的情况下重复自己

How not to repeat myself without macros when writing similar CUDA kernels?

本文关键字：情况下自己何避免 CUDA 内核更新时间：2023-10-16

我有几个CUDA内核基本上是做相同的一些变化。我想做的是减少所需的代码量。我的第一个想法是使用宏，所以我得到的内核看起来像这样(简化):

__global__ void kernelA( ... )
{
   INIT(); // macro to initialize variables
   // do specific stuff for kernelA
   b = a + c;
   END(); // macro to write back the result
}
__global__ void kernelB( ... )
{
   INIT(); // macro to initialize variables
   // do specific stuff for kernelB
   b = a - c;
   END(); // macro to write back the result
}
...

因为宏是讨厌的，丑陋的和邪恶的，我正在寻找一个更好的和更干净的方式。有什么建议吗?

(一个switch语句不能完成这项工作:实际上，相同的部分和特定于内核的部分是相当交织的。需要几个switch语句，这将使代码非常难以读懂。此外，函数调用不会初始化所需的变量。)

(这个问题可能对一般c++也是负责的，只需将所有的'CUDA内核'替换为'函数'并删除'__global__')

更新:我在评论中被告知，类和继承不能很好地与CUDA混合。因此，只有答案的第一部分适用于CUDA，而其他部分是对您问题中更一般的c++部分的回答。

对于CUDA，你必须使用纯函数，"c风格":

struct KernelVars {
  int a;
  int b;
  int c;
};
__device__ void init(KernelVars& vars) {
  INIT(); //whatever the actual code is
}
__device__ void end(KernelVars& vars) {
  END(); //whatever the actual code is
}
__global__ void KernelA(...) {
  KernelVars vars;
  init(vars);
  b = a + c;
  end(vars);
}

这是通用c++的答案，在c++中，您可以使用像构造函数和析构函数这样的OOP技术(它们非常适合那些init/end对)，或者可以与其他语言一起使用的模板方法模式:

使用ctor/dtor和模板，" c++风格":

class KernelBase {
protected:
  int a, b, c;
public:
  KernelBase() {
    INIT(); //replace by the contents of that macro
  }   
  ~KernelBase() {
    END();  //replace by the contents of that macro
  }
  virtual void run() = 0;
};
struct KernelAdd : KernelBase {
  void run() { b = a + c; }
};
struct KernelSub : KernelBase {
  void run() { b = a - c; }
};
template<class K>
void kernel(...)
{
  K k;
  k.run();
}
void kernelA( ... ) { kernel<KernelAdd>(); }

使用模板方法模式，通用的"OOP风格"

class KernelBase {
  virtual void do_run() = 0;
protected:
  int a, b, c;
public:
  void run() { //the template method
    INIT(); 
    do_run();
    END();
  }
};
struct KernelAdd : KernelBase {
  void do_run() { b = a + c; }
};
struct KernelSub : KernelBase {
  void do_run() { b = a - c; }
};
void kernelA(...)
{
  KernelAdd k;
  k.run();
}

您可以使用设备函数作为"INIT()"answers"END()"的替代。

__device__ int init()
{
    return threadIdx.x + blockIdx.x * blockDim.x;
}

另一种选择是使用函数模板:

#define ADD 1
#define SUB 2
template <int __op__> __global__ void caluclate(float* a, float* b, float* c)
{
   // init code ...
switch (__op__)
{
case ADD:
  c[id] = a[id] + b[id];
break;
case SUB:
  c[id] = a[id] - b[id];
break;
    }
    // end code ...
}

，并使用:

calcualte<ADD><<<...>>>(a, b, c);

CUDA编译器完成工作，构建不同的函数版本并删除死代码部分以进行性能优化。