在cuda中使用静态成员函数模板结构的另一种方法

Alternative way to template struct with static member function in cuda?

本文关键字：结构另一种方法函数模板静态成员 cuda 更新时间：2023-10-16

在C++中，我经常使用封装在模板结构中的静态函数，以便能够在编译时指定函数模板，从而进行各种优化，例如内联等（顺便问一句，这有名字吗？）。示例（相当做作，可能有错误，但你明白了）：

template <int dim>
struct ImplementationA {
    static float compute(float a) {
        // do stuff, e.g.
        return 2*pow(a,dim);
    }
};
template <int dim>
struct ImplementationB {
    static float compute(float a) {
        // do other stuff, e.g.
        return 3*pow(a,dim);
    }
};
template <template <int> class ImplT, int dim> class Test {
    void compute_stuff(float *dst, const float *src, int N) {
        for(int i=0; i<N; i++)
            dst[i] = ImlT<dim>::compute(src[i]);
    }
};
void main() {
    float v1[100];
    float v2[100];
    Test<ImplementationB,3> t;
    t.compute_stuff(v2,v1,N);
}

然而，如果我想在compute作为内核的CUDA中做同样的事情，即__global__，这是不可能的，因为不能有static __global__成员函数。我还有什么其他的可能性可以提供同样最小的性能开销？我使用GCC 4.6，所以一些C++11功能不可用。

您可以创建一个在__device__方法中实现的模板类和只使用该类并调用方法的小模板__global__函数：

template <int dim> class ImplementationA
{
public:
    // parameters
    float *dst;
    const float *src;
    int N;
    // implementation
    __device__ void compute()
    {
        float a = src[threadIdx.x];
        // ...
    }
};
// The same for ImplementationB
// global function
template <class Impl>
__global__ void compute(Impl impl)
{
    impl.compute();
}
// call
ImplementationA<3> impl;
impl.src = src;
compute<<<1, 32>>>(impl);