CUDA-并行生成Halton序列

CUDA - Generating the Halton sequence in parallel

本文关键字：Halton 序列并行 CUDA- 更新时间：2023-10-16

我想在CUDA中编写一个内核，该内核将并行生成Halton序列，每个线程生成并存储1个值。

从序列来看，生成序列中的每个后续值似乎都涉及到生成前一个值的工作。从头开始生成每个值将涉及冗余工作，并导致线程执行时间之间存在很大差距。

有没有任何方法可以通过改进串行算法的并行内核来实现这一点？我对并行编程真的很陌生，所以如果答案是一些众所周知的模式，请原谅这个问题。

注意：我确实在一本教科书中找到了这个链接(使用它时没有描述它的工作原理)，但那里的文件链接已经死了。

Halton序列由以下生成：

求i在p进制中的表示
反转位顺序

例如，基于2的Halton序列：

index      binary     reversed     result
1             1           1           1 /   10 = 1 / 2
2            10          01          01 /  100 = 1 / 4
3            11          11          11 /  100 = 3 / 4
4           100         001         001 / 1000 = 1 / 8
5           101         101         101 / 1000 = 5 / 8
6           110         011         011 / 1000 = 3 / 8
7           111         111         111 / 1000 = 7 / 8

因此，确实有很多反向的重复工作。我们能做的第一件事就是重用以前的结果。

当计算base-p Halton序列中索引为i的元素时，我们首先确定i的前导位和base-p表示的剩余部分(这可以通过以base-p方式调度线程来完成)。然后我们有

out[i] = out[remaining_part] + leading_bit / p^(length_of_i_in_base_p_representation - 1)
//"^" is used for convenience

为了避免不必要的全局内存读取，每个线程都应该处理具有相同"剩余部分"但不同"前导位"的所有元素。如果我们在p^n和p^(n+1)之间生成Halton序列，那么在概念上应该有p^n个并行任务。但是，如果我们为一个线程分配一组任务，则不会产生任何问题。

可以通过混合重新计算和从内存加载来进行进一步的优化。

示例代码：

线程总数应为p^m。

const int m = 3 //any value
__device__ void halton(float* out, int p, int N)
{
const int tid = ... //globally unique and continuous thread id
const int step = p^m; //you know what I mean
int w = step; //w is the weight of the leading bit
for(int n = m; n <= N; ++n) //n is the position of the leading bit
{
for(int r = tid; r < w; r += step) //r is the remaining part
for(int l = 1; l < p; ++l) //l is the leading bit
out[l*w + r] = out[r] + l/w;
w *= p;
}
}

注意：这个例子不计算Halton序列中的第一个p^m元素，但仍然需要这些值。